The effects of a Corpus on isiZulu Spellcheckers based on N-grams

External Link

Peer-reviewed Proceedings

Ndaba, B., Hussein, S., Keet, C.M. & Khumalo, L

Ndaba, B., Hussein, S., Keet, C.M. & Khumalo, L. 2016. The effects of a Corpus on isiZulu Spellcheckers based on N-grams. Proceedings of the IST-Africa 2016 Conference. Durban. South Africa.

Publication year: 2016

Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable

Langa Khumalo

Professor & Executive Director

The effects of a Corpus on isiZulu Spellcheckers based on N-grams

Leave a Reply Cancel reply