Review Article

A Comparative Study of Some Automatic Arabic Text Diacritization Systems

Table 2

Some statistics about the corpus splits used in this study.

ā€‰Train setValidation setTest set

Characters16082164784570820022
Tokens2460405120075125220
Numbers3326016481637
Digits7596337943774
Arabic words2103156102479107291
Arabic letters8356030407434426469
Diacritics7290312355666371726
Undiacritized forms1057201951520520
Diacritized forms1632372612927298