Review Article
A Comparative Study of Some Automatic Arabic Text Diacritization Systems
Table 2
Some statistics about the corpus splits used in this study.
| ā | Train set | Validation set | Test set |
| Characters | 16082164 | 784570 | 820022 | Tokens | 2460405 | 120075 | 125220 | Numbers | 33260 | 1648 | 1637 | Digits | 75963 | 3794 | 3774 | Arabic words | 2103156 | 102479 | 107291 | Arabic letters | 8356030 | 407434 | 426469 | Diacritics | 7290312 | 355666 | 371726 | Undiacritized forms | 105720 | 19515 | 20520 | Diacritized forms | 163237 | 26129 | 27298 |
|
|