Research Article

Parsing of Research Documents into XML Using Formal Grammars

Table 1

Literature of information extraction from various types of documents.

S/NDocument typeTechniqueApproachAuthors

1Invoices(i) Bidirectional LSTM deep neural network and trained data extracted end-to-end from invoiceMachine-based[2]
(ii) Named entity recognition using BERT (bidirectional encoder representations from transformers)Machine-based[54]
(iii) Optical character recognition and graph convolution network from invoice imagesMachine-based[53]

2Financial reports(i) Detection of key performance indicators (KPI) from a report using the density of alpha-numeric characters in a rule-based fashionRule-based[16]

3Medical clinical notesParse meaningful critical values from clinical notes and perform a semantic lookupRule-based[21, 55]

4Legal documents:
(i) Court record docs (CRDs)
(i)Bidirectional LSTM for training and extracting informationMachine-based[17]
(ii) Compliance documents(ii) Context-free grammar for complex rule interpretationRule-based[56]

5Software requirements documentsSyntactic and semantic analysis approach to align with standard writing best practicesRule-based[15]

6CVsRule-based text extraction from CVRule-based[49, 57]

7Academia: literature researchOptical character recognition and graph convolution network from invoice imagesMachine-based[19, 20]