2006年6月
Table form document analysis based on the document structure grammar
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
- ,
- ,
- ,
- 巻
- 8
- 号
- 2-3
- 開始ページ
- 201
- 終了ページ
- 213
- 記述言語
- 英語
- 掲載種別
- 研究論文(学術雑誌)
- DOI
- 10.1007/s10032-005-0008-3
- 出版者・発行元
- SPRINGER HEIDELBERG
Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated "cookie cutter" data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0. 1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content.
- リンク情報
- ID情報
-
- DOI : 10.1007/s10032-005-0008-3
- ISSN : 1433-2833
- eISSN : 1433-2825
- Web of Science ID : WOS:000247734500010