論文

査読有り
2006年6月

Table form document analysis based on the document structure grammar

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
  • Akira Amano
  • ,
  • Naoki Asada
  • ,
  • Masayuki Mukunoki
  • ,
  • Masahito Aoyama

8
2-3
開始ページ
201
終了ページ
213
記述言語
英語
掲載種別
研究論文(学術雑誌)
DOI
10.1007/s10032-005-0008-3
出版者・発行元
SPRINGER HEIDELBERG

Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated "cookie cutter" data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0. 1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content.

リンク情報
DOI
https://doi.org/10.1007/s10032-005-0008-3
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000247734500010&DestApp=WOS_CPL
ID情報
  • DOI : 10.1007/s10032-005-0008-3
  • ISSN : 1433-2833
  • eISSN : 1433-2825
  • Web of Science ID : WOS:000247734500010

エクスポート
BibTeX RIS