Table form document analysis based on the document structure grammar

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION

Akira Amano
Naoki Asada
Masayuki Mukunoki
Masahito Aoyama

巻: 8
号: 2-3
開始ページ: 201
終了ページ: 213
記述言語: 英語
掲載種別: 研究論文（学術雑誌）
DOI: 10.1007/s10032-005-0008-3
出版者・発行元: SPRINGER HEIDELBERG

Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated "cookie cutter" data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0. 1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content.

リンク情報

DOI: https://doi.org/10.1007/s10032-005-0008-3
Web of Science: https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000247734500010&DestApp=WOS_CPL

ID情報

DOI : 10.1007/s10032-005-0008-3
ISSN : 1433-2833
eISSN : 1433-2825
Web of Science ID : WOS:000247734500010

エクスポート: BibTeX RIS

浅田尚紀

論文

Table form document analysis based on the document structure grammar

メニュー

共著者の一覧