Mar, 1997

Retrieval Methods for Text with Missrecognized Characters

Research bulletin of the National Center for Science Information System

Ohta Manabu
Takasu Atsuhiro
Adachi Jun

Volume: 9
Number: 9
First page: 161
Last page: 172
Language: Japanese
Publishing type
Publisher: National Institute of Informatics

This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of Japanese documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection based on 2-gram statistics, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. Those with a validity value greater than a given threshold are judged to satisfy the input query. In addition, the performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

Link information

CiNii Articles: http://ci.nii.ac.jp/naid/110000466528
CiNii Books: http://ci.nii.ac.jp/ncid/AN10015340
URL: http://id.ndl.go.jp/bib/4207495

ID information

ISSN : 0913-5022
CiNii Articles ID : 110000466528
CiNii Books ID : AN10015340

Export: BibTeX RIS

Manabu Ohta

Misc.

Retrieval Methods for Text with Missrecognized Characters

Menu

Coauthors