Utilization of Multiple Sequence Analyzers for Bibliographic Information Extraction

PATTERN RECOGNITION APPLICATIONS AND METHODS, ICPRAM 2014

Atsuhiro Takasu
Manabu Ohta

Volume: 9443
Number
First page: 222
Last page: 236
Language: English
Publishing type: Research paper (international conference proceedings)
DOI: 10.1007/978-3-319-25530-9_15
Publisher: SPRINGER INT PUBLISHING AG

This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.

Link information

DOI: https://doi.org/10.1007/978-3-319-25530-9_15
Web of Science: https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000374104100015&DestApp=WOS_CPL

ID information

DOI : 10.1007/978-3-319-25530-9_15
ISSN : 0302-9743
Web of Science ID : WOS:000374104100015

Export: BibTeX RIS

Manabu Ohta

Papers

Utilization of Multiple Sequence Analyzers for Bibliographic Information Extraction

Menu

Coauthors