Papers

Peer-reviewed
2015

Utilization of Multiple Sequence Analyzers for Bibliographic Information Extraction

PATTERN RECOGNITION APPLICATIONS AND METHODS, ICPRAM 2014
  • Atsuhiro Takasu
  • ,
  • Manabu Ohta

Volume
9443
Number
First page
222
Last page
236
Language
English
Publishing type
Research paper (international conference proceedings)
DOI
10.1007/978-3-319-25530-9_15
Publisher
SPRINGER INT PUBLISHING AG

This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.

Link information
DOI
https://doi.org/10.1007/978-3-319-25530-9_15
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000374104100015&DestApp=WOS_CPL
ID information
  • DOI : 10.1007/978-3-319-25530-9_15
  • ISSN : 0302-9743
  • Web of Science ID : WOS:000374104100015

Export
BibTeX RIS