2015
Utilization of Multiple Sequence Analyzers for Bibliographic Information Extraction
PATTERN RECOGNITION APPLICATIONS AND METHODS, ICPRAM 2014
- ,
- Volume
- 9443
- Number
- First page
- 222
- Last page
- 236
- Language
- English
- Publishing type
- Research paper (international conference proceedings)
- DOI
- 10.1007/978-3-319-25530-9_15
- Publisher
- SPRINGER INT PUBLISHING AG
This paper discusses the problems of analyzing title page layouts and extracting bibliographic information from academic papers. Information extraction is an important function for digital libraries to offer, providing versatile and effective access paths to library content. Sequence analyzers, such as those based on a conditional random field, are often used to extract information from object pages. Recently, digital libraries have grown and can now handle a large number and wide variety of papers. Because of the variety of page layouts, it is necessary to prepare multiple analyzers, one for each type of layout, to achieve high extraction accuracy. This makes rule management important. For example, at what stage should we invest in a new analyzer, and how can we acquire it efficiently, when receiving papers with a new layout? This paper focuses on the detection of layout changes and how we learn to use a new sequence analyzer efficiently. We evaluate the confidence metrics for sequence analyzers to judge whether they would be suited to title page analysis by testing three academic journals. The results show that they are effective for measuring suitability. We also examine the sampling of training data when learning how to use a new analyzer.
- Link information
- ID information
-
- DOI : 10.1007/978-3-319-25530-9_15
- ISSN : 0302-9743
- Web of Science ID : WOS:000374104100015