Papers

Peer-reviewed
2011

Semi-supervised Bibliographic Element Segmentation with Latent Permutations

DIGITAL LIBRARIES: FOR CULTURAL HERITAGE, KNOWLEDGE DISSEMINATION, AND FUTURE CREATION
  • Tomonari Masada
  • ,
  • Atsuhiro Takasu
  • ,
  • Yuichiro Shibata
  • ,
  • Kiyoshi Oguri

Volume
7008
Number
First page
60
Last page
+
Language
English
Publishing type
Research paper (international conference proceedings)
DOI
10.1007/978-3-642-24826-9_11
Publisher
SPRINGER-VERLAG BERLIN

This paper proposes a semi-supervised bibliographic element segmentation. Our input data is a large scale set of bibliographic references each given as an unsegmented sequence of word tokens. Our problem is to segment each reference into bibliographic elements, e.g. authors, title, journal, pages, etc. We solve this problem with an LDA-like topic model by assigning each word token to a topic so that the word tokens assigned to the same topic refer to the same bibliographic element. Topic assignments should satisfy contiguity constraint, i.e., the constraint that the word tokens assigned to the same topic should be contiguous. Therefore, we proposed a topic model in our preceding work [8] based on the topic model devised by Chen et al. [3]. Our model extends LDA and realizes unsupervised topic assignments satisfying contiguity constraint. The main contribution of this paper is the proposal of a semi-supervised learning for our proposed model. We assume that at most one third of word tokens are already labeled. In addition, we assume that a few percent of the labels may be incorrect. The experiment showed that our semi-supervised learning improved the unsupervised learning by a large margin and achieved an over 90% segmentation accuracy.

Link information
DOI
https://doi.org/10.1007/978-3-642-24826-9_11
DBLP
https://dblp.uni-trier.de/rec/conf/icadl/MasadaTSO11
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000306259000011&DestApp=WOS_CPL
URL
http://dblp.uni-trier.de/db/conf/icadl/icadl2011.html#conf/icadl/MasadaTSO11
ID information
  • DOI : 10.1007/978-3-642-24826-9_11
  • ISSN : 0302-9743
  • DBLP ID : conf/icadl/MasadaTSO11
  • Web of Science ID : WOS:000306259000011

Export
BibTeX RIS