Papers

Peer-reviewed
2012

Clustering Documents with Maximal Substrings

ENTERPRISE INFORMATION SYSTEMS, ICEIS 2011
  • Tomonari Masada
  • ,
  • Atsuhiro Takasu
  • ,
  • Yuichiro Shibata
  • ,
  • Kiyoshi Oguri

Volume
102
Number
First page
19
Last page
34
Language
English
Publishing type
Research paper (international conference proceedings)
DOI
10.1007/978-3-642-29958-2_2
Publisher
SPRINGER-VERLAG BERLIN

This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction.

Link information
DOI
https://doi.org/10.1007/978-3-642-29958-2_2
DBLP
https://dblp.uni-trier.de/rec/conf/iceis/MasadaTSO11
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000345339600002&DestApp=WOS_CPL
URL
http://dblp.uni-trier.de/db/conf/iceis/iceis2011.html#conf/iceis/MasadaTSO11
ID information
  • DOI : 10.1007/978-3-642-29958-2_2
  • ISSN : 1865-1348
  • DBLP ID : conf/iceis/MasadaTSO11
  • Web of Science ID : WOS:000345339600002

Export
BibTeX RIS