Papers

Peer-reviewed
Jun 15, 2007

Accuracy of Document Classification with Dirichlet Mixtures

  • MASADA TOMONARI
  • ,
  • TAKASU ATSUHIRO
  • ,
  • ADACHI JUN

Volume
48
Number
SIG11(TOD34)
First page
14
Last page
26
Language
Japanese
Publishing type
Research paper (scientific journal)
Publisher
Information Processing Society of Japan (IPSJ)

The naive Bayes classifier is a well-known method for document classification. However, the naive Bayes classifier gives a satisfying classification accuracy only after an appropriate tuning of the smoothing parameter. Moreover, we should find appropriate parameter values separately for different document sets. In this paper, we focus on an effective probabilistic framework for document classification, called Dirichlet mixtures, which requires no parameter tuning and provides satisfying classification accuracies with respect to various document sets. Many researches in the field of image processing and of natural language processing utilize Dirichlet mixtures. Especially, in the field of natural language processing, many experiments are conducted by using real document data sets. However, most researches use the perplexity as an evaluation measure. While the perplexity is a purely theoretical measure, the accuracy is popular for document classification in the field of information retrieval or of text mining. The accuracy is computed by comparing correct labels with predictions made by the classifier. In this paper, we conduct an evaluation experiment by using 20 newsgroups data set and the Korean Web newspaper articles under the intention that we will use Dirichlet mixtures for multilingual applications. In the experiment, we compare the naive Bayes classifier with the classifier based on Dirichlet mixtures and clarify their qualitative and quantitative differences.

Link information
CiNii Articles
http://ci.nii.ac.jp/naid/110006317681
CiNii Books
http://ci.nii.ac.jp/ncid/AA11464847
URL
http://id.ndl.go.jp/bib/8862207
URL
http://hdl.handle.net/10069/16317
URL
http://id.nii.ac.jp/1001/00017426/
ID information
  • ISSN : 1882-7799
  • CiNii Articles ID : 110006317681
  • CiNii Books ID : AA11464847

Export
BibTeX RIS