Accuracy of Document Classification with Dirichlet Mixtures

MASADA TOMONARI
TAKASU ATSUHIRO
ADACHI JUN

Volume: 48
Number: SIG11(TOD34)
First page: 14
Last page: 26
Language: Japanese
Publishing type: Research paper (scientific journal)
Publisher: Information Processing Society of Japan (IPSJ)

The naive Bayes classifier is a well-known method for document classification. However, the naive Bayes classifier gives a satisfying classification accuracy only after an appropriate tuning of the smoothing parameter. Moreover, we should find appropriate parameter values separately for different document sets. In this paper, we focus on an effective probabilistic framework for document classification, called Dirichlet mixtures, which requires no parameter tuning and provides satisfying classification accuracies with respect to various document sets. Many researches in the field of image processing and of natural language processing utilize Dirichlet mixtures. Especially, in the field of natural language processing, many experiments are conducted by using real document data sets. However, most researches use the perplexity as an evaluation measure. While the perplexity is a purely theoretical measure, the accuracy is popular for document classification in the field of information retrieval or of text mining. The accuracy is computed by comparing correct labels with predictions made by the classifier. In this paper, we conduct an evaluation experiment by using 20 newsgroups data set and the Korean Web newspaper articles under the intention that we will use Dirichlet mixtures for multilingual applications. In the experiment, we compare the naive Bayes classifier with the classifier based on Dirichlet mixtures and clarify their qualitative and quantitative differences.

Link information

CiNii Articles: http://ci.nii.ac.jp/naid/110006317681
CiNii Books: http://ci.nii.ac.jp/ncid/AA11464847
URL: http://id.ndl.go.jp/bib/8862207
URL: http://hdl.handle.net/10069/16317
URL: http://id.nii.ac.jp/1001/00017426/

ID information

ISSN : 1882-7799
CiNii Articles ID : 110006317681
CiNii Books ID : AA11464847

Export: BibTeX RIS

Tomonari Masada

Papers

Accuracy of Document Classification with Dirichlet Mixtures

Menu