Jun 15, 2007
Accuracy of Document Classification with Dirichlet Mixtures
- ,
- ,
- Volume
- 48
- Number
- SIG11(TOD34)
- First page
- 14
- Last page
- 26
- Language
- Japanese
- Publishing type
- Research paper (scientific journal)
- Publisher
- Information Processing Society of Japan (IPSJ)
The naive Bayes classifier is a well-known method for document classification. However, the naive Bayes classifier gives a satisfying classification accuracy only after an appropriate tuning of the smoothing parameter. Moreover, we should find appropriate parameter values separately for different document sets. In this paper, we focus on an effective probabilistic framework for document classification, called Dirichlet mixtures, which requires no parameter tuning and provides satisfying classification accuracies with respect to various document sets. Many researches in the field of image processing and of natural language processing utilize Dirichlet mixtures. Especially, in the field of natural language processing, many experiments are conducted by using real document data sets. However, most researches use the perplexity as an evaluation measure. While the perplexity is a purely theoretical measure, the accuracy is popular for document classification in the field of information retrieval or of text mining. The accuracy is computed by comparing correct labels with predictions made by the classifier. In this paper, we conduct an evaluation experiment by using 20 newsgroups data set and the Korean Web newspaper articles under the intention that we will use Dirichlet mixtures for multilingual applications. In the experiment, we compare the naive Bayes classifier with the classifier based on Dirichlet mixtures and clarify their qualitative and quantitative differences.
- Link information
- ID information
-
- ISSN : 1882-7799
- CiNii Articles ID : 110006317681
- CiNii Books ID : AA11464847