2003年3月6日

単語クラスタリングの語義判別問題への応用

情報処理学会研究報告自然言語処理（NL）

佐々木稔
新納浩幸

巻: 2003
号: 23
開始ページ: 145
終了ページ: 152
記述言語: 日本語
掲載種別
出版者・発行元: 一般社団法人情報処理学会

本稿では，文書要約の支援を目的としたシソーラスの自動構築を行うために，大規模な単語集合に対するクラスタリング手法の提案を行う．これまでの単語クラスタリングに関する研究は，索引語・文書行列を利用してさまざまな要素間類似度やアルゴリズムを用いてクラスタリングが行われている．この索引語・文書行列を利用した場合，索引語の分布はどのような文書内容で出現するかを統計的に示したもので，文書内における語と語の間にある意味的なつながりはそれほど強くない．そのため，結果として出力されるクラスタにはある話題に共通する単語が集まりやすくなると考えられる．意味的につながりを持つクラスタを構築するために，共起関係を持つ単語の組を抽出し，ある単語に対して意味的につながりやすい単語を統計的に表現し，それをクラスタリングすることで意味的な共通性を持つクラスタの自動構築を目指す．In this paper, we propose a new clustering algorithm for large scale document size to construct the thesaurus automatically in aid of summarization. The existing word-clustering systems use various similarity and clustering algorithm based on the context of the information retrieval. In case of the clustering using term-document matrix, the distribution of the index word represents the frequency of the word appearance in a certain contents of a document. Therefore, semantic relation between these words in the document is not so strong. As a result, the words which appear frequently in the contents tend to be gathered for one cluster. To construct a cluster set in which semantic relation between these words is contained, we show a word clustering using a pair of words with cooccurrence relation automatically. We further show that our clustering is effective for word sense disambiguation in comparison with using term-document matrix.

リンク情報

CiNii Articles: http://ci.nii.ac.jp/naid/110002911604
CiNii Books: http://ci.nii.ac.jp/ncid/AN10115061
URL: http://id.ndl.go.jp/bib/6547678
URL: http://id.nii.ac.jp/1001/00048315/

ID情報

ISSN : 0919-6072
CiNii Articles ID : 110002911604
CiNii Books ID : AN10115061

エクスポート: BibTeX RIS

佐々木稔

MISC

単語クラスタリングの語義判別問題への応用

メニュー

共著者の一覧