Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

自然言語処理

Mo Shen
Daisuke Kawahara
Sadao Kurohashi

巻: 23
号: 3
開始ページ: 235
終了ページ: 266
記述言語: 英語
掲載種別: 研究論文（学術雑誌）
DOI: 10.5715/jnlp.23.235
出版者・発行元: 一般社団法人言語処理学会

<p>Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-of-vocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study. </p>

リンク情報

DOI: https://doi.org/10.5715/jnlp.23.235
CiNii Articles: http://ci.nii.ac.jp/naid/130005411025
CiNii Books: http://ci.nii.ac.jp/ncid/AN10472659
URL: http://id.ndl.go.jp/bib/027492546

ID情報

DOI : 10.5715/jnlp.23.235
ISSN : 1340-7619
CiNii Articles ID : 130005411025
CiNii Books ID : AN10472659

エクスポート: BibTeX RIS

黒橋禎夫

論文

Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

メニュー

共著者の一覧

フォロー一覧