論文

査読有り
2016年6月

A comparative study of dictionaries and corpora as methods for language resource addition

LANGUAGE RESOURCES AND EVALUATION
  • Shinsuke Mori
  • ,
  • Graham Neubig

50
2
開始ページ
245
終了ページ
261
記述言語
英語
掲載種別
研究論文(学術雑誌)
DOI
10.1007/s10579-016-9354-7
出版者・発行元
SPRINGER

In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.

リンク情報
DOI
https://doi.org/10.1007/s10579-016-9354-7
Web of Science
https://gateway.webofknowledge.com/gateway/Gateway.cgi?GWVersion=2&SrcAuth=JSTA_CEL&SrcApp=J_Gate_JST&DestLinkType=FullRecord&KeyUT=WOS:000377898300004&DestApp=WOS_CPL
ID情報
  • DOI : 10.1007/s10579-016-9354-7
  • ISSN : 1574-020X
  • eISSN : 1574-0218
  • Web of Science ID : WOS:000377898300004

エクスポート
BibTeX RIS