2016年6月
A comparative study of dictionaries and corpora as methods for language resource addition
LANGUAGE RESOURCES AND EVALUATION
- ,
- 巻
- 50
- 号
- 2
- 開始ページ
- 245
- 終了ページ
- 261
- 記述言語
- 英語
- 掲載種別
- 研究論文(学術雑誌)
- DOI
- 10.1007/s10579-016-9354-7
- 出版者・発行元
- SPRINGER
In this paper, we investigate the relative effect of two strategies for language resource addition for Japanese morphological analysis, a joint task of word segmentation and part-of-speech tagging. The first strategy is adding entries to the dictionary and the second is adding annotated sentences to the training corpus. The experimental results showed that addition of annotated sentences to the training corpus is better than the addition of entries to the dictionary. In particular, adding annotated sentences is especially efficient when we add new words with contexts of several real occurrences as partially annotated sentences, i.e. sentences in which only some words are annotated with word boundary information. According to this knowledge, we performed real annotation experiments on invention disclosure texts and observed word segmentation accuracy. Finally we investigated various language resource addition cases and introduced the notion of non-maleficence, asymmetricity, and additivity of language resources for a task. In the WS case, we found that language resource addition is non-maleficent (adding new resources causes no harm in other domains) and sometimes additive (adding new resources helps other domains). We conclude that it is reasonable for us, NLP tool providers, to distribute only one general-domain model trained from all the language resources we have.
- リンク情報
- ID情報
-
- DOI : 10.1007/s10579-016-9354-7
- ISSN : 1574-020X
- eISSN : 1574-0218
- Web of Science ID : WOS:000377898300004