研究ブログ

研究ブログ >> Article details

2019/09/11

Making a Scientific Research Article Word List

Tweet ThisSend to Facebook | by masayakanzaki
Scientific Research Article Word List (SRAWL)
http://bit.ly/SRAWL

Note on the "List" column: the numbers correspond to the base word lists of Nation's BNC/COCA word lists (e.g. 1 = the first 1000 words, 2 = the second 1000 words, 3 = the third 1000 words).  
31 = proper nouns
32 = marginal words
33 = transparent compounds
34 = abbreviations
NL = not in the lists

Purpose
To create a word list that will help non-English speaking scientists to learn words that are frequently used in scientific research articles

Background

Coxhead, A. (2000).A New Academic Word List.TESOL Quarterly, 34(2), 213-238.doi:10.2307/3587951

 

Hyland, K., & Tse, P. (2007). Is There an "Academic Vocabulary"? TESOL Quarterly,41(2), 235-253. Retrieved from http://www.jstor.org/stable/40264352

 

Reference

Nation, I. S. P.(2016). Making and using word lists for language learning and testing.Amsterdam: John Benjamins Publishing Company

https://benjamins.com/catalog/z.208

 

Method

First,the titles and abstracts of 12,968 research articles and reports, published in Science between 2000 and 2016, were collected to create a mini corpus of 1.7 million words.

 

Science

https://www.sciencemag.org/

 

Second,lexical coverage and word frequency were investigated using vocabulary analysis software.

 

Range

https://www.victoria.ac.nz/lals/about/staff/paul-nation#vocab-programs

Range program compares a text against vocabulary lists to see what words in the text are and are not in the lists, and to see what percentage of the items in the text are covered by the lists.

Coverage
https://drive.google.com/open?id=1q4bJbSiso5TjiOtcJkInctcfAKmCASpK

AntConc

https://www.laurenceanthony.net/software/antconc/

 

Third, a list of high-frequency words in the corpus was compiled and then rearranged to make it learner-friendly.

 

95% coverage

The corpus has 1,673,791 running words.

7,850 of lemmas appear 13 times or more.

Those 7,859 lemmas comprise 95.22% of the total running words.

 

Removing unimportant items

952 of them are proper nouns, marginal words, transparent compounds or abbreviations.

They were removed from the list and now it has 6,898 words.

 

Dividing the list

6,898 words were divided into four groups: BASIC 1965, ESSENTIAL 1305, CORE 2389, and ADVANCED 1239.

 

BASIC 1965

The first and second 1000 words of the BNC/COCA lists

 

ESSENTIAL 1305

The third to 25th 1000 words of the BNC/COC lists and “not in the lists” words that appear 80 times or more in the SRA corpus

 

CORE 2389

The third to 25th 1000 words of the BNC/COC lists and “not in the lists” words with frequency of between 20 and 79 in the SRA corpus

 

ADVANCED 1239

The third to 25th 1000 words of the BNC/COC lists and “not in the lists” words with frequency of between 13 and 19 in the SRA corpus

Presentation slides (ppt)
http://bit.ly/MK20190925

 

 

 




14:40 | Impressed! | Voted(0) | Comment(0)