In the reference corpus, we trust

People will always ask (and rightly so) how can we trust a corpus to be representative of the language we are studying. The answer is we can’t. But we can make sure it is as unbiased as possible but carefully setting criteria which will ensure at least it is reproducible and somewhat representative.

Take the British National Corpus (BNC), for example.

It was by and large built in the 1980s. It is 100 million tokens (words) in size, 90 million of those tokens written and the remaining 10 million spoken language. The samples were taken from as wide a variety as possible. In my opinion it is a representative sample for almost all the words we want to investigate. It would be impossible to say it is representative of all words. The words which are not representative are small in number as well as low in frequency.

And perhaps because of their low frequency they readily become unrepresentative. A word which does not occur often (less than 1 in one-million occurrences) will necessarily mean they are not across all genres. Also small changes in their frequency will make them standout as different to higher frequency words (more are needed to affect its size). So these unrepresentative low frequency words really do not affect the overall corpus as much as people sometimes think.