A glossary of corpus types

There are many types of corpus depending on their use. Below is a list some of the main types.

diachronic – a corpus which looks at changes across a timeframe.

learner – a corpus of L2 learner writing of speech.

monitor – a type of diachronic corpus which may continue to grow with new texts added over time.

monolingual – includes only one language.

multilingual – a corpus with two or more languages.

parallel – a corpus with both a target language (L2) and first language (L1).

reference – a corpus to which other corpora are used to compare with, usually through statistical data analysis.

synchronic – a corpus that has been constructed at a certain time (like a snapshot) to represent a language.

raw – a corpus with no annotation.

tagged – a corpus with annotation (for example, Parts-Of-Speech tags).

target – a corpus that is compared to a reference corpus.

In the reference corpus, we trust

People will always ask (and rightly so) how can we trust a corpus to be representative of the language we are studying. The answer is we can’t. But we can make sure it is as unbiased as possible but carefully setting criteria which will ensure at least it is reproducible and somewhat representative.

Take the British National Corpus (BNC), for example.

It was by and large built in the 1980s. It is 100 million tokens (words) in size, 90 million of those tokens written and the remaining 10 million spoken language. The samples were taken from as wide a variety as possible. In my opinion it is a representative sample for almost all the words we want to investigate. It would be impossible to say it is representative of all words. The words which are not representative are small in number as well as low in frequency.

And perhaps because of their low frequency they readily become unrepresentative. A word which does not occur often (less than 1 in one-million occurrences) will necessarily mean they are not across all genres. Also small changes in their frequency will make them standout as different to higher frequency words (more are needed to affect its size). So these unrepresentative low frequency words really do not affect the overall corpus as much as people sometimes think.

Frequency is everything

Within the mind we tend to think of things as universal or generic without relating it to the wider world. We say things like, “the sun rises from the east”, without seeing it in context that which it occurs. We probably even have a perfect literally unclouded image of a singular sunrise that represents all sunrises in our heads.

But the sunrises from the east with a frequency and regularity that is often not taken in account when it should be. It rises once a day. Or to be more precise the earth, covered in an protective “lubricating” atmosphere, turns once a day to give the illusion of the sun rising. We are so easily duped and we’re duped on a daily basis by all kinds of illusions.

The reliability of this event like all other events is what gives us our understanding and our rhythm. We often choose to have a rhythm in order to have a regularity to help us through the day. So in this sense frequency is something important. It may be everything.

As I get older things are no longer a singular mental object but repeated objects with a certain frequency. Understanding that frequency is what gives sense to the world. Otherwise there are only perfect mental objects, which is not true at all.

Yes, frequency is everything.

“A Corpus Linguistics Glossary” is now complete

The page A Corpus Linguistics Glossary is now complete. Check it out. Please feel free to contact me if there are any terms which you think should be included on the list, or if there are any corrections needed. Cheers.

word list

A list of all the types in a corpus. Usually arranged by frequency with the highest frequency at the top.

As a reference corpus a word list can tell you which are the most common words within a language. Placed against another corpus from a different period (or one that is marked with usage information) it can tell you how language as changed.


Case – lower and uppercase – serves the purpose of helping reading and therefore meaning in graphic texts. Nothing substantially changes to the pronunciation of a word. It is therefore a wholly written feature of language that is not apparent in spoken form.

Concordancing software often allow you to choose between being case-sensitive or not. At times, it may be desirable to make a distinction between uppercase and lowercase in doing corpus linguistic analysis. An example of such desirability may be in the case where a text is abundant with the word token Will, as in the nickname for William, in which case the inflated frequency may be mistaken for the modal auxiliary.

Keywords List – AntConc

The keywords list in AntConc is, as the name suggests, a tool to create a list of keywords. To do this your target corpus is compared to a reference corpus. The target and reference corpora do not need to be of the same size. The comparison is then done statistically. The statistics in AntConc used for this task are either chi-squared and log-likelihood.

In AntConc load your corpus or corpora. Go to Wordlist tab then click start.

make wordlist

Select the Tools Preference menu.

Continue reading

AntWordProfiler short review

I have been playing with Laurence Anthony’s AntWordProfiler for a bit now. It is a corpus linguistic tool to “profile” coverage of texts in terms of comprehensibility particularly for reading in a second language. To understand how it works one must understand its predecessor Range by Paul Nation.

Paul Nation is a researcher in Vocabulary Acquisition a subfield in Applied Linguistics. His interest was mainly how much coverage of a text is needed before vocabulary can be acquired from reading without the aid of dictionaries and from textual context alone. To this end he created a the Range program. The Range program has two main functions: 1) to show the distribution of words across mutliple files or texts, and 2) to show how much of the text is covered by carefully designed wordlists based on frequency or knowledge. It does this by number crunching and showing this through statistics.

AntWordProfiler essentially does the same thing but is an upgrade in terms of functionality. Instead of just statistics we can now look at coverage in the text itself. And with a little tweaking of the word definition you can make AntWordProfiler mirror the Range program’s  (and AntConc’s) results.

Again, Mr Anthony has come up with a slick and easy-to-use product. The controls are less intuitive than AntConc (but more so than the Range program) but still it does not take much to figure out the functions.

The selling points, for me, are:

  1. the ability to creating identical results with other products thus making research results compatible and comparable;
  2. transferability of its results in plain-text to other platforms;
  3. speed (not as fast as AntConc (especially the old versions)) but still fast, and;
  4. the ability to process large volumes of text (Range crashes at about 250,000 tokens)

This is the tool to use if you need to profile texts or look at type occurrence over multiple files.

[Update] Mr Anthony has kindly pointed out two omissions to me – that AntWordProfiler is free (yes, free!) to use, and that it is available on different platforms (Windows, Mac OS X and Linux) which Range is not.

type-token ratio

The type-token ratio (or TTR) is used to compare two corpora in terms of lexical complexity. The formula is the number of types divided by the number of tokens. The closer to 1 the greater the complexity. The closer to 0 the greater the repetition of words. There is not a specific ratio which can be said to be ideal as such but that one corpus can be said comparably more or less complex to another only when they are of similar size. Therefore approximate size equivalence is an important criterion in using TTR.

Playing with the ix500

Bought a Fujitsu ix500 scanner last week. Wow! I don’t know how I had lived without this incredible machine for so long.

The scanning is so quick – 30 pages double-sided in a minute. By default it saves as PDF. But one click and it is converted into a Word document. I scanned a novel (for research purposes) into what used to take me the better half of a day. Now it is finished in just 10 minutes. You have to use destructive scanning though but that is a small price to pay (if the book is commercial paperback, of course) for having the scanning done in literally a fraction of the time.

I had planned to use it for corpus building only but the use of this machine goes well beyond that. Very useful for work, research and teaching.

More on this later.