Corpus Linguistics – A Short Introduction

What is Corpus Linguistics?
Corpus linguistics is the use of digitalized text (corpus) or texts, usually naturally occurring material, in the analysis of language (linguistics). Techniques used include generating frequency word lists, concordance lines (keyword in context or KWIC), collocate, cluster and keyness lists. The plural of corpus is corpora.

What does one need to do corpus linguistics?
A personal computer (Windows, MAC, Linux, etc) is usually enough for small corpora. With it one can use a concordance program or concordancer to analyse plain-text files (extension “.txt”).

What does one need to know to do corpus linguistics?
To know the language you want to study is, of course, important. You also need to know some of the basic ideas in corpus linguistics, such as word list, frequency, type, token and concordance. Since these are the most basic and important concepts let us have a quick look at them.

The first thing you would want to do is make a word list. It is usually arranged from highest to lowest frequency of types. A type is a unique form of a word. A word is defined as running letters separated by space or punctuation. Thus the sentence:

“To be or not to be; that is the question.

has 8 types (to, be, or, not, that, is, the and question). The types “to” and “be” have frequencies of 2 (that is, they occurred twice in our example). And if we count every word (do a word count in layman’s terms) then we have 10 tokens. Below is an example of a word list made by a concordance program (Antconc).

Theoretically there is nothing to say our corpus could not have contained just ten words as in the above sentence. What we did above is what a corpus program would do, only it can do it to millions of tokens in a matter of seconds.

The frequency count of types that we did above is useful to a certain extent. In order to see what the frequency is all about we need to look at the types in context, that is, we need to make a concordance of the type in question. Making a concordance will put the word in the middle and show you what the surrounding text looks like. Usually the concordance lines are arranged by a sorting criteria (one to the right, then two to the right of the main word, for example). This way we can quickly see patterns in the lines. When the type in question is placed in the middle to make concordance lines it is called keyword in context or KWIC. Here is an example concordance lines for “Harry” in Harry Potter and the Philosopher’s Stone.

Where can I get a concordance program?
The concordance program I recommend for beginners, novices and veterans alike is Antconc by Laurence Anthony. It is free, fast and incredibly intuitive in design. Or else here is a list of other concordance programs available. A little knowledge and you can almost do anything with it. Once you have a concordance program you will need to make a corpus which easier to make than you think.

How to make a corpus?
To make a corpus really means to make a plain-text file. In Windows open a text editor, in my case a program called Notepad (it can be found in All Programs > Accessories). Type in some text then save it in a place where you can find it again. All you need to do now is open the file in Antconc and you are ready to have some fun.

The operating functions of Antconc should be self evident. A couple of minutes of playing with it should be enough to get you going. But if you still need or want guidance here is a guide I made for simple operations with AntConc as an example.

Older guides are still available here:
A Simple Guide to Using AntConc (English)
Un Guide Simple Pour Utiliser AntConc (French, translated by Stefania Solofrizzo)

