Tag Archives: corpus glossary

These are entries in my Glossary of Corpus Linguistic Terms.

word list

A list of all the types in a corpus. Usually arranged by frequency with the highest frequency at the top.

As a reference corpus a word list can tell you which are the most common words within a language. Placed against another corpus from a different period (or one that is marked with usage information) it can tell you how language as changed.


Case – lower and uppercase – serves the purpose of helping reading and therefore meaning in graphic texts. Nothing substantially changes to the pronunciation of a word. It is therefore a wholly written feature of language that is not apparent in spoken form.

Concordancing software often allow you to choose between being case-sensitive or not. At times, it may be desirable to make a distinction between uppercase and lowercase in doing corpus linguistic analysis. An example of such desirability may be in the case where a text is abundant with the word token Will, as in the nickname for William, in which case the inflated frequency may be mistaken for the modal auxiliary.

type-token ratio

The type-token ratio (or TTR) is used to compare two corpora in terms of lexical complexity. The formula is the number of types divided by the number of tokens. The closer to 1 the greater the complexity. The closer to 0 the greater the repetition of words. There is not a specific ratio which can be said to be ideal as such but that one corpus can be said comparably more or less complex to another only when they are of similar size. Therefore approximate size equivalence is an important criterion in using TTR.


How words (collocates) relate to a particular word (keyword or node). In corpus, this usually means within a certain distance from the node. For example, ±5 words to either side of the node which are then collated and summed for quick comprehension.

Words often come together with greater-than-chance regularity. This can either be within the same phrase, clause, sentence or even between sentences, that is, over sentence boundaries.


Short for Key Word In Context. It is a way of looking at a search term (type) in a concordance program with the keyword centred so as to see the patterns created by the other words, its context.

Below is an example of a concordance search of the term ‘violence’ in a corpus.

violence kwic

The words ‘domestic’, ‘TV’ and ‘of’ seem to stand out and warrants further investigation. This is even before the surrounding text has been sorted.

The unique form of the tokens (words) in a corpus. Often accompanied by frequency data.

Meaning is treated as secondary. Corpus linguistic analysis does not directly reveal the various meanings of a word. This must be inferred from its usage. In corpus linguistics this usually done by concordancing, collocations, clusters, etc.


The individual forms (words) of a corpus. The sum of the tokens is the size of the corpus. The term contrasts with type in order to distinguish how we are observing the form, whether as one instance in the corpus (token), or as combined instances relating to its frequency within a corpus (type).