A glossary of corpus types

There are many types of corpus depending on their use. Below is a list some of the main types.


diachronic – a corpus which looks at changes across a timeframe.

learner – a corpus of L2 learner writing of speech.

monitor – a type of diachronic corpus which may continue to grow with new texts added over time.

monolingual – includes only one language.

multilingual – a corpus with two or more languages.

parallel – a corpus with both a target language (L2) and first language (L1).

reference – a corpus to which other corpora are used to compare with, usually through statistical data analysis.

synchronic – a corpus that has been constructed at a certain time (like a snapshot) to represent a language.

raw – a corpus with no annotation.

tagged – a corpus with annotation (for example, Parts-Of-Speech tags).

target – a corpus that is compared to a reference corpus.

“A Corpus Linguistics Glossary” is now complete

The page A Corpus Linguistics Glossary is now complete. Check it out. Please feel free to contact me if there are any terms which you think should be included on the list, or if there are any corrections needed. Cheers.

word list

A list of all the types in a corpus. Usually arranged by frequency with the highest frequency at the top.

As a reference corpus a word list can tell you which are the most common words within a language. Placed against another corpus from a different period (or one that is marked with usage information) it can tell you how language as changed.

type-token ratio

The type-token ratio (or TTR) is used to compare two corpora in terms of lexical complexity. The formula is the number of types divided by the number of tokens. The closer to 1 the greater the complexity. The closer to 0 the greater the repetition of words. There is not a specific ratio which can be said to be ideal as such but that one corpus can be said comparably more or less complex to another only when they are of similar size. Therefore approximate size equivalence is an important criterion in using TTR.

collocation

How words (collocates) relate to a particular word (keyword or node). In corpus, this usually means within a certain distance from the node. For example, ±5 words to either side of the node which are then collated and summed for quick comprehension.

Words often come together with greater-than-chance regularity. This can either be within the same phrase, clause, sentence or even between sentences, that is, over sentence boundaries.

KWIC

Short for Key Word In Context. It is a way of looking at a search term (type) in a concordance program with the keyword centred so as to see the patterns created by the other words, its context.

Below is an example of a concordance search of the term ‘violence’ in a corpus.

violence kwic

The words ‘domestic’, ‘TV’ and ‘of’ seem to stand out and warrants further investigation. This is even before the surrounding text has been sorted.

Other Corpus Linguistic Terms:

type

The unique form of the tokens (words) in a corpus. Often accompanied by frequency data.

Meaning is treated as secondary. Corpus linguistic analysis does not directly reveal the various meanings of a word. This must be inferred from its usage. In corpus linguistics this usually done by concordancing, collocations, clusters, etc.

token

The individual forms (words) of a corpus. The sum of the tokens is the size of the corpus. The term contrasts with type in order to distinguish how we are observing the form, whether as one instance in the corpus (token), or as combined instances relating to its frequency within a corpus (type).