A Corpus Linguistics Glossary

cluster – a sequence of concurrent tokens of a particular node.

collocation – ‘You shall know a word by the company it keeps’, (Firth 1957). Usually shown in tabular form to show which are salient for a particular node.

concordance – a display of lines of a node which is centred in order to show patterns with surrounding tokens. See also KWIC.

concordance plot – a visual representation of the distribution of a type.

corpus – a “body” of electronic text(s) used for analysis in corpus linguistics. Plural of corpus is corpora.

frequency – refers to the number of times a type occurs in a corpus.

keyword – a type which is salient within a corpus when compared statistically to another corpus.

KWIC – Short for “KeyWord In Context”. The search term for a concordance. The words which surround it within the concordance view can be thought of its context.

lemma – the forms of a type which can be considered to related to a headword. The headword play has plays, played and playing as its lemmata (plural of lemma).

n-gram – a complete list of a predetermined sequence length of types within a corpus. A search sequence of two types is called a 2-gram, three types 3-gram, and so forth.

node – the central type or sequence of types which is the focus of analysis in corpus linguistics.

parts-of-speech tag or POS tag – the morpho-grammatical labels given to a type to mark the role it plays within its context.

token – a “word” within a corpus. Not necessarily unique in the corpus. Used most often to talk about word count and the size of a corpus.

type – a unique word form in a corpus. Types are placed in a word list arranged most often in order of frequency or alphabetical order, and usually shown with frequency count.

type-token ratio – the statistical number of type divided by token to show how “varied” one corpus compared to another corpus of similar size. A corpus is more varied than another if the number is closer to 1.

wildcard – “open” search metacharacters used to find various combinations of user-defined characters.

word list – a complete list of types in a corpus usually shown in frequency or alphabetical order.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s