cluster – a sequence of concurrent tokens of a particular node.
collocation – ‘You shall know a word by the company it keeps’, (Firth 1957). Usually shown in tabular form to show which are salient for a particular node.
concordance – a display of lines of a node which is centred in order to show patterns with surrounding tokens. See also KWIC.
concordance plot – a visual representation of the distribution of a type.
corpus – a “body” of electronic text(s) used for analysis in corpus linguistics. Plural of corpus is corpora.
frequency – refers to the number of times a type occurs in a corpus.
keyword – a type which is salient within a corpus when compared statistically to another corpus.
KWIC – Short for “KeyWord In Context”. The search term for a concordance. The words which surround it within the concordance view can be thought of its context.
lemma – the forms of a type which can be considered to related to a headword. The headword play has plays, played and playing as its lemmata (plural of lemma).
n-gram – a complete list of a predetermined sequence length of types within a corpus. A search sequence of two types is called a 2-gram, three types 3-gram, and so forth.
node – the central type or sequence of types which is the focus of analysis in corpus linguistics.
parts-of-speech tag or POS tag – the morpho-grammatical labels given to a type to mark the role it plays within its context.
token – a “word” within a corpus. Not necessarily unique in the corpus. Used most often to talk about word count and the size of a corpus.
type – a unique word form in a corpus. Types are placed in a word list arranged most often in order of frequency or alphabetical order, and usually shown with frequency count.
type-token ratio – the statistical number of type divided by token to show how “varied” one corpus compared to another corpus of similar size. A corpus is more varied than another if the number is closer to 1.
wildcard – “open” search metacharacters used to find various combinations of user-defined characters.
word list – a complete list of types in a corpus usually shown in frequency or alphabetical order.