Developing a more functional function word list

Here is my paper on developing a list of function words for use in corpus linguistics. It consists of 387 word types and covers 43.3% of 43.4% (or 99.9% coverage and differentiation) of all function and content words (tokens). I am working on a slimmer list of 196 words at the moment which will hopefully…More

A glossary of corpus types

There are many types of corpus depending on their use. Below is a list some of the main types. diachronic – a corpus which looks at changes across a timeframe. learner – a corpus of L2 learner writing of speech. monitor – a type of diachronic corpus which may continue to grow with new texts added…More

In the reference corpus, we trust

People will always ask (and rightly so) how can we trust a corpus to be representative of the language we are studying. The answer is we can’t. But we can make sure it is as unbiased as possible but carefully setting criteria which will ensure at least it is reproducible and somewhat representative. Take the…More

Frequency is everything

Within the mind we tend to think of things as universal or generic without relating it to the wider world. We say things like, “the sun rises from the east”, without seeing it in context that which it occurs. We probably even have a perfect literally unclouded image of a singular sunrise that represents all…More

word list

A list of all the types in a corpus. Usually arranged by frequency with the highest frequency at the top. As a reference corpus a word list can tell you which are the most common words within a language. Placed against another corpus from a different period (or one that is marked with usage information)…More

case

Case – lower and uppercase – serves the purpose of helping reading and therefore meaning in graphic texts. Nothing substantially changes to the pronunciation of a word. It is therefore a wholly written feature of language that is not apparent in spoken form. Concordancing software often allow you to choose between being case-sensitive or not. At times,…More

Keywords List – AntConc

The keywords list in AntConc is, as the name suggests, a tool to create a list of keywords. To do this your target corpus is compared to a reference corpus. The target and reference corpora do not need to be of the same size. The comparison is then done statistically. The statistics in AntConc used…More

AntWordProfiler short review

I have been playing with Laurence Anthony’s AntWordProfiler for a bit now. It is a corpus linguistic tool to “profile” coverage of texts in terms of comprehensibility particularly for reading in a second language. To understand how it works one must understand its predecessor Range by Paul Nation. Paul Nation is a researcher in Vocabulary…More

type-token ratio

The type-token ratio (or TTR) is used to compare two corpora in terms of lexical complexity. The formula is the number of types divided by the number of tokens. The closer to 1 the greater the complexity. The closer to 0 the greater the repetition of words. There is not a specific ratio which can…More