ICEWeb – the web as corpus

I hadn’t done web as corpus before. That is until now.

People say the web as corpus linguistic data is unreliable. But then they said that too when the first corpora were made all those years ago. Undoubtedly how good the sample is is an important factor. One can say the same thing about any scientific experiment with a small sample size. Thus choice of sample as well as size is important.

All language is language. We can use literature as the yardstick or some other medium. So why not the web.

Martin Weisser was nice enough to inform me about his work in ICEWeb, a program for web corpus analysis. It is an easy to use interface with a simple help menu to explain the basics. How one chooses and analyses a web corpus is something else, something which I have yet to master.

I recommend that you try it if you are interested in studying language and the web.

WordNet 3.0 Vocabulary Helper

This seems like an interesting tool, WordNet 3.0 Vocabulary Helper. Wikipedia defines WordNet as something which “groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.”

Created at Princeton University for research in Machine Translation. An offline version can be downloaded from the official Princeton University website.