1. I understood your point about meta text. As far as I know, OOo Writer used a similar scheme to show bold, headings etc as it is used in html. I mean angle brackets. I got this while working on odt and doc files in OmegaT.
    And idea to eliminate these tags using notepad is fairly simple. I myself do the same while building a corpus. Copy texts from doc files, paste in notepad, save, go to next file….



    1. Cutting and pasting is very tedious. There is a freeware program called Zilla Word to Text Converter I use but the encoding is in Indian and so the quotation marks and apostrophie are not read correctly in the concordancer I use. This problem has annoyed me for some time now. I cannot get it to correct in perl programs I have written either.

      Perhaps you can help me understand what the problem is (see you are from Pakistan and may be able to help).


  2. Corpus files should be in plain text, rule no. 1.
    Wanna remove meta text i.e. text within angle brackets: use regular expressions. (rule no. 2)
    I’ve learn these two rules in last few years while working with corpora.



    1. Hello Muhammad,
      Thank you for your comment. I am fully aware of what you are trying to say. Please the link to understand the menaing of the term meta-text in this context. The problem many of my readers are facing is how to get rid of information hidden in Word documents, not meta-tags (HTML, XML, etc).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s