4 thoughts on “What is meta-text information … and how to get rid of it?

    Cutting and pasting is very tedious. There is a freeware program called Zilla Word to Text Converter I use but the encoding is in Indian and so the quotation marks and apostrophie are not read correctly in the concordancer I use. This problem has annoyed me for some time now. I cannot get it to correct in perl programs I have written either.

    Perhaps you can help me understand what the problem is (see you are from Pakistan and may be able to help).


    I understood your point about meta text. As far as I know, OOo Writer used a similar scheme to show bold, headings etc as it is used in html. I mean angle brackets. I got this while working on odt and doc files in OmegaT.
    And idea to eliminate these tags using notepad is fairly simple. I myself do the same while building a corpus. Copy texts from doc files, paste in notepad, save, go to next file….


    Hello Muhammad,
    Thank you for your comment. I am fully aware of what you are trying to say. Please the link to understand the menaing of the term meta-text in this context. The problem many of my readers are facing is how to get rid of information hidden in Word documents, not meta-tags (HTML, XML, etc).


    Corpus files should be in plain text, rule no. 1.
    Wanna remove meta text i.e. text within angle brackets: use regular expressions. (rule no. 2)
    I’ve learn these two rules in last few years while working with corpora.


