4 thoughts on “What is meta-text information … and how to get rid of it?

  1. Corpus files should be in plain text, rule no. 1.
    Wanna remove meta text i.e. text within angle brackets: use regular expressions. (rule no. 2)
    I’ve learn these two rules in last few years while working with corpora.

    Like

    • Hello Muhammad,
      Thank you for your comment. I am fully aware of what you are trying to say. Please the link to understand the menaing of the term meta-text in this context. The problem many of my readers are facing is how to get rid of information hidden in Word documents, not meta-tags (HTML, XML, etc).

      Like

  2. I understood your point about meta text. As far as I know, OOo Writer used a similar scheme to show bold, headings etc as it is used in html. I mean angle brackets. I got this while working on odt and doc files in OmegaT.
    And idea to eliminate these tags using notepad is fairly simple. I myself do the same while building a corpus. Copy texts from doc files, paste in notepad, save, go to next file….

    Like

    • Cutting and pasting is very tedious. There is a freeware program called Zilla Word to Text Converter I use but the encoding is in Indian and so the quotation marks and apostrophie are not read correctly in the concordancer I use. This problem has annoyed me for some time now. I cannot get it to correct in perl programs I have written either.

      Perhaps you can help me understand what the problem is (see you are from Pakistan and may be able to help).

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s