Tools

Here you can find information about pre-processing digital texts for NLP research purposes. Often, you need to clean up your texts before feeding them to other NLP tools such as Natural Language Toolkit (NLTK), and the information below will be useful for such cases. Pre-processing includes removing scanning artifacts (OCR errors) from your texts, removing unwanted information such as HTML or XML tags, changing character encoding (e.g. to/from unicode), and so on. When you have large/many text files to deal with, using batch-processing software helps to complete the job much faster than manual editing.

Regular Expression

When you want to search and replace chunks of texts in “patterns” (e.g. all HTML/XML tags are in < >), Regular Expression (regex or regexp in short) comes in handy. Regular Expression is a set of rules for matching a string of texts in a specified pattern. Here is an example for removing HTML tags (the pattern matches any text in < > including the brackets, and removes all the matches):

/<(.|n)*?>//g

Regex pattern is placed between slashes (as in /SEARCH_PATTERN/REPLACE_PATTERN/) and an optional flag is placed after the last /. In this example, /g means global match, i.e. find all possible matches in the source text. The pattern <(.|n)*?> means 0 or more instances of any character (including new lines) enclosed by < >. When you do a search and replace using regex, you can use this pattern in search and leave the replace blank in order to strip out all the tags from your text. You can find more regex examples of stripping out HTML/XML tags here. Also, google “strip tags regular expression” for more examples.

Regex resources

Text processing on Desktop

Below is a list of desktop software (Windows/Mac and some Linux) that you can use to pre-process texts. You should prepare your texts in plain text files (.txt) first, not Word .doc or Adobe .pdf files. Make sure to keep a copy of all original files elsewhere before you make changes to them.
 

  1. Text batch processing (Find and replace, regular expression support): Use these programs to search/replace/remove texts from your texts. Batch processing software makes it easy to process multiple texts (e.g. strip out all tags in all files in a particular directory).
    • TextCrawler (freeware, Windows): Find and replace words and phrases across multiple files and folders, create Regular Expressions (comes with a built-in regular expression tester so that you can test your patterns on a sample text before modifying your files), perform batch operations (process multiple files at once), and more.
    • Here is a link to some other text batch processing tools. This linked article is a bit outdated (circa 2009), but the list is still very useful. There are some Mac utilities as well.
  2. Character encoding conversion:
    • iconv (freeware, Windows/Mac/Linux): Convert the character encodings of text files (e.g. ANSI ↔ Unicode/UTF, etc). Command-line interface. On Wikipedia: Iconv
  3. Compare files and/or folders and merge differences (Visual “diff” and merge tools): These are useful when you want to compare the original with the modified texts and check the changes made visually, side-by-side.
    • WinMerge (freeware, Windows): Compare both files and folders, check the differences visually, and merge them.
    • Meld (freeware, Linux): Visual diff/merge tool for Linux desktop users.
    • DiffMerge (freeware, Windows/Mac/Linux): Like WinMerge, but cross-platform. Check its homepage for a very detailed pdf manual.
  4. Misc: