Text Pre-Processing

Text Pre-Processing - Cleaning, Stemming & Lemmatization

2 minute read

❌ Raw text is messy and inconsistent.

e.g., “WOw!!! The new iphone 18 pro is SOOO good! I luvv it… best phone ever? Check it out at someCoolSite.com #tech #apple”

So, first we need to do some pre-processing of this messy text, before it can be used for model training.
There are 2 main steps:

Cleaning: Removing punctuation, lowercasing, stop word removal and stripping special characters.
Stemming/Lemmatization: Reducing words to their root form.

Removing punctuation, lowercasing, stop word removal and stripping special characters.

Reduce words to their root form, by chopping off suffixes, often resulting in non-dictionary roots; very fast.

Input: “Running was considered better than going to gym.”
Porter Stemmer Output: [“run", “wa”, “consid”, “better”, “than”, “go”, “to”, “gym”]

Reduce words to their root form, i.e, dictionary base form (lemma).

Input: “Running was considered better than going to gym.
WordNet Lemmatizer Output: [“run", “be”, “consider”, “good”, “than”, “go”, “to”, “gym”]

When to use Lemmatization or Stemming: