Text Pre-Processing

Text Pre-Processing
Text Pre-Processing

❌ Raw text is messy and inconsistent.

e.g., “WOw!!! The new iphone 18 pro is SOOO good! 😍 I luvv it… best phone ever? Check it out at someCoolSite.com #tech #apple”

So, first we need to do some pre-processing of this messy text, before it can be used for model training.
There are 2 main steps:

  • Cleaning: Removing punctuation, lowercasing, stop word removal and stripping special characters.
  • Stemming/Lemmatization: Reducing words to their root form.
Cleaning

Removing punctuation, lowercasing, stop word removal and stripping special characters.

  • Input: “🙋‍♂️ Hello, together we will learn NLP (Natural Language Processing)!!!”:
  • Output: “hello learn nlp natural language processing”
Stemming

Reduce words to their root form, by chopping off suffixes, often resulting in non-dictionary roots; very fast.

  • Input: “Running was considered better than going to gym.”
  • Porter Stemmer Output: [“run", “wa”, “consid”, “better”, “than”, “go”, “to”, “gym”]
Lemmatization

Reduce words to their root form, i.e, dictionary base form (lemma).

  • Input: “Running was considered better than going to gym.
  • WordNet Lemmatizer Output: [“run", “be”, “consider”, “good”, “than”, “go”, “to”, “gym”]

When to use Lemmatization or Stemming:

  • ✅ Lemmatization: Accuracy; e.g., chatbots
  • ✅ Stemming: Speed; e.g., searching massive datasets