Text Pre-Processing
Text Pre-Processing
2 minute read
Text Pre-Processing
❌ Raw text is messy and inconsistent.
e.g., “WOw!!! The new iphone 18 pro is SOOO good! 😍 I luvv it… best phone ever? Check it out at someCoolSite.com #tech #apple”
So, first we need to do some pre-processing of this messy text, before it can be used for model training.
There are 2 main steps:
- Cleaning: Removing punctuation, lowercasing, stop word removal and stripping special characters.
- Stemming/Lemmatization: Reducing words to their root form.
Cleaning
Removing punctuation, lowercasing, stop word removal and stripping special characters.
- Input: “🙋♂️ Hello, together we will learn NLP (Natural Language Processing)!!!”:
- Output: “hello learn nlp natural language processing”
Stemming
Reduce words to their root form, by chopping off suffixes, often resulting in non-dictionary roots; very fast.
- Input: “Running was considered better than going to gym.”
- Porter Stemmer Output: [“run", “wa”, “consid”, “better”, “than”, “go”, “to”, “gym”]
Lemmatization
Reduce words to their root form, i.e, dictionary base form (lemma).
- Input: “Running was considered better than going to gym.
- WordNet Lemmatizer Output: [“run", “be”, “consider”, “good”, “than”, “go”, “to”, “gym”]
When to use Lemmatization or Stemming:
- ✅ Lemmatization: Accuracy; e.g., chatbots
- ✅ Stemming: Speed; e.g., searching massive datasets