Cleaning text data for Machine Learning using Python

We have loads and loads of text data sitting to be examined and analysed. But we cannot directly go ahead and use the raw text data as it is for our machine learning and deep learning models, it needs to be cleaned and preprocessed.

When we say the “data needs to be cleaned” it means that the data contains unecessary elements that need to be removed, some information needs to be fetched from the raw data that can further help us build the model more effectively.

We’ll see some of the most common methods to clean the text data and prepare it for machine learning.

What you’ll get in the tutorial

We’ll start with my personal favourite Twitter sentiment analysis dataset available on Kaggle.

Download both the dataset files(test and train both) and place them in your working directory. For this project spcifcally we’ll just use the train dataset, you can perform the same methods that you learnt on the test data set for practice.

So far we know that the raw text data cannot be directly fed into a maching learning model, it needs cleaning and when we say cleaning, we’ll be doing the following:

We have differnt types of text data hence different steps/methods of preprocessing and cleaning needs to be done for the text data at hand and may differ for someother dataset.

We will be performing two main tasks on our Twitter dataset:

Feature Extraction

  • Loading a file
  • Counting number of words
  • Counting number of characters
  • Average characters per word
  • Stop word counting
  • Counting the number of #Hashtags and @Mentions
  • Counting number of numeric digits
  • Counting Uppercase words

Preprocessing and cleaning

  • Converting to lowercase
  • Expanding words
  • Counting the number of Emails and removing them
  • Counting the number of URLs and removing them
  • Removing special characters
  • Removing multiple spaces between words
  • Removing HTML tags
  • Removing Accented words
  • Removing Stop Words
  • Converting words to base form
  • Removing common words
  • Removing rare words
  • Word Cloud
  • Spelling correction
  • Tokenization
  • Lemmatization
  • NER-Named Entity Recognition
  • Noun Detection
  • Language Detection
  • Sentence Translation

Let’s Get Started !!

Assuiming that you have basic python knowledge and have installed python on you machine, let’s get started !!

Feature Extraction

Loading a file

Use the following code to import pandas and then load the train dataset placed in your working directory:

import pandas as pd
df = pd.read_csv('train.csv', encoding = 'latin1')

Counting number of words

Use the following code to create a new column with name, word_counts which saves the count of words in each row for the twitts column:

df[‘word_counts’] = df[‘twitts’].apply(lambda x: len(str(x).split()))

Counting number of characters

Use the following python function to count the number of charcters in the row and implement it on each row using the lambda function:

def char_counts(x):
s=x.split()
x = ‘’.join(s)
return len(x)
df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))

Average characters per word

Use the following code to create a new column for the DataFrame which gives the average of characters per word. We use the values in the columns we created in previous steps:

df['avg_word_len'] = df['char_counts']/df['word_counts']

Stop word counting

You need to install spacy library before we can find the count of StopWords in the text. We’re downloading pretrained statistical model for english(small)

pip install spacy
python -m spacy download en_core_web_sm

StopWords typically refers to the most common words in a language(is,the, etc.)

Importing stop words from spacy

import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords

Use following code to create a new column stop_words_len which saves the number of stopwords in the text:

df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))

Counting the number of #Hashtags and @Mention

Use the following codes for creating separate columns for Hashtags and Mentions respectively

For Hashtags(#):

df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))

For Mentions(@):

df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

Counting number of numeric digits

Use the code to create a new column, numerics_count which contains the number of digits present in the corresponding twitts column

df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

Counting Uppercase words

Use the code to create a new column, upper_counts which contains the number of uppercase words present in the corresponding twitts column

df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))

Preprocessing and Cleaning

Converting to lowercase

Use the code below to convert the twitts to lowercase:

df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())

Expanding words

We need to contractions like don’t to expansions like do not. For this we need to provide a dictionary of contractions that are generally present in a text, you can make additions to the dictionary with your findings.

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}

Function to use key, value pair logic to replace contractions with expansions

def cont_to_exp(x):
if type(x) is str:
for key in contractions:
value = contractions[key]
x = x.replace(key, value)
return x
else:
return x

Use lambda function to implement the contraction to expansions on the twitts column

df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))

Counting the number of Emails and removing them

We need to import regular expressions library for this, we create a new column emails with the emails found in the twitts

import redf['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))

We use the newly created emails column to generate another column emails_count that gives the number of emails

df['emails_count'] = df['emails'].apply(lambda x: len(x))

Use the below mentioned code to remove the emails in the twitts:

re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

This removes all the emails from the twitts column

Counting the number of URLs and removing them

Multiple URLs are going to be present in the text that must be removed. Use the code to find the number of URLs in the twitts column

df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

Below code removes the found URLs in the twitts column

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))

Removing special characters

Special characters like $ % ^ & * ( ) are present in the text that needs to be removed before we move to model building

Following code removes all spcecial characters from the twitts column. We use regular expressions in this too:

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))

Removing multiple spaces between words

We need to manage situation where multiple spaces are there in the text like below:

x =  'hi    hello     how are you'

The Following code fixes the above issue of multiple spaces in the twitts column:

df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store