Cleaning text data for Machine Learning using Python

Image for post
Image for post

What you’ll get in the tutorial

We’ll start with my personal favourite Twitter sentiment analysis dataset available on Kaggle.

Feature Extraction

  • Loading a file
  • Counting number of words
  • Counting number of characters
  • Average characters per word
  • Stop word counting
  • Counting the number of #Hashtags and @Mentions
  • Counting number of numeric digits
  • Counting Uppercase words

Preprocessing and cleaning

  • Converting to lowercase
  • Expanding words
  • Counting the number of Emails and removing them
  • Counting the number of URLs and removing them
  • Removing special characters
  • Removing multiple spaces between words
  • Removing HTML tags
  • Removing Accented words
  • Removing Stop Words
  • Converting words to base form
  • Removing common words
  • Removing rare words
  • Word Cloud
  • Spelling correction
  • Tokenization
  • Lemmatization
  • NER-Named Entity Recognition
  • Noun Detection
  • Language Detection
  • Sentence Translation

Let’s Get Started !!

Assuiming that you have basic python knowledge and have installed python on you machine, let’s get started !!

Feature Extraction

Loading a file

import pandas as pd
df = pd.read_csv('train.csv', encoding = 'latin1')
df[‘word_counts’] = df[‘twitts’].apply(lambda x: len(str(x).split()))
def char_counts(x):
s=x.split()
x = ‘’.join(s)
return len(x)
df['char_counts'] = df['twitts'].apply(lambda x: char_counts(str(x)))
df['avg_word_len'] = df['char_counts']/df['word_counts']
pip install spacy
python -m spacy download en_core_web_sm
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in stopwords]))
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper()]))

Preprocessing and Cleaning

Converting to lowercase

df['twitts'] = df['twitts'].apply(lambda x: str(x).lower())
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}
def cont_to_exp(x):
if type(x) is str:
for key in contractions:
value = contractions[key]
x = x.replace(key, value)
return x
else:
return x
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
import redf['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+\b)', x))
df['emails_count'] = df['emails'].apply(lambda x: len(x))
re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)
df['url_flags'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x))
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'[^\w ]+', "", x))
x =  'hi    hello     how are you'
df['twitts'] = df['twitts'].apply(lambda x: ' '.join(x.split()))

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store