IMDB Movie Reviews Sentiment Analysis with Machine Learning.

6 min readJan 14, 2021

We have IMDB data for movie reviews and their sentiment whether it’s positive or negative, we’ll use machine learning to create a binary classifier for the reviews.

We’ll start with downloading the IMDB dataset-

https://www.dropbox.com/s/mdvgzifpfdd05iv/IMDB-Dataset.csv?dl=0

Importing Libraries

Let’s start with importing libraries :

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as snsimport re
from nltk.stem import PorterStemmerimport spacy
nlp=spacy.load("en_core_web_sm")
from spacy.lang.en.stop_words import STOP_WORDS as stopwordsfrom sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrixfrom sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNBimport pickle

Once the dataset is downloaded, we’re going to read it and print first 5 rows:

df=pd.read_csv("IMDB-Dataset.csv")
df.head()

Checking the dataset, we need to know that the dataset we’re dealing with has balanced values, meaning the number of positives and negeatives are equal or close to equal

df.sentiment.value_counts()positive    25000
negative    25000
Name: sentiment, dtype: int64

And the dataset is balance, we have equal number of positives and negatives !!

Cleaning of Data

Renaming Columns

df=df.rename(columns={"sentiment":"sent","review":"rev"})

Contractions to Expansions

We need to handle contractions like, “don’t” needs to be converted to “do not” and likewise many other words can be handled by below code snippet and function:

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"won't": "would not",
'dis': 'this',
'bak': 'back',
'brng': 'bring'}
def cont_exp(x):
    for keys in contractions:
        val=contractions[keys]
        x=x.replace(keys,val)
    return x
df["clean"]=df.rev.apply(lambda x: cont_exp(x))

Stemming

Stemming is one important part of text cleaning, it is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

We are using PortStemmer to perform stemming here, which we imported in the imports section above

def porter(x):
    return PorterStemmer().stem(x)df["clean"]=df.rev.apply(lambda x: porter(x))

We can perform Lemmetization through the below code and function snippet(not performed on the data)

def lemmer(x):
    val=[]
    doc=nlp(x)
    for token in doc:
        if (token.lemma_=="-PRON-") or (token.lemma_==" be "):
            val.append(token.text)
        else:
            val.append(token.lemma_)
    return " ".join(val)df.clean=df.clean.apply(lambda x: lemmer(x))

Once Stemming or Lemmatization are performed on the dataset, we can move on to the next step of Text cleaning

Removing HTML tags, special characters, Stopwords and lowercasing

We have the text that is full of HTML tags(e.g. <br>), special characters(e.g. %$#@) and Stopwords(commonly used word, as “the”, “a”, “an”). We need to remove all of them and also convert the text into lowercasing.

def worder(x):
    val="!@#$%^&*()1234'\"5<>:;?/67890.,-_"
    x=x.lower()
    x=re.sub(r"<.*?>"," ",x)
    val1=[]
    for char in x:
        if char not in val:
            val1.append(char)
    x= "".join(val1)
    x= " ".join([t for t in x.split() if t not in stopwords ])
    return x
df["clean"]=df.clean.apply(lambda x: worder(x))

Once the all the cleaning and removal tasks are performed, we can remove proceed with the next step of converting words to vectors, by using CountVectorizer, we can use TfidfVectorizer also.

Creating a bag of words

Use the below code to convert clean reviews into vectors

y=np.array(df.sent.values)
cv=CountVectorizer(max_features=1000)
X=cv.fit_transform(df.clean).toarray()print(X.shape,y.shape)
(50000, 1000) (50000,)

Train-Test Split

Use the below code snippet to convert dataset into train and test data

xtrain, xtest, ytrain, ytest=train_test_split(X, y, test_size=0.3, random_state=123)
print(xtrain.shape,xtest.shape, ytrain.shape,ytest.shape)
(35000, 1000) (15000, 1000) (35000,) (15000,)

The dataset has been divided into 70:30 ratio of training and test data.

Model creation

We are going to create different models to train the training set and then perform predictions on test set to check the accuracy and other metrics on them to get the best model out of them

sv=SVC()                                    #SupportVectorClassifier
rf=RandomForestClassifier()                 #RandomForestClassifier
mn=MultinomialNB(alpha=1.0,fit_prior=True)  #MultinomialNB

Fitting the model on the training dataset( This might take time depending upon the type of machin you have as the data is very large)

mn.fit(xtrain, ytrain)
rf.fit(xtrain, ytrain)

Once Fitting is done we want to predict the outcomes on the Test data:

predict_rf=rf.predict(xtest)
predict_mn=mn.predict(xtest)

I created a function to display Accuracy_score, Confusion_matrix and Classification_Report on the different predicted data vs the actual data

def acc(y,pred):
    print("--Accuracy-->", accuracy_score(y,pred))
    print("--ConfuMat-->\n","\n",confusion_matrix(y,pred))
    print("\n--ClassRep-->\n","\n", classification_report(y,pred))

Output for RandomForestClassifier

acc(ytest,predict_rf)--Accuracy--> 0.829
--ConfuMat-->
 
 [[6214 1255]
 [1310 6221]]

--ClassRep-->
 
               precision    recall  f1-score   support

    negative       0.83      0.83      0.83      7469
    positive       0.83      0.83      0.83      7531

    accuracy                           0.83     15000
   macro avg       0.83      0.83      0.83     15000
weighted avg       0.83      0.83      0.83     15000

MultinomialNB

acc(ytest,predict_mn)--Accuracy--> 0.8264
--ConfuMat-->
 
 [[6100 1369]
 [1235 6296]]

--ClassRep-->
 
               precision    recall  f1-score   support

    negative       0.83      0.82      0.82      7469
    positive       0.82      0.84      0.83      7531

    accuracy                           0.83     15000
   macro avg       0.83      0.83      0.83     15000
weighted avg       0.83      0.83      0.83     15000

Testing with a new REVIEW !!

I have taken one good review from one of my favourite movies-Shawshank Redemption and a bad review from one of the worst bollywood movies-Coolie No1, I mean I know i dont need a machine learning model to tell me whether the review is going to be posititive or negative :D

ShawshankRedemption:

shawshankRedemtion  =  "I have never seen such an amazing film since I saw The Shawshank Redemption. Shawshank encompasses friendships, hardships, hopes, and dreams. And what is so great about the movie is that it moves you, it gives you hope. Even though the circumstances between the characters and the viewers are quite different, you don't feel that far removed from what the characters are going through.It is a simple film, yet it has an everlasting message. Frank Darabont didn't need to put any kind of outlandish special effects to get us to love this film, the narration and the acting does that for him. Why this movie didn't win all seven Oscars is beyond me, but don't let that sway you to not see this film, let its ranking on the IMDb's top 250 list sway you, let your friends recommendation about the movie sway you.Set aside a little over two hours tonight and rent this movie. You will finally understand what everyone is talking about and you will understand why this is my all time favorite movie."

Let’s test with ShawshankRedemption

f1=cont_exp(shawshankRedemtion)f2=lemmer(f1)f3=worder(f2)f4=cv.transform([f3])rf.predict(f4)
array(['positive'], dtype=object)

POSITIVE

And performing the same operations on the Coolie data:

coolie="Sense less comedy, poor acting by Sara Ali Khan and Varun Dhawan. Old cooli no1 is very good.. But this one crossed limit in Bakwass.. My Suggestion is don't watch this movie.. Waste of 2 hours"f1=cont_exp(coolie)f2=lemmer(f1)f3=worder(f2)f4=cv.transform([f3])rf.predict(f4)
array(['negative'], dtype=object)

NEGATIVE

You can get the same data on my github account listed below:

damanpreet26/IMDB_Movie_Reviews_Sentiment_Analysis

You can download the dataset from my dropbox: https://www.dropbox.com/s/mdvgzifpfdd05iv/IMDB-Dataset.csv?dl=0 Refer the…

github.com