1. Data

Data source: https://www.cs.cornell.edu/people/pabo/movie-review-data/

The data is released as a part of the paper Thumbs up? Sentiment classification using machine learning techniques. This dataset contains 1386 movie reviews classified into two sentiment levels - positive and negative. The reviews, present as .txt files, are stored in a folder mentioning the class label. It means, we have two folders neg & pos, which contain 692 and 694 .txt files.

Naturally, this is a supervised classification problem. So, we need a Machine Learning model which learns how to predict the polarity (+ve/-ve) of a movie review given the text.

Our final objective is to build an efficient & accurate ML model. The possible target values are negative & positive. There are a lot of metrics to base the model’s performance on. In our case, we will be using Accuracy metric.

The reviews are processed down-cased text files. This is both a blessing and curse. It’s a curse because capitalization acts an important feature and it’s also a blessing since we have a limit on the number of input features to the ML model. So, we continue using the lower-cased text files.

We can lower the number of input features further using Stemming/Lemmatization. Redundant features like stop words can also be removed.

Since the dataset is present in folder structure format, we start by loading it into our notebook using DirSource function. DirSource is used to read the documents inside a directory and store them all in a bucket. Next, we pass this DirSource object to VCorpus function. VCorpus stands for Volatile Corpora Structure, which is a standard tm object that is suitable for performing most tm functions.

## Loading required package: NLP
neg_reviews = VCorpus(DirSource("data/mix20_rand700_tokens_0211/tokens/neg/"),readerControl=list(language="en"))
pos_reviews = VCorpus(DirSource("data/mix20_rand700_tokens_0211/tokens/pos/"),readerControl = list(language="en"))
neg_reviews # dimension of the corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 692
inspect(neg_reviews[1]) # first document of the corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 3217
inspect(neg_reviews[1:3]) # first three documents of the corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 3217
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6143
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1767
## [1] "tristar / 1 : 30 / 1997 / r ( language , violence , dennis rodman ) cast : jean-claude van damme ; mickey rourke ; dennis rodman ; natacha lindinger ; paul freeman director : tsui hark screenplay : dan jakoby ; paul mones ripe with explosions , mass death and really weird hairdos , tsui hark's \" double team \" must be the result of a tipsy hollywood power lunch that decided jean-claude van damme needs another notch on his bad movie-bedpost and nba superstar dennis rodman should have an acting career . actually , in \" double team , \" neither's performance is all that bad . i've always been the one critic to defend van damme -- he possesses a high charisma level that some genre stars ( namely steven seagal ) never aim for ; it's just that he's never made a movie so exuberantly witty since 1994's \" timecop . \" and rodman . . . well , he's pretty much rodman . he's extremely colorful , and therefore he pretty much fits his role to a t , even if the role is that of an ex-cia weapons expert . it's the story that needs some major work . van damme plays counter-terrorist operative jack quinn , who teams up with arms dealer yaz ( rodman ) to rub out deadly gangster stavros ( mickey rourke , all beefy and weird-looking ) in an antwerp amusement park . the job is botched when stavros' son gets killed in the gunfire , and quinn is taken off to an island known as \" the colony \" -- a think tank for soldiers \" too valuable to kill \" but \" too dangerous to set free . \" quinn escapes and tries to make it back home to his pregnant wife ( natacha lindinger ) , but stavros is out for revenge and kidnaps her . so , what's a kickboxing mercenary to do ? quinn looks up yaz and the two travel to rome so they can rescue the woman , kill stavros , save the world and do whatever else the screenplay requires them to do . with crazy , often eye-popping camera work by peter pau and rodman's lite brite locks , \" double team \" should be a mildly enjoyable guilty pleasure . but too much tries to happen in each frame , and the result is a movie that leaves you exhausted rather than exhilarated . the numerous action scenes are loud and headache-inducing and the frenetic pacing never slows down enough for us to care about what's going on in the movie . and much of what's going on is just wacky . there's a whole segment devoted to net-surfing monks that i have yet to figure out . and the climax finds quinn going head-to-head with a tiger in the roman coliseum while yaz circles them on a motorcycle , trying to avoid running over land mines and hold on to quinn's baby boy ( who's in a bomb equipped basket ) -- all this while stavros watches shirtless from the bleachers . did i mention \" double team \" is strange ? when it all comes down , this is just another rarely entertaining formula killathon , albeit one that feels no need to indulge in gratuitous profanity . rodman juices things up with his blatantly vibrant screen persona , though , leading up to a stunt where he kicks an opponent between the legs . but we didn't need \" double team \" to tell us he could do that , did we ? <a9> 1997 jamie peck e-mail : jpeck1@gl . umbc . edu visit the reel deal online : http : //www . gl . umbc . edu/~jpeck1/ "

Negative reviews are stored in neg_reviews variable and positive ones in pos_reviews. As mentioned above, they are VCorpus objects. One can peel off the abstraction by going through the environment section in R. inspect function is helpful in understanding the data-type of the object and contents in it. It shows that there are two main keys: metadata and content. If we want to look at a particular index’s text data, we can use neg_reviews[[1]]\\$content.

This piece of code will take the first instance of neg_reviews and accesses the content attribute, which is where our textual data is stored.

2. Now, Preprocessing

Preprocessing: A series of operations performed to normalize the dataset. These operations include, but are not limited to, lower-casing, removing unwanted characters, stemming, stopword removal, etc.

Since these operations must be same for both positive & negative reviews, we combine them into one big variable called reviews and perform operations on this object. We combine them using the c() function. The resulting variable reviews contains 1386 documents.

tm_map function is used to map a data object with a function. To be specific, it takes a data object and a function as an input and applies that function on each entry of the data object and returns the result.

?getTransformations returns the list of available transformations

‘removeNumbers’, ‘removePunctuation’, ‘removeWords’, ‘stemDocument’, and ‘stripWhitespace’. The names of the functions are intuitive enough to understand what they mean. We can also apply custom functions but convert them to suitable tm_map format by wrapping them around with content_transformer.

Inorder to pass the arguments, we can use options(argument_name) inside tm_map. For example,

removeNumbers has an argument called ucp - a logical specifying whether to use Unicode character properties for determining digit characters. If FALSE (default), characters in the ASCII [:digit:] class (i.e., the decimal digits from 0 to 9) are taken; if TRUE, the characters with Unicode general category Nd (Decimal_Number).

So, we can use tm_map(reviews,removeNumbers, options=(ucp=FALSE)) to send our arguments.

reviews=c(neg_reviews,pos_reviews) # merge, concatenate both groups-corpuses
reviews_post=tm_map(reviews,removeNumbers, options=(ucp=FALSE))

reviews_post=tm_map(reviews_post,removePunctuation) # Remove punctuations
reviews_post=tm_map(reviews_post, content_transformer(tolower)) # convert to lowercase

By perusing the dataset, it is observed that the web links are pre-tokenized i.e http://www.google.com is now split up as http: / / www. google. com. This limits us from applying regex. So, we add these tokens to the list of stop words that will be removed.

# stopwords() function returns the list of stop words for a given language.
# We use this list
english_stopwords = stopwords("english") # list of english stopwords
english_stopwords = append(english_stopwords, c("http","http:", "https:","/","www.",".edu",".com",".in",".eu"))


Lemmatization and stemming are special cases of normalization. The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the outputs are different.

Stemming is more rudimentary and chops off suffixes often resulting in out-of-vocabulary words. And Lemmatization follows morphological analysis, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. So, Lemmatization performs better than Stemming. Hence, we will be lemmatizing the corpus.

library textstem contains a function called lemmatize_strings, which performs lemmatization. The other way of applying a function (other than tm_map) is shown below. It is a naive approach. We go through each document, update it and go to the next.

reviews_post=tm_map(reviews_post,stripWhitespace) # To remove any extra white spaces

## Loading required package: koRpus.lang.en
## Loading required package: koRpus
## Loading required package: sylly
## For information on available language packages for 'koRpus', run
##   available.koRpus.lang()
## and see ?install.koRpus.lang()
## Attaching package: 'koRpus'
## The following object is masked from 'package:tm':
##     readTagged
# Lemmatize the data
for (i in 1:length(reviews_post)) {reviews_post[[i]]["content"]<-lemmatize_strings(reviews_post[[i]]["content"])}

3. Featurize

After pre-processing the movie review text files, we need to convert it into a format suitable for ML models. For Machine Learning, inputs are numeric, mostly in the form of matrices. So, we need to convert our text files into a matrix, where each word is a feature/column and each row represents a document/review. Bag-of-words is a concept where we store a data point in terms of a vector where each feature is a word’s frequency.