NLP: A Simple Python program to analyze reviews/comments and classify the review as Negative or Positive :)

in #nlp7 years ago (edited)

I was just killing my time with some NLP stuff recently and came across few articles that machine learning has grown so powerful and will continue to grow thousand folds(bcz the corpus/training data is increasing day by day).

Then I though of trying a basic program that can train on a set of positive reviews and a set of negative review independently. Then I thought of testing it against a totally unknown set of reviews, to check if this algorithm can predict the type of review(+ive or -ive). Hence I came up with below code and tested it .... To my wonder it is working good, so good, so soo good.
I mean given the size of my training data and the outputs I tested randomly, the accuracy for Positive reviews was 92-100% and for Negative it was 89 - 99% ... yayyyy !!! :D
It was a good learning though, I am sharing my code below , just in case if there are NLP curious people around ^_^ :D


Feel free to reuse the code or suggest me edits or tips (y)


import re
import math
from collections import Counter

inputFile = open("../sample.txt").readlines() # Give un known data here as input
outputFile = open("../testdata.txt", 'w') # Expect this file to be your output :)
posReview = open("../hotelPosTive_train.txt", 'r').readlines() # +ive training corpus Download from internet :P
negReview = open("../hotelNegTive_train.txt", 'r').readlines() # -ive training corpus Download from internet :P

positiveWords = [] # all positive words from corpus
negativeWords = [] # all negative words from corpus

probPos = {} # dictionary with initial word likelihood probabilities in +ive review
probNeg = {} # dictionary with initial word likelihood probabilities in -ive review

for a1 in posReview:
a1 = a1.strip().split()
for b1 in range(1, len(a1)):
S = re.sub('[^A-Za-z0-9]+', '', a1[b1]) # removed all special characters from input
positiveWords.append(S)
for a2 in negReview:
a2 = a2.strip().split()
for b2 in range(1, len(a2)):
S = re.sub('[^A-Za-z0-9]+', '', a2[b2]) # removed all special characters from input
negativeWords.append(S)

positivePriorprob = len(posReview) / (len(posReview) + len(negReview)) # prior Probability for +ive reviews
negativePriorprob = len(negReview) / (len(posReview) + len(negReview)) # prior Probability for -ive reviews

freqPositiveWords = Counter(positiveWords) # positiveReviews words and their freq
freqNegativeWords = Counter(negativeWords) # negativeReviews words and their freq
uniquePositiveWords = set(positiveWords) # Positive unique words in +ive review corpus
uniqueNegativeWords = set(negativeWords) # Negative unique words in -ive review corpus
uniqueAllWords = set(positiveWords + negativeWords) # Total unique words in the corpus

def nBayesAlgorithm(review):
probReviewPos = 0
probReviewNeg = 0
probFinalPos = 1
probFinalNeg = 1

for a3 in review:
    probPos[a3] = (freqPositiveWords[a3] + 1) / (
        len(positiveWords) + len(uniqueAllWords))  # smoothed +ive probabability
    probNeg[a3] = (freqNegativeWords[a3] + 1) / (
        len(negativeWords) + len(uniqueAllWords))  # smoothed -ive probabability
for a4 in review:
    if a4 in uniqueAllWords:
        probReviewPos += math.log(probPos[a4])
        probReviewNeg += math.log(probNeg[a4])
    else:
        continue
probFinalPos = probReviewPos + math.log(positivePriorprob)
probFinalNeg = probReviewNeg + math.log(negativePriorprob)

if float(probFinalPos) > float(probFinalNeg):
    return "POS"
else:
    return "NEG"

*****************Printing To Output File***************
for a5 in inputFile:
reviewList = a5.split()
outputFile.write(reviewList[0] + " " + nBayesAlgorithm(reviewList[1:len(reviewList)]) + "\n")

Image Credits: Internet(Google image search)

Sort:  

me interesa lo voy a revisar