top of page

A quick sentiment analysis

  • staniszradek
  • 21 wrz 2023
  • 3 minut(y) czytania

Zaktualizowano: 25 kwi 2024


Today I am going to use the reviews I collected in the previous post via webscraping and try to evaluate their sentiment. For this puprose I will employ a Pre-Trained Sentiment Analyzer called VADER from NLTK library.



If you recall last post, we've ended with a dataframe that looks like this:



It is clear that some basic cleaning is required for any analysis. Let's start with removing all html elements like '<br>' and' <br/>' as they are just residues from scraping the website. Then, I will extract the part that starts with 'Reviewed' and ends with a date from each row. As a result we have a new column with date being text (string) and our 'reviews' column containing only the actual review.


Typically I would like to convert the date into datetime format in order to be able to manipulate it later on. In addition let's pull out a month and year from the date. Now our df looks like this:




The next step is to create a Sentiment Intensity Analyzer object and use it in the the function to see which reviews are positive, negative or neutral. A description of how the analyzer assigns scores can be read here: https://vadersentiment.readthedocs.io/en/latest/pages/about_the_scoring.html




Threshold values for classifying reviews as either positive, neutral, or negative are up to you. I want to spot the vast majority of negative reviews so I set the threshold value as <-0,05 for negative reviews.


Now let's pick 10 random reviews and see how the analyzer managed to classify the reviews:


After reading these 10 random reviews I must say I am really surprised how well the Analyzer performed. Out of 10 reviews I have only 1 doubt regarding the last review (1508), however this one is generally ambiguous as far as I understand it correctly. Other reviews seem to be classified really well.


In this case let's see what is the overall distribution of positive, negative and neutral reviews in our dataset.



Roughly 40% of all reviews were classified as negative. If we take a look at the website with these reviews we will see an average rating for this company equals 3.9/5 with 46% reviews rated 1. If we compare this number with 40% classified as negative by Vader Analyzer I would say it's not bad, especially that there are some reviews left without assigning a rate to it.


source: https://www.consumeraffairs.com/travel/cheapoair.html?page=9#sort=recent&filter=none


Now that we have classified our reviews we could focus for example on negative ones and try to figure out what are the reasons behind all these complaints. We could calculate most frequent words or phrases, etc. But this is a topic for another post.

Today let's look at whether the rate of negative reviews is more or less constant across all years or can we spot any trend.




And let's visualize our findings:



We can see in the above chart that despite significant fluctuations we can observe a downward trend when it comes to percentage of negative reviews per year. That could be a good starting point for further analysis that could lead to valuable conclusions.


To sum up, in some cases a technique of webscraping can be helpful to collect data in an automated manner. Instead of copying and pasting roughly 8000 reviews we can use a beautiful soup library to do it for us. Applying other packages like NLTK we are able to conduct a detailed analysis of our customers experiences directly from reviews they uploaded in the internet.

One important thing to keep in mind when drawing any serious conclusions is that customers making a review online are to some extent not representative as they represent a sample made of people who have volunteered to make a review (most probably they had either very good or very bad experience and want to share it). You can read more about this here:

https://en.wikipedia.org/wiki/Self-selection_bias

Comments


bottom of page