COMP 5070: Sentiment Analysis for Summarizing Yelp Online Reviews
VerifiedAdded on 2023/06/11
|16
|2075
|473
Homework Assignment
AI Summary
This assignment undertakes sentiment analysis of Yelp reviews for restaurants and cafes using the Yelp_reviews.csv dataset, which contains over 1.5 million rows of data. The analysis is performed using R software, employing both qualitative and quantitative methods. The average review rating is found to be satisfactory, with positive words used more frequently than negative words. However, the average net sentiment score is low, indicating a potential area of concern. The analysis also explores the distribution of positive and negative words, overall sentiment, and the relationship between review length and star ratings. It identifies businesses with the highest and lowest ratings and examines the correlation between ratings, useful votes, and review length. The conclusion highlights that positive reviews are more prevalent than negative reviews, but the average sentiment is not satisfactory, and the usefulness of votes is statistically significant with review length. The appendix includes the R code used for the analysis.

Running head: SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
Yelp Reviews
Name of Student:
Name of University:
Course ID:
Yelp Reviews
Name of Student:
Name of University:
Course ID:
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
Table of Contents
Answer 1..........................................................................................................................................2
Answer 2..........................................................................................................................................2
Answer 3..........................................................................................................................................3
Answer 4..........................................................................................................................................5
Answer 5..........................................................................................................................................6
Answer 6..........................................................................................................................................7
Answer 7..........................................................................................................................................9
Answer 9........................................................................................................................................11
Appendix:......................................................................................................................................12
Table of Contents
Answer 1..........................................................................................................................................2
Answer 2..........................................................................................................................................2
Answer 3..........................................................................................................................................3
Answer 4..........................................................................................................................................5
Answer 5..........................................................................................................................................6
Answer 6..........................................................................................................................................7
Answer 7..........................................................................................................................................9
Answer 9........................................................................................................................................11
Appendix:......................................................................................................................................12

2SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
Answer 1.
Introduction and Background:
“Yelp” is an online business that accepts and publishes reviews in the fields of local
businesses and daily life incidents. Yelpers have written 71 million reviews to date. “Yelp” has
become a very crucial site specially for small businesses that can achieve success or close down
business as per online reviews.
The data analysis undertakes yelp reviews for restaurants and cafes; summarises the
“Online Reviews” vis sentiment analysis. The data analysis is executed on the basis of
Yelp_reviews.csv file that contains 1569264 rows and 12 columns. The file size including
yelp_reviews.csv is 122.5 MB in size. The big data is analyses with the help of “R” software.
The data analysis is both qualitative and quantitative simultaneously.
Answer 2.
The average rating is satisfactory. (Mean = 3.743 and SD = 1.311468). Hence, overall the
reviews are indicating satisfaction.
The number characters of reviews have range from 0 to 1047. The average number of characters
of reviews is 126 with SD = 115.498.
The total number of positive words used in a review is 7.07 on an average with standard
deviation 5.927. The total number of positive words used in a review has range 0 to 94.
Answer 1.
Introduction and Background:
“Yelp” is an online business that accepts and publishes reviews in the fields of local
businesses and daily life incidents. Yelpers have written 71 million reviews to date. “Yelp” has
become a very crucial site specially for small businesses that can achieve success or close down
business as per online reviews.
The data analysis undertakes yelp reviews for restaurants and cafes; summarises the
“Online Reviews” vis sentiment analysis. The data analysis is executed on the basis of
Yelp_reviews.csv file that contains 1569264 rows and 12 columns. The file size including
yelp_reviews.csv is 122.5 MB in size. The big data is analyses with the help of “R” software.
The data analysis is both qualitative and quantitative simultaneously.
Answer 2.
The average rating is satisfactory. (Mean = 3.743 and SD = 1.311468). Hence, overall the
reviews are indicating satisfaction.
The number characters of reviews have range from 0 to 1047. The average number of characters
of reviews is 126 with SD = 115.498.
The total number of positive words used in a review is 7.07 on an average with standard
deviation 5.927. The total number of positive words used in a review has range 0 to 94.
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

3SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
The total number of negative words used in a review is 2.55 on an average with standard
deviation 3.25. The total number of positive words used in a review has range 0 to 65.
Hence, positive words in the yelp review is far greater than negative words in the review.
The average net sentiment score is very low with mean = 4.522 and standard deviation = 4.522.
Its range is (-59) to 80. The low net sentiment average is obviously a reason of concern.
Answer 3.
Positive Words:
Among the first twenty selected samples, the modal discrete value of positive words is 3
with the 4-time occurrences.
The total number of negative words used in a review is 2.55 on an average with standard
deviation 3.25. The total number of positive words used in a review has range 0 to 65.
Hence, positive words in the yelp review is far greater than negative words in the review.
The average net sentiment score is very low with mean = 4.522 and standard deviation = 4.522.
Its range is (-59) to 80. The low net sentiment average is obviously a reason of concern.
Answer 3.
Positive Words:
Among the first twenty selected samples, the modal discrete value of positive words is 3
with the 4-time occurrences.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
1 3 4 5 6 7 8 10 12
Frequency distribution of Positive words
positive words
fre q u e n c y
0 1 2 3 4
The distribution of first twenty cases is positively and right skewed.
Negative Words:
Among the first twenty selected samples, the modal discrete value of negative words is 0
with the 8-time occurrences followed by 2 and 3 negative words with 4-time occurrences.
1 3 4 5 6 7 8 10 12
Frequency distribution of Positive words
positive words
fre q u e n c y
0 1 2 3 4
The distribution of first twenty cases is positively and right skewed.
Negative Words:
Among the first twenty selected samples, the modal discrete value of negative words is 0
with the 8-time occurrences followed by 2 and 3 negative words with 4-time occurrences.

5SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
0 1 2 3 4 7 8
Frequency distribution of Negative words
Negative words
fre q u e n c y
0 2 4 6 8
The distribution of first twenty cases is positively and highly right skewed.
Answer 4.
Out of first 20 cases of net sentiment, only 3 values are positive and rest of the 17 values
are negative. The values of the first 20 cases lies in the interval of (-4) to 12.
0 1 2 3 4 7 8
Frequency distribution of Negative words
Negative words
fre q u e n c y
0 2 4 6 8
The distribution of first twenty cases is positively and highly right skewed.
Answer 4.
Out of first 20 cases of net sentiment, only 3 values are positive and rest of the 17 values
are negative. The values of the first 20 cases lies in the interval of (-4) to 12.
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

6SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
-4 -1 2 3 4 5 6 7 8 10 12
Frequency distribution of 'Overall Sentiment'
Overall Sentiment
fre q u e n c y
0 1 2 3 4 5
Out of first 20 cases, the value of net sentiment “3” occurred mostly for the 5 times
followed by the value of net sentiment “5” occurred for the 3 times. Overall, the distribution is
slightly negatively skewed.
Answer 5.
The samples that have referred that-
Samples with low rating (stars2 and stars1) have highest review length as an average,
while highest rating (star5) has lowest review length as an average.
Samples with stars 2 have highest median review length, while highest rating (stars 5) has
highest review length.
Lowest rating (stars 1) is most scattered numbers of review length, while highest rating
(stars 5) has minimum scattered numbers of review length.
-4 -1 2 3 4 5 6 7 8 10 12
Frequency distribution of 'Overall Sentiment'
Overall Sentiment
fre q u e n c y
0 1 2 3 4 5
Out of first 20 cases, the value of net sentiment “3” occurred mostly for the 5 times
followed by the value of net sentiment “5” occurred for the 3 times. Overall, the distribution is
slightly negatively skewed.
Answer 5.
The samples that have referred that-
Samples with low rating (stars2 and stars1) have highest review length as an average,
while highest rating (star5) has lowest review length as an average.
Samples with stars 2 have highest median review length, while highest rating (stars 5) has
highest review length.
Lowest rating (stars 1) is most scattered numbers of review length, while highest rating
(stars 5) has minimum scattered numbers of review length.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
The interquartile range is highest for the review lengths of low rating, while it is highest
for the review lengths of low rating.
1 2 3 4 5
Levels of stars vs. Avearege review length
Levels of stars
A v e r a g e r e v i e w l e n g t h
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
The bar plot shows that average review length is highest for low ratings (stars 1 and 2).
The average review length is decreasing as the rating improves afterwards 2.
Yes, on an average, positive reviews are lengthier than negative reviews (127>118).
Answer 6.
Businesses with highest and lowest rating:
Star1.
The lowest rating (stars 1) is maximum for the business “6LM_Klmp3hOP0JmsMCKRqQ” and
minimum for the business “- -D12rW_xO8GuYBomlg9zw”.
The interquartile range is highest for the review lengths of low rating, while it is highest
for the review lengths of low rating.
1 2 3 4 5
Levels of stars vs. Avearege review length
Levels of stars
A v e r a g e r e v i e w l e n g t h
0 2 0 4 0 6 0 8 0 1 0 0 1 2 0 1 4 0
The bar plot shows that average review length is highest for low ratings (stars 1 and 2).
The average review length is decreasing as the rating improves afterwards 2.
Yes, on an average, positive reviews are lengthier than negative reviews (127>118).
Answer 6.
Businesses with highest and lowest rating:
Star1.
The lowest rating (stars 1) is maximum for the business “6LM_Klmp3hOP0JmsMCKRqQ” and
minimum for the business “- -D12rW_xO8GuYBomlg9zw”.

8SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
Star2.
Businesses with highest and lowest rating:
The rating (stars 2) is maximum for the business “Xhg93cMdemu5pAMkDoEdtQ” and
minimum for the business “- -lemggGHgoG6ipd_RMb-g”.
Star3.
Businesses with highest and lowest rating:
The moderate rating (stars 3) is maximum for the business “Xhg93cMdemu5pAMkDoEdtQ” and
minimum for the business “- -4Pe8BZ6gj57VFL5mUE8g”.
Star4.
Businesses with highest and lowest rating:
The high rating (stars 4) is maximum for the business “4bEjOyTaDG24SY5TxsaUNQ” and
minimum for the business “- -4Pe8BZ6gj57VFL5mUE8g”.
Star5.
Businesses with highest and lowest rating:
Star2.
Businesses with highest and lowest rating:
The rating (stars 2) is maximum for the business “Xhg93cMdemu5pAMkDoEdtQ” and
minimum for the business “- -lemggGHgoG6ipd_RMb-g”.
Star3.
Businesses with highest and lowest rating:
The moderate rating (stars 3) is maximum for the business “Xhg93cMdemu5pAMkDoEdtQ” and
minimum for the business “- -4Pe8BZ6gj57VFL5mUE8g”.
Star4.
Businesses with highest and lowest rating:
The high rating (stars 4) is maximum for the business “4bEjOyTaDG24SY5TxsaUNQ” and
minimum for the business “- -4Pe8BZ6gj57VFL5mUE8g”.
Star5.
Businesses with highest and lowest rating:
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

9SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
The highest rating (stars 5) is maximum for the business “2e2e7WgqU1BnpxmQL5jbfw” and
minimum for the business “- -qeSYxyn62mMjWvznNTdg”.
Answer 7.
Correlation:
The two variables ratings (stars) and number of useful voters (votes_useful) are uncorrelated
(correlation coefficient = -0.04897). “votes_useful” is moderately and positively correlated with
length of the reviews (correlation coefficient = 0.3258).
Regression:
The linear regression model is executed assuming “votes_useful” as dependent variable and
“stars” as well as “review_length” as independent variables. The p-values shows that review
The highest rating (stars 5) is maximum for the business “2e2e7WgqU1BnpxmQL5jbfw” and
minimum for the business “- -qeSYxyn62mMjWvznNTdg”.
Answer 7.
Correlation:
The two variables ratings (stars) and number of useful voters (votes_useful) are uncorrelated
(correlation coefficient = -0.04897). “votes_useful” is moderately and positively correlated with
length of the reviews (correlation coefficient = 0.3258).
Regression:
The linear regression model is executed assuming “votes_useful” as dependent variable and
“stars” as well as “review_length” as independent variables. The p-values shows that review
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

10SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
length has significant association with votes_useful but “stars” does not have linear significant
association with votes_useful.
Answer 9.
Conclusion:
The overall analysis depicts that positive reviews are more than negative reviews, that
shows an optimistic attitude of the people towards “lodging and food” than pessimistic approach.
The low ratings are found to be more reviewed than high ratings. However, average sentiment of
the people towards the restaurants and cafes is not satisfactory. Usefulness of votes of consumers
also has statistical significant association with review length. The negative ratings are more
discussed than positive ratings in terms of review length. Although, positive feedbacks are
slightly greater than negative feedbacks in almost all the cases discussed.
length has significant association with votes_useful but “stars” does not have linear significant
association with votes_useful.
Answer 9.
Conclusion:
The overall analysis depicts that positive reviews are more than negative reviews, that
shows an optimistic attitude of the people towards “lodging and food” than pessimistic approach.
The low ratings are found to be more reviewed than high ratings. However, average sentiment of
the people towards the restaurants and cafes is not satisfactory. Usefulness of votes of consumers
also has statistical significant association with review length. The negative ratings are more
discussed than positive ratings in terms of review length. Although, positive feedbacks are
slightly greater than negative feedbacks in almost all the cases discussed.

11SUMMARISING ONLINE REVIEWS VIA SENTIMENT MINING
Appendix:
my_data <- read.csv(file.choose())
my_data
my_data<- as.data.frame(my_data)
library(dplyr)
business_id = my_data[,2]
stars=my_data[,4]
review_length=my_data[,5]
votes_useful=my_data[,7]
pos_words=my_data[,10]
neg_words=my_data[,11]
net_sentiment=my_data[,12]
summary(stars)
sd(my_data$stars)
summary(review_length)
sd(my_data$review_length)
summary(pos_words)
sd(my_data$pos_words)
summary(neg_words)
sd(my_data$neg_words)
summary(net_sentiment)
sd(my_data$net_sentiment)
//////////////////////////////////
t1 = table(my_data[1:20,10])
t1
as.data.frame(t1)
barplot(t,xlab="positive words", ylab="frequency",main=
"Frequency distribution of Positive words", col="green")
Appendix:
my_data <- read.csv(file.choose())
my_data
my_data<- as.data.frame(my_data)
library(dplyr)
business_id = my_data[,2]
stars=my_data[,4]
review_length=my_data[,5]
votes_useful=my_data[,7]
pos_words=my_data[,10]
neg_words=my_data[,11]
net_sentiment=my_data[,12]
summary(stars)
sd(my_data$stars)
summary(review_length)
sd(my_data$review_length)
summary(pos_words)
sd(my_data$pos_words)
summary(neg_words)
sd(my_data$neg_words)
summary(net_sentiment)
sd(my_data$net_sentiment)
//////////////////////////////////
t1 = table(my_data[1:20,10])
t1
as.data.frame(t1)
barplot(t,xlab="positive words", ylab="frequency",main=
"Frequency distribution of Positive words", col="green")
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 16
Related Documents
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2026 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.





