Analysis of MovieLens Database Using SQL Queries - Data Science

Verified

Added on  2020/09/22

|7
|1319
|264
Homework Assignment
AI Summary
This assignment focuses on analyzing the MovieLens dataset using SQL queries. It begins with an overview of the database schema, including entity-relationship diagrams and relational schema design, ensuring adherence to normalization principles. The core of the assignment involves formulating and executing SQL queries to extract meaningful insights from the dataset. These queries address a range of questions, such as identifying users who contributed tags, counting movies with specific genre combinations, analyzing rating distributions, listing movies released within a particular time frame, and determining the most highly rated movies based on different criteria, including weighted ratings. The solutions provided demonstrate the ability to write effective SQL queries, interpret query results, and derive valuable conclusions from the data. The assignment covers a wide range of SQL functionalities, including string manipulation, aggregation, joining tables, and ranking data. The document is contributed by a student to be published on the website Desklib, which provides all the necessary AI based study tools for students.
Document Page
Part 1:
ERD Relationship
Relational Schema
i) There are two trains running at 2PM on a weekday.
ii) There are Three Trains depart form the Flinders st ,Frankston and Pakenham
The design has been made for be no partial dependences of any of the columns on the
primary key. it’s in the first normal form and all the non-key attributes are fully
functional dependent on the primary key. it is satisfying the conditions of the second
normal form.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Part 2-SQL queries
Question Answer Code
1 How many users contributed tags? 58 select count(distinct userid)
from bus5dwr.dbo.movielens_tags;
2 How many movies had exactly 3 genres
associated with them? (Comedy|Drama|
Romance would be one example.)
2338 select count(distinct movieid)
from
(
select movieid,title,genres,count(value) as cnt
from
(
select *
from bus5dwr.dbo.movielens_movies
cross apply string_split(genres,'|')
) as M
group by movieid,title,genres
) as N
where cnt= 3;
3 How many users gave ratings that are
not whole numbers (e.g. 3.5, 4.5)?
369 SELECT COUNT(DISTINCT userid)
from bus5dwr.dbo.movielens_ratings
WHERE rating LIKE '%.%';
4 List the movies whose title (and year)
appear more than once in the
movielens_movies table.
Title
1-"Confessions of a
Dangerous Mind (2002)"
2-"Emma (1996)"
3-"Eros (2004)"
SELECT title
from bus5dwr.dbo.movielens_movies
group by title
having COUNT(title)> 1;
Document Page
4-"Saturn 3 (1980)"
5-"War of the Worlds
(2005)"
5 How many movies were released
between 1985 and 1995 inclusive?
1879 SELECT title
FROM bus5dwr.dbo.movielens_movies
where right(title, 6) >='(1985)'and right(title,6)
<='(1995)';
6 Which movies in the movielens_movies
table did not receive a rating? Write a
query to list five of these movies.
# title
1-"Twentieth Century
(1934)"
2-"This Gun for Hire
(1942)"
3-"Parallax View, The
(1974)"
4-"For All Mankind
(1989)"
5-"I Know Where I'm
Going! (1945)"
SELECT TOP 5 title FROM dbo.movielens_movies
LEFT JOIN movielens_ratings ON
movielens_ratings.movieId=movielens_movies.movieId
where rating is NULL
7 How many users tagged three or more
different movies?
26 select count (DISTINCT userid)
from
(
select userid,count(DISTINCT movieid) as cnt
from bus5dwr.dbo.movielens_tags
group by userid
) as M
where cnt >=3
Document Page
8 What are the names of the five movies
that had the most ratings? Show the title
and number of ratings in two columns.
# Title Num_ratings
1-"Forrest Gump (1994)"
"329"
2-"Shawshank
Redemption, The (1994)"
"317"
3-"Pulp Fiction (1994)"
"307"
4-"Silence of the Lambs,
The (1991)" "279"
5-"Matrix, The (1999)"
"278"
select title ,Num_ratings
from
(
select title,Num_ratings,rank() over(order by
Num_ratings DESC) Rnk
from
(
select movieid ,count(userId) as
Num_ratings
from
bus5dwr.dbo.movielens_ratings
group by movieid
) as Rating
Left join bus5dwr.dbo.movielens_movies as
Movies
on Rating.movieid = Movies.movieId
) as M
where RNk < 6
9 What are the five tags that had the most
distinct users contributing them? Show
the tag and number of users in two
columns.
# Tag Num_users
1-"atmospheric" "10"
2-"funny" "10"
3-"sci-fi" "10"
4-"Comedy" "9"
5-"dark comedy" "9"
SELECT top 5 tag, count(distinct userId)
FROM bus5dwr.dbo.movielens_tags group by tag having
count(distinct userid) > 1 order by count(distinct
userid) desc;
10 What are the three highest rated movies
by average rating where one of the
genres is ‘Mystery’ and
Title Average_rating
1-"Memories of Murder
(Salinui chueok) (2003)"
SELECT top 3 a.title as Title, avg(b.rating) as
Average_rating
FROM bus5dwr.dbo.movielens_movies a
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
the number of ratings for the movie is at
least five? Show the movie name and
average rating in two columns.
"4.7"
2-"Incendies (2010)"
"4.4"
3-"Witness for the
Prosecution (1957)"
"4.388889"
left join bus5dwr.dbo.movielens_ratings b
on a.movieid = b.movieid where a.genres like ('%Mystery
%')
group by a.title having count(b.rating) >= 5 order by
avg(b.rating) desc;
11 Find the five genres with the most
distinct tags. Show the genre and tag
count in two columns. For example, a
movie having two genres Action|Thriller
will have its tags contribute to each of
the Action and Thriller genres.
Hint: see example B in
https://docs.microsoft.com/en-us/sql/t-
sql/functions/string-split-transact-sql?
view=sql-server-2017 to see how to split
the genres field.
#-Genres Tag_Count
1-"Drama" "977"
2-"Comedy" "644"
3-"Thriller" "578"
4-"Action" "482"
5-"Crime" "478"
select top 5 value as genres, count(distinct b.tag)
Tag_Count
from bus5dwr.dbo.movielens_movies a CROSS APPLY
STRING_SPLIT(a.genres, '|')
left join bus5dwr.dbo.movielens_tags b
on a.movieId = b.movieId
group by value order by count(distinct b.tag) desc;
12 Let us attempt to find the highest rated
movies, but not simply by computing the
average rating, since that would favour
movies with a small number of ratings.
We will make use of the following
formula.
Weighted rating of a movie =
(V*R+10*C)/(V+10), where
V = number of votes for the
movie
R = average (mean) rating for the
movie
# -Title Weighted_rating
1-"Pulp Fiction (1994)"
-"4.175128"
2-"Guardians of the
Galaxy (2014)"
-"3.97124"
3-"Django Unchained
SELECT top 5 a.title ,(count(b.rating)* avg(b.rating) +
10 * (select avg(rating) FROM
bus5dwr.dbo.movielens_ratings))/ (count (b.rating) + 10)
as weighted_rating
FROM bus5dwr.dbo.movielens_movies a LEFT JOIN
bus5dwr.dbo.movielens_ratings b on a.movieid=b.movieId
where b.movieId in (select distinct movieid FROM
bus5dwr.dbo.movielens_tags where tag ='funny')
GROUP by a.title ORDER BY (COUNT(b.rating)* AVG(b.rating)
+10* (SELECT AVG(rating) FROM
bus5dwr.dbo.movielens_ratings))/(COUNT(b.rating) + 10)
desc;
Document Page
C = the average rating across all
movies
This formula represents the equivalent of
adding 10 additional ratings to each
movie, each equal to the average rating
across all movies, then finding the
resulting average. Hence a movie with a
single rating will not have too high a
weighted rating.
Find the names and ratings of the five
highest rated movies according to the
formula above having the tag ‘funny’.
(2012)"- "3.889081"
4-"Big Lebowski, The
(1998)" - "3.8880653"
5-"Big Short, The (2015)"
"3.8337657"
Document Page
chevron_up_icon
1 out of 7
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]