logo

Assignment on Different Types of Recommendation Systems

8 Pages3049 Words287 Views
   

Added on  2019-10-12

Assignment on Different Types of Recommendation Systems

   Added on 2019-10-12

ShareRelated Documents
Assignment OverviewThis assignment contains two parts. First, you will implement an LSH algorithm, using bothCosine and Jaccard similarity measurement, to find similar products. Second, you will implementa collaborative-filtering recommendation system. The datasets you are going to use are theAmazon Review datasets. The task sections below explain the assignment instructions in detail.The goal of the assignment is to make you understand how different types of recommendationsystems work and more importantly, try to find a way to improve the accuracy of therecommendation system yourself. Environment RequirementsPython: 2.7 Scala: 2.11 Spark: 2.2.1IMPORTANT: We will use these versions to compile and test your code. If you use otherversions, there will be a 20% penalty since we will not be able to grade it automatically.You can only use Spark RDD.Submission DetailsFor this assignment you will need to turn in a Python, Java, or Scala program depending on yourlanguage of preference.Your submission must be a .zip file with name: <Firstname>_<Lastname>_hw3.zip. The structure ofyour submission should be identical as shown below. The Firstname_Lastname_Description.pdf filecontains helpful instructions on how to run your code along with other necessary information asdescribed in the following sections. The OutputFiles directory contains the deliverable output files foreach problem and the Solution directory contains your source code.
Assignment on Different Types of Recommendation Systems_1
DatasetsWe are continually using Amazon Review data. This time we use the Amazon Instant Videocategory. We have already transferred the string id of user and product to integers for yourconvenience. You should download two files from Blackboard:1.video_small_num.csv2.video_small_testing_num.csvTask1: LSH (50%)LSH AlgorithmIn this task, you will need to develop the LSH technique using the video_small_num.csv file. Thedataset is provided to you under the /data folder of the bundled.zip file of the assignment. Thegoal of this task is to find similar products according to the ratings of the users. In order to solvethis problem, you will need to read carefully the sections 3.3 – 3.5 from Chapter 3 of the Miningof Massive Datasets book.In this problem, we will focus on 0-1 ratings rather than the actual ratings of the users. To be morespecific, if a user has rated a product, then his contribution to the characteristic matrix is 1 while ifhe hasn’t rated a product his contribution is 0. Our goal is to identify similar products whoseSimilarity is greater or equal to 0.5.In task 1, you’re required to implement two approach using different similarity measurements:1.Jaccard based LSH (30%) Implementation Guidelines - ApproachThe original characteristic matrix must be of size [users] x [products]. Each cell contains a 0 or1 value depending on whether the user has rated the product or not. Once the matrix is built,you are free to use any collection of hash functions that you think would result in a moreconsistent permutation of the row entries of the characteristic matrix.Some potential hash functions could be of type:f(x)= (ax + b) % morf(x) = ((ax + b) % p) % mwhere p is any prime number and m is the number of bins.You can use any value for the a, b, p or m parameters of your implementation.Once you have computed all the required hash values, you must build the Signature Matrix. Once the Signature Matrix is built, you must divide the Matrix into b bands with r rows each, where
Assignment on Different Types of Recommendation Systems_2
bands x rows = n (n is the number of hash functions), in order to generate the candidate pairs.Remember that in order for two products to be a candidate pair their signature must agree (i.e.,be identical) with at least one band.Once you have computed the candidate pairs, your final result will be the candidate pairs whoseJaccard Similarity is greater than or equal to 0.5. After computing the final similar items, youmust compare your results against the provided ground truth dataset using the precision andrecall metrics. The ground truth dataset contains all the products pairs that have Jaccardsimilarity above or equal to 0.5. The ground truth dataset is located under the /data folder of theassignment’s bundled .zip file and named as video_small_ground_truth_jaccard.csv.Example of Jaccard Similarity:user1user2user3user4product10111product20100Jaccard Similarity (product1, product2) =#products_intersection / #products_union = 1/3Execution ExampleThe program that you will implement should take two parameters as input and generate one file as an output. The first parameter must be the location of the ratings.csv file and the second onemust be the path to the output file followed by the name of the output file. The name of the output file must be Firstname_Lastname_SimilarProducts_Jaccard.txt. The content of thefile must follow the same directions of question 1 in the Questions & Grades Breakdownsection below. If your program does not generate this file or it does not follow the specificationsas described in the following section, there would be a penalty of 50%.Java/Scala Execution Example:Please use JaccardLSH as class namePython Execution Example:Grading:In order to get full credit for this question you should have precision >= 0.9 and recall >= 0.8.If not, then you will get partial credit based on the formula:(Precision / 0.9) * 15 + (Recall / 0.8) * 15And your run time should be less than 120 seconds, or there’ll be 20% penalty.2.Cosine based LSH (20%)Please refer to https://github.com/soundcloud/cosine-lsh-join-spark and use the library to implement.
Assignment on Different Types of Recommendation Systems_3

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Assignment - Girvan-Newman Algorithm | Spark Framework
|5
|1250
|54

Deliverables. There are two different programs to submi
|2
|413
|220

Programming Assignment | Program Creation
|4
|1061
|515

R-Programming Assignment
|5
|1206
|84