Assignment on Different Types of Recommendation Systems
Added on - 12 Oct 2019
Showing pages 1 to 3 of 8 pages
Assignment OverviewThis assignment contains two parts. First, you will implement an LSH algorithm, using bothCosine and Jaccard similarity measurement, to find similar products. Second, you will implementa collaborative-filtering recommendation system. The datasets you are going to use are theAmazon Review datasets. The task sections below explain the assignment instructions in detail.The goal of the assignment is to make you understand how different types of recommendationsystems work and more importantly, try to find a way to improve the accuracy of therecommendation system yourself.Environment RequirementsPython: 2.7 Scala: 2.11 Spark: 2.2.1IMPORTANT: We will use these versions to compile and test your code. If you use otherversions, there will be a 20% penalty since we will not be able to grade it automatically.You can only use Spark RDD.Submission DetailsFor this assignment you will need to turn in a Python, Java, or Scala program depending on yourlanguage of preference.Your submission must be a .zip file with name:<Firstname>_<Lastname>_hw3.zip. The structure ofyour submission should be identical as shown below. The Firstname_Lastname_Description.pdf filecontains helpful instructions on how to run your code along with other necessary information asdescribed in the following sections. TheOutputFilesdirectory contains the deliverable output files foreach problem and theSolutiondirectory contains your source code.
DatasetsWe are continually using Amazon Review data. This time we use the Amazon Instant Videocategory. We have already transferred the string id of user and product to integers for yourconvenience. You should download two files from Blackboard:1.video_small_num.csv2.video_small_testing_num.csvTask1: LSH (50%)LSH AlgorithmIn this task, you will need to develop the LSH technique using thevideo_small_num.csvfile. Thedataset is provided to you under the/datafolder of the bundled.zip file of the assignment. Thegoal of this task is to find similar products according to the ratings of the users. In order to solvethis problem, you will need to read carefully the sections 3.3 – 3.5 from Chapter 3 of the Miningof Massive Datasets book.In this problem, we will focus on 0-1 ratings rather than the actual ratings of the users. To be morespecific, if a user has rated a product, then his contribution to the characteristic matrix is 1 while ifhe hasn’t rated a product his contribution is 0.Our goal is to identify similar products whoseSimilarity is greater or equal to 0.5.In task 1, you’re required to implement two approach using different similarity measurements:1.Jaccard based LSH (30%)Implementation Guidelines - ApproachThe original characteristic matrix must be of size [users] x [products]. Each cell contains a 0 or1 value depending on whether the user has rated the product or not. Once the matrix is built,you are free to use any collection of hash functions that you think would result in a moreconsistent permutation of the row entries of the characteristic matrix.Some potential hash functions could be of type:f(x)= (ax + b) % morf(x) = ((ax + b) % p) % mwherepisany prime numberandmis thenumber of bins.You can use any value for thea, b, p or mparameters of your implementation.Once you have computed all the required hash values, you must build the Signature Matrix. Oncethe Signature Matrix is built, you must divide the Matrix intobbands withrrows each, where
bands x rows = n (n is the number of hash functions),in order to generate the candidate pairs.Remember that in order for two products to be a candidate pair their signature must agree (i.e.,be identical) with at least one band.Once you have computed the candidate pairs, your final result will be the candidate pairs whoseJaccard Similarity is greater than or equal to 0.5. After computing the final similar items, youmust compare your results against the provided ground truth dataset using the precision andrecall metrics. The ground truth dataset contains all the products pairs that have Jaccardsimilarity above or equal to 0.5. The ground truth dataset is located under the/datafolder of theassignment’s bundled .zip file and named asvideo_small_ground_truth_jaccard.csv.Example of Jaccard Similarity:user1user2user3user4product10111product20100Jaccard Similarity (product1, product2)=#products_intersection / #products_union = 1/3Execution ExampleThe program that you will implement should take two parameters as input and generate one fileas an output. The first parameter must be the location of theratings.csvfile and thesecond onemust be the path to the output file followed by the name of the output file. The name ofthe output file must beFirstname_Lastname_SimilarProducts_Jaccard.txt. The content ofthefile must follow the same directions of question 1 in theQuestions & Grades Breakdownsectionbelow. If your program does not generate this file or it does not follow thespecificationsas described in the following section, there would be a penalty of 50%.Java/Scala Execution Example:Please useJaccardLSHas class namePython Execution Example:Grading:In order to get full credit for this question you should haveprecision >= 0.9 and recall >= 0.8.If not, then you will get partial credit based on the formula:(Precision / 0.9) * 15 + (Recall / 0.8) * 15And your run time should be less than120 seconds, or there’ll be20% penalty.2.Cosine based LSH (20%)Please refer tohttps://github.com/soundcloud/cosine-lsh-join-sparkand use the library to implement.