Friendship Network Updates and Logging
VerifiedAdded on 2019/09/22
|6
|1676
|286
Report
AI Summary
The assignment is to implement two MapReduce jobs and design a public API for Scalica operations. The first job is to update friendship lists based on tombstone events, where the output should be statements to add or delete user IDs from friendship lists. The second job is to log user interaction data, including IP addresses and HTTP headers, and make it available for analysis pipelines. The third part of the assignment is to design a public RESTful API for two Scalica operations: reading all posts of a user and making a user start following another user.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
In this assignment you will complete the following tasks
1. Answer 5 questions that exercise material we covered in class
General instructions:
This assignment does not have a coding section. Please answer the following question. Please
be concise and clear in your answers.
Important: Make sure you view this file while logged in as your nyu.edu account.
Step 1: Make a copy of this file and rename it <your full name> ex4.
Step 2: In the new document answer the following questions:
Part 1: Consistency, sharding, MapReduce, FileSystem, APIs
Q1: Sharding
Think back to the data model you created in ex2 for the subletting site. Assume that there are
only 3 models:
● User: A registered user
● Apartment
● Listing: An open or closed listing. A closed listing always has a booker field pointing to a
user.
The time has come to scale-out your site and shard your data model.
Q1a) Describe how you would shard your data: What field would you shard by? If there is any
hierarchical structure - describe it. (Example in Scalica, Posts are nested under and co-located
with the user who posted them).
Your answer goes here:
Q1b) Which of the following three views will be costly to produce under your sharding scheme?
why?
1) Display all bookings for a user (past, current and future)
2) Display the apartments a user owns
3) Display all listings for an apartment (open and closed)
Your answer goes here:
Q1 bonus for 5 points) How do you propose to solve the problem in Q1b)?
1. Answer 5 questions that exercise material we covered in class
General instructions:
This assignment does not have a coding section. Please answer the following question. Please
be concise and clear in your answers.
Important: Make sure you view this file while logged in as your nyu.edu account.
Step 1: Make a copy of this file and rename it <your full name> ex4.
Step 2: In the new document answer the following questions:
Part 1: Consistency, sharding, MapReduce, FileSystem, APIs
Q1: Sharding
Think back to the data model you created in ex2 for the subletting site. Assume that there are
only 3 models:
● User: A registered user
● Apartment
● Listing: An open or closed listing. A closed listing always has a booker field pointing to a
user.
The time has come to scale-out your site and shard your data model.
Q1a) Describe how you would shard your data: What field would you shard by? If there is any
hierarchical structure - describe it. (Example in Scalica, Posts are nested under and co-located
with the user who posted them).
Your answer goes here:
Q1b) Which of the following three views will be costly to produce under your sharding scheme?
why?
1) Display all bookings for a user (past, current and future)
2) Display the apartments a user owns
3) Display all listings for an apartment (open and closed)
Your answer goes here:
Q1 bonus for 5 points) How do you propose to solve the problem in Q1b)?
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Q2: Fan in/Fan out
Scalica has 3 models. All are sharded by user id: ScalicaUser (information about a user), Post
(stores the posts the user authored), and Following (the user ids of users this user is following).
In scalica, a user - Alice’s timeline is constructed as follows: The list of users Alice follows is
read. These users are “bucketed” into the different shards they are in. For each shard a single
read gets all the posts for the people Alice follows in that shard. The results from all shards are
combined and sorted by time of posting, and then displayed. This is done every time Alice loads
her home page. This technique is called fan-in (You fan-in the data on demand).
To improve the home page’s load time, you decide to switch to fan-out. This means that you
will construct beforehand and store the timeline for each user, and then simply read it when the
user loads their home page(single read) and display it. Give a general description of how you
would implement fan-out. In your description include
● Any changes to the data model (if any are needed)
● Storage systems you propose to use
● What the system does when a user posts a new post
● What the system does to read the timeline
Keep your answer shorter than half a page.
Your answer goes here:
Q3: MapReduce
Your service has a social graph. That is, you store friendship relationships. Friendship is
symmetric, so if Alice is Bob’s friend, then Bob is Alice’s friend. In your database, you store for
each user a list of her friends. So if Alice and Bob are friends, Bob’s user id should be stored in
Alice’s friends list and vice-versa.
Scalica has 3 models. All are sharded by user id: ScalicaUser (information about a user), Post
(stores the posts the user authored), and Following (the user ids of users this user is following).
In scalica, a user - Alice’s timeline is constructed as follows: The list of users Alice follows is
read. These users are “bucketed” into the different shards they are in. For each shard a single
read gets all the posts for the people Alice follows in that shard. The results from all shards are
combined and sorted by time of posting, and then displayed. This is done every time Alice loads
her home page. This technique is called fan-in (You fan-in the data on demand).
To improve the home page’s load time, you decide to switch to fan-out. This means that you
will construct beforehand and store the timeline for each user, and then simply read it when the
user loads their home page(single read) and display it. Give a general description of how you
would implement fan-out. In your description include
● Any changes to the data model (if any are needed)
● Storage systems you propose to use
● What the system does when a user posts a new post
● What the system does to read the timeline
Keep your answer shorter than half a page.
Your answer goes here:
Q3: MapReduce
Your service has a social graph. That is, you store friendship relationships. Friendship is
symmetric, so if Alice is Bob’s friend, then Bob is Alice’s friend. In your database, you store for
each user a list of her friends. So if Alice and Bob are friends, Bob’s user id should be stored in
Alice’s friends list and vice-versa.
Your user base is very large, so your database is sharded by user ids. As a result, when your
system marks two user (Alice and Bob) as friends, it adds each to the other’s friends list. But
because the two users’ data can be in different shards, the updates are NOT done inside a
transaction (because distributed transactions are expensive and complicated).
Due to machine failures and other issues, sometimes the process only updates the friends list of
the first user. This introduces inconsistency. People can see Alice as Bob’s friend, but Bob is
not Alice’s friend. To mitigate this problem, you decide to run a daily MapReduce job that finds
these inconsistencies, and advises how to correct them.
A few things to note:
● For simplicity, assume two users can become friends only once and ‘un-friend’ only
once. Assume befriending and un-friending are mutual decisions.
● When two users are un-friended, this event is marked as a ‘tombstone’ in the friendship
list of both users.
● A friendship list for user id x is a list of pairs <Integer,Boolean> where for each pair the
Integer is a user id (e.g. y) and the boolean is True if x and y are friends and False if x
and y were friends but unfriended (tombstone).
● The input to each Map function is a user id and her friend list
● The output of the Reduce function should be statements of the form “Add|Delete use id
Z To|From the friends list of user id W”. Examples: “Add user id i To the friends list of
user id j” or “Delete user id m from the friends list of user id n”. Assume that these
directives are fed to another system that knows how to execute them. Also note that
“Delete x from y” can mean changing <x, True> to <x, False> in y’s friend list or adding
<x, False> to y’s friends list (if <x, True> does not exist there). Verify that you
understand why the latter case can occur. In any event, you don’t need to worry about
how these two cases are executed. You just need to produce the “Delete …” directive.
Write the Map and Reduce function for this job, in pseudo-code. You can use any syntax, but
make it easy to understand what you are doing.
system marks two user (Alice and Bob) as friends, it adds each to the other’s friends list. But
because the two users’ data can be in different shards, the updates are NOT done inside a
transaction (because distributed transactions are expensive and complicated).
Due to machine failures and other issues, sometimes the process only updates the friends list of
the first user. This introduces inconsistency. People can see Alice as Bob’s friend, but Bob is
not Alice’s friend. To mitigate this problem, you decide to run a daily MapReduce job that finds
these inconsistencies, and advises how to correct them.
A few things to note:
● For simplicity, assume two users can become friends only once and ‘un-friend’ only
once. Assume befriending and un-friending are mutual decisions.
● When two users are un-friended, this event is marked as a ‘tombstone’ in the friendship
list of both users.
● A friendship list for user id x is a list of pairs <Integer,Boolean> where for each pair the
Integer is a user id (e.g. y) and the boolean is True if x and y are friends and False if x
and y were friends but unfriended (tombstone).
● The input to each Map function is a user id and her friend list
● The output of the Reduce function should be statements of the form “Add|Delete use id
Z To|From the friends list of user id W”. Examples: “Add user id i To the friends list of
user id j” or “Delete user id m from the friends list of user id n”. Assume that these
directives are fed to another system that knows how to execute them. Also note that
“Delete x from y” can mean changing <x, True> to <x, False> in y’s friend list or adding
<x, False> to y’s friends list (if <x, True> does not exist there). Verify that you
understand why the latter case can occur. In any event, you don’t need to worry about
how these two cases are executed. You just need to produce the “Delete …” directive.
Write the Map and Reduce function for this job, in pseudo-code. You can use any syntax, but
make it easy to understand what you are doing.
Your answer goes here:
def Map(Integer user_id, List<Pair<Integer, Boolean>> friend_list):
// user_id: the id of the user whose friends list is currently processed.
// friend_list: the friend list. Assume no duplicates.
// TODO: Your logic goes here.
// TODO: Change this to the Key and Value you wish to emit.
// Emit can be called more than once inside a mapper function.
Emit(change_me, change_me)
def Reduce( /* TODO: put the types of key and list of values here*/):
// TODO: implement me.
def Map(Integer user_id, List<Pair<Integer, Boolean>> friend_list):
// user_id: the id of the user whose friends list is currently processed.
// friend_list: the friend list. Assume no duplicates.
// TODO: Your logic goes here.
// TODO: Change this to the Key and Value you wish to emit.
// Emit can be called more than once inside a mapper function.
Emit(change_me, change_me)
def Reduce( /* TODO: put the types of key and list of values here*/):
// TODO: implement me.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Q4: Storing and Analyzing logs
You decide to add a simple user-interaction analysis feature to your site. You reason that if you
log the IP address of the client on each request, along with all the HTTP headers that were
included in the request, you could derive interesting results. For example, you could use geo-
location tools to see the geographical distribution of your users, for each hour of the day. Or you
could look at the HTTP_USER_AGENT in the request to see what browsers your users use
(and maybe correlate that with the country they are in). These analyses will allow you to
improve your service to better suit your users. In Django you write a new Django Middleware
that has access to these data on each request so it can log them.
Describe how you would implement this logging feature. Specifically, describe where you plan to
write that data to. If you plan to copy/move that data around - describe that too. How would you
make this data available for analysis pipelines? (you don’t need to describe any MapReduce
jobs). How would you make sure that this logging step does not add latency to your page
serving.
Hint: you’d like to collect as much data as possible, but losing a few records is not catastrophic.
Your answer goes here:
Q5: Public APIs
In this question you will write a public RESTful API for two Scalica operations. You will not need
to describe your API, instead you will describe it by example. That is, you will write a sample
request and response. Nothing else is required or allowed.
Things to note:
● For this question, ignore authentication or authorization. Assume everyone is authorized.
● Use HTTP and JSON (where it is needed)
● No need to specify optional HTTP headers
● Assume the hostname is www.nyuscalica.net
● Other than that you are free to choose your API structure.
● Do not document/annotate/explain the examples.
You decide to add a simple user-interaction analysis feature to your site. You reason that if you
log the IP address of the client on each request, along with all the HTTP headers that were
included in the request, you could derive interesting results. For example, you could use geo-
location tools to see the geographical distribution of your users, for each hour of the day. Or you
could look at the HTTP_USER_AGENT in the request to see what browsers your users use
(and maybe correlate that with the country they are in). These analyses will allow you to
improve your service to better suit your users. In Django you write a new Django Middleware
that has access to these data on each request so it can log them.
Describe how you would implement this logging feature. Specifically, describe where you plan to
write that data to. If you plan to copy/move that data around - describe that too. How would you
make this data available for analysis pipelines? (you don’t need to describe any MapReduce
jobs). How would you make sure that this logging step does not add latency to your page
serving.
Hint: you’d like to collect as much data as possible, but losing a few records is not catastrophic.
Your answer goes here:
Q5: Public APIs
In this question you will write a public RESTful API for two Scalica operations. You will not need
to describe your API, instead you will describe it by example. That is, you will write a sample
request and response. Nothing else is required or allowed.
Things to note:
● For this question, ignore authentication or authorization. Assume everyone is authorized.
● Use HTTP and JSON (where it is needed)
● No need to specify optional HTTP headers
● Assume the hostname is www.nyuscalica.net
● Other than that you are free to choose your API structure.
● Do not document/annotate/explain the examples.
Operation a) Read all posts of a user (whose id you know):
In the example, assume the user had only ever posted two post. However, remember that some
users can have many thousands of post, so consider that when devising the request/response
format.
Request
GET ...
Response
Operation b) Make a user (whose id you know) start following another user (whose id you
know too)
Request
POST ...
Response
In the example, assume the user had only ever posted two post. However, remember that some
users can have many thousands of post, so consider that when devising the request/response
format.
Request
GET ...
Response
Operation b) Make a user (whose id you know) start following another user (whose id you
know too)
Request
POST ...
Response
1 out of 6
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
© 2024 | Zucol Services PVT LTD | All rights reserved.