Cloud Computing ITECH 2201: Big Data Analysis Workbook - Week 6

Verified

Added on  2023/06/11

|23
|7772
|440
Homework Assignment
AI Summary
This assignment focuses on Big Data concepts within the context of ITECH 2201 Cloud Computing. It covers data science fundamentals, characteristics of Big Data (the 7 V's), and the construction of Big Data platforms. The assignment explores how to acquire, organize, and analyze Big Data for enterprises, including the use of virtualization technologies. It also examines Big Data products like Google's PageRank and Flu Trends, and contrasts traditional relational databases with NoSQL databases for Big Data storage. Furthermore, the assignment discusses how large-scale data is used effectively by companies like Google, Facebook, and LinkedIn. Desklib provides this assignment as a resource for students studying Big Data and cloud computing, offering access to past papers and solved assignments.
Document Page
ITECH 2201 Cloud Computing
School of Science, Information Technology & Engineering
Workbook for Week 6 (Big Data)
Please note: All the efforts were taken to ensure the given web links are
accessible. However, if they are broken – please use any appropriate
video/article and refer them in your answer
Part A (4 Marks)
Exercise 1: Data Science (1 mark)
Read the article at http://datascience.berkeley.edu/about/what-is-data-science/ and
answer the following:
What is Data Science?
The study of what information represents, where it comes from and how the information can be
turned into a valuable resource to create IT strategies and businesses is known as data science.
Patterns can be identified by mining huge number of unstructured and structured data that can be
analyzed to help businesses in cost efficiencies, competitive advantage and new market
opportunities.
According to IBM estimation, what is the percent of the data in the world today that has
been created in the past two years?
According to IBM estimation, 90 % percent of the data in the world today that has been created in
the past two years.
What is the value of petabyte storage?
The value of petabyte storage is 10^15 bytes or 1,000,000 gigabytes or 1,000
terabytes
CRICOS Provider No. 00103D Insert file name here Page 1 of 23
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
For each course, both foundation and advanced, you find at
http://datascience.berkeley.edu/academics/curriculum/ briefly state (in 2 to 3 lines)
what they offer? Based on the given course description as well as from the video.
The purpose of this question is to understand the different streams available in
Data Science.
In the foundation courses, students who are good in OOP will get 12 units of the
coursework, whereas the students who are not so good at OOP will get 15 units of course
work.
The foundation course work includes 3 units of python data science, Research design
and application for data and analysis, Statistics for Daa science, fundamental sof data
engineering and applied machine learning.
In the advanced courses, the units include experiments and casualty, behind data
human values, scaling up really big data, statistical methods for discrete response, time
series and panel data, machine learning at scale, natural language processing with deep
learning and data visualization
Exercise 2: Characteristics of Big Data (2 marks)
Read the following research paper from IEEE Xplore Digital Library
Ali-ud-din Khan, M.; Uddin, M.F.; Gupta, N., "Seven V's of Big Data understanding Big
Data to extract value," American Society for Engineering Education (ASEE Zone 1), 2014
Zone 1 Conference of the , pp.1,5, 3-5 April 2014
and answer the following questions.
Summarise the motivation of the author (in one paragraph)
The motivation of the author is the simple fact that big data is everywhere in our lives now
and it is the solution to many problems that are present in the industries now. To build the
technology of the future, the big data has all the right materials. Big data is used in every
aspects if our lives now such as in small and large businesses, film making, law
enforcement and entertainment. It is used by large corporations like Amazon, Facebook
and Google. To really use big data to its full advantage, the wen experience needs to be
CRICOS Provider No. 00103D Insert file name here Page 2 of 23
Document Page
chosen as the data pool as most of the people nowadays access web and mobile apps
most of the time. The largest generator of data should be Google and it has changed the
market scenario by introducing big data related technology like Map Reduce, Hadoop and
Google Big table. The author has stressed on the fact that big data will help to
revolutionize other industries like biological research, politics, sustainability, environmental
research, finance and education.
What are the 7 v’s mentioned in the paper? Briefly describe each V in one paragraph.
The 7 Vs that are mentioned in the paper are validity, veracity, volume, volatility, variety
and velocity.
The volume aspect of the 7 v’s points to the fact that big data is created from several
sources such as natural disasters, forecasting, weather, crime reports, space images,
medical, research studies, networking, images, social, video, audio and text. The volume
of data can be extracted from social media, GPS trails, government documents, telemetry,
web pages and so on.
The second aspect is velocity. The big data needs to be transferred at an optimum velocity
so that it can be processed. The system taking in the data should have capable
infrastructure to handle the data. The feedback loop should have high speed from the
input to the decision.
The third aspect is the variety of big data that comes from different sources. The data can
be in the form of names, images, text, video and audio. Users use different browsers and
upload their data on different clouds
The third aspect is the .variety. The fourth V is veracity. Next comes validity and value.
Explore the author’s future work by using the reference [4] in the research paper.
Summarise your understanding how Big Data can improve the healthcare sector.
Big data can improve healthcare by
Personalized treatments and medicines
Better treatment
Prevent malicious behaviour
CRICOS Provider No. 00103D Insert file name here Page 3 of 23
Document Page
Exercise 3: Big Data Platform (1 mark)
In order to build a big data platform - one has to acquire, organize and analyse the big
data. Go through the following links and answer the questions that follow the links: Check
the videos and change the wordings
http://www.infochimps.com/infochimps-cloud/how-it-works/
http://www.youtube.com/watch?v=TfuhuA_uaho
http://www.youtube.com/watch?v=IC6jVRO2Hq4
http://www.youtube.com/watch?v=2yf_jrBhz5w
Please note: You are encouraged to watch all the videos in the series from Oracle.
How to acquire big data for enterprises and how it can be used?
Big data is being utilized to enhance operational productivity, and the capacity to settle on
proper choices in light of the extremely most recent up-to-the-minute data is quickly turning
into the standard. The main objective of big data is to enable organizations to settle on more
educated business choices by empowering data Scientist, prescient modelers and different
investigation experts to break down extensive volumes of exchange information, and in
addition different types of information that might be undiscovered by ordinary business
intelligence programs.
How to organize and handle the big data?
To organize and handle big data efficiently, the initial step is to convey the information down
to its dataset and lessen the measure of information to be overseen. Next, use the energy of
virtualization innovation (Baldini et al. 2016). Associations must virtualize this novel
informational index with the goal that not just different applications can reuse similar
information impression, yet in addition the littler information impression can be put away on
any seller free stockpiling gadget.
Virtualization is the tool that associations can use to fight the Big Data administration
challenge.
By diminishing the information, virtualizing the reuse and capacity of the information and
unifying the administration of the informational index, Big Data is at last changed into little
information and oversaw like virtual information.
CRICOS Provider No. 00103D Insert file name here Page 4 of 23
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
What are the analyses that can be done using big data?
There are several analyses that can be conducted with the help of big data. Big data
Analytics is utilized by them to break down dangers like hostile to illegal tax avoidance,
extortion relief, know your customer activity. Media and excitement requires constant offer
information to full the expanding requests of the clients in various arrangements and
assortment of gadgets such as billboards, TV, YouTube and some more. Their primary
initiative is to use enormous information and convey an ongoing substance crosswise over
various Medias. Wimbledon supposition examination, Amazon Prime and Spotify are live
illustrations. Social insurance industry is in most need of huge information examination.
They have a tremendous measure of blood test results to exchange information, from
remedies to media discourse. Because of absence of legitimate examination, the
wellbeing segment has dependably neglected to use the information to check the cost and
get medical advantages. Humedica, Obama care, Cerner are a few cases of such
industries.
Part B (4 Marks)
Part B answers should be based on well cited article/videos – name the references used
in your answer. For more information read the guidelines as given in Assignment 1.
Exercise 4: Big Data Products (1 mark)
Google is a master at creating data products. Below are few examples from Google.
Describe the below products and explain how the large scale data is used effectively in
these products.
a. Google’s PageRank
PageRank is the thing that Google uses to decide the significance of a site page. It's
one of numerous components used to figure out which pages show up in list items.
PageRank tries to check a site page's significance. PageRank doesn't stop at the
popularity of the link. It likewise takes a guess at the significance of the page that
contains the connection. Pages with higher PageRank have more weight in "voting"
with their connections than pages with bring down PageRank. It additionally checks the
CRICOS Provider No. 00103D Insert file name here Page 5 of 23
Document Page
quantity of connections on the page throwing the "vote." Pages with more connections
have less value.
b. Google’s Spell Checker
Google's spell check is an exceptionally old component Google has been always making
strides. Google is utilizing both web file and additionally question preparing calculation
with a specific end goal to choose if the word you write needs a refinement. At times
Google doesn't give the first inquiry any shot whatsoever: It looks for the "right" spelling.
These inquiries may have the most reduced QR.
c. Google’s Flu Trends
Google Flu Trends is presently never again distributing current assessments. Google
operated the service. It gave assessments of flu action to in excess of 25 nations. By
amassing Google Search questions, it tried to make precise assessments about the
activity of influenza.
d. Google’s Trends
Google Trends is a trend searching application that shows how much of the time a given
hunt term is gone into Google's web index with respect to the website's aggregate pursuit
volume over a given timeframe. It can be utilized for near watchword inquire about and to
find occasion activated spikes in volumes of keywords.
It gives watchword related information including look volume list and geological data
about internet searcher clients.
Like Google – Facebook and LinkedIn also uses large scale data effectively. How?
LinkedIn, Facebook and Google use large scale data by analysing user behaviour and
interations. Theb large scale data is then analysed to find patterns.
Exercise 5: Big Data Tools (2 marks)
Briefly explain why a traditional relational database (RDBS) is not effectively used
to store big data?
CRICOS Provider No. 00103D Insert file name here Page 6 of 23
Document Page
RDBS is not usually used for storing big data due to the following reasons.
To begin with, the size of the data has expanded hugely to the scope of petabytes—one
petabyte. RDBMS thinks that its testing to deal with such immense information volumes.
To address this, RDBMS included more CPUs or more memory to the database
administration system to scale up vertically.
Second, most of the information arrives in a semi-organized or unstructured arrangement
from online networking, sound, video, messages, and messages. Be that as it may, the
second issue identified with unstructured information is outside the domain of RDBMS in
light of the fact that social databases can't arrange unstructured information. They're
composed and organized to suit organized information, for example, weblog sensor and
money related information.
Big data is created at a high speed. RDBMS needs in high speed since it's intended for
relentless data maintenance as opposed to fast development. Regardless of whether
RDBMS is utilized to deal with and store "huge information," it will end up being extremely
costly. Thus, the powerlessness of relational databases to deal with big data prompted the
rise of new technologies.
What is NoSQL Database?
NoSQL Database gives an tool to measure capacity and recovery of information that is
demonstrated in tabular fashion utilized as a part of social databases. It is a way to deal
with database plan that can accommodate a wide assortment of information models,
including graph formats columnar and key value.
Name and briefly describe at least 5 NoSQL Databases
Five NoSQL databases are mentioned as follows:-
Wide column (Cassandra, HBase) - Information is put away in sections rather than
columns as in a traditional SQL framework. Any number of segments (and
consequently a wide range of sorts of information) can be assembled or totaled as
required for questions or information sees.
Document databases ( MongoDB, CouchDB) - Embedded data is put away as
JSON structures or archives where the information could be anything from numbers
CRICOS Provider No. 00103D Insert file name here Page 7 of 23
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
to strings to free content. There is no intrinsic need to indicate what fields, assuming
any, a report will contain.
Multi model databases like Cosmos DB and Orient DB.
Graph databases - Information is provided to as a system or chart of substances
and their connections, with every hub in the diagram a freestyle lump of information.
Example, Neo4j.
Key-value stores ( Riak, Redis) - from basic numbers or strings to complex JSON
reports—are gotten to in the database by method for keys.
What is MapReduce and how it works?
MapReduce is a paradigm of programming that takes into consideration huge versatility
over hundreds or thousands of servers in a Hadoop bunch. It works by rearranging which
is the procedure by which middle information from mappers are exchanged to at least 0,1
reducers. Every reducer gets at least 1 keys and its related qualities relying upon the
quantity of reducers (for an adjusted load). Facilitate the qualities related with each key
are privately arranged.
Briefly describe some notable MapReduce products (at least 5)
Some products and applications of MapReduce are mentioned as follows:-
Distributed grep is utilized to scan for a given example in an expansive number of
records. For instance, a web manager can utilize conveyed grep to look web server
sign so as to locate the best asked for pages that match a given example
With the innovative progressions in area based administrations, there is a colossal
surge in the measure of geospatial information. Geospatial questions (closest
neighbor inquiries and invert closest neighbor questions) devour parcel of
computational assets and it is watched that their preparing in characteristically
parallelizable.
Digital Elevation Models are advanced or 3D portrayal of the scene, where every (X,
Y) position is spoken to by a solitary rise esteem. DEMs are additionally alluded by
the name Digital Terrain Model (DTM) or Digital Surface Model (DSM). A DEM can
be spoken to as a raster (a matrix of squares) or as a triangular unpredictable
CRICOS Provider No. 00103D Insert file name here Page 8 of 23
Document Page
system (TIN), and can be created from remotely detected (utilizing satellites) or
specifically overviewed arrive height data.
Count of URL Access Frequency - The guide work forms logs of site page demands
and yields <URL, 1>. The decrease work includes all qualities for a similar URL and
produces a <URL, add up to count> combine.
Inverted Index - The guide work parses each record, and radiates an arrangement
of <word, report ID> sets.
Amazon’s S3 service lets to store large chunks of data on an online service. List
some 5 features for Amazon’s S3 service.
The features are:-
Unmatched durability
Comprehensive security
In place query
Flexible management
Vendor support
Getting the concise, valuable information from a sea of data can be challenging. We
need statistical analysis tool to deal with Big Data. Name and describe some (at
least 3) statistical analysis tools.
Three statistical tools have been described as follows:-
MS Excel brings a wide assortment of devices for perception and measurable examination
of your physiological information. Information import from content documents is as basic
as producing rundown measurements and adjustable illustrations and figures.
Advantages:-
• It offers a great deal of control and adaptability.
• It is generally accessible and moderately modest for understudies and private
substances.
CRICOS Provider No. 00103D Insert file name here Page 9 of 23
Document Page
• It doesn't require to learn new techniques for controlling information and drawing
diagrams.
MATLAB is a framework of general analysis , which requires programming abilities to a
substantially more prominent degree than Excel or SPSS.
Advantages:-
• MATLAB offers specific tool kits for the investigation of information coming from eye
following, EEG, ECG, EMG and so forth and outward appearance examination.
• In MATLAB, examination, handling steps and results can be totally altered.
• It offers scholarly licenses at a lessened cost.
SPSS is searching programming, including measurements, measurable and non-
measurable test effectiveness. Plots of SPSS are ordinarily found in scholarly papers and
business explore reports.
Advantages:-
• SPSS has a productive data administration and offers a considerable measure of control
over information association.
• It offers an extensive variety of techniques, diagrams and graphs.
• SPSS verifies that the result is kept separate from the information itself, producing very
much organized reports and worksheets containing comes about.
Exercise 6: Big Data Application (1 mark)
Name 3 industries that should use Big Data
Google, Facebook and twitter should use big data.
From your lecture and also based on the below given video link:
https://www.youtube.com/watch?v=_sXkTSiAe-A
Write a paragraph about memory virtualization.
CRICOS Provider No. 00103D Insert file name here Page 10 of 23
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Memory virtualization permits organized servers to share a pool of memory to
conquer physical memory impediments, a typical bottleneck in programming
execution. The memory pool might be gotten to at the application level or
working framework level. At the application level, the pool is gotten to through an
API or as an organized record framework to make a rapid shared memory store.
At the working framework level, a page reserve can use the pool as an extensive
memory asset that is significantly quicker than neighborhood or arranged
stockpiling. Memory virtualization usage are recognized from shared memory
frameworks.
Watch the below mentioned YouTube link:
https://www.youtube.com/watch?v=wTcxRObq738
Based on the video answer the following questions:
What is RAID 0?
Disk striping or RAID 0 is a procedure that splits up a document and spreads the
information over all the plate drives in a RAID gathering. The advantage of RAID
0 is that it enhances execution. Since striping spreads information crosswise
over more physical drives, various circles can get to the substance of a
document, enabling composes and peruses to be finished all the more rapidly. A
disadvantage to RAID 0 is that it doesn't have equality. In the event that a drive
ought to flop, there is no repetition and all information would be lost.
Describe Striping, Mirroring and Parity.
Disk striping is a strategy in which numerous littler plates go about as a solitary
substantial circle. The procedure separates substantial information into
information pieces and spreads them over different stockpiling gadgets. Disk
striping gives the upside of amazingly huge databases or extensive single-table
table space utilizing just a single consistent gadget.
Data mirroring includes the continuing task of duplicating information from one
area to a nearby or remote stockpiling medium. In short, a mirror is a precise of
a dataset. Most regularly, it is utilized when numerous precise of information are
required in different areas.
Parity drive is a hard drive utilized as a part of a RAID exhibit to give adaptation
to non-critical failure. For instance, RAID 3 utilizes it to make a framework that is
both blame tolerant and, in view of information striping, quick. The XOR of the
greater part of the information drives in the RAID cluster is composed to the
parity drive.
CRICOS Provider No. 00103D Insert file name here Page 11 of 23
Document Page
Exercise 2: Storage Design (2 marks)
Summarize storage repository design based on the following video link:
https://www.youtube.com/watch?v=eVQH7C3nulY
In the mentioned video, a storage repository on a LUN is connected to a
grouped server pool, as a result of the idea of the OCFS2 document framework
it employments. Thus, a server pool must exist with grouping empowered, and
no less than one server must be available in the clustered environement. Local
server storage with a repository additionally has a place in this classification,
since nearby circles are constantly found as LUNs.
Below YouTube link describes the Intelligent Storage System
https://www.youtube.com/watch?v=raTIRsMi7zk
Based on the watched video answer the following questions:
What is ISS?
Storage Arrays include arrays of rich RAID that give exceptionally streamlined
I/O handling abilities are by and large alluded as Intelligent Storage Arrays or
Intelligent Storage Systems. These stockpiling frameworks have the ability to
meet the necessities of the present I/O concentrated cutting edge applications.
These applications require abnormal amounts of execution, accessibility,
security, and versatility. Along these lines, to meet the necessities of the
applications numerous sellers of clever stockpiling system currently support
SSDs, deduplication, compression, architecture and encryption.
What are the 4 main components of the ISS?
The front end gives the interface between host and the storage. It comprises of
two parts: front-end ports and front-end controllers. The front-end ports empower
hosts to interface with the canny stockpiling framework. Each front-end port has
preparing rationale that executes the proper transport convention, for example,
SCSI, iSCSI and Fibre Channel, for capacity associations.
Cache is a critical part that upgrades the I/O execution in a canny stockpiling
framework. Store is semiconductor memory where information is set incidentally
to lessen the time required to benefit I/O asks for from the host. Cac enhances
stockpiling framework execution by disconnecting has from the mechanical
deferrals related with physical plates, which are the slowest segments of a clever
stockpiling framework.
The back end gives an interface amongst store and the physical circles. It
comprises of two parts: back-end ports and back-end controllers. The back end
CRICOS Provider No. 00103D Insert file name here Page 12 of 23
chevron_up_icon
1 out of 23
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]