ITECH 2201 Cloud Computing: Big Data Products and Data Science Task

Verified

Added on 2023/06/11

AI Summary

This assignment for ITECH 2201 Cloud Computing explores various aspects of Big Data and Data Science. It begins with an overview of Data Science, including its definition, growth, and challenges, referencing an article from UC Berkeley. The assignment further examines the characteristics of Big Data through the "Seven V's" framework (Volume, Velocity, Variety, Veracity, Validity, Volatility, and Value), referencing an IEEE paper. It also investigates how Big Data can improve the healthcare sector. Furthermore, the assignment discusses the acquisition, organization, and analysis of Big Data for enterprises, referencing videos on Big Data platforms. Finally, it delves into Google's data products like PageRank, Spell Checker, Flu Trends, and Trends, explaining how large-scale data is effectively used in these products. The assignment also compares how Facebook and LinkedIn utilize large-scale data.

ITECH 2201 Cloud Computing
School of Science, Information Technology & Engineering
Workbook for Week 6 (Big Data)
Please note: All the efforts were taken to ensure the given web links are accessible. However,
if they are broken – please use any appropriate video/article and refer them in your answer
Part A (4 Marks)
Exercise 1: Data Science(1 mark)
Read the article at http://datascience.berkeley.edu/about/what-is-data-science/ and
answer the following:
What is Data Science?
_____Now day’s data science is fastest and significant growing demand for the data operator’s
professional different non-public and public organizations etc. It also highlighted about the limited
supply of data at scale which also reflects the fastest rising of salaries for all data analyst,
statisticians, data engineers etc. As this is one the new field emerges in today’s world the main
challenges is the technique to use all data in more effective manner.
According to IBM estimation, what is the percent of the data in the world today that has been
created in the past two years?
______As per the report of 2017, more than 90% of data from today’s world will be created in last
two years. However, according to report, different new sensor techniques, devices arises the data
growth rate which will be more accelerated. The main challenges faced by marketers is the
increasing demand of customers to know all their needs, expectation , preferences as per each
interaction as well as transaction.
__________________________________________________________________
What is the value of petabytestorage?
CRICOS Provider No. 00103D Insert file name here Page 1 of 10

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In the context of enterprise storage, the system mainly started to leave all terabyte behind,
moving to petabyte; towards Exabyte storage. The value of petabyte (PB) storage is byte
1015 data, 1000 terabytes as well as 1,000,000 Gigabyte (GB). In addition, some of
vendors who sell different associated storage system are IBM Scale Out Network
Attached Storage (SONAS), Hitachi NAS Platform (HNAS), Panasas ActiveStor etc.
_______________________________________________________________________
For each course, both foundation and advanced, you find at
http://datascience.berkeley.edu/academics/curriculum/briefly state (in 2 to 3 lines) what
they offer?Based on the given course description as well as from the video. The purpose
of this question is to understand the different streams available in Data Science.
__As per this article, foundation course mainly offers knowledge as well as proficiency in
different object oriented programming foundation course, different units of advanced as
well as foundation coursework.
_______________________________________________________________________
_However, in the context of advanced courses, they mainly offers better causality as well
as experience knowledge, knowledge related to both Human and Values data. It also
describes about different statistical techniques in the context of time series, panel data as
well as discrete responses.
Exercise 2: Characteristics of Big Data(2 marks)
Read the following research paper from IEEE Xplore Digital Library
Ali-ud-din Khan, M.; Uddin, M.F.; Gupta, N., "Seven V's of Big Data understanding Big
Data to extract value," American Society for Engineering Education (ASEE Zone 1), 2014
Zone 1 Conference of the , pp.1,5, 3-5 April 2014
and answer the following questions:
Summarise the motivation of the author (in one paragraph)
_In this particular article, () discussed about the context of motivation to prepare this
particular paper by doing the proper outlining based on all related arguments from the
context of BigData. However, in this it also discusses about driving better result from the
raw materials of big data in both the Internet and Technology world. Apart from that, large
amount of data to be processed as well as diagnosed based on all related queries along
with different tradition techniques like using SQL etc.
CRICOS Provider No. 00103D Insert file name here Page 2 of 10

What are the 7 v’s mentioned in the paper? Briefly describe each V in one paragraph.
_______________________________________________________________________
Below discussed are the 7 v’s mentioned and discussed below from the article mentioned
above listed below:
1. Volume: In the context of big data volume, it mainly refers to the day sizes including
audio, video, different calamities of natural disaster, weather forecasting etc. The main
importance of big data is discussed about the differences along with traditional d which
can be accessed by doing proper SQL query.
2. Velocity: The speed as well as data velocity will be mainly discussed based on two
different perspectives. In the first case, it is the velocity of different incoming data,
whereas, in other one, it is the data moving speed.
3. Variety: Another variation between traditional data and big data discussed about
different shapes of big data which is mainly acquires from user’s interface in direct
manner.
4. Veracity: Veracity of big data discusses about the data reliability. However, comparing
with traditional data it gets normalized where big data directly acquired from different
users. It also becomes less reliable in nature. Hence, this is one of the important stages to
process big data to process data types.
5. Validity: In validity of data it discusses about the accuracy as well as correctness of data
based on their intended usage. It means that related data can be truth in nature where as
it will not be valid as well as suitable based on the situation.
6. Volatility: Volatility of big data is based on data retention policy in case of both big data
and traditional data. However, big data is the easy one to implement whereas data variety,
volume, velocity enlarge different issues in big data world.
7. Value: After the above mentioned 6v’s, value one is the desired outcome as per big
data analysis compared with the features and approaches of previous characteristics. As
per different researchers, data value needed to exceed both its ownership as well as costs
from management.
CRICOS Provider No. 00103D Insert file name here Page 3 of 10

Explore the author’s future work by using the reference [4] in the research paper.
Summarise your understanding how Big Data can improve the healthcare sector in
300 words.
Below discussion would provide an overview on how the implementation of big data can
improve the healthcare sector:
Big data implementation is creating a huge hype in the healthcare sector since the world
of technology is ever-changing and the data storage is increasing exponentially every
passing day. It has become quite necessary to implement the big data and data analytics
even in the healthcare sectors. The amounts of use cases that are prevailing in the
healthcare industry are well suited for the implementation of big data and data analytics.
For example, EMRs alone collect a humongous amount of data but there is certainly a
variety of data noticed in these. Since the amount of data is probable of getting out of
control, big data and data analytics are well required to notice a pattern and collect similar
data objects.
Without the utilization of big data in healthcare industry, mane healthcare organizations
are seen getting swamped by pedestrian problems like regulatory reporting and
operational dashboards. Before the basic ones are even noticed, it gets piled up with
newer used cases. Therefore, implementation of Big Data is essential in healthcare
industry to make the situation better and arrange the huge pile of data in a much more
organized way reducing time and resources.
Exercise 3: Big Data Platform(1 mark)
In order to build a big data platform - one has to acquire, organize and analyse the big
data. Go through the following links and answer the questions that follow the links: Check
the videos and change the wordings
− http://www.infochimps.com/infochimps-cloud/how-it-works/
− http://www.youtube.com/watch?v=TfuhuA_uaho
− http://www.youtube.com/watch?v=IC6jVRO2Hq4
− http://www.youtube.com/watch?v=2yf_jrBhz5w
Please note: You are encouraged to watch all the videos in the series from Oracle.
How to acquire big data for enterprises and how it can be used?
_Before the implementation of big data, the acquisition phase is one of the main
challenges as per the infrastructure development. Due to the reference of big data, the
CRICOS Provider No. 00103D Insert file name here Page 4 of 10

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

data is mainly streamed by using greater variety and greater velocity. However, the
particular infrastructure is mainly required to provide better big data acquisition delivering
predictable latency, low to capture as well as execution of both simple and short queries.
In addition, NoSQL database is mainly used to store as well as collect different social
media data from different available databases.
How to organize and handle the big data?
To organize as well as handling of big data allows of doing all filtering, transforming as
well as sorting of data from data warehouses. However, Oracle also enables the end-to-
end controlling of both un-structured as well as structured contents.
_______________________________________________________________________
What are the analyses that can be done using big data?
The infrastructure needed to analyse big data required to have the capacity to support
more complex statistics like data mining of vast range and different core statistical
analysis. It also helps to automate the whole decision process as well as increasing the
faster delivery response time.
Part B (4 Marks)
Part B answers should be based on well cited article/videos – name the references used
in your answer.For more information read the guidelines as given in Assignment 1.
Exercise 4: Big Data Products (1 mark)
Google is a master at creating data products. Below are few examples from Google.
Describe the below products and explain how the large scale data is used effectively in
these products.
a. Google’s PageRank
______Google uses the algorithm page rank in order to rank websites by order in the
results that are being put forward by their search engine. This is done to measure the
importance of web pages. However, PageRank is not the only algorithm used by Google
CRICOS Provider No. 00103D Insert file name here Page 5 of 10

for ranking pages according to their importance, but it the first and best-known algorithm.
For large-scale data and searching of full texts a simple iterative algorithm is used which is
corresponding to the principal eigenvector of the normalised link matrix on a page of the
web.___________
b. Google’s Spell Checker
__Google Spell Checker is a tool by which any kind of misspelled word is identified and
corrected based on user behaviour. Google uses both indexing and query processing
algorithm in order to find out the refinement required in a typed word. However, spell
checking is not related to search index. It has a close proximity of searching more than
107,000 results at one go. The query processing algorithm is based on the calculation of
query rank (QR) and the frequency of query (QF) along with the user satisfaction (US) with
query.______
c. Google’s Flu Trends
__Google Flu Trends is a tool that provides influenza activity trends encompassing 25 odd
countries. It uses a CDC monitor that helps in collection of data from multiple resources
from FluSurv-NET surveillance system. It collects data from usable information broken into
five categories, such as, Viral Surveillance, Mortality, Hospitalizations, Output Illness
Surveillance and Geographic Spread of Illness. __
d. Google’s Trends
____The online search tool Google Trends helps a user to find out the frequent use of
specific keywords, phrases or subjects that have been queried over the specified amount
of time. However, it does not present the specific search numbers and works the best with
the utilization of Keyword Planner. It shows a ‘normalized’ or almost related level of
interests for a possible phrase or keyword and also allows comparing the level of interests
with respect to the phrases and keywords that has the potentiality of being the target
phrases. ____
Like Google – Facebook and LinkedIn also uses large scale data effectively. How?
____Facebook and LinkedIn also have the potentiality of using large scale data in an
effective way just like Google. However, the primary difference between them from Google
CRICOS Provider No. 00103D Insert file name here Page 6 of 10

is that Google uses the large-scale data based on the individual searches of any user and
the ‘best guessed’ data that are assumed from the sites the user visits and the search
terms being used. Nevertheless, Facebook or LinkedIn do not assume the user’s surfing
styles and specifically asks the users about their zone of interests and other specific
details._____
Exercise 5: Big Data Tools(2 marks)
Briefly explain why a traditional relational database (RDBS) is not effectively used to store
big data?
__Relational Database Management System or RDBMS served as the solution for
database needs but, the rapid change in the volume and the velocity of data in business
has limited its use in effectively storing the humongous volume of big data. The big data
has a tremendously large range of petabytes which is equal to 1024 terabytes. RDBMS is
not qualified enough to handle petabytes and needs the use of adding much more Central
Processing Units or CPUs, that is, more memory to scale up the database management
system vertically. __
What is NoSQL Database?
_NoSQL Database or Not Only SQL Database is a database design approach used to
accommodate a huge variety of data models that includes document, key-value, graph
and columnar formats. It forms an alternative to the traditional relational databases where
data are placed in tables and care is given to the data schema before the database is
built. It is essentially used to work the large sets of distributed data._
Name and briefly describe at least 5 NoSQL Databases
__
1. Key-Value store NoSQL database: This is the easiest of all the NoSQL data stores.
The client in this database system gets the value for the keys, deletes a key from the data
store or assigns a value to the keys in this database system. The value for this database
system is just a mere blob that just alters or stores the value of a key without knowing the
contents of the key.
CRICOS Provider No. 00103D Insert file name here Page 7 of 10

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2. Documents Store NoSQL Database: These are similar to key-value databases for that
it possesses a key or value and the data is stored as a value. The value has a unique
identifier in the face of a key.
3. Column Store NoSQL Database: In these kinds of databases, the data gets stored in
the cells grouped in columns unlike the traditional way of storing them as rows of data.
Then the columns are logically grouped into families of columns. The properties like
reading and writing on the databases are also done through columns rather than rows.
4. Graph Based NoSQL Database: This database follows an Entity - Attribute – Value
Model. Entities, also known as nodes, stores data for each entity in the database. ___
What is MapReduce and how it works?
______MapReduce is a a proposed framework in action for Apache Hadoop that
processes large data sets in a parallel manner across a cluster of Hadoop.
It uses a ‘map and reduce’ process in two steps. The analysis is supplied by job
configuration and the Hadoop framework help to provide the parallelization, distribution
and scheduling services. ___
Briefly describe some notable MapReduce products (at least 5)
____Apache Hadoop: An Open-source software for Big Data, which is distributed and
scalable while computing.
Couchdb: It focuses in how to use a MapReduce open source database software and
possess a scalable architecture.
Disco Project: Being one of the most distributed computer based system, this lightweight
as well as open-sourced as a MaprReduce framework.
Infinispan: RedHat has developed this software that has the capability to store huge
amount of data of a key-vale NoSQL in a distributed cache.
Riak: This is a NoSQL Database. It is scalable, easily available, and has a very easy
mode of operation in a distributed environment. ____
Amazon’s S3 service lets to store large chunks of data on an online service. List some 5
features for Amazon’s S3 service.
____1. Using BitTorrent with Amazon S3
CRICOS Provider No. 00103D Insert file name here Page 8 of 10

2. Versioning the capabilities of Amazon S3.
3. Hosting static websites on Amazon S3.____
Getting the concise, valuable information from a sea of data can be challenging. We need
statistical analysis tool to deal with Big Data. Name and describe some (at least 3)
statistical analysis tools.
___Knime: It is the leading open solution for the innovation driven by data used for
discovering the potential hidden in a data.
OpenRedine: It generally deals with messy data for cleaning, transforming and extending
of data along with web services and the external data.
Orange: It is a data visualization tool generally focusing to be user friendly for the novice
users. It provides interactive workflow with a large toolbox for creating interactive
workflows for the analysing and visualizing of data.___
Exercise 6: Big Data Application (1 mark)
Name 3 industries that should use Big Data – justify your claim in 250 words for each
industry using proper references.
___There are few industries that have not included big data or data analytics in their
business process yet. Amongst them, three of them are listed below with justifications for
the claims. These are:
1. Non-Profit Organizations: One of the biggest non-profit organizations that need big
data and data analytics is Wikipedia.org. Wikipedia consists of indispensable use and
would make a huge difference in the cyberworld if it ever went out of use. It has a huge
number of data at disposal that can use big data and data analytics in order to boost the
efforts for fundraising. The implementation of big data and data analytics into action in
these industries would give a much more better understanding of acquiring, renewing,
converting and upgrading of prospects.
2. Sales: Not using big data in sales industry is a disservice. Big data could help the
services of a sales team to determine on areas to focus on. It would further help
companies in engaging, retaining, and developing workforce to rank their importance
according to their sales skills. It would mostly help the sales industry in finding out the
sales workforce skills and arranging them according to their respective zone of expertises.
CRICOS Provider No. 00103D Insert file name here Page 9 of 10

3. Insurance: One of the hugest data driven industries in the world is the insurance
industry. Therefore, it is most unfortunate that non-utilization of big data and data analytics
prevail in this industry. Most insurance companies stand on the shoulders of expertises of
an expert. However, it is often found that big opportunities are missed out on every time.
Big data would allow insurance companies in determining the overlooked variables, such
as the links between mobile phone usage with that of the impeding number of accidents.
___
CRICOS Provider No. 00103D Insert file name here Page 10 of 10