Data Lake Concept and AWS Application in Data Management

Verified

Added on 2021/10/03

AI Summary

This report provides an introduction to the concept of a data lake, defining it as a repository for storing vast amounts of unstructured data from various sources. It emphasizes that data lakes maintain data in its original format without a rigid hierarchy, only using the data when needed. The report then highlights Amazon Web Services (AWS) as a current application of data lakes, detailing how AWS offers services to store, move, and analyze data in real-time, securely storing gigabytes to exabytes of data. It explains AWS's analytical tools, including support for open formats, and how they enable interactive SQL queries, data warehousing, and real-time analytics. The report further discusses the flexibility of AWS in storing data in various formats and its cost-effectiveness, emphasizing its long-term backup and archiving capabilities. The report also mentions AWS's secure services, including encryption and access control, along with the integration of partner tools to improve efficiency.

Running head: INTRODUCTION TO DATABASE DESIGN AND MANAGEMENT
Introduction to Database Design and Management
Name of the Student
Name of the University
Author’s note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1INTRODUCTION TO DATABASE DESIGN AND MANAGEMENT
Concept of Data Lake along with Current Application
A data lake could be defined as a repository of storage, which would be able
to hold a huge amount of data within their present format. A data lake stores data
from various sources in an unstructured way (Terrizzano et al., 2015). There would
not be a form of hierarchy or any form of organized positioning of the individual
pieces of data. The data lake would always store the data in their present format. In
addition to this, the data lake would accept and thus retain data from different kinds
of data sources, support their present schemas and data types. The use of the data
would only be needed whenever they would be ready to be used. These data lakes
would thus be helpful for the generation and processing of data.
One of the current application of Data Lake is Amazon Web Services (AWS)
Data Lake. The AWS platform provides an agile set of services in order to store,
move and analyse data (Madduri et al., 2014). AWS helps in the importing of data in
real time. They store the data in a secure format that would range from gigabytes
to exabytes. AWS would also be able to analyse the data based on a broader form
of selection based on search engines and analytic tools. The impact of machine
learning would be able to forecast the future outcomes of the analysing of data and
thus would also be able to prescribe actions based on them (Gupta et al., 2015).
AWS helps in offering a broader set of analytic tools. These tools would be
highly efficient for performing analysing of data with the aid of open formats and
standards. The raw data could be stored in a format based on the choices of the
organization (Kim, Trimi & Chung, 2014). The possible formats for the storing of
data in the data lake are CSV, Grok, ORC, Parquet and Avro. AWS provides the
flexibility to perform analysing on the data in various number of ways that includes
interactive SQL queries, data warehousing, processing of big data and real-time
analytics. The breadth of the services based on data analytics would be able to
ensure the needs of the organizations. Based on the usage of data analytics, the
AWS services would be able to meet the use cases of existing and future analytics
(Wong & Kerkez, 2016). The AWS data lake would also be able to store and retrieve
any quantity of data based on the factors of unmatched stability and thus also be
able to deliver durable solutions. The AWS platform would also provide storing
services across multiple data centres based on three zones of availability based on

2INTRODUCTION TO DATABASE DESIGN AND MANAGEMENT
a single AWS region. The AWS storage platform helps in offering the replication of
data among different regions.
The AWS data lake services are also highly secure. They provide services for
accessing, logging and perform audit based on policies. The AWS services also
provides server-side encryption, key based encryption and many others. The data
lake that have been built based on the AWS platform are mostly cost-effective. The
AWS data lake supports long-term of backup facility and would also be able to
archive them at extremely lower costs (Gupta et al., 2015). The different kinds of
analytic services supported y AWS such as Amazon Athena and Amazon Redshift
were built on extremely fast interactive query based performance that would be
able to support a number of interactive queries. The subsets of data that would be
needed by the objects would be returned that would lead to the running of queries
at a 400% faster rate and would also incur a much lower cost. Glacier Select, which
is one of the services provided by AWS provides a similar kind of capability based
on archiving of data in a faster manner. They also allow for extending the analytical
capability over the data lake in order to include archival storage (Duggal & Paul,
2013). The different other analytical services supported by AWS also helps in
supporting dynamical based pricing and follows a pay-as-you-go approach based on
the resources that are being consumed.
The AWS Partner Network has also formed a number of partner integrations
unlike any other data lake platform (Kim, Trimi & Chung, 2014). They have
collaborated with different partners that includes consultation from different
independent software vendors from all over the globe. This form of services would
provide the efficiency of work and integrate tools that would be helpful.

3INTRODUCTION TO DATABASE DESIGN AND MANAGEMENT
References
Duggal, P. S., & Paul, S. (2013, November). Big Data Analysis: Challenges and
Solutions. In International Conference on Cloud, Big Data and Trust (Vol. 15,
pp. 269-276).
Gupta, A., Agarwal, D., Tan, D., Kulesza, J., Pathak, R., Stefani, S., & Srinivasan, V.
(2015, May). Amazon redshift and the case for simpler data warehouses.
In Proceedings of the 2015 ACM SIGMOD international conference on
management of data (pp. 1917-1923). ACM.
Kim, G. H., Trimi, S., & Chung, J. H. (2014). Big-data applications in the government
sector. Communications of the ACM, 57(3), 78-85.
Madduri, R. K., Sulakhe, D., Lacinski, L., Liu, B., Rodriguez, A., Chard, K., ... & Foster,
I. T. (2014). Experiences building Globus Genomics: a next‐generation
sequencing analysis service using Galaxy, Globus, and Amazon Web
Services. Concurrency and Computation: Practice and Experience, 26(13),
2266-2279.
Terrizzano, I. G., Schwarz, P. M., Roth, M., & Colino, J. E. (2015, January). Data
Wrangling: The Challenging Yourney from the Wild to the Lake. In CIDR.
Wong, B. P., & Kerkez, B. (2016). Real-time environmental sensor data: An
application to water quality using web services. Environmental Modelling &
Software, 84, 505-517.