This report aims to provide the discussion of the big data technologies and then provide the appropriate architecture solution for the management of health data records. The brief use cases are provided for illustrating the users of the big data in healthcare.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Running head: BIG DATA ANALYSIS Big Data Analysis Name of the Student Name of the University Author Note
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
1BIG DATA ANALYSIS Executive summary This report aims to provide the discussion of the big data technologies and then provide the appropriate architecture solution for the management of health data records. The brief use cases are provided for illustrating the users of the big data in healthcare. The discussion of big data technologies has been provided for determining the advantages and the disadvantages that are provided by the big data technologies. The big data architecture solution has been recommended for the healthcare data management that could be implemented by the healthcare organizations. The characterization of the big data is done dividing it into three categories which are velocity of the data processing, varieties of data and extreme volume of data. Many others also treat the big data as some typical way of analyzing and systematic extraction of some particular information or it can be considered as some unique way of dealing with some large number of datasets. Here this large amount of data is handled with by some traditional type of data processing software. Lastly this report concludes with the appropriate conclusion for the report.
2BIG DATA ANALYSIS Table of Contents Introduction:....................................................................................................................................3 Big data use case..............................................................................................................................4 Critical Analysis of Big Data Technology:.....................................................................................4 Different types of Big Data Technology:....................................................................................4 Hadoop.....................................................................................................................................4 Spark........................................................................................................................................7 Data Lakes.............................................................................................................................10 R technology..........................................................................................................................11 Big data architecture solution........................................................................................................13 Conclusion.....................................................................................................................................14 References......................................................................................................................................16
3BIG DATA ANALYSIS Introduction: The term “Big Data” is actually describes that is some large set or volume of data which can be in form of unstructured, semi-structured and structured data. Big data mainly mines some of the important information which are mainly utilized for some advanced projects or can be used for some analytics operations. The characterization of the big data is done dividing it into three categories which are velocity of the data processing, varieties of data and extreme volume of data. Many others also treat the big data as some typical way of analyzing and systematic extraction of some particular information or it can be considered as some unique way of dealing with some large number of datasets. Here this large amount of data is handled with by some traditional type of data processing software. The concept of big data is quite important as it provides some great advantages to the statistical type of power, but in this case another disadvantage is that data with some high complexity can lead to higher rate of false discovery. As the big data technology is having both the positive and negative aspects regarding the processing of the data an extended study will be done regarding the Big Data Technology. Here a critical analysis of the big data technology will be done and some solution regarding the architecture of the big data will be discussed.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
4BIG DATA ANALYSIS Big data use case Figure 1: Use case of Big data in healthcare Critical Analysis of Big Data Technology: Different types of Big Data Technology: In the present situation there are several of big data technologies that are available in the market. In the following section some of the important big data technologies are elaborated. Hadoop
5BIG DATA ANALYSIS Hadoop is commonly referred as the open source software framework that is used for storing any data and then executing the application on the clusters of commodity hardware (Chen and Zhang 2014). It offers the vast storage for all kind of data, huge processing power along with the ability of handling the virtually limitless concurrent jobs or the tasks. The ecosystem of the Hadoop is might not be dominant like the other Big Data technologies but this technology is one of the most important open source framework for perfect distributed processing of some large sets of data. Hadoop has also developed itself so that it becomes capable of working as commercial type of big data solutions which are currently depending on the Hadoop related big data solution (GandomiandHaider2015).TheimportantvendorsfortheHadoopecosystemarethe Hortonworks, Cloudera and MapR. Some of the important public cloud system is also providing this Hadoop technology. Some of the benefits of Hadoop technology are: The benefits of using Hadoop could be significant as it provides the ability of storing and processing vast amount of any type of data easily, immense computing power, quick fault tolerance, flexibility, low cost, and scalability (Oussous et al. 2018). With the huge data volumes along with the varieties that are consistently increasing, specifically from the social media and the Internet of Things, it is required to have the significant processing power and computing capability. The data of the healthcare are required to be analyzed in the real time and provide quick results for synthesis that could help the doctors and the management to take quick, informed decisions (Khan et al. 2014). The distributed computing power of Hadoop helps in processing the big data quickly and the easily. The more computing nodes that are used, the more processing power the organization could have. The processing of the data and the application are significantly protected
6BIG DATA ANALYSIS against the hardware failure. In the situation when any of the node is not working properly, the jobs would be automatically redirected to the other nodes for ensuring that the distributed computing does not fail (Chen et al. 2014). The multiple copies of all the data are managed automatically. The Hadoop provides significant flexibility to the organization for storing and managing data easily. Unlike the conventional relational databases, there is no requirement of preprocessing the data prior storing it (Kim, Trimi and Chung 2014). With the implementation of Hadoop, the organizations could store as much data as it is required and then take the decision about what to do with the data later. It includes the unstructured data like the text, videos or any image. This open source structure of Hadoop is free and it utilizes the commodity hardware for storing significantly large amount of data (Watson 2014). With the help of Hadoop, the organisations could gain the ability of growing the systems for the improved handling of data by the simple addition of the nodes. Due to this features, there is a major benefit of minimal administration that is required (Vitolo et al. 2015). Some of the issues that are observed within Hadoop are: MapReduce programming might not work efficiently for all problems: It might be appropriate for the information requests as well as the problems that could be divided into the independent units but it has not been used for the iterative and the interactive tasks of analytics. The MapReduce is significantly file intensive (Jin et al.2015). As the nodes do not perform the intercommunication except using the shuffles and the sorts, the iterative algorithms need several map shuffle phases for the efficient completion. It creates
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
7BIG DATA ANALYSIS multiple files among the MapReduce phases and it is inefficient for the advanced analytic computing (Hashem et al. 2015). Knowledgegap:Itcouldbesignificantlydifficulttodiscovertheentry-level programmers who possess the adequate Java skills in being significantly productive with the MapReduce. Data security: The major challenge that is associated with Hadoop is the challenge around the segmented data security issues, even though the new technologies are easily surfacing. The proper utilization of the Kerberos authentication protocol could be the security measure for the Hadoop environments (Pääkkönen and Pakkala 2015).Extensive data management and data governance: Hadoop does not possess the easy-to- use, complete feature tools for the data management, data cleansing, metadata and the governance. Specifically is lacking the tools for the data quality and the standardization (Bhatt, Dey and Ashour 2017). Spark The Apache Spark is also a part of the ecosystem of Hadoop, but it is also used in different situation as a single technology. This is typical type of technology, which is used for processing the Big Data within the Hadoop environment. The main advantage of the Apache Spark over the traditional type of Hadoop system is that it is hundred times faster while compared with the Hadoop system. Currently, interest in this technology is growing noticeably (Zhang 2016). Apache Spark is referred as the fast, in-memory data processing engine with the elegant and the expressive development APIs for allowing the data workers to execute the streaming, SQL workloads or the machine learning efficiently, which needs the fast iterative access to the datasets. With the Spark executing on the Apache Hadoop YARN, the developers everywhere
8BIG DATA ANALYSIS could now execute the creation of the application for exploiting the power of Spark, derive the insights and then enrich the workloads of the data science within the single, shared dataset in the Hadoop (Suthaharan 2014). The architecture of the Hadoop YARN offers the base for allowing the sharing of common dataset and the cluster by the Spark and the other applications with ensuring the constant levels of the response and the service. The Spark is presently one of the many data access engines that works with the YARN in HDP. Spark has been designed for the data science and the abstraction helps in simplification of the data science (Zhang et al. 2015). The data scientists majorly utilize the machine learning that is the set of methods as well as the algorithms that could learn from the data. These particular algorithms are frequently iterative and the ability of Spark to cache the dataset in the memory significantly speeds up the iterative data processing and it makes Spark the ideal processing engine for the implementation of these algorithms. There are several benefits that could be gained from by using Apache Spark are: Increased data processing that helps in reducing the number of the read-write to the disk. The Spark also initiates with similar concept of being significantly able to execute the jobsofMapReduceexceptthatisinitiallyplacesdataintotheRDDs(Resilient Distributed Datasets) for allowing the proper storing of the data in the memory that could be accessed anytime (Demchenko, De Laat and Membrey 2014). This access to the real time data helps in proper working of the organizations. As with the increase of the amount of data that is captured from within the organizations are increasing exponentially, there is the major requirement of swift processing of all the data and provide real time results by the processing and the manipulation of the data. Spark helps in the analysis of the real time data when it has been collected. The applications could be categorized as the fraud detection, the electronic data of trading,
9BIG DATA ANALYSIS and the log processing in the live streams (Zomaya and Sakr 2017). The graph processing is the major functionality of Spark. Apart from the Steam processing, Spark could also be utilized for the graph processing from the advertising to the social data analysis, the graph processing captures the relationships in the data among the entities (Abuín et al. 2015). Even though there are several benefits of Apache Spark, there are some limitations to this technology that is required to be considered and analysed prior utilization. Some of the major drawbacks of the Apache Spark are: Absence of the in-house file management system: The Apache Spark majorly depends on any other third party system for the capabilities of the file management and therefore it makes this platform significantly less efficient than the other platforms. When it has not been merged with HDFS or the Hadoop Distributed File System, it requires to be utilized with any other cloud based platform of data (Baaziz and Quoniam 2014). Huge number of smaller files: This is considered as another prospect of the file management that is a drawback of Spark. As the Apache Spark is used with Hadoop, the developers usually face the issues of the small files. HDFS supports the restricted number of the large files rather than the large number of the small files.Inefficient real time processing: When the Spark streaming is considered, the major arriving stream is segmented into the batches of the pre-defined intervals and each of the batch is then processed as the RDD or Resilient Distributed Dataset. Afterwards the operations has been applied to each of the batch, results are then returned back in the batches. Therefore, the treatment of data in the batches do not qualify to be referred as the
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
10BIG DATA ANALYSIS real time processing but as there is significant speed in the operations, the Apache Spark could be referred as inefficient processing platform (Hashem et al. 2016). Data Lakes Data lakes is one of the major technology in the Big Data which is used by many of the enterprises for accessing vast stores of data. In this case there are huge data repository which collects different kind of data from different kind of sources and this data are stored in the repository while the data maintain its natural state. Though various types of data is collected and stored using the data lakes, it is quite different from a normal data warehouse. It is different from a data warehouse in the sense that it process and structures the data for the storage purpose which is missing in normal data warehouses. The services of data lake is quite useful for the enterprises who are aiming to store the data in very much effective manner but they are not sure about how to use this technology. Some important categories of the Internet of Things data can fit within this type of category and the current trends in the IoT is one of the main reason behind the important rate of growth of the Data Lakes. Some of the major benefits of data lakes includes: Ability of deriving the value from unrestricted kinds of data Ability of storing all kinds of the unstructured and structured data in the data lakes from the social media data to the CRM data Increased flexibility Unrestricted methods to query the data Application of variety of the tools for gaining valuable insight into the meaning of the data Elimination of the data silos
11BIG DATA ANALYSIS Democratized access to the data using the single, combined view of the data across the organization when utilizing the effective platform of data management Even though the data lakes provides several benefits, there are some challenges of the data lakes that are required to be considered and analysed. Some of the challenges includes: Understanding the limitations and the purpose of technology: The Hadoop environment offers the vendor with the independent tool chain for the data storage, analytics and the management in proper format and it could manage the infinite quantity of the data and along with the add-ons, the real time and the data science use cases are made available for any enterprise. In the present world of the information security and vast data breaches, the security structure is significantly important stage when the introduction of data lake is done. There is significant issues related to the data governance within the data lakes. Defining the security policies as well as the procedure could be problematic for the organizations and it makes the data management significantly difficult. R technology The R is also an open source technology for the Big Data. The R is mainly designed as a programming language a typical type of software environment that is used for handling the data related with the statistics. R programming language involves the functions, which supports the linear modelling, the non-linear modelling, the classical statistics, the classifications, as well as clustering. It has prevailed as significantly popular in the academic settings because of the strong features and fact that it is significantly free for downloading in the source code form under terms of Free Software Foundation. It compiles and executes on the Unix platforms as well as another
12BIG DATA ANALYSIS systems that includes the Linux, Windows as well as the macOS. The development of the R programming language has been done around the standard command line interface. The users leverage the command line interface for reading the data and then load this into any other workspace, specify the commands and then obtain the results.The commands could be of variouskinds and it could executethe complicatedfunctions, which executesthe linear regressions and any other advanced calculations. With the help of the R language, the users could write their respective functions. This environment permits the users to execute the combination of the individual operations like the joining of the distinct files into the single document, extracting the single variable and then executing the regression on resulting data set, into the single function that could utilized repeatedly. Currently, this open source software is managed by the R foundation. Users are able to use this technology under the license of GPL 2. Many of the important IDEs is capable of supporting this language which includes the Visual StudioandtheEclipsewhichincreasestheusabilityofthesoftware.Consideringthe programming language environment, currently the R is one of the most popular language. This programming language is so much popular in this context as this can be used exclusively for almost every Big Data projects which provides Big Data importance and importance of R in the field of the Big Data technology. Some of the major benefits of the R programming language are: Can execute anywhere: The development core team of R has implemented significant effort into developing R available and accessible for the various kinds of software and hardware. It means that R could be available for the Windows operating system, Unix systems as well as Mac.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
13BIG DATA ANALYSIS R supports the extensions: R executes various kind of functions, like the data manipulation, the statistical modelling as well as graphics. The major advantage of R, moreover, is the extensibility. The developers could easily write the respective software and then distribute in form of the add-on packages. Some of the issues of R programming language are: The native R is significantly slower than the major competitors. The optimization of the code in the R language is significantly harder and there is the major issue in the R where the packages tends to be slower than the other alternatives. The R users gains immensely from inclusion of the packages. The packages offer theextracharacteristicsandtheconveniencefunctionsthatallowsthe fragmentation of the data and then conduct the statistical analysis. Moreover, maintenance of the packages relies on goodwill and the altruism of the R users. It has been observed that the scripts of R does not execute properly with the newer version of the similar package.
14BIG DATA ANALYSIS Big data architecture solution Figure 2: Hadoop architecture Figure: Apache Spark
15BIG DATA ANALYSIS Figure 4: Data lake reference architecture Conclusion Therefore, it could be concluded that the implementation of the big data technology could help the healthcare organization to manage their data more efficiently and handle all the requests with the improved functionality. Hadoop is commonly referred as the open source software framework that is used for storing any data and then executing the application on the clusters of commodity hardware. It offers the vast storage for all kind of data, huge processing power along with the ability of handling the virtually limitless concurrent jobs or the tasks. The Apache Spark is also a part of the ecosystem of Hadoop, but it is also used in different situation as a single technology. This is typical type of technology, which is used for processing the Big Data within the Hadoop environment. The main advantage of the Apache Spark over the traditional type of Hadoop system is that it is hundred times faster while compared with the Hadoop system. Data lakes is one of the major technology in the Big Data which is used by many of the enterprises for
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
16BIG DATA ANALYSIS accessing vast stores of data. In this case there are huge data repository which collects different kind of data from different kind of sources and this data are stored in the repository while the data maintain its natural state.
17BIG DATA ANALYSIS References Abuín, J.M., Pichel, J.C., Pena, T.F. and Amigo, J., 2015. BigBWA: approaching the Burrows– Wheeler aligner to Big Data technologies.Bioinformatics,31(24), pp.4003-4005. Baaziz, A. and Quoniam, L., 2014. How to use Big Data technologies to optimize operations in Upstream Petroleum Industry.arXiv preprint arXiv:1412.0755. Bhatt, C., Dey, N. and Ashour, A.S. eds., 2017. Internet of things and big data technologies for next generation healthcare. Chen, C.P. and Zhang, C.Y., 2014. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data.Information sciences,275, pp.314-347. Chen, M., Mao, S., Zhang, Y. and Leung, V.C., 2014. Big data: related technologies, challenges and future prospects. Demchenko, Y., De Laat, C. and Membrey, P., 2014, May. Defining architecture components of the Big Data Ecosystem. In2014 International Conference on Collaboration Technologies and Systems (CTS)(pp. 104-112). IEEE. Gandomi, A. and Haider, M., 2015. Beyond the hype: Big data concepts, methods, and analytics.International journal of information management,35(2), pp.137-144. Hashem, I.A.T., Chang, V., Anuar, N.B., Adewole, K., Yaqoob, I., Gani, A., Ahmed, E. and Chiroma, H., 2016. The role of big data in smart city.International Journal of Information Management,36(5), pp.748-758.
18BIG DATA ANALYSIS Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A. and Khan, S.U., 2015. The rise of “big data” on cloud computing: Review and open research issues.Information systems,47, pp.98-115. Jin, X., Wah, B.W., Cheng, X. and Wang, Y., 2015. Significance and challenges of big data research.Big Data Research,2(2), pp.59-64. Khan, N., Yaqoob, I., Hashem, I.A.T., Inayat, Z., Ali, M., Kamaleldin, W., Alam, M., Shiraz, M. and Gani, A., 2014. Big data: survey, technologies, opportunities, and challenges.The Scientific World Journal,2014. Kim,G.H.,Trimi,S.andChung,J.H.,2014.Big-dataapplicationsinthegovernment sector.Communications of the ACM,57(3), pp.78-85. Oussous, A., Benjelloun, F.Z., Lahcen, A.A. and Belfkih, S., 2018. Big Data technologies: A survey.Journal of King Saud University-Computer and Information Sciences,30(4), pp.431-448. Pääkkönen, P. and Pakkala, D., 2015. Reference architecture and classification of technologies, products and services for big data systems.Big data research,2(4), pp.166-186. Suthaharan, S., 2014. Big data classification: Problems and challenges in network intrusion prediction with machine learning.ACM SIGMETRICS Performance Evaluation Review,41(4), pp.70-73. Vitolo, C., Elkhatib, Y., Reusser, D., Macleod, C.J. and Buytaert, W., 2015. Web technologies for environmental Big Data.Environmental Modelling & Software,63, pp.185-198. Watson,H.J.,2014.Tutorial:Bigdataanalytics:Concepts,technologies,and applications.CAIS,34, p.65.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
19BIG DATA ANALYSIS Zhang, Y., 2016. GroRec: a group-centric intelligent recommender system integrating social, mobile and big data technologies.IEEE Transactions on Services Computing,9(5), pp.786-795. Zhang, Y., Qiu, M., Tsai, C.W., Hassan, M.M. and Alamri, A., 2015. Health-CPS: Healthcare cyber-physical system assisted by cloud and big data.IEEE Systems Journal,11(1), pp.88-95. Zomaya, A.Y. and Sakr, S. eds., 2017.Handbook of big data technologies. Berlin: Springer.