Abstract:Dataisconsideredasanassetinmostofthe commercial business organizations all around the world. It can be said that the growth and progress of a commercial business organization depends hugely on the management of the huge volumes of data. There are different categories of concepts which are directly related with management of data such as reporting of the data, analytics of data, and managing the insightsofthedata.DifferentaspectsofHDFSshallbe discussed, critically analyzed and elaborated in this report from different perspectives. The data which was considered in this report will be from secondary sources and only latest peer reviewed journals will be selected to gather information about HDFS. Keywords:Hadoop Distributed File System, POSIX style,and Traffic Management I.INTRODUCTION There are diverse categories of data threats which can have a huge impact on the net profitability and business sales of corporateestablishmentssuchasthesocialengineering attacks, software attacks, identity threat, theft of intellectual property and extortion of information. The address these diverse categories of threats there are diverse categories of data storage systems which are created anddeployedbycorporateestablishmentssuchasthe secondary and primary data storage systems [1]. The diverse categories of secondary data storage systems are Blu-ray discs, and USB memory sticks. The categories of primary data storage systems are register and the diverse categories of distributed systems. This report shall focus onHadoop Distributed File System (HDFS) which is one of the most significant primary data storage systems which are increasingly used in most of the global organizations all around the world. Implementation of this distributed file systemis a hugechallengeforthe strategicplannersandtheITexpertsofthebusiness organizations. The concept of HDFS was introduced in the year2006,anditwasusedintheApachesoftware foundationproject[2].Thisprimarystoragesystemsis widely used in most of the big data analytics projects all around the world. In the year 2012, Hadoop and HDFS was available in a unified version of 1.0.The following version of the software was there in the year 2013 with the This report shall be having numerous sections and each of the section will be very much significant to enhance the originality and effectiveness of the report. The following unit of this report shall be introducing how HDFS works in most of the corporate establishments all over the world. II.DETAILSOFHDFS A.Working principles of HDFS Therearehugevolumesofdatawhicharetransferred betweeneachbusinessunittoanotherinacorporate environment and managing the security of those data is a huge challenge for most of these organization as a result this distributed file became popular. HDFS is very much useful to transfer huge volumes of data between the compute nodes of a business [3]. There are diverse categoriesof frameworkswhich can be closely aligned with this file storage systems such as the Map Reduceprogrammaticframeworkwhichareverymuch useful to process bulk volumes of data with a seconds. HDFS is very much useful to manage the integrity as well as the originality of the data as it breaks down the longer and complex data into smaller segments. Each of these divided segments are distributed equally to different nodes in a cluster. This primary storage system is very well aligned with the concept of parallel processing [4]. Usually there are diverse categories of faults when huge volumes of data are managed however, the efficiency of this distributed file system is very much on the higher side as a result there are norealfaulttolerantissuesrelatedwiththissystem. Recovery of the data can be done in a more organized manner with the help of HDFS. Server tracking is one of the other significant functionality of HDFS. Even if data is at fault, HDFS is very much beneficial as it ensures processing of the data. Thearchitectureof HDFS is very much dynamic in nature it can be customized according to the business requirements of its end users. Master slave architecture is one of the most popular HDFS architecture and this architecture consists of a single Name Node. It is can be said that larger datasets can be managed in the first place with the help of this distributed system. The master node is also termed as the data chunking architecture and it consists of elements from the Google File System as well as general parallel file system [5]. This data storage system works on the basis of different design style such as the POSIX style. The following diagram will be verymuchusefulto understandthehow NameNodes functions with the help of the Data nodes as well as the Meta data. Figure 1: HDFS architecture Source: (Bende and Shedge 2016) From the above diagram it can be said replication of data can be done in an organized manner with the help of the Name Nodes. This data storage system is very much useful and this system was first introduced by Yahoo [6]. The marketing team of Yahoo used this system in the first place, most of the search engine requirements of this organization were successfully addressed in the first place with the help of HDFS[25].Mostofthelargescaleenactmentofthe hardware are supported with the help of HDFS, at the same timeitcanalsobesaidthatthousandsofnodesand hundreds of petabytes can be managed in the first place with the help of this HDFS [7]. Both log processing and machine learning languages are aligned beautifully with the help of HDFS.
The version of this system kept n changing over the years, Apache Hadoop 3.0.0 was available for business use by the endoftheyear2017.AdditionalNameNodeswas intrioduced in each of the versions [8]. There are other technologieswhichwerewidelyincorporatedwiththis distributed file system such as the dynamometer and the diverse categories of performance testing tools [26]. There are complexities related with the incorporation procedure of the different technologiesThere are diverse categories of features which are directly related with use of the HDFS features which can be understood from the below table. FeaturesDescription Minimum interventionMost of the nodes which do not have their operator can be intervened with the help coming from the HDFS. Integrity of the dataMost of the corrupted data are replicated on numerous occasions. RollbackReturningtotheprevious versions is one of the prime functionality of HDFS. Scaling outScaling up is also one of the key features of scaling of the data. ComputingBothparallelaswellas distributedcommutingis supported with the help of HDFS. Table 1: Description of the key features of HDFS III.ADVANTAGESANDLIMITATIONS A.Advantages Thedatastoragecapabilityofthecommercial organizations using the HDFS. Similar data which are stored in multiple sets can be managed in the first place with the help coming from HDFS. Thedata seek time of the organizations can be enhanced in the first place using HDFS. Aggregation of the data bandwidth is also considered as one of the prime benefits of HDFS [9]. Standardization of the data sets can be enhanced with the help of HDFS as well. The data availability is also enhanced in the first place with the help of the HDFS [10]. The data loads of the streaming format can also be managed with the help coming from HDFS. Data processing can be done is a smaller amount of time with the help of HDFS. Checkingthestatus ofthe clustersistheother benefit of this file storage system. The data security can also be enhanced in the first placewiththehelpoftheHDFS.Mostofthe commodity hardware which are increasingly used in ourcityaresignificantlybenefittedifHDFSis incorporated [11]. HDFS usually do not have any sort of compatibility issues with the other file storage systems or software. Inspite of the malfunction of numerous nodes HDFS is very much useful and reliable. Data coherency is also one of the most significant contributions of this file distributed system. Thehigherfaulttoleranceistheothermajor advantage of using HDFS. B.Disdvantages Lowlatencydataaccessisoneoftheprime limitations of HDFS. The cost of managing the latency is very much on the higher side which is one of the most significant drawback of this storage system. This storage system is very much useful to store andmanagelargerchunksorvolumesofdata rather than smaller data, thus it must not be used in smaller organizations and it is also one of the limitations of this storage system. Inefficientdataaccesspatternistheother limitation related with HDFS. IV.ASPECTSOFHDFS A.Goals of HDFS There are diverse categories of objectives which are related with the use of HDFS such as the followings: Recovery and fault detection:There are diverse categories of hardware as well as software which are directly related with HDFS, as a result fault occurs sometimes, failure of components tends to occur sometimes, and however the incorporationofHDFSisverymuchusefulasitcan successfully deal with these kinds of faults [12]. Most of the data security issues which occurs due to the presence of thousands of servers can be restricted or managed in the first place with the assistance of HDFS. Traffic Management: Managing the outgoing as well as the incoming trafficis one of the most significantsuccess factors in global establishments and HDFS can be very muchusefultomonitortheseoutgoingaswellasthe incoming traffic [13]. It can also be said that management of thethroughputofthedataisalsooneofthenotable objectives of HDFS. Most of the streaming data can be accessed in the first place with the help of HDFS. At the same time, it can also be said that Huge datasets:The bigger datasets which are managed by the commercial organizations can be managed with the help of HDFS as well. Most of the business applications which are used in our society have different categories of data sets and these data sets are very much vulnerable to different data security issues which can be sorted in the first place with the help of HDFS [14]. Scaling of thousands can be done in an organized manner with the help of HDFS as well. The other goal of this storage system is to facilitate adoption procedure. Multiple hardware platforms can be accessed with the help of HDFS. HDFS do not have any sort of issues with any of the operating systems which are used both in corporate environments and in our society.
Hardware at data: There are diverse categories of data computations which are conducted in most of the business organizations all over the world, however, it can be seen that there are human errors in most of those computations. The introduction of HDFS is very much useful to manage those computations. B.HDFS Clients This section of the paper shall illustrating how HDFS clients creates a new file. In most of the cases, a library is used which exports the HDFS file system interface. This is very much beneficial as it supports delete, write, read operations which are generally conducted on the data which are stored and managed. The necessity of the multiple replicas can be identified in the first place with the help of the HDFS. In the initial stages, the HDFS client asks for the Name Node to get the list of the Data nodes, the network topology is very much useful sort the network topology [15]. The client then contacts the node directly, after that a new pipeline is organized from one to another then sends data accordingly. When the first block is filed a new block is requested,asaresultanewblockgetscreated.The procedure which are used to create a new file can be understood from the below diagram. Figure 2: Creation of a new file using HDFS client (Source:Mahmoud, Hegazy and Khafagy 2018) C.Data Nodes Each block which are there in the data node is there due to the presence of two files of the file system. The initial or the first file contains the data which is accessed by HDFS. The metadata of that particular file is in their in the second file. The actual length of the block is directly proportional to the size of the data file. Each Name Node gets associated with the data node in the initial stages, after that a handshaking protocol is practiced [16]. Verification of the namespace ID is the main objective behind the handshaking protocol. The interiority of the file system has to be maintained under every circumstances. After the handshake procedure the data node registers themselves with the Name node. The data node is very much useful as it successfully stores the data and there is a unique identified related with the data which is stored. The storage ID is them assigned with the Data node [17]. Most of the block replicas are managed in the first place with the help of the Data nodes and after that a block report is sent which contains a general stamp and a block ID. The block report is very much useful to block differenttypesofreplica[19].Inthefollowingstep heartbeats are send to the Name node by the Data node. The name node is the responsible to send new replicas to the other blocks of the Data nodes. Heartbeats is considered as theintegralpartyofadatanode.Theloadbalancing decisions as well as the space allocation of the Name node are the other major operations of the data nodes. The role of theNamenodeisverymuchsignificantintheentire procedure as is replies heartbeats to the instructions which aresendtothedatanodes.Thediversecategories commands which are directly related with the Name nodes are as followings: Sending of immediate bock report. Removing of the local block replicas. Replication of blocks to the other nodes. Shutting down of a node. Re-registering of the node. a)I/O operationsd and Replica Management Write, Read of the file:After the creation of new file with the help of HDFS, and after the file is closed, the bytes cannot be altered in the first place.HDFS is very much useful regarding the enactment of multiple reader model and single writer model [18]. When one new file is opened, no other file cannot be opened unless it is altered or removed. A heart best is said to the Name node, at the same time, it can be said that the when the file will be closed the lease will be revoked. The duration of the lease is bounded by the soft limit as well as the hard limit. If the soft limit expires or the consumers of thus storage devices fails to close the file another client can preempt the deal. In less than one hour the hard limit expires and if the client has failed to renew the lease the HDFS shall automatically close on behave of the writer. HDFS DEAMOINS Name NodesData Nodes High RAM is required for this Name Nodes.It runs mostly on the slave nodes.Itcanrunonamaster mode. Metadata can be stored in first place with the help of the namenodes.Higher memory is required to store the data with the help of the Data Nodes. Storing meta data can be a lotmoreeasierwiththe help of this Namenodes. Data storage is one of the most significant characteristic features of HDFS. The following diagram is very much useful to understand how HDFS stores the bulk volumes of data in a commercial business environment. Figure 3: Data storage facility of HDFS
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
(Source:Scholz,et al. 2017) Career Opportunities:There are diverse categories of jobs which are related with HDFS. The average annual salary of a Hadoop developer is $USD102000, whereas the annual salary of experienced Hadoop developer is $USD131000. Apart from this job there are jobs in terms of the design of the HDFS according to the business requirements, planners of this type of distributed systems and enactment of HDFS. There are expert jobs available in Europe and United States where the real life problems are solved.There are diverse advantages related with studying HDFS in high school in terms of the huge amount of knowledge about the use of different fault tolerant technology [20]. The significance of maintaining these technologies can also be understood from the different types of academic courses which are provided from different educational institutions. Job in the Hadoop clusters can be easily obtained if most of the Hadoop concepts are understood by the graduates. Literature Review AccordingtoCiritogluetal.(2018),therearediverse categories of components related with Hadoop architecture suchasthehadoopcluster,namenode,datanode, secondary nodes like HDFS client, diverse categories of framework,andflexibleschema.Thepaperhelpedin understanding that checkpoint is very much significant for the enactmentof the secondaryname mode [21]. The investigator of this peer reviewed journal was very much useful to identify that maintenance of the Name Mode is very much required in most of the HDFS which are used in commercial establishments [23]. The significance of the backupnodeinHDFSwasalsoelaboratedinthe discussions of this paper. On the other hand, as revealed byAbead, Khafagy and Omara (2016), there are few limitations related with the use of HDFS such as its inability to deal with the small files, its slow processing speed, it have no real time data processing capability, latency issues of the HDFS, and the different types of security issues [22]. The paper focused on the fact that only batch processing is performed by HDFS whereas itissupposedtoprocessdifferenttypesoflargeand complex data sets. The journal was very much useful to identifythatiterativeprogressiveisoneofthemost significant limitations of HDFS. CONCLUSION Management of the data is very much required for the growth and progress of both society as well as corporate establishmentsastherearediversecategoriesofdata security threats such as theft of intellectual properties and the data security attacks [24]. Most of the data management challengeswhicharefacedinthecommercial establishments all around the world can be addressed in the first place using HDFS. The main components of the HDFS architecture are Name Node, Data Nodes, Meta data and the different types of blocks. There are diverse versions of Hadoop versions which are available for commercial use. TheprimespecificationsofHDFSareitscomputing capability,maintainingtheintegrityofthedataad controlling the intervention of the nodes. Standardization of bulk volumes of data can be done using HDFS, at the same time,HDFSisalsoverymuchusefultomaintainthe security and the availability of the data. Data coherency can be maintained using Hadoop as well [26]. Low latency data access, cost of managing the low latency and inefficient data access pattern are the prime drawbacks of HDFS. The prime goals of HDFS is detection of the faults and recovering those data, traffic management and management of large chunks of data sets. The procedure of creation of a new file with the help of the HDFS client can also be understood from this paper, the paper was very much useful as it provided a detailed description about data nodes and replica management. The career opportunities of Hadoop was also discussed in the concluding section of this report. RECOMMENDATION The list of recommendations related with the use of HDFS in different business environments are as followings: The robustness of the data has to be identified in the first place before storing the data in HDFS. The communication protocols has to be understood asitcansuccessfullyminimizeddifferent categories of faults. Accessibility of the data has to be known to each and every stakeholders who works with this file storage system. REFERENCE [1] S. Bende, and R. Shedge. Dealing with small files probleminhadoopdistributedfilesystem.Procedia Computer Science,79, pp.1001-1012, 2016 [2] P. Satheesh, B. Srinivas, P.R.S. Naidu and B.P. Kumar. Study on Efficient and Adaptive Reproducing Management in Hadoop Distributed File System. InInternet of Things andPersonalizedHealthcareSystems(pp.121-132). Springer, Singapore, 2012 [3] H. E, Ciritoglu, T. Saber, T.S. Buda, J. Murphy, and C. Thorpe. Towards a better replica management for hadoop distributedfilesystem.In2018IEEEInternational Congress on Big Data (BigData Congress)(pp. 104-111). IEEE, 2018 [4] Y. Xie, A.M.Q. Farhan and M. Zhou. Performance Analysis of Hadoop Distributed File System Writing File Process. In2018 International Conference on Intelligent Autonomous Systems (ICoIAS)(pp. 116-120). IEEE, 2018 [5] M.M. Fahmy, I. Elghandour, and M. Nagi. CoS-HDFS: co-locating geo-distributed spatial data in hadoop distributed filesystem.InProceedingsofthe3rdIEEE/ACM InternationalConferenceonBigDataComputing, Applications and Technologies(pp. 123-132). ACM. [6] C.Y. Lin, and Y.C. Lin. An overall approach to achieve loadbalancingforHadoopdistributedfilesystem. International Journal of Web and Grid Services,13(4), pp.448-466, 2017
[7]C.Liao,A.Squicciarini,andD.Lin.Last-hdfs: Location-aware storage technique for hadoop distributed file system. In2016 IEEE 9th International Conferenceon Cloud Computing (CLOUD)(pp. 662-669). IEEE, 2016 [8] N. Jyoti. The role of Hadoop Distributed Files System (HDFS) in Technological era, 2017 [9]P.P.Deshpande.HadoopDistributedFileSystem: Metadata Management, 2017 [10] V.B. Bobade. Survey paper on big data and Hadoop. InternationalResearchJournalofEngineeringand Technology,3(1), pp.2395-0056, 2016 [11] H. Mahmoud, A, Hegazy, and M.H. Khafagy. An approach for big data security based on Hadoop distributed file system. In2018 International Conference on Innovative Trends in Computer Engineering (ITCE)(pp. 109-114). IEEE, 2018 [12] N. Bhojwani and A. P.V. Shah. A survey on hadoop hbase system.Development,3(1), 2016 [13] B. Tian, Y. Tian, B. Huebert, Y. Sun, T. Hurt, W. Ho, Y. Zhang, D. Chen, and H. Lee. SecHDFS: a secure data allocation scheme for heterogenous Hadoop systems. In 2016IEEEInternationalConferenceonNetworking, Architecture and Storage (NAS)(pp. 1-2). IEEE, 2016 [14] S. Maneas, and B. Schroeder. The evolution of the hadoop distributed file system. In2018 32nd International ConferenceonAdvancedInformationNetworkingand Applications Workshops (WAINA)(pp. 67-74). IEEE, 2018 [15]R.Scholz,N.Tcholtchev,P.LämmelandI. Schieferdecker. A CKAN Plugin for Data Harvesting to the Hadoop Distributed File System. InCLOSER(pp. 19-28), 2017 [16] G. Manogaran, C. Thota, D. Lopez, V. Vijayakumar, K.M.AbbasandR.Sundarsekar.Bigdataknowledge system in healthcare. InInternet of things and big data technologies for next generation healthcare(pp. 133-157). Springer, Cham, 2017 [17] T.L.S.R. Krishna, T. Ragunathan, and S. K. Battula. Customized web user interface for hadoop distributed file system.InProceedingsoftheSecondInternational Conference on Computer and Communication Technologies (pp. 567-576). Springer, New Delhi, 2016 [18] N. V. Patil. Apache Hadoop Distributed File System. Asian Journal For Convergence In Technology (AJCT),3, 2017 [19] B. Rangaswamy, N. Geethanjali, T. Ragunathan and B.S. Kumar. Performance Improvement of Read Operations in Distributed File System Through Anticipated Parallel Processing.InProceedingsoftheSecondInternational Conference on Computer and Communication Technologies (pp. 275-287). Springer, New Delhi, 2016 [20]E.S.Abead,M.H.KhafagyandF.A.Omara.An efficient replication technique for hadoop distributed file system.Int. J. Sci. Eng. Res,7(1), pp.254-261, 2016 [21] W.G. Choi and S. Park. A write-friendly approach to manage namespace of Hadoop distributed file system by utilizingnonvolatilememory.TheJournalof Supercomputing, pp.1-31, 2019 [22] K. Arnepalli and K.S. Rao. Application of Hadoop Distributed File System in Digital Libraries. Basha, S.J., Kumar, P.A. and Babu, S.G., 2016. Storage and processingspeedforknowledgefromenhancedcloud computing with Hadoop frame work: A survey.IJSRSET, 2(2), pp.126-132, 2016 [23] W. Ryu. Flexible Management of Data Nodes for Hadoop Distributed Filesystem.ALLDATA 2017, p.10, 2017 [24] S.Y. Inamdar,A.H. Jadhav, R.B. Desai, P.S. Shinde, I. M. Ghadage and A.A. Gaikwad. Data Security in Hadoop Distributed File System, 2016 [25]S.Niazi,M.Ismail,S.Haridi,J.Dowling,S. GrohsschmiedtandM.Ronström.Hopsfs:Scaling hierarchical file system metadata using newsql databases. In 15th{USENIX}ConferenceonFileandStorage Technologies ({FAST} 17)(pp. 89-104), 2017 [26] V. Taran, O. Alienin, S. Stirenko, Y. Gordienko and A. Rojbi.Performanceevaluationofdistributedcomputing environments with Hadoop and Spark frameworks. In2017 IEEEInternationalYoungScientistsForumonApplied Physics and Engineering (YSF)(pp. 80-83). IEEE, 2017