Upload and categorize faculty list using Pig in Hadoop
Verified
Added on 2023/04/20
|2
|599
|300
AI Summary
This tutorial explains how to upload and categorize a faculty list using Pig in Hadoop. It covers steps to copy the dataset to HDFS, create new datasets based on criteria such as degree level, years of teaching, and last degree obtained, and copy the files from HDFS to the local file system.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Upload the dataset ‘CIS_FacultyList.csv’ into HDFS storage on the cluster to your designated storage space. If the copying is being made from windows instead of linux, one can use WinScp software which acts as a linux platform. Using it one can efficient drag and drop files between windows and linux operating system. >Hadoop fs –copyFromLocalCIS_FacultyList.csv’ Before copying the csv file to HDFS, it is important to ensure the file is present in Linux local file system.To check if the file is present in the local file system, use the below mentioned command. >ls 2.Use Pig to create new datasets from the source file that categorises the instructors using the following criteria: a. The degree level – Bachelors, Masters or Doctorate b. Number of years of teaching – less than 5 years, or more than 5 years c. Whether the last degree was obtained from North America, ******ope or elsewhere HINT: Consider using the Pig Latin Split (Partition), For each and Group statement constructs. >pig The above command takes us to the grunt shell where all the pig commands can be executed. > CIS_Faculty = LOAD 'CIS_FacultyList.csv' USING PigStorage(',') AS (ID:int, Name: chararray, Location:chararray, Grade:chararray, Title: chararray, Join Date:string, LWD: string, Type: chararray, Division: chararray, Reports To: chararray, Highest:chararray, Highest Qualification: chararray, Major: chararray, University:chararray, All Qualifications from Profile: chararray, Courses Taught- Term 201510: chararray, MAJOR TEACHING FIELD: chararray, DOCUMENT OTHER PROFESSIONAL CERTIFICATION CRITIERA Five Years Work Experience Teaching Excellence Professional: chararray, Criteria: chararray); The above command loads the data. To check the schema and its commands, use DUMP command anywhere. > DUMP CIS_Faulty Creating a new dataset that categorises the instructors based on the Degree Level – Bachelors, Masters or Doctorate using SPLIT >SPLIT CIS_Faculty into Bachelor_Level if Highest == “Bachelor”, Master_Level if Highest = “Masters”, Doctorate_Level if (Highest == “Doctorate” and Highest == “Ph.D”; >DUMP Bachelor_Level; >DUMP Master_Level; >DUMP Doctorate_Level;
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
>SPLIT CIS_FacultyList into Less_Than_5 if Number_of_Years_of_Teaching < 5, More_Than_5 if Number_of_Years_of_Teaching >5; >DUMP Less_Than_5; >DUMP More_Than_5; >SPLIT CIS_FacultyList into NorthAmerica if University = “North America”, Elsewhere if University != “North America”; >DUMP NorthAmerica; >DUMP Elsewhere; 3. Copy the file from hdfs to the local file system storage. After executing the above commands, on the whole 7 different datasets have been created. All the seven files can be either placed in one directory and the directory be copied at once or the databases can be copied separately one by one using the below mentioned commands. >hadoop fs -copyToLocal Bachelor_level ~/Documents >hadoop fs -copyToLocal Master_level ~/Documents >hadoop fs -copyToLocal Doctorate_level ~/Documents >hadoop fs -copyToLocal Less_Than_5 ~/Documents >hadoop fs -copyToLocal More_Than_5 ~/Documents >hadoop fs -copyToLocal NorthAmerica ~/Documents >hadoop fs -copyToLocal Elsewhere ~/Documents In this case the destination directory path is Documents which can be change as per the requirement.