Upload and categorize faculty list using Pig in Hadoop

Verified

Added on 2023/04/20

AI Summary

This tutorial explains how to upload and categorize a faculty list using Pig in Hadoop. It covers steps to copy the dataset to HDFS, create new datasets based on criteria such as degree level, years of teaching, and last degree obtained, and copy the files from HDFS to the local file system.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Upload the dataset ‘CIS_FacultyList.csv’ into HDFS storage on the cluster to your designated storage
space. If the copying is being made from windows instead of linux, one can use WinScp software which
acts as a linux platform. Using it one can efficient drag and drop files between windows and linux
operating system.
>Hadoop fs –copyFromLocal CIS_FacultyList.csv’
Before copying the csv file to HDFS, it is important to ensure the file is present in Linux local file
system.To check if the file is present in the local file system, use the below mentioned command.
>ls
2. Use Pig to create new datasets from the source file that categorises the instructors using the following
criteria: a. The degree level – Bachelors, Masters or Doctorate b. Number of years of teaching – less than
5 years, or more than 5 years c. Whether the last degree was obtained from North America, ******ope or
elsewhere HINT: Consider using the Pig Latin Split (Partition), For each and Group statement constructs.
>pig
The above command takes us to the grunt shell where all the pig commands can be executed.
> CIS_Faculty = LOAD 'CIS_FacultyList.csv' USING PigStorage(',') AS (ID:int, Name: chararray,
Location:chararray, Grade:chararray, Title: chararray, Join Date:string, LWD: string, Type: chararray,
Division: chararray, Reports To: chararray, Highest:chararray, Highest Qualification: chararray, Major:
chararray, University:chararray, All Qualifications from Profile: chararray, Courses Taught- Term
201510: chararray, MAJOR TEACHING FIELD: chararray, DOCUMENT OTHER PROFESSIONAL
CERTIFICATION CRITIERA Five Years Work Experience Teaching Excellence Professional:
chararray, Criteria: chararray);
The above command loads the data. To check the schema and its commands, use DUMP command
anywhere.
> DUMP CIS_Faulty
Creating a new dataset that categorises the instructors based on the Degree Level – Bachelors, Masters or
Doctorate using SPLIT
>SPLIT CIS_Faculty into Bachelor_Level if Highest == “Bachelor”, Master_Level if Highest =
“Masters”, Doctorate_Level if (Highest == “Doctorate” and Highest == “Ph.D”;
>DUMP Bachelor_Level;
>DUMP Master_Level;
>DUMP Doctorate_Level;

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

>SPLIT CIS_FacultyList into Less_Than_5 if Number_of_Years_of_Teaching < 5, More_Than_5 if
Number_of_Years_of_Teaching >5;
>DUMP Less_Than_5;
>DUMP More_Than_5;
>SPLIT CIS_FacultyList into NorthAmerica if University = “North America”, Elsewhere if University !=
“North America”;
>DUMP NorthAmerica;
>DUMP Elsewhere;
3. Copy the file from hdfs to the local file system storage. After executing the above commands, on the
whole 7 different datasets have been created. All the seven files can be either placed in one directory and
the directory be copied at once or the databases can be copied separately one by one using the below
mentioned commands.
>hadoop fs -copyToLocal Bachelor_level ~/Documents
>hadoop fs -copyToLocal Master_level ~/Documents
>hadoop fs -copyToLocal Doctorate_level ~/Documents
>hadoop fs -copyToLocal Less_Than_5 ~/Documents
>hadoop fs -copyToLocal More_Than_5 ~/Documents
>hadoop fs -copyToLocal NorthAmerica ~/Documents
>hadoop fs -copyToLocal Elsewhere ~/Documents
In this case the destination directory path is Documents which can be change as per the requirement.

1 out of 2

+13062052269

info@desklib.com

Upload and categorize faculty list using Pig in Hadoop

Contribute Materials

Secure Best Marks with AI Grader

Related Documents

Data Analysis with Hadoop and Pig