Clustering Analysis of Rock Samples: Data Integration and K-Means

Verified

Added on 2022/12/14

AI Summary

This assignment solution focuses on clustering analysis of rock samples from Chemplus and Star Labs. The analysis begins with data preprocessing, including importing data from Excel files, checking data types, handling missing values, and data cleaning. The solution then integrates the datasets, standardizes the data, and performs a correlation analysis. The core of the assignment involves K-means clustering, exploring different cluster sizes and evaluating the results using metrics like Pseudo F and Cubic Clustering Criterion (CCC). The solution uses SAS Enterprise Miner to conduct the cluster analysis and provides a report detailing the process, findings, and interpretation of the clusters, including the use of canonical discriminant analysis and SGPLOT procedure for visualization. The goal is to identify the different types of rocks and their characteristics based on the analyzed data.

/* ***************************************** File No 1 (Chemplus Lab)
********************************************************************************
****************** */
/* (a) Before doing cluster analysis, explore the data set and preprocess the
data set if necessary. */
/*
********************************************************************************
********************************************************************************
***** */
/* Import a file 3338430_437973566_rockchemplus.xls and Use filename statements
to
define the paths to the raw data files. */
filename MyLib '/folders/myfolders/Project-
15/3338430_437973566_rockchemplus.xls';
proc import datafile=MyLib
dbms=xls
out=chemplus_lab;
getnames=YES;
run;
/* Check Variable, Type, Len, Format, Informat and Label of dataset
chemplus_lab*/
proc contents data = chemplus_lab varnum;
run;
/*Check Original Imported data */
*proc print data = chemplus_lab;
*run;
proc format;
value location_type (multilabel)
1 = 'Central'
2 ='East'
3 ='North'
4 ='South'
5 ='West';
run;
/* Keep necessaccery columns i,e I removed variable K,L & M. And delete row
which has NA Value*/
data process_chemplus_lab;
retain Sample_ID Locations Volumetric_Density Volume Mass Hardness Sands_Content
Carbornates_Content Clays_Content Surface_Area K L M;
set chemplus_lab;
Carbornates_Content_old = input(Carbornates_Content, best18.);
format Carbornates_Content_old best18. Locations location_type.;
drop Carbornates_Content K L M;
rename Carbornates_Content_old = Carbornates_Content;
Carbornates_Content_old = Carbornates_Content_old*0.001; /* Convert miligram to
gram */
if Carbornates_Content =. then delete;
run;
proc print data = process_chemplus_lab;
var Sample_ID Locations Volumetric_Density Volume Mass Hardness Sands_Content
Carbornates_Content Clays_Content Surface_Area;
run;
proc contents data = process_chemplus_lab varnum;
run;
/* Proc mean to find values for variables like chartype mean mode std min max n
etc.*/

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

proc means data=PROCESS_CHEMPLUS_LAB chartype mean std min max n vardef=df
nonobs;
run;
/* **************************************************** File No 2 (Star Lab)
********************************************************************************
************ */
/* Import a file 3338430_437973566_rockchemplus.xls and Use filename statements
to
define the paths to the raw data files.*/
filename MyLib '/folders/myfolders/Project-15/3338431_218657016_rockstar.xls';
proc import datafile=MyLib
dbms=xls
out=star_lab;
getnames=YES;
run;
/* Check Variable, Type, Len, Format, Informat and Label of dataset
chemplus_lab*/
proc contents data = star_lab varnum;
run;
data star_lab_new;
retain Sample_ID Locations Volumetric_Density Volume Mass Hardness Sands_Content
Carbornates_Content Clays_Content Surface_Area;
set star_lab;
*if Volumetric_Density = 'NA' then delete;
/*Change Char Variable to Num and Format it. */
Volumetric_Density_New = input(Volumetric_Density, best17.);
Mass_New = input(Mass, best14.);
Hardness_New = input(Hardness, best14.);
Sands_Content_New = input(Sands_Content, best14.);
Carbornates_Content_New = input(Carbornates_Content, best18.);
Clays_Content_New = input(Clays_Content, best14.);
Carbornates_Content_New = Carbornates_Content_New *0.001; /*Convert from
miligram to gram unit */
format Volumetric_Density_New best17. Mass_New best14. Hardness_New best14.
Sands_Content_New best14. Carbornates_Content_New best18. Clays_Content_New
best14. Locations location_type.;
/*Drop old columns */
drop Volumetric_Density Mass Hardness Sands_Content Carbornates_Content
Clays_Content;
/*Renmae new column with older names */
rename Volumetric_Density_New = Volumetric_Density
Mass_New = Mass
Hardness_New = Hardness
Sands_Content_New= Sands_Content
Carbornates_Content_New = Carbornates_Content
Clays_Content_New = Clays_Content;
/* Delete observations with missing values */
if cmiss(of _all_) then delete;
run;
/*Checked Changed data types and formates */
proc contents data = star_lab_new varnum;
run;
/*View the whole data file */
proc print data = star_lab_new ;
run;
ods graphics on;

/* Proc mean to find values for variables like chartype mean mode std min max n
etc.*/
proc means data=star_lab_new chartype mean std min max n vardef=df nonobs;
run;
/*
********************************************************************************
********************************************************************************
*************************** */
/* Combine both the dataset */
/* Integrate data sources and transform data. I check the
Volumetric_Density=(Mass/Volume) if value gets difference fill original column
value by obtained density*/
data full_dataset;
set process_chemplus_lab star_lab_new;
density = (Mass/Volume);
difference = (density - Volumetric_Density);
if Sample_ID = 2320 then Volumetric_Density = 2.59949;
if Sample_ID = 4777 then Volumetric_Density = 9.14461;
if Sample_ID = 1179 then Volumetric_Density = 3.37017;
if Sample_Id = 5141 then delete;
drop density difference;
run;
options date pageno=4 number; /*Starts from page number 4 with datestamp */
proc print data = full_dataset;
var Sample_ID Locations Volumetric_Density Volume Mass Hardness Sands_Content
Carbornates_Content Clays_Content Surface_Area;
run;
/*Check duplicated records and removed it: Total record with duplicates = 1012
and after removing of duplicates = 847 means 1012-847=165 duplicate records
removed*/
proc sort data=full_dataset out=fresh_datafile nodupkey;
by Sample_ID;
run;
proc print data= fresh_datafile;
var Sample_ID Locations Volumetric_Density Volume Mass Hardness Sands_Content
Carbornates_Content Clays_Content Surface_Area;
run;
/* Standardization of Data with mean= 0 and Standard deviation=1*/
proc standard data=fresh_datafile out= standard_data mean=0 std=1;
var Volumetric_Density Volume Mass Hardness Sands_Content Carbornates_Content
Clays_Content Surface_Area;
run;
/* Check the correleation of all independent variables*/
proc corr data = standard_data pearson;
var Volumetric_Density Volume Mass Hardness Sands_Content Carbornates_Content
Clays_Content Surface_Area;
run;
/* Perform K-Mean Clustering Analysis:
Select maxclusters=8 And analyse behaviour of the clusters*/
proc fastclus data=standard_data maxclusters=9 converge=0 out=cluster_result1;
var Volumetric_Density Volume Mass Hardness Sands_Content
Clays_Content Surface_Area Carbornates_Content;
run;
/*Output Analysis: For maxclusters=9,

The Pseudo F = 135.01 And Cubic Clustering Criterion (CCC)= 41.286,
That decreasing as compare to the cluster value =7. So we cannot select the
cluster size 9.
*/
/* Perform K-Mean Clustering Analysis:
Select maxclusters=7 And analyse behaviour of the clusters*/
proc fastclus data=standard_data maxclusters=7 converge=0 out=cluster_result2;
var Volumetric_Density Volume Mass Hardness Sands_Content
Clays_Content Surface_Area Carbornates_Content;
run;
/*Output Analysis: For maxclusters=7,
The Pseudo F = 161.75 And Cubic Clustering Criterion (CCC)= 52.170,
As we see the performance analysis of both the factors, we can choose the
cluster size 7 for our model.
Because maxclusters=6 also perform same as like maxclusters=9, the factor values
are decreasing.*/
/*Check the accuracy of the model*/
proc freq data= cluster_result2 compress;
tables Sample_ID*cluster;
run;
/*Perform a canonical discriminant analysis */
proc candisc data=cluster_result2 out=Can noprint;
class Cluster;
var Locations Volumetric_Density Volume Mass Hardness Sands_Content
Clays_Content Surface_Area Carbornates_Content;
run;
/* the SGPLOT procedure plots the two canonical variables */
proc sgplot data=Can;
scatter y=Can4 x=Can1 / group=Cluster;
run;

1 out of 4

Clustering Analysis of Rock Samples: Data Integration and K-Means

Paraphrase This Document

Related Documents

Portfolio Project: SAS Analysis of Health Data - MEPS Dataset

Data Analysis Methods: Qualitative & Quantitative Report

+13062052269

info@desklib.com

Clustering Analysis of Rock Samples: Data Integration and K-Means

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

Portfolio Project: SAS Analysis of Health Data - MEPS Dataset

Data Analysis Methods: Qualitative & Quantitative Report

+13062052269

info@desklib.com