IRIS Dataset Classification Using KNN Algorithm: A Practical Approach

Verified

Added on 2023/04/06

AI Summary

This assignment focuses on implementing a K-Nearest Neighbors (KNN) classifier for the IRIS dataset. The solution involves loading the IRIS dataset, randomizing the data order, and splitting it into training and testing sets. The Euclidean distance is computed between each test data point and all training data points. The 'k' nearest neighbors are identified, and a class label is assigned based on the majority vote among these neighbors, with ties resolved randomly. The code provides a function to perform KNN classification, calculate classification accuracy, and generate a confusion matrix. The assignment includes a detailed explanation of the KNN algorithm, the code implementation, and the steps involved in data preparation, distance calculation, and label prediction. The aim is to classify the IRIS dataset using KNN with different parameter settings and evaluate the performance of the classifier.

% 1: Load iris.mat file which contains Iris data and its label
% seperately.
% 2: Randomize the order of data for each iternation so that new sets of
% training and test data are formed.
%
% The training data is of having size of Nxd where N is the number of
% measurements and d is the number of variables of the training data.
%
% Similarly the size of the test data is Mxd where M is the number of
% measurements and d is the number of variables of the test data.
% 3: For each observation in test data, we compute the euclidean distance
% from each obeservation in training data.
% 4: We evalutate 'k' nearest neighbours among them and store it in an
% array.
% 5: We apply the label for which distance is minimum
% 5.1: In case of a tie, we randomly label the class.
% 6: Return the class label.
% 7: Compute confusion matrix.
Clear all;
Clc;
% step 1
Load iris.mat;
%step 2: Randomizing and dividing data into 0.8:0.2 ratio for training and
testing.
split=0;
count=0;
p=0.80;
while(count~=1)
numofobs=length(irisdata);
rearrangement= randperm(numofobs);
newirisdata=irisdata(rearrangement,:);
newirislabel=irislabel(rearrangement);
split = ceil(numofobs/2);
count=count+1
end
iristrainingdata = newirisdata(1:split,:);
iristraininglabel = newirislabel(1:split);
iristestdata = newirisdata(split+1:end,:);
originallabel = newirislabel(split+1:end);
if p<=0.8
numoftrainingdata = size(iristrainingdata,1);
else
numoftestdata = size(iristestdata,1);
end
for sample=1:numoftestdata
function [predicted_labels,nn_index,accuracy] =
KNN_(k,data,labels,t_data,t_labels)
%KNN_: classifying using k-nearest neighbors algorithm. The nearest neighbors
%search method is euclidean distance

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

%Usage:
% [predicted_labels,nn_index,accuracy] =
KNN_(3,numoftrainingdata,numoftrainingdata_labels,numoftestdata,numoftestdata_lab
els)
% predicted_labels = KNN_(3,
numoftrainingdata,numoftraining_labels,numoftesting)
%Input:
% - k: number of nearest neighbors
% - data: (NxD) training data; N is the number of samples and D is the
% dimensionality of each data point
% - labels: training labels
% - t_data: (MxD) testing data; M is the number of data points and D
% is the dimensionality of each data point
% - t_labels: testing labels (default = [])
%Output:
% - predicted_labels: the predicted labels based on the k-NN
% algorithm
% - nn_index: the index of the nearest training data point for each
training sample (Mx1).
% - accuracy: if the testing labels are supported, the accuracy of
% the classification is returned, otherwise it will be zero.
%checks
if nargin < 4
error('Too few input arguments.')
elseif nargin < 5
numoftraining_labels=[];
accuracy=0;
end
if size(data,2)~=size(t_data,2)
error('data should have the same dimensionality');
end
if mod(k,2)==0
error('to reduce the chance of ties, please choose odd k');
end
%initialization
predicted_labels=zeros(size(t_data,1),1);
ed=zeros(size(t_data,1),size(data,1)); %ed: (MxN) euclidean distances
ind=zeros(size(t_data,1),size(data,1)); %corresponding indices (MxN)
k_nn=zeros(size(t_data,1),k); %k-nearest neighbors for testing sample (Mxk)
%calc euclidean distances between each testing data point and the training
%data samples
for test_point=1:size(t_data,1)
for train_point=1:size(data,1)
%calc and store sorted euclidean distances with corresponding indices
ed(test_point,train_point)=sqrt(...
sum((t_data(test_point,:)-data(train_point,:)).^2));
end
[ed(test_point,:),ind(test_point,:)]=sort(ed(test_point,:));
end
%find the nearest k for each data point of the testing data
k_nn=ind(:,1:k);
nn_index=k_nn(:,1);
%get the majority vote
for i=1:size(k_nn,1)
options=unique(labels(k_nn(i,:)'));

max_count=0;
max_label=0;
for j=1:length(options)
L=length(find(labels(k_nn(i,:)')==options(j)));
if L>max_count
max_label=options(j);
max_count=L;
end
end
predicted_labels(i)=max_label;
end
%calculate the classification accuracy
if isempty(t_labels)==0
accuracy=length(find(predicted_labels==t_labels))/size(t_data,1);
end