logo

Objective of the Data Collection Process

9 Pages1948 Words402 Views
   

Added on  2020-02-18

Objective of the Data Collection Process

   Added on 2020-02-18

ShareRelated Documents
Machine Learning Part 1: Understanding or the data For this research two set of data set was given, namely the training data and the test data. So, initially the training data will be used and the information from the training data will be used on the test data.a)The main objective of the data collection process is to identify the actions carried out by a person to understand their behavior.Collected data can be used to understand the behavior of humans and their social context. Since it is not possible to do the same for the large set of data, computer system different statistical tools are used. b)For the current research the data set includes different types of human activities. This includes walking, walking upstairs, sitting, standing and lying down. Similarly data also shows that 30 subjects have performed these activities. c)The total numbers of instances in the current data set are7352 for the training data and 2947 forthe test data. Similarly there are 561 features to represent each instances.To find the number of stances and features following code was used in python:#number of observations & featuresprinttrain.shapeprinttest.shape(7352, 563)(2947, 563)d)In this paper the multiclass Support Vector Machine approach was used for the analysis. This method was used for the purpose of classification of the data collected from the Smartphone. Data was collected from 30 volunteers in the age group of 19-48 and the volunteers were instructed to follow the protocol of activities. Results from the paper shows that the maximum accuracy achieved is 96 %. Part 2: K-Nearest Neighbour Classification The K Nearest Neighbour classification in python was performed using the following process:As mentioned in the assignment the 10 fold cross validation was used where the value of K lies between 1 and 50.n_neighbors=10clf=neighbors.KNeighborsClassifier(n_neighbors,weights='uniform')clf.fit(X_train,y_train)KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
Objective of the Data Collection Process_1
metric_params=None, n_neighbors=10, p=2, weights='uniform')Now for choosing the best value for K on the f1 model following code was run:#f1n_neighbors=10clf=neighbors.KNeighborsClassifier(n_neighbors,weights='uniform')clf.fit(X_train,y_train)fromsklearn.metricsimportf1_scorefromsklearn.grid_searchimportGridSearchCVpred_y=clf.predict(X_train)printf1_score(y_train,pred_y)0.848535189907pred_y=clf.predict(X_test)print f1_score(y_test,pred_y)0.0In[141]:parameters= {'n_neighbors': range(1,51)}knn=sklearn.neighbors.KNeighborsClassifier()clf=sklearn.grid_search.GridSearchCV(knn, parameters, cv=10, scoring='f1')clf.fit(X_train, y_train)GridSearchCV(cv=10, error_score='raise',estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_neighbors=5, p=2, weights='uniform'),fit_params={}, iid=True, loss_func=None, n_jobs=1,param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]},pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='f1',verbose=0)Above we have used GridSearch module from sklearn to cross validate the model & tune the parameters. All arguments for the KNN model is given as parameters in the grid search function. We have tuned the model for K ranging from 1 to 50 with 10 fold cross validation.Now the best estimator was found through the following technique:print(clf.best_estimator_)KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_neighbors=1, p=2, weights='uniform')The plot of the cross validation accuracy with respect to K can be obtained through following code:
Objective of the Data Collection Process_2
plot of cross validation accuracy with respect to Kneighbour_list=[]mean_list=[]std_list=[]forneighbour,mean,stdinclf.grid_scores_:neighbour_list.append(neighbour)mean_list.append(mean)std_list.append(np.std(std))x=np.arange(1,len(neighbour_list)+1)plt.plot(mean_list)plt.show()Now, on the basis of cv accuracy, value of K has been selected as 1, and the f1 score, confustion matrix and the recall score has been obtained using the following code in python:n_neighbors=1clf=sklearn.neighbors.KNeighborsClassifier(n_neighbors,weights='uniform')clf.fit(X_train,y_train)fromsklearn.metricsimportf1_scorefromsklearn.metricsimportconfusion_matrixfromsklearn.metricsimportrecall_scorepred_y=clf.predict(X_test)printf1_score(y_test,pred_y)printconfusion_matrix(y_test,pred_y)Part 3: Multiclass Logistic Regression with Elastic NetIn this section the elastic-net regularized logistic regression classifier was built using the following procedure:#part 3: logistic regression with elastic netfromsklearnimportlinear_modelparameters={'alpha':[0.00010,0.00030,0.00100,0.00300,0.01000,0.03000],'l1_ratio':[0,0.15,0.5,0.7,1]}logistic=linear_model.SGDClassifier(loss='log')clf_2=sklearn.grid_search.GridSearchCV(logistic,parameters,cv=10,scoring='f1')clf_2.fit(X_train,y_train)GridSearchCV(cv=10, error_score='raise',estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15,learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,
Objective of the Data Collection Process_3

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Introduction to Machine Learning Assignment 2022
|5
|1318
|8