Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

Unlock your academic potential

© 2024 | Zucol Services PVT LTD | All rights reserved.

Added on 2023/05/29

|5

|1176

|54

AI Summary

This paper discusses the need for managing big data set dimensions and compares the techniques of PCA and SVD for dimensionality reduction in a comparative case study. It analyzes the accuracy, mean square error, and processing time of both techniques and suggests modifications for future applications.

Your contribution can guide someone’s learning journey. Share your
documents today.

Running head: INTERNET TECHNOLOGY AND STRATEGY

Dimensionality Reduction Using PCA and SVD in Big Data:

A Comparative Case Study

Name of the Student

Name of the University

Author’s Note

Dimensionality Reduction Using PCA and SVD in Big Data:

A Comparative Case Study

Name of the Student

Name of the University

Author’s Note

Need help grading? Try our AI Grader for instant feedback on your assignments.

1

INTERNET TECHNOLOGY AND STRATEGY

Answer to Question 1:

There is a need to manage the big data set dimensions because there is an increase in

the growth of data ate the rate of 40 percent per year. The researchers have found that the

data would have a 10 fold increment by 2020 and the main sources of data would be sales &

financial transaction that would consist of 56 percent of the database the followed by the

leads and sales contract generated from the customer database that would be 51%. At an

average the data of each of the enterprise increases by the rate of 33%. The healthcare

industry currently covers a larger part of the digital universe and is expected to have an

exponential growth and thus the current storage and the technology is needed to be updated

for the management of the growth of the data and minimize the complexity of this situation

(Tanwar, Ramani, & Tyagi, 2017). The available algorithm that are used for the management

of the data would not be able to process and manage the current form of Big Data. The

redundancy of the data is also needed to be managed by performing cleaning operation and

maintaining the quality of the data.

For cleaning the data the Dimensionality Reduction (DR) technique is needed to be

applied and it is the procedure of converting the datasets that contains vast dimension into a

data subset containing less dimension such that it is ensured that no information gets losses.

The dimension reduction is mainly used for the improvement of the accuracy for predicting

the classifier and decreasing the computation cost. The machine learning problems can be

solved for getting the quality features in regression and classification.

Answer to Question 2:

The two techniques that are used in the paper for dimension reduction are as follows:

PCA (Principle Component Analysis), and

SVD (Single Value Decomposition)

INTERNET TECHNOLOGY AND STRATEGY

Answer to Question 1:

There is a need to manage the big data set dimensions because there is an increase in

the growth of data ate the rate of 40 percent per year. The researchers have found that the

data would have a 10 fold increment by 2020 and the main sources of data would be sales &

financial transaction that would consist of 56 percent of the database the followed by the

leads and sales contract generated from the customer database that would be 51%. At an

average the data of each of the enterprise increases by the rate of 33%. The healthcare

industry currently covers a larger part of the digital universe and is expected to have an

exponential growth and thus the current storage and the technology is needed to be updated

for the management of the growth of the data and minimize the complexity of this situation

(Tanwar, Ramani, & Tyagi, 2017). The available algorithm that are used for the management

of the data would not be able to process and manage the current form of Big Data. The

redundancy of the data is also needed to be managed by performing cleaning operation and

maintaining the quality of the data.

For cleaning the data the Dimensionality Reduction (DR) technique is needed to be

applied and it is the procedure of converting the datasets that contains vast dimension into a

data subset containing less dimension such that it is ensured that no information gets losses.

The dimension reduction is mainly used for the improvement of the accuracy for predicting

the classifier and decreasing the computation cost. The machine learning problems can be

solved for getting the quality features in regression and classification.

Answer to Question 2:

The two techniques that are used in the paper for dimension reduction are as follows:

PCA (Principle Component Analysis), and

SVD (Single Value Decomposition)

2

INTERNET TECHNOLOGY AND STRATEGY

PCA – The PCA works by taking a dataset that comprises of a set of tuples and it

focuses on the point of high dimensional space. The line up direction of the tuples is also

searched by PCA to form a table of data consisting of the vital information. Only the

important information are added in the table for compressing the dataset size. The

description of the dataset is also simplified for the analysis of the factors and the structure

of data. For creating the table a matrix is considered and a search is performed for the

eigen vector for maximizing the variance of raw data (Tanwar, Ramani, & Tyagi, 2017).

The axis that is related with the second eigen vector is considered as the distance between

the first axis. The data of the higher dimension is displaced by projecting the essential

axes and it is related with the larger eigen values. Finally the estimation of data is done by

comparing the data with less dimension with the raw data.

SVD – It is used for distinguishing the dimension with the data that shows highest

variation. It provides the permission for getting the best estimation for raw data with less

dimension. A correct portrayal of any of the matrix is permitted and it eliminates the less

essential dimension for the creation of the approximate portrayal with the desired

dimensions. In this methodology an m x n matrix is decomposed into U, S and V.

U matrix = m x r

S = r x r

V = n x r

This is used for eliminating the vector numbers for getting the actual variance needed.

With the diminishing of the vector noise can be eliminated from the raw set of data.

INTERNET TECHNOLOGY AND STRATEGY

PCA – The PCA works by taking a dataset that comprises of a set of tuples and it

focuses on the point of high dimensional space. The line up direction of the tuples is also

searched by PCA to form a table of data consisting of the vital information. Only the

important information are added in the table for compressing the dataset size. The

description of the dataset is also simplified for the analysis of the factors and the structure

of data. For creating the table a matrix is considered and a search is performed for the

eigen vector for maximizing the variance of raw data (Tanwar, Ramani, & Tyagi, 2017).

The axis that is related with the second eigen vector is considered as the distance between

the first axis. The data of the higher dimension is displaced by projecting the essential

axes and it is related with the larger eigen values. Finally the estimation of data is done by

comparing the data with less dimension with the raw data.

SVD – It is used for distinguishing the dimension with the data that shows highest

variation. It provides the permission for getting the best estimation for raw data with less

dimension. A correct portrayal of any of the matrix is permitted and it eliminates the less

essential dimension for the creation of the approximate portrayal with the desired

dimensions. In this methodology an m x n matrix is decomposed into U, S and V.

U matrix = m x r

S = r x r

V = n x r

This is used for eliminating the vector numbers for getting the actual variance needed.

With the diminishing of the vector noise can be eliminated from the raw set of data.

3

INTERNET TECHNOLOGY AND STRATEGY

Answer to Question 3:

The paper compares the two techniques used for dimension reduction on the basis of

accuracy and mean square error. The accuracy of the SVD and the PCA decreases with the

increase in the number of attributed but the processing time of PCA is much higher than the

SVD. In case of the mean square error the SVD have more mean square error when compared

with the PCA (Tanwar, Ramani, & Tyagi, 2017). Thus it is found that SVD is more efficient

than the PCA as there is no need of computation of the covariance matrix that can add

numerical error.

Answer to Question 4

In future more modification can be done on the SVD and PCA for increasing the

accuracy and reliability for the management of the large scale of growing data. In the

applications used in information technology the rank of matrix is small than the size of the

matrix and thus in this case the PCA ad SVD can be applied for getting the approximate

results. The split and combine methodology can be applied for the management of the

growing data.

INTERNET TECHNOLOGY AND STRATEGY

Answer to Question 3:

The paper compares the two techniques used for dimension reduction on the basis of

accuracy and mean square error. The accuracy of the SVD and the PCA decreases with the

increase in the number of attributed but the processing time of PCA is much higher than the

SVD. In case of the mean square error the SVD have more mean square error when compared

with the PCA (Tanwar, Ramani, & Tyagi, 2017). Thus it is found that SVD is more efficient

than the PCA as there is no need of computation of the covariance matrix that can add

numerical error.

Answer to Question 4

In future more modification can be done on the SVD and PCA for increasing the

accuracy and reliability for the management of the large scale of growing data. In the

applications used in information technology the rank of matrix is small than the size of the

matrix and thus in this case the PCA ad SVD can be applied for getting the approximate

results. The split and combine methodology can be applied for the management of the

growing data.

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4

INTERNET TECHNOLOGY AND STRATEGY

Bibliography

Azar, A. T., & Hassanien, A. E. (2015). Dimensionality reduction of medical big data using

neural-fuzzy classifier. Soft computing, 19(4), 1115-1127.

Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and

technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Franke, B., Plante, J. F., Roscher, R., Lee, E. S. A., Smyth, C., Hatefi, A., ... & Hoffman, M.

M. (2016). Statistical inference, learning and models in big data. International

Statistical Review, 84(3), 371-389.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data

preprocessing: methods and prospects. Big Data Analytics, 1(1), 9.

Kuang, L., Hao, F., Yang, L. T., Lin, M., Luo, C., & Min, G. (2014). A tensor-based

approach for big data representation and dimensionality reduction. IEEE transactions

on emerging topics in computing, 2(3), 280-291.

Tanwar, S., Ramani, T., & Tyagi, S. (2017, August). Dimensionality Reduction Using PCA

and SVD in Big Data: A Comparative Case Study. In International Conference on

Future Internet Technologies and Trends (pp. 116-125). Springer, Cham.

INTERNET TECHNOLOGY AND STRATEGY

Bibliography

Azar, A. T., & Hassanien, A. E. (2015). Dimensionality reduction of medical big data using

neural-fuzzy classifier. Soft computing, 19(4), 1115-1127.

Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and

technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Franke, B., Plante, J. F., Roscher, R., Lee, E. S. A., Smyth, C., Hatefi, A., ... & Hoffman, M.

M. (2016). Statistical inference, learning and models in big data. International

Statistical Review, 84(3), 371-389.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data

preprocessing: methods and prospects. Big Data Analytics, 1(1), 9.

Kuang, L., Hao, F., Yang, L. T., Lin, M., Luo, C., & Min, G. (2014). A tensor-based

approach for big data representation and dimensionality reduction. IEEE transactions

on emerging topics in computing, 2(3), 280-291.

Tanwar, S., Ramani, T., & Tyagi, S. (2017, August). Dimensionality Reduction Using PCA

and SVD in Big Data: A Comparative Case Study. In International Conference on

Future Internet Technologies and Trends (pp. 116-125). Springer, Cham.

1 out of 5