Comparative Analysis of PCA and SVD for Big Data Reduction

Verified

Added on  2023/05/29

|5
|1176
|54
Report
AI Summary
This report presents a comparative case study of PCA (Principal Component Analysis) and SVD (Singular Value Decomposition) for dimensionality reduction in big data. The study highlights the need for dimensionality reduction due to the exponential growth of data and the limitations of existing algorithms. The report discusses how PCA and SVD are employed to compress datasets by reducing dimensions while preserving essential information, thereby improving the accuracy of classifiers and decreasing computational costs. The analysis compares PCA and SVD based on accuracy and mean square error, revealing that SVD is often more efficient due to the avoidance of covariance matrix computations. Furthermore, the report suggests future modifications for enhancing the accuracy and reliability of these techniques in managing large-scale data, emphasizing the applicability of PCA and SVD in information technology applications where the matrix rank is smaller than the matrix size. The report concludes with a bibliography of relevant research papers.
Document Page
Running head: INTERNET TECHNOLOGY AND STRATEGY
Dimensionality Reduction Using PCA and SVD in Big Data:
A Comparative Case Study
Name of the Student
Name of the University
Author’s Note
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
1
INTERNET TECHNOLOGY AND STRATEGY
Answer to Question 1:
There is a need to manage the big data set dimensions because there is an increase in
the growth of data ate the rate of 40 percent per year. The researchers have found that the
data would have a 10 fold increment by 2020 and the main sources of data would be sales &
financial transaction that would consist of 56 percent of the database the followed by the
leads and sales contract generated from the customer database that would be 51%. At an
average the data of each of the enterprise increases by the rate of 33%. The healthcare
industry currently covers a larger part of the digital universe and is expected to have an
exponential growth and thus the current storage and the technology is needed to be updated
for the management of the growth of the data and minimize the complexity of this situation
(Tanwar, Ramani, & Tyagi, 2017). The available algorithm that are used for the management
of the data would not be able to process and manage the current form of Big Data. The
redundancy of the data is also needed to be managed by performing cleaning operation and
maintaining the quality of the data.
For cleaning the data the Dimensionality Reduction (DR) technique is needed to be
applied and it is the procedure of converting the datasets that contains vast dimension into a
data subset containing less dimension such that it is ensured that no information gets losses.
The dimension reduction is mainly used for the improvement of the accuracy for predicting
the classifier and decreasing the computation cost. The machine learning problems can be
solved for getting the quality features in regression and classification.
Answer to Question 2:
The two techniques that are used in the paper for dimension reduction are as follows:
PCA (Principle Component Analysis), and
SVD (Single Value Decomposition)
Document Page
2
INTERNET TECHNOLOGY AND STRATEGY
PCA – The PCA works by taking a dataset that comprises of a set of tuples and it
focuses on the point of high dimensional space. The line up direction of the tuples is also
searched by PCA to form a table of data consisting of the vital information. Only the
important information are added in the table for compressing the dataset size. The
description of the dataset is also simplified for the analysis of the factors and the structure
of data. For creating the table a matrix is considered and a search is performed for the
eigen vector for maximizing the variance of raw data (Tanwar, Ramani, & Tyagi, 2017).
The axis that is related with the second eigen vector is considered as the distance between
the first axis. The data of the higher dimension is displaced by projecting the essential
axes and it is related with the larger eigen values. Finally the estimation of data is done by
comparing the data with less dimension with the raw data.
SVD – It is used for distinguishing the dimension with the data that shows highest
variation. It provides the permission for getting the best estimation for raw data with less
dimension. A correct portrayal of any of the matrix is permitted and it eliminates the less
essential dimension for the creation of the approximate portrayal with the desired
dimensions. In this methodology an m x n matrix is decomposed into U, S and V.
U matrix = m x r
S = r x r
V = n x r
This is used for eliminating the vector numbers for getting the actual variance needed.
With the diminishing of the vector noise can be eliminated from the raw set of data.
Document Page
3
INTERNET TECHNOLOGY AND STRATEGY
Answer to Question 3:
The paper compares the two techniques used for dimension reduction on the basis of
accuracy and mean square error. The accuracy of the SVD and the PCA decreases with the
increase in the number of attributed but the processing time of PCA is much higher than the
SVD. In case of the mean square error the SVD have more mean square error when compared
with the PCA (Tanwar, Ramani, & Tyagi, 2017). Thus it is found that SVD is more efficient
than the PCA as there is no need of computation of the covariance matrix that can add
numerical error.
Answer to Question 4
In future more modification can be done on the SVD and PCA for increasing the
accuracy and reliability for the management of the large scale of growing data. In the
applications used in information technology the rank of matrix is small than the size of the
matrix and thus in this case the PCA ad SVD can be applied for getting the approximate
results. The split and combine methodology can be applied for the management of the
growing data.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
4
INTERNET TECHNOLOGY AND STRATEGY
Bibliography
Azar, A. T., & Hassanien, A. E. (2015). Dimensionality reduction of medical big data using
neural-fuzzy classifier. Soft computing, 19(4), 1115-1127.
Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and
technologies: A survey on Big Data. Information Sciences, 275, 314-347.
Franke, B., Plante, J. F., Roscher, R., Lee, E. S. A., Smyth, C., Hatefi, A., ... & Hoffman, M.
M. (2016). Statistical inference, learning and models in big data. International
Statistical Review, 84(3), 371-389.
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2016). Big data
preprocessing: methods and prospects. Big Data Analytics, 1(1), 9.
Kuang, L., Hao, F., Yang, L. T., Lin, M., Luo, C., & Min, G. (2014). A tensor-based
approach for big data representation and dimensionality reduction. IEEE transactions
on emerging topics in computing, 2(3), 280-291.
Tanwar, S., Ramani, T., & Tyagi, S. (2017, August). Dimensionality Reduction Using PCA
and SVD in Big Data: A Comparative Case Study. In International Conference on
Future Internet Technologies and Trends (pp. 116-125). Springer, Cham.
chevron_up_icon
1 out of 5
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]