Advanced Data Mining Project Report: Banking Sector Data Analysis

Verified

Added on 2022/10/11

AI Summary

This report delves into the comparison of various data processing methods, feature detection, EDM, and algorithms relevant to the banking sector. With the rapid growth of data, data mining has become strategically important, necessitating the analysis of data processing, feature detection, EDM, and algorithms. The report discusses data processing techniques, including the Apriori algorithm and k-Means clustering, and their applications. It also explores Enterprise Data Management (EDM) and feature selection methods. The CRISP-DM methodology and Convolutional Neural Networks (CNN) are also covered, highlighting their roles in data mining projects and image classification. The report examines the application of these techniques within the context of the banking industry, emphasizing customer retention, fraud detection, and credit card approval. The provided information covers data processing, EDM, and feature selection techniques, along with the methodology and results of the analysis.

Running head: ADVANCED DATA MINING
Advanced-Data Mining
Name of the Student
Name of the University
Author Note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1
ADVANCED DATA MINING
Executive Summary
Nowadays, data is considered as a huge thing. There is a massive growth in the amount
of data every day. Be it in the platform of banking or be in case of the social media platforms, the
data is growing at a rapid rate at also at a considerable amount too. Let us consider the system of
banking worldwide. There are more than thirty thousand banks safeguarding money, and they
need to store huge data, and that is why data mining is required. Data mining is an act of
examining the large databases which are pre-existing for the generation of the new information.
The comparison of the data processing, feature selection, EDM and algorithm comparison is
made. It will also help in enabling the finance as well as the banking systems.

2
ADVANCED DATA MINING
Table of Contents
Introduction......................................................................................................................................3
Discussion........................................................................................................................................3
Related Work...............................................................................................................................3
Data processing............................................................................................................................4
EDM............................................................................................................................................5
Feature Selection.........................................................................................................................5
Data Mining Methodology..........................................................................................................6
Result...........................................................................................................................................8
Conclusion.......................................................................................................................................9
References......................................................................................................................................10

3
ADVANCED DATA MINING
Introduction
This report deals with the comparison of various data processing, feature
detection, EDM, as well as the algorithms. For different organizations like the banking sector,
the data mining techniques is in the process of becoming important strategically due to the
massive growth in the amount of data be it the industry of banking or be it the platform of social
media. Over a while, wide amendments have been faced by the banking sector. The importance
of the creation of a base of knowledge has also been realized by them (Akerkar 2013). For
maintaining the growing amount of data, data mining should be taken into consideration, and the
feature detection algorithms, data processing algorithms, as well as the EDM algorithms, is
analyzed.
Discussion
Related Work
For past few years, many researches have been done for classification and recognition
approaches. Usage of classification of the nearest neighbor was also considered. The input
images were entered into the network directly which resulted in the improvement of the
recognition rate and which is suitable for multiple identification. An efficient recognition is
proposed for the recognition with the usage of CNN which included feature extraction and had
rising values and adapted the method of Faster Region-based CNN or R-CNN. The deployment
became faster for the process of recognition.
The concepts of data mining are very important for all sectors like banking, which is also
becoming strategy-wise important day by day. It is also important for the various kinds of
organizations of business (Armbrust et al. 2015). The concept of data mining includes the

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4
ADVANCED DATA MINING
method of data being analyzed from the different kinds of perspectives and then summarizing
those into the information that is valuable
Data processing
The system of the data processing consists of the combination of various processes,
people as well as the machines which in turn produces a set of outputs that are defined for a
given set of the inputs (Batchu, Mishra and Rege 2014). The outputs, as well as the inputs, are
termed as data or information. The data is also known as the facts which are dependent on the
relation of that particular interpreter with the system (Bolón et al. 2013). It involves the
collection of data as well as the data input. The major types of abundant data are web, e-
commerce, remote sensing, transactions etc.
Apriori algorithm: The function of the Apriori algorithm is an association. This algorithm
performs the market basket analysis by the process of discovering the number of co-occurring
items. This data can also be termed as the item sets that are frequent within a given set (Caggiano
et al. 2016). This algorithm is responsible for finding those rules. This find with that particular
confidence, which is always higher than that of the specified faith that is minimum. It is also
responsible for finding those sort of rules with support, which is still higher than that of detailed
help, which is minimum.
k-Means clustering: The functionality of algorithm of k-Means clustering is to perform the
process of clustering. This clustering algorithm is a based on distance where data is divided into
many clusters which are predetermined (Caggiano et al. 2015). A single centroid is there in each
and every groups. The cases that are there in those group are situated closer to that centroid.

5
ADVANCED DATA MINING
Data mining is used for customer retention, fraud prevention, fraud detection, credit card
approval and many more.
EDM
EDM is defined as the ability of a particular organisation in order to define the data
precisely as well as integrate the data easily and also retrieve the data effectively for both the
internal as well as the external communications and applications (Chandrashekar and Sahin
2014). It basically focusses on the creation of the content which will be consistent, accurate as
well as transparent in nature.
EDM is the short form for Enterprise Data Management. It consists of a host of capabilities
(Hazen, 2014). For enabling this capability some components are needed like the data
management vision which means the core values should be described by that particular
organisation regarding which the program of the enterprise data management is based on, goals
of data management (Kreinovich et al. 2013). It consists of goals of the program of the
management of enterprise data that should be related to the goals of strategic business.
It also consists of the model of governance, issues management as well as resolution and
last but not the least is the control and monitoring (Li et al. 2018). The capabilities of the
management of the enterprise data includes inventory of the critical data, integration of data,
profiling of data, qualities of data, the management of metadata as metadata is the information of
that particular data followed by the management of master data, then comes the management of
the reference data and last but not least is the privacy of the data (Macfarlane et al. 2017). The
data of the banking sector is kept according to the EDM, including the various steps mentioned
above.

6
ADVANCED DATA MINING
Feature Selection
Selection of feature is a method where all the features which is most important for the
contribution to all the variables, that particular individuals show interests or outputs which are
predicted are selected manually or automatically (Peralta et al. 2015). The selection of the
features is very important because of the appearance of any kind of features that are irrelevant
in data can rise to decrement in accuracy of the model which in turn, has the ability to make
those particular model learn on the based on the features that are irrelevant in nature.
The method of selection of features is also termed as the pre-processing step, which is
very vital. It is utilized for text classification, that is utilized for solving problems of
dimensionality (Das and Kizhekkethottam 2015). However, most metrics which ignores all the
redundancy fully among them totally evaluates all features separately. The most important steps
which are described in the process of feature selection consists of an evaluation of the subsets,
generation of the subset, validation along with stopping the criteria.
The data of the banking sector will go through data processing as the data will be
processed (Stark 2015). It will also go through the management of the enterprise data, and it will
also have the feature selection for selecting all the features in which data would be stored.
Data Mining Methodology
The full form of CRISP-DM is cross-industry process for data mining. The methodology
for CRISP-DM helps in providing a viewpoint which is designed for planning a project related to
data mining. It is a methodology that is robust in nature. This model can also be referred to as
sequence of idealised events where various kinds of tasks can be executed in different order and
gradually it will become a major necessity to back track to all the previous tasks and repeat some
of the particular actions. This model does not possess the tendency to capture the various routes

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7
ADVANCED DATA MINING
that are viable through data mining process. The each phases of the process consists of the
understanding of data as well as business, preparation of data, modelling, deployment as well as
evaluation.
The first and the foremost initiative of above mentioned process of CRISP-DM focuses
on determination of business objectives like what an individual wants to accomplish from that
particular business objective. The desired outputs of the projects which are related to this are
setting objectives, production of a project plan, success criteria of a business. It is a process
framework which is dominant for data mining.
CNN stands for convolutional neural network or ConvNet which is special sort of neural
networks which are multi-layered which are designed for recognizing the patterns which are
visual directly from the pixel images with minimal amount of preprocessing. CNN are similar to
that of the neural networks that are ordinary. It consists of structured layer series which consists
of three layers namely convolve layer, rectified linear unit and the pooling layer. Features of an
image is extracted by the convolve layer using filters.
All negative pixel values are replaced by the rectified linear units. The feature map is
allowed to be downsampled ny the pooling layer after the dimensionality is reduced by the
rectified linear units. They consists of neurons which contains learnable weights as well as
biases. The architecture of CNN makes the assumptions explicitly that the images are the inputs
which allows the individuals for encoding of the particular properties into the architecture.
CNN is a branch of neural network which is very effective in the areas of classification,
image recognition and many more. At the CNN end, the last pooling layer output acts in the form
of the input to the layer which is fully connected. One or more layers like these can exist as the

8
ADVANCED DATA MINING
meaning of fully connected is that each and every node that is situated in the first layer is directly
connected to each and every node in the single layer. In the field of machine learning, CNN are
the feed forward neural networks which are complex. These are used for the classification of
images as well as image recognition as it has very high accuracy. It has over fifty-five thousand
images, the test set contains over ten thousand images along with the validation set that contains
over five thousand images.
In CNN, the layers of convolution helps in extracting the feature. CNN is also used in
recognition of speech, processing of natural languages, video analysis. In CNN image
classification, an input is taken, that particular image is processed and classified under various
categories like dogs, cats, lions etc. An image is seen by the computer in the form of array of
pixels and is dependent on the resolution of the images.
CNN is responsible for the improvement of the accuracy of the classification of the
image. Under this what happens is despite of the entire image being fed as an array of numbers,
the image is in turn broken into numbers of tiles and then prediction of the tiles is done by the
machine. A hierarchical model is followed by CNN which works on network building similar to
that of a funnel and comes out with an output of a fully connected layer where the neurons are
connected to one another and the processing of the output takes place.
Result
For the experiment related CNN, the input image size is set in order of pixels that is 100
by 100 by 3 because of the computer memory constraint. A single database is displayed by the
image by the identification of the categories. The content will be showed using green colour if
that is true else blue if that is false. It considers a raw colour image. The layers are automatically
extracted by the features. For performance comparison, stack layers can be added. Number of

9
ADVANCED DATA MINING
columns is represented by the size in the convolve layer which should be skipped for sliding
owindow which will alter as the values possess the tendency of effecting the result of
performance of recognition. The maxepochs values showcases the iteration numbers for the
process of training. The value of weight is represented by the initial learning rate during the
process of training.
Next is accuracy of validation which is 100%, which makes final accuracy as 1. The time
taken to display output images is 5 seconds. The options of training is needed to be specified for
the CNN. An epoch can be defined as a whole training cycle of the whole dataset. The most
number of epochs which is defined for the CNN is 10 with the initial learning rate of 0.001. The
frequency rate of the CNN is 30 iterations.
Conclusion
Nowadays, data is available in various forms like in the way of texts, pictures, charts,
graphs etc. In order to turn the raw data into the information that is useful, data mining is used.
The data is processed in various ways and obviously using various algorithms (Tang, Alelyani
and Liu 2014). The banking sector uses a huge amount of data from different individuals. Hence,
in case of the data of the banking sector, the data of the banking sector needs to be processed
(Zaharia 2016). The data of the banking sector much be managed according to the components
and the capabilities of the enterprise data management. The features also need to be selected in
order to store huge data. This can be further used for enhancing the recognition of various
categories in future.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

10
ADVANCED DATA MINING
References
Akerkar, R. ed., 2013. Big data computing. Crc Press.
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T.,
Franklin, M.J., Ghodsi, A. and Zaharia, M., 2015, May. Spark sql: Relational data processing in
spark. In Proceedings of the 2015 ACM SIGMOD international conference on management of
data (pp. 1383-1394). ACM.
Batchu, S.K., Mishra, A.K. and Rege, O.U., Mobile Iron Inc, 2014. Selective management of
mobile device data in an enterprise environment. U.S. Patent 8,695,058.
Bolón-Canedo, V., Sánchez-Maroño, N. and Alonso-Betanzos, A., 2013. A review of feature
selection methods on synthetic data. Knowledge and information systems, 34(3), pp.483-519.
Caggiano, A., Perez, R., Segreto, T., Teti, R. and Xirouchakis, P., 2016. Advanced sensor signal
feature extraction and pattern recognition for wire EDM process monitoring. Procedia CIRP, 42,
pp.34-39.
Caggiano, A., Teti, R., Perez, R. and Xirouchakis, P., 2015. Wire EDM Monitoring for Zero-
Defect Manufacturing based on Advanced Sensor Signal Processing. Procedia CIRP, 33,
pp.315-320.
Chandrashekar, G. and Sahin, F., 2014. A survey on feature selection methods. Computers &
Electrical Engineering, 40(1), pp.16-28.
Hazen, B.T., Boone, C.A., Ezell, J.D. and Jones-Farmer, L.A., 2014. Data quality for data
science, predictive analytics, and big data in supply chain management: An introduction to the
problem and suggestions for research and applications. International Journal of Production
Economics, 154, pp.72-80.
Kreinovich, V., Lakeyev, A.V., Rohn, J. and Kahl, P.T., 2013. Computational complexity and
feasibility of data processing and interval computations (Vol. 10). Springer Science & Business
Media.
Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J. and Liu, H., 2018. Feature
selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), p.94.
Macfarlane, R., Muir, D.W., Boicourt, R.M., Kahler III, A.C. and Conlin, J.L., 2017. The NJOY
Nuclear Data Processing System, Version 2016 (No. LA-UR-17-20093). Los Alamos National
Lab.(LANL), Los Alamos, NM (United States).
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M. and Herrera, F., 2015.
Evolutionary feature selection for big data classification: A mapreduce approach. Mathematical
Problems in Engineering, 2015.

11
ADVANCED DATA MINING
Pradeep, A., Das, S. and Kizhekkethottam, J.J., 2015, February. Students dropout factor
prediction using EDM techniques. In 2015 International Conference on Soft-Computing and
Networks Security (ICSNS) (pp. 1-7). IEEE.
Stark, J., 2015. Product lifecycle management. In Product lifecycle management (Volume 1) (pp.
1-29). Springer, Cham.
Tang, J., Alelyani, S. and Liu, H., 2014. Feature selection for classification: A review. Data
classification: Algorithms and applications, p.37.
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J.,
Venkataraman, S., Franklin, M.J. and Ghodsi, A., 2016. Apache spark: a unified engine for big
data processing. Communications of the ACM, 59(11), pp.56-65.