Dublin Business School Telecom Churn Analysis and Classification

Verified

Added on 2022/08/15

AI Summary

This report presents a data mining project focused on analyzing customer churn within the telecommunications industry. Utilizing a dataset obtained from Kaggle, the project aims to classify customer churn rates and identify key factors influencing customer loyalty. The analysis employs the CRISP-DM methodology, encompassing business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The project uses R-studio for data analysis, including data cleaning, exploration, and the application of logistic regression. The report details the selection of variables, model building, and evaluation of different logistic regression models to determine the most accurate classification algorithm. The goal is to provide insights for the telecommunication company to retain customers and increase revenue by focusing on at-risk customer segments. The project also acknowledges the limitations of the chosen approach and references relevant literature to support its methodology.

Proposed Research Project Title:
TELECOMMUNICATION SERVICE CONSUMPTION
Business Problem
From the link https://www.kaggle.com/search?q=customer+churn+dataset+datasetFileTypes
%3Acsv there will be a provision of a dataset (in a CSV format) that provides the response rates
of customers to a specific telecommunication company and its services. The response column for
his matter will be the Churn column. The Churn column gives the two options that the customers
have for their churn rate as well as their loyalty to the firm. There will be the classification of the
churn column to see where customers fall (Kumar and Yadav, 2020).
The stakeholders here will include the variable for which are supposed to be analyzed, the
employees and the owners of the respective telecommunication company as well as the
competitors of the telecommunication company that is supposed to be analyzed.
The problem that faces the company in regards to the customers and that requires the company to
take a classification approach is suitable for an analytics solution because the company will be
able to know through analytics, what number of customers are not loyal. From this, the company
can rally its resources and his business acumen that it has got to help maintain the customers that
would rather be lost if nothing is done in particular (Davis and Ryan, 2020).
There is a whole load of business benefits that come with the classification of the different levels
of customers. First, the company will know which customers are loyal and which customers are
not loyal to its services. From here, the company can direct its resources accordingly to the
required group hence no wastage. This then sees to it that customers are retained and hence
revenues are assured.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Analytics Problem
As had been pointed above, the business aim is the retention of the revenues that would rather be
lost if efforts are not taken to retain the customers that would rather churn. Analytically, with all
the relevant independent variables, there will be the classification of the response or target
variable to help find out what numbers of customers should be focused on the more. There will
be performance evaluation and additional tweaks on the classification algorithms to help aid the
correctness of the two customers’ categories (Meghyasi and Rad, 2020).
The ABT attributes are the variables that would rather be used in the development of the
classification models and in a listing they are; customerID, gender, SeniorCitizen, Partner,
Dependents tenure, phone service, multiple lines, InternetService, online security, online
backup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract,
PaperlessBilling, PaymentMethod, MonthlyCharges and TotalCharges. They are not all to be
used as others will be cleaned out in the algorithms to build-up processes.
The assumption that is employed in the algorithms that will be built up from the business
problem will be that all the independent variables will help give the most accurate classification
of the cases in consideration (Rai, Khandelwal and Boghey, 2020).
The metrics of success by the end of the deployment of the analytical algorithm used for the
classification of the churn column will be to find out the increased and retained revenue levels
when the customers that are planning to churn are retained. The next one will be to try and find
out if by any chance the calcification algorithms have a higher percentage level of accuracy
when it comes to the classification of the same response column. A higher level of accuracy

(through the confusion matrix) means the correct classification of the cases and therefore no
wastage on the money spent on saving customers from churning (Martínez et al 2020).
Data
The data sources will be as indicated above and will be indicated again and is;
https://www.kaggle.com/search?q=customer+churn+dataset+datasetFileTypes%3Acsv. The data
format is a CSV file and has 21 columns in total and a total of 7044 rows. The dataset needs
cleaning and the fact that some cells are empty all those needs to be filled.
Exploration of data visually will be done by the aid of the software of analysis that is chosen and
here there will be a look into how many cells have got missing entries and therefore needs to be
filled. There will be a graphical illustration on how the various, numerical columns are
distributed. All the distribution will be done using the bivariate and univariate algorithm analysis
(Momin, Bohra and Raut, 2020).
There will be the reporting of all the graphical findings as well as the logistic classification
findings that will be reached in the long run. Before which there will be the reporting of the
bivariate as well as the univariate graphical listings in the long run. There will also be the testing
of accuracy on the classification algorithms and all that is to be done and will be done using the
confusion matrix as well as the ROC and the AUC.
The main business and analytics problem statements are to help classify and find out what
specific category needs more attention and which one does not and this, therefore, helps save
more revenue for the telecommunication firm.

Methodology Selection
This section gives the ways chosen for algorithms that are to be used during the classification
algorithms. There will be selective picking of algorithms because there will be majorly
classification of the response variable. There will be the split of the main dataset into two
datasets and this includes both train and test datasets.
The software of analysis, in this case, will be R-studio. This will read the CSV dataset, a format
that can be read in R. Then split into both train and test sets. All the dataset, variables will be
selectively selected as there will be two logistic variables will be developed with different
accuracy levels. Later on, the best logistic regression will be picked from the two sets of
variables selectively picked (Wong, Yang and Chen, 2020).
Selection of approaches and the variables that will be used, there will be the inclusion of the
linear regression to help check important variables that will be picked accordingly.
The reason for choosing logistic regression as the classification algorithm and this is because the
response variable is binary.
Model Building
After the methodology section, there needs to be the development of the models and algorithms
that have been picked. In R, there will be the use of the linear regression as well as the logistic
regression. The linear regression will be used to help find the most viable variables for use in the
long run.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The first logistic regression will be run and reevaluated and then the second will also be run
using a different set of variables as opposed to the first one and this too will be evaluated using
the measurement algorithms.
The findings will then be reported as to which the logistic regression algorithm is the best and
which the logistic regression algorithm is not the best. This then clearly will give what set of
variables the company should importantly use when it comes to the classification of the response
variable. Limitations here will be the fact that there will be several variables that will have to be
dropped for the classification to meet high accuracies.

References
Davis, C. and Ryan, J.T., 2020. Building Codes: The Foundation for Resilient Communities. In
Optimizing Community Infrastructure (pp. 211-220). Butterworth-Heinemann.
Kumar, H. and Yadav, R.K., 2020. Rule-Based Customer Churn Prediction Model Using
Artificial Neural Network Based and Rough Set Theory. In Soft Computing: Theories and
Applications (pp. 97-108). Springer, Singapore.
Meghyasi, H. and Rad, A., 2020. Customer Churn Prediction in Irancell Company by Using Data
Mining.
Rai, S., Khandelwal, N. and Bogey, R., 2020. Analysis of Customer Churn Prediction in
Telecom Sector Using CART Algorithm. In First International Conference on Sustainable
Technologies for Computational Intelligence (pp. 457-466). Springer, Singapore.
Martínez, A., Schmuck, C., Pereverzyev Jr, S., Pirker, C. and Haltmeier, M., 2020. A machine
learning framework for customer purchase prediction in the non-contractual setting. European
Journal of Operational Research, 281(3), pp.588-596.
Momin, S., Bohra, T. and Raut, P., 2020. Prediction of Customer Churn Using Machine
Learning. In EAI International Conference on Big Data Innovation for Sustainable Cognitive
Computing (pp. 203-212). Springer, Cham.
Wong, T.T., Yang, N.Y. and Chen, G.H., 2020. Hybrid Classification Algorithms Based on
Instance Filtering. Information Sciences.