BUS 105 Business Information Systems: Data Mining & Management Report

Verified

Added on 2023/04/21

AI Summary

This report explores data mining and data management within the context of Business Information Systems (BUS 105). It defines data mining, outlines the phases of the Cross-Industry Standard Process for Data Mining (CRISP-DM), and discusses the importance and key elements of data mining, such as accuracy, relevancy, and specificity. The report also addresses common problems in data mining, including poor data quality and privacy concerns. The second part focuses on data management, presenting a data dictionary and analyzing book data, including sorting by weeks on the list, identifying popular authors, and calculating average weeks on the list. The analysis reveals insights into book popularity trends and author performance, highlighting the practical application of data mining techniques for understanding information and making business decisions. Desklib provides access to this document and a wealth of similar resources for students.

BUSINESS INFORMATION SYSTEMS
Authors name
BUS 105
Name of the professor
Institution name
City/state
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Part A
Question 1
Definition of data mining
Data mining is defined as the process that involved sorting large data sets as a way of
identifying and establishing patterns and relationships which are necessary for analytical
problem solving. Through the data mining tools enterprises are able to extrapolate trends into
the future.
Phases of the Cross-Industry Standard Process for Data Mining
Business understanding
This is the primary phase whose, main objective is to understand the objectives and
requirements of a project. This knowledge is thereafter converted into a definition that can fit
data mining (IBM Corporation 2012).
Data understanding
This phase began with data collection then proceeds to activities that enhance familiarity with
the data set. It assists detect subsets of interest that are useful in the hypothesis formulation.
Data preparation
This phase accounts for all the activities that are necessary to construct the final dataset from
the raw data. The tasks here include recording, tabling as well as selection of attributes. Data
transformation and cleaning also forms part of this phase.
Modelling
This stage involved selection and application of a number of models to obtain optimal values.
Evaluation
After building the models, this stage involves evaluation and reviewing the steps which have
been executed so as to create the model (Chapman et al. 2000). This way the application of
the model in solving a business problem is verified.
Deployment

After creating the model there is need to organize the knowledge gained in a way that it can
be useful to the consumers. This phase therefore entails carrying out of the deployment tasks
by the clients.
Question 2
Importance of data mining
Data mining assists in making informed decisions by projecting the future using the past
available data.
Elements of data mining
Below elements need to be put in place when carrying out data mining
Accuracy: a resource is only valuable if the quality aspect is reliable. Hence, when preparing
a tool for data mining it’s important to ensure the sources where the information is to be
gathered are accurate.
For example, a retail trader may be interested in product pricing offered by a number of
competitors. So as to ensure the information gathered is accurate, he/ she needs to consider
season, source and the products of concern.
The tools used to gather the data need to be able to differentiate the nature of data that is to be
collected this way data accuracy is aligned to the harvesting reason.
Relevancy: For information to be useful for decision making, it’s vital that it do consider the
context aspect. In a situation where a simple tool is designed to support harvesting of data it
is possible that it may fail to ensure the context needed is enough to ensure the data source is
relevant. The use of machine learning has assisted bridge the context gap. According to IBM
there are three areas that context need to cover that is industry, data and transfer.
For example; when crafting MySQL, it is not possible for your code to recognise context. A
more sophisticated tool needs to be applied so as to learn about the data context by using the
accumulated data and logical expressions.
Specificity: in the real world there are huge chunks of data that is availed on a daily basis.
When collecting data its therefore vital that the researcher do mine only the vital information
that can be in line with business strategy and the awareness of the industry (Cerami 2018).
Question 3

Problems in data mining
Poor quality of data, issues such as noisy data, missing values, dirty data, incorrect values as
well as poor representation of data sampling hinders quality of data mining. Also, redundant
data sources cause problems in data mining (Big Data Made simple 2015). Lastly there is the
issue of confidentiality and privacy concerns from organizations, individuals and government
agencies.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Part B: Data Management
1.
2. Data dictionary
Field Name Data Type
Publisher Text
Author Text
Primary isbn10 Text
date Date
Book title Text
Weeks on list Number
Data Dictionary
3. Sort by number of weeks
Publisher Author Primary isbn10 date Book title Weeks on list Column1
Riverhead Paula Hawkins 1594634025 2/19/17 THE GIRL ON THE TRAIN 102
Scribner Anthony Doerr 1501173219 05/07/2017 ALL THE LIGHT WE CANNOT SEE 81
Vintage E L James 525431888 03/05/2017 FIFTY SHADES DARKER 66
St. Martin's Kristin Hannah 1466850604 10/29/17 THE NIGHTINGALE 63
Penguin Group Kathryn Stockett 1440697663 04/08/2012 THE HELP 58
Washington SquareFredrik Backman 1476738025 7/23/17 A MAN CALLED OVE 56
Andrews McMeelRupi Kaur 144947425X 03/11/2018 MILK AND HONEY 52
Bantam George R R Martin 553897845 9/17/17 A GAME OF THRONES 51
Berkley Liane Moriarty 425274861 5/21/17 BIG LITTLE LIES 38
Ballantine Lisa Wingate 425284697 06/09/2018 BEFORE WE WERE YOURS 35
Scout Ruth Ware 1501132954 10/08/2017 THE WOMAN IN CABIN 10 35
Ten most popular authors
Author
Paula Hawkins
Anthony Doerr
E L James
Kristin Hannah
Kathryn Stockett
Fredrik Backman
Rupi Kaur
George R R Martin
Liane Moriarty
Lisa Wingate
Ruth Ware
4.
a. Authors with more than 10 books
Authors Number of books
Christina Lauren 13
Christine Feehan 28
Danielle Steel 32
David Baldacci 20
Dean Koontz 17
Debbie Macomber 26

Number of weeks on the list
Author Weeks on the list
Christina Lauren 8
Christine Feehan 31
Danielle Steel 38
David Baldacci 121
Dean Koontz 14
Debbie Macomber 23
Grand Total 235
b. Average number of weeks in the list
Author Average of Weeks on list
Christina Lauren 0.6154
Christine Feehan 1.1071
Danielle Steel 1.1875
David Baldacci 6.0500
Dean Koontz 0.8235
Debbie Macomber 0.8846
Grand Total 1.7279
5. Column Chart
3 authors

Findings
The data dictionary assists the user of the information understand the nature of the data. In
the data given there are three types of data that is text, data and number.
When the data is arranged based on the number of weeks that the books have been on the
New York times bestseller list, it is observed that the book that was most popular took 102
weeks in the list. This was the book titled ‘The Girl on The Train” by Paula Hawkins
published by Riverhead.
Analysing the authors by popularity Paula Hawkins is the most popular in the New York
Times’s list with Anthony Doerr, E L James, Kristin Hannah, Kathryn Stockett, Fredrik
Backman, Rupi Kaur, George RR Martin, Liane Moriaty, Lisa Wingate and Ruth Ware
closing the list of top 10 authors whose books dominated the article.
Furthermore, there are a number of authors who despite having none of their books appearing
on the top 10 popular list had several books titles making it to the list. One of these authors is
Danielle Steel who had a total of 32 different book titles being listed also Christine Feehan
had 28 book titles making the list.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Looking at the total number of weeks the books stayed in the list indicateds that David
Baldacci books dominated the article taking 121 weeks in the article between 2011 to 2018.
Averagely each of his books stayed in the article for 6 weeks. The column charts gives a
visual view of how long the books took in the list.
The focus of data mining is to use the past information to generate a model that can be
replicated to a population with an intention of gauging the trends in the data set. Just like
mentioned in the data mining the practical aspect that was involved in the analysis
concentrated on sampling the information through sorting to enable us make conclusions
regarding the data (Cerami 2018). This gives just a view of how crucial data mining is when
it comes to understanding information by business managers as well as any other parties that
may have an interest in the data.

References
Big Data Made simple 2015, ‘Top 12 common problems in data mining’, Big Data Made
Simple, February 3, viewed 3 January 2018, < http://bigdata-madesimple.com/12-common-
problems-in-data-mining/>.
Cerami, G 2018, ‘3 major Elements of Data Mining’, Connotate, April 24, viewed 4 January
2018, < https://www.connotate.com/three-major-elements-data-mining/>.
Chapman, P, Clinton, J, Kerber, R, Khabaza, T, Reinartz, T, Shearer, C & Wirth, R 2000,
‘Step-by step data mining guide’, Crisp-DM 1.0, viewed 4 January 2018, < https://www.the-
modeling-agency.com/crisp-dm.pdf>.
IBM Corporation 2012, Crisp-DM Help Overview, viewed 4 January 2018, <
https://www.ibm.com/support/knowledgecenter/en/SS3RA7_15.0.0/com.ibm.spss.crispdm.he
lp/crisp_overview.htm>.
SAS Institute Inc 2018, ‘Managing the analytics life cycle for decisions at scale’, White
paper, viewed 4 January 2018, <
https://www.sas.com/content/dam/SAS/en_us/doc/whitepaper1/manage-analytical-life-cycle-
continuous-innovation-106179.pdf>.