Introduction to Data Mining Project

Verified

Added on 2019/09/19

AI Summary

This assignment requires students to read a delimited file into a data-frame, apply cursory validations and rename columns if necessary, split the data into testing and training datasets, implement an algorithm using a library such as regression, naive Bayes, clustering, or k-nearest neighbors, apply the model to 20% of the data and provide measures of performance, visualize the model with a simple plot, and write a one-paragraph description of the project and business problem being solved.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Introduction to Data Mining: Project Overview
1) Read a delimited file (pipe or comma delimited) into a data-frame.
Consider using Hospital Compare data as a data source:
https://data.medicare.gov/data/hospital-compare (click on “download csv flat files”)
BONUS CREDIT: For bonus credit, create a table or tables in Postgres, populate the table(s) with
insert statements, and read the data into a data-frame using R. The DDL and insert statements should
be submitted with the assignment. The more elaborate the database, the more bonus credit you are
likely to receive (e.g. creating two tables and joining them together is worth more than a single table).
2) Apply some cursory validations (checking for nulls and blanks) and rename your columns if
necessary
3) Split your data into a testing and training dataset (80% training and 20% testing)
Hint: Use “the subset” function in R.
3) Using a library, implement an algorithm that we’ve discussed in class using 80% of the data. Model
options include:
 Regression (Linear, Logistic)
 Naive Bayes (Bernoulli, Multinomial, MLE)
 Clustering (Hierarchical, k-Means)
 k-Nearest Neighbors (as a classifier or predictor)
 TF-IDF
 Other (approval needed)
4) Apply the model to 20% of the data and provide some measure of model performance. Note that for
clustering, a testing/training split is not necessary.
 Z-test
 Confusion Matrix
 ROC Curve
 Inter-cluster SS (sum of squares)
 Precision/Recall, Specificity & Sensitivity
5) Visualize the model in some way with a simple plot.
 Scatterplots
 Correlation Matrix
 Histograms
6) A one-paragraph write-up on what business problem is being solved with your project and why the
model was selected.
BONUS CREDIT: Use R-Shiny to present the data in a browser. The more elaborate the UI (from a
functionality and style perspective), the more bonus credit you are likely to receive.
Submission Instructions:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

💎 Get Pro

The completed project should be submitted through Canvas. The attachments should include:
 The data file (or a sample of the data file if it is large). If you are choosing the bonus credit
option, the DDL and insert statements should be included.
 The R code that splits your data into a training and testing dataset and applies your model
 The R code that evaluates your model and visualizes the results. If you are choosing the bonus
credit option, the UI and Server files should be included.
 A screen shot of the visualization (i.e. a plot). If you are choosing the bonus credit, a screen shot
of the Shiny UI should be included.
 A simple one-paragraph description of the project and the business problem that you are
solving.
Submitted code should be functional such that I can copy it into my IDE and produce the same results.
IMPORTANT: I will be conducting a similarity search on each
assignment (using advanced NLP techniques) to find plagiarized
code (whether it is code shared between students and or copied
from the Internet). Plagiarized code includes code that is
structurally the same (despite renamed objects and variable
names). Students submitting unoriginal work will receive a
failing grade in the course.

1 out of 2

+13062052269

info@desklib.com

Introduction to Data Mining Project

Contribute Materials

Secure Best Marks with AI Grader

Related Documents

Genetic Algoritham Assignment 2022