Comprehensive Report: Analyzing and Cleaning Messy Data Challenges

Verified

Added on  2023/06/10

|7
|1323
|256
Report
AI Summary
This report addresses the pervasive issue of "messy data," particularly within healthcare, where data resides in multiple locations and formats. It highlights the challenges posed by Electronic Medical Record (EMR) systems, where inconsistent data capture hinders analysis. The report suggests using tools like Winpure for data cleaning and NoSQL databases for handling unstructured data. Key steps in data cleaning, such as removing unwanted observations, are emphasized. Furthermore, the report discusses data quality assessment, focusing on accuracy, consistency, and completeness, providing a comprehensive overview of how to manage and improve data quality. Desklib is a valuable resource for students seeking past papers and solved assignments related to data science.
Document Page
Running head: MESSY DATA
Messy Data
[Name of the Student]
[Name of the University]
[Author note]
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
1MESSY DATA
Example of “Messy Data”
Example 1: Much of the Data is placed in multiple locations:
The data present in healthcare generally tends to be residing in multiple locations and
they are received from different sources which includes the EMRS, HR software and many more.
This type of data are generally received from the entire organization. Aggregation of all this data
into a singular and central system can help in making this data accessible and actionable. Besides
this the data which is generally received is in different formats which makes the data messier
(Gerard et al., 2018). Along with different forms of data, same of kind of data exists in the
different systems but in different formats.
Solution:
Use of the Winpure software would be helpful in the process of elimination of this type
of problem. This is one of the popular and affordable data cleaning tool that can be used for the
purpose of accomplishing the task related to the cleaning of huge amount of data along with the
removal of similar data, making corrections and standardizing effortlessly. This is also capable
of cleaning of the data from the database, CRMs and many more. Usage of this is generally done
for the database which includes the Access, SQL server and many more (Wickham, 2014). Other
main key features of Winpure includes the advanced data cleaning along with fuzzy matching,
super-fast data scrubbing, and many more.
Example 2:
The EMR software or the Electronic Medical Record software has been associated with
supplying a platform in order to capture data in a consistent way. In last few years it has been
Document Page
2MESSY DATA
notices that the documentation of the clinical facts and the findings by making use of papers has
been associated with training ab industry in order to capture the data in any kind of way which is
convenient for the person providing care. This is generally done without considering the way by
which this data could be eventually be analyzing or aggregated (Singer et al., 2015). The EMR is
associated with making attempts so as to standardizing the process of capturing the data. Despite
of this it has been seen that the persons who are associated with providing care are very reluctant
in the process of adopting the one-size-fit-all approach towards documentation. So it has been
seen that the unstructured form of the captured data often appease the frustrated users of the
EMR which initially hinders the process of providing care. Due to all this reasons most of the
data which is captured by this way are difficult to aggregate and analyze in a consistent way.
With the improvements taking place in EMR the users get more training so as to have a standard
workflow.
Solution: Structured and Unstructured form of data
It is seen that the traditional SQL is capable of having an effective result while handling
of the large amount of structured data but by making use of the NoSQL unstructured data can be
handled very easily. The NoSQL database is associated with the storage of the unstructured data
without the existence of any kind of particular schema (Mutz, Pemantle, & Pham, 2017). Each of
the row would be having its own set of column values. NoSQL is associated with providing a
better performance while huge amount of data re being stored and besides this there also exists
many open source NoSQL databases which helps in the process of analysis.
Document Page
3MESSY DATA
Important Step in "Cleaning" Data:
The most important step that is included in the data cleaning process is the removal of the
unwanted observations. This is considered to be the first step which is to be followed while
cleaning the data. The unwanted observations generally includes the duplicate or the irrelevant
observations (John Walker, 2014). By removal of the unwanted observations it is possible to
remove a major portion of the messy data and provide a clear data without any kind of
repetitions or any kind of irrelevant information.
Data Quality Assessment Dependent on the Intended Use of the Data
Accuracy:
This is generally associated with stating the degree, which is used by the data for the
purpose of representing the “real-life” objects which it has intended in order to develop the
model. Accuracy is generally measured by making use of the values which are generally
associated with agreeing to the identified source if the right information. There correct
information is collected from various sources which perhaps includes the result of a manual
process, dynamically computed values, or the database of record and many more.
Consistency:
The most basic form which is to be considered is the consistency, which is generally
associated with the values of data where it is seen that one of the data set remains consistent with
values of a different set of data. Specification is provided by a strict definition of consistency
which states that two data values are to be drawn from separate data sets which should not be
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
4MESSY DATA
conflicting with one another (Gould, Sunbury, & Dussault, 2014). Despite of this the consistency
is not associated with necessarily implying the correctness.
Completeness:
Expectation which are generally related to the completeness is associated with indicating
the fact that, there exists some attributes which are not to be assigned with the values that are
present in a data set. Assigning of the Completeness rules to a data set can be done in three levels
of constraints which mainly includes the following:
1. Mandatory attributes which requires a value,
2. Optional attributes, which might be having a value which is generally based on some set of
conditions, and lastly,
3. Inapplicable attributes, which are not having a value.
Document Page
5MESSY DATA
Bibliography:
Gerard, D., Ferrão, L. F. V., Garcia, A. A. F., & Stephens, M. (2018). Harnessing Empirical
Bayes and Mendelian Segregation for Genotyping Autopolyploids from Messy
Sequencing Data. bioRxiv, 281550.
Gould, R., Sunbury, S., & Dussault, M. (2014). In praise of messy data. The Science
Teacher, 81(8), 31.
John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think.
Mutz, D. C., Pemantle, R., & Pham, P. (2017). Forthcoming in The American Statistician The
perils of balance testing in experimental design: Messy analyses of clean data.
Singer, J. M., André, C. D., Rocha, F. M., & Zerbini, T. (2015). Fitting non-linear mixed models
to messy longitudinal data: an example.
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1-23.
Halloran, K. M., Murdoch, J. D., & Becker, M. S. (2015). Applying computeraided
photoidentification to messy datasets: a case study of T hornicroft's giraffe (G iraffa
camelopardalis thornicrofti). African Journal of Ecology, 53(2), 147-155.
Document Page
6MESSY DATA
chevron_up_icon
1 out of 7
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]