ENRP20001: Imputation of Hydrological Data using R-Language

Verified

Added on 2022/11/29

AI Summary

This report focuses on the imputation of hydrological data, a crucial aspect of environmental engineering, using the R-language. The study addresses the challenges posed by missing data in hydrological datasets and explores various imputation techniques to ensure data completeness. The report begins with an introduction to the significance of hydrological data in understanding climate change and its impact on water resources, including rainfall, water retention, and evaporation. It highlights the problems caused by floods and excessive evaporation and the need for data analysis to address these issues. The objectives are clearly stated, emphasizing the importance of data cleaning and imputation methods. The literature review covers various data imputation methods such as Kriging, nearest neighbour, linear interpolation, arithmetic mean, artificial intelligence techniques, and regression methods, with a focus on regression analysis due to its ease of explanation. The report discusses the benefits for stakeholders, including the hydrological department and communities affected by floods and evaporation. The methodology involves the use of the R-language for data imputation, particularly focusing on regression analysis to establish the amount of water lost through floods and evaporation. The conclusion underscores the significance of the findings in informing decisions and actions related to water conservation and management. The report provides a comprehensive overview of data imputation techniques and their application in environmental engineering, offering valuable insights for stakeholders.

IMPUTATION 1
Imputation of Hydrological data using R-Language
Name of Author
Name of Class
Name of Professor
Name of School
State and City of School
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

IMPUTATION 2
Abstract
Rainfall, water retention by water retention reservoirs, water percolation through the soil
and into plants roots and water evaporation are key structure to be considered when looking at
climate change and deciding which was to take in order to be able to conserve more water to
enable individuals and the vegetation at large to continue benefiting in different ways. The
above-mentioned parameters are crucial when it comes to climate trends and studying climate
trends. The said parameters largely dictate how to act in unbearable rainfall and water situations
such as the ones that cause floods and extreme evaporations of water back into the atmosphere
due to scorching sun rays leaving the ground so dry. Different extreme water behaviours that I
would consider in this case are as a result of the soil type that is present in an area, availability of
water reservoirs to hold excess water when it rains and excess sun rays that ensure that water
stays for lesser periods on the ground after it rained (Woodrow, Lindsay and Berg, 2016). The
missing or the existence of the effects of the parameters that I have mentioned manifest
themselves in different ways. One such way is the existence of floods when there are no
adequate or elaborate drainage systems, reservoirs and when the soil type is clay that does not
allow easier percolation of water into the deeper layers of the soil itself. Another extreme effect
of the parameters mentioned is the existence of too much heat from the sun (now that more sun
rays reach the earth due to a fractured ozone layer) which makes the water that had settled on the
ground from the rain to evaporate more quickly than expected. This leaves plants without enough
water to use for their growth, either in processing food or in providing breezes by drips,
something that makes lots of plants to wither more and more.
Introduction
The extreme issues brought forth by the existing parameters are mentioned above have
led to concerns being raised by different stakeholders. These upsetting concerns made the
hydrological department to set out on research to help the devastated groups in solving the
problems that they have. On the same note, it is a directive for me to do the same research and
the same studies on the very parameters that were set out by the hydrological department to see if
I can come up with various concerns solutions that match up to those of the hydrological
department. If my solutions do not match at all in any way then how are mine relevant and how
can they be amalgamated to those of the hydrological department to improve on the results of the
said implementation of the very exact solutions that the two groups have come up with
(Gerlinger).
The entire process involves usage of hydrological data on each and every parameter. The
data related to such in most cases miss specific data series that need to be imputed. The
imputation software that we are supposed to use in the data imputation is the R software
(Heeringa, West and Berglund, 2017).

IMPUTATION 3
It is okay if data is missing because that can be imputed and be replaced, what cannot be
okay is if data is missing and there are wrong data entries on the column areas where data was
entered while picking the relevant datasets (Cox, 2018).
After the data collection process, data might end up missing due to different factors. The factors
that can lead to data missing can be; unimportant data might be considered while collecting or
doing data entry, faultiness of equipment used for data collection, data entry can be abolished
altogether because it can just be too hard to understand a set of data and incompatibility of data
with other data that the data is to be related to (Van Buuren, 2018). When filling up missing data,
the methods that are used in such cases vary greatly as they will be dependent on data's ratio and
missing pattern, mechanisms that are amiss and the type and number of involved variables in the
dataset in question. This gives rise to totally different types of missing data and they are; missing
completely at random, missing at random and no ignorable missing data (Little and Rubin,
2019). Missing completely at random data is a dataset that is totally independent of the yet
observed data that is in the set. Missing at random, one that is not independent of the observed
data set and that are actually related but are just missing and are provided by a datasets predictive
class (Gonzalez-Ocantos and LaPorte, 2019). The methods of missing data imputation will be
discussed in the literature review section with informed details addressed to stakeholders.
Objectives
Hydrological data is scientific data involved with the motion, portioning, distribution and
the excellence of water on the universe; this includes the water cycle, how it falls from the skies
as well as water divide and actual sustainability (Beck et al, 2017). There are datasets collected
by the hydrological department, the datasets might have missing values cause of reason stated
above. The datasets are to be used to address different challenges from different stakeholders.
The dataset in its nature is considered to be dirty and thorough cleaning is needed on it. There are
different cleaning methods; one such method is deleting an entire row which has missing data.
The habit of deleting an entire row with missing data to a great extent I not usually encouraged,
the reason is that in so doing, one will be throwing away other important data points that are
truly needed in the data analysis. Because of this in us cleaning our data we will focus on other
machine learning methods of data cleaning bore using it to run the analysis for better results that
can be used to create solutions for issues raised by stakeholders.
Stakeholders and Benefits
There are different stakeholders to be addressed by this exercise of hydrological data
imputations. The very first lot is the hydrological department. This is the lot that is supposed to
understand the water circulation processes in regards to the variables that are mentioned in the
abstract section. If there are regions that are experiencing floods then they are supposed to
understand what soil type is in these respective regions. How the pores of the soil type are open
for percolation of water. If it is clay soil which cannot be changed into anything better, they are

IMPUTATION 4
there drainage channels that channel the excess water out of the flooded areas into some
reservoirs? (Sharp, 2016). The imputations of the data values would help the management to
answer to issues that are raised by affected people with confidence. After the analysis of the
imputed datasets, then not only will there be management be able to answer affected people by
words but by action in that the volumes of water that will be wasted in terms of floods can be
channelled to reservoirs that will store the water which can then later be used for purposes
irrigation.
Another benefit that arises to stakeholders who raise complaints over excess water
evaporation from excess heat due to climate change is that after the imputation of a dataset that
explains the amount of water that is lost via evaporation, then relevant authorities will be able to
give advice on how to save plants that are grown from excess evaporation. Either of the
harvested water in the reservoirs can be used or mulching can be done to reduce the effects of
excess evaporation on plants and young crops (Malaiya, Arora and Arora, 2017).
Literature Review
The cleaning of the data will be discussed in this area and the software that will be
focused on is the R software. As was mentioned earlier, there are several ways of cleaning
datasets that would not prove as attractive for data analysis. From this matter I had pointed out
one of them that must be discouraged at all costs, the reason being, by deleting rows that have
got missing data entry points, we will be throwing away other data points that would help a great
deal when it comes to analyzing a dataset that is involved.
Of the methods provided by scientists to clean data by actually imputing missing value
are; Kriging method, the nearest neighbour method, the linear interpolation, the arithmetic mean
method, artificial intelligence techniques and the regression method (Kim, Gao and Rzhetsky,
2018). All of the methods mentioned above are all machine learning methods and they can both
be run in both R and python. Since python is a malty purpose programming language and a bit
complex to most other people to understand, we will use R instead to discuss our imputation into
data analysis (Das et al, 2016).
The purpose of my hydrological data imputation is to establish the amount of water lost through
floods (due to poor soil drainage) and through evaporation due to excess heat after it rains.
Of all the machine learning methods of data imputation, the easiest to explain to a
stakeholder group with no or little knowledge of data analysis or data imputation is regression
analysis. This is because, in the field of climate, rainfall amounts, temperature rise or fall, solar
radiations are all predicted using regression. This is the reason as to why choosing regression for
the major cause of my data imputation process would help a great deal in explaining end results
and understanding to stakeholders (Jones, 2017).

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

IMPUTATION 5
Regression method – in this case, we have two types of regression methods or regression
analysis. One of the methods is linear regression analysis; where there is one dependent variable
and one and only one independent variable. The second regression analysis is the multiple linear
regression which has one dependent variable and two or more independent variables. We can
also have the dependent variable termed as our response or target variable and our independent
variable as our explanatory or predictor variable. A linear regression model assumes a more
linear relationship between the target variable and the predictor variable or variables and the best
model for expressing the linear regression model is something that looks like;
Yi = B0 + B1Xi1 + … + BpXip + Ei; Yi is the ith target variable, B0 is the intercept, Bis are the
coefficients, Xij is the ith jth independent variable and Ei is the error term (Harrell, 2015). The
reason as to why the model is linear is because the Bith value of independence is increased by
increasing predictive value ith. Of the model, if the predictor values are zero then the Bo, which
is the intercept is the one modelled to be our response variable.
Since regression can only be run on a totally complete dataset of the missing entries in specific
rows, the missing datasets can be filled by using actual codes; impute=mice(dataset[,7:8, 13],
m=3, seed = 123), where 7, 8 and 13 are just random column numbers that should have their
datasets included. This is the point which when reached and passed, can be said to have given a
clean dataset which can then be run through a regression model to determine the R squared
value, the p-value that will dictate for us if the model is the best or not. Using this we can now
predict and write the actual model which later can be used to make a prediction on the amount of
water that would be expected to be lost during floods and during evaporation processes.
Conclusion
When a model for making predictions is finally realized, then appropriate decisions and
actions can happen only because there are numbers that influence these decisions. Since you
know that numbers never lie, decisions backed by well-understood numbers on how to conserve
amounts of water that are lost in various ways will be the best decision that benefits all
stakeholders.

IMPUTATION 6
References
Beck, H.E., Vergopolan, N., Pan, M., Levizzani, V., van Dijk, A.I., Weedon, G.P., Brocca, L.,
Pappenberger, F., Huffman, G.J. and Wood, E.F., 2017. Global-scale evaluation of 22
precipitation datasets using gauge observations and hydrological modeling. Hydrology and
Earth System Sciences, 21(12), pp.6201-6217.
Cox, D.R., 2018. Analysis of binary data. Routledge.
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong, A., Vrieze, S.I., Chew, E.Y.,
Levy, S., McGue, M. and Schlessinger, D., 2016. Next-generation genotype imputation service
and methods. Nature genetics, 48(10), p.1284.
Gerlinger, C., Bamber, L., Leverkus, F., Schwenke, C., Haberland, C., Schmidt, G. and Endrikat,
J., 2016. Comparing The EQ-5D-5L Value Sets Across Diferent Countries–Impact on
Inpertretation of Clinical Study Results. Value in Health, 19(7), p.A389.
Gonzalez-Ocantos, E. and LaPorte, J., 2019. Process Tracing and the Problem of Missing Data.
Sociological Methods & Research, p.0049124119826153.
Harrell Jr, F.E., 2015. Regression modeling strategies: with applications to linear models,
logistic and ordinal regression, and survival analysis. Springer.
Heeringa, S.G., West, B.T. and Berglund, P.A., 2017. Applied survey data analysis. Chapman
and Hall/CRC.
Jones, N., 2017. How machine learning could help to improve climate forecasts. Nature News,
548(7668), p.379.
Kim, J.S., Gao, X. and Rzhetsky, A., 2018. RIDDLE: Race and ethnicity Imputation from
Disease history with Deep LEarning. PLoS computational biology, 14(4), p.e1006106.
Little, R.J. and Rubin, D.B., 2019. Statistical analysis with missing data (Vol. 793). Wiley.
Malaiya, A., Arora, A. and Arora, B.B., 2017. Performance comparison of cuboical box type
solar still deployed with different basin profiles. International Journal, 5(3), pp.362-365.
Sharp, L., 2016. Ridding the world of hunger cannot be separated from the need to curb the
harmful effects of climate change on food security and nutrition. Impact, 2016(1), pp.18-19.
Van Buuren, S., 2018. Flexible imputation of missing data. Chapman and Hall/CRC.
Woodrow, K., Lindsay, J.B. and Berg, A.A., 2016. Evaluating DEM conditioning techniques,
elevation source data, and grid resolution for field-scale hydrological parameter extraction.
Journal of hydrology, 540, pp.1022-1029.