Imputation of Hydrological Data Using R Language

Verified

Added on  2022/12/16

|11
|4314
|164
AI Summary
This article discusses the process of imputing hydrological data using R language. It explains the importance of data imputation and different missing mechanisms. The objectives of the study, stakeholders involved, and the benefits of imputing hydrological data are also discussed.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Hydrological 1
Imputation of Hydrological Data Using R Language
Name of Student
Name of Class
Name of Professor
Name of School
City and State of School
Date

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Hydrological 2
Table of Contents
Table of Contents
1. Abstract...................................................................................................................................................3
2. Introduction.............................................................................................................................................4
2.1. Missing completely at Random:........................................................................................................4
2.2. Missing at Random:..........................................................................................................................4
2.3. Missing depending on unobserved predictors:.................................................................................4
2.4. Missing dependent on the Value itself:............................................................................................5
3. Objectives................................................................................................................................................5
4. Stakeholders and Benefits.......................................................................................................................6
5. Literature Review....................................................................................................................................7
6. Conclusion...............................................................................................................................................9
References.................................................................................................................................................10
Document Page
Hydrological 3
1. Abstract
Hydrological data is scientific data involved with the motion, portioning, distribution and
the excellence of water on the universe; this includes the water cycle, how it falls from the skies
as well as water divide and actual sustainability (Woodrow, et al. 2016). Hydrological data in
most cases is used for purposes of predictions, forecasting and understanding past and future
hydrological circulations. The hydrological datasets that are usually used for thorough statistical
analysis, like any other statistical analysis, always display missing data entries. A dataset with
missing entries is always considered an incomplete dataset and therefore needs completing by
imputation. Imputation aids in that it helps individuals who are running the statistical models and
doing data analysis to come up with informative models and best results depending on the
statistical question to be answered by the data analyst trying to analyze a dataset at a specific
time and within specific conditions. Not only do datasets with missing data entries fail to give
the best of models and results during analysis but in some instances, they fail to at specific times
under specific codes to run and give the best results. That is why the analysis software such as R
and python do have codes that are used for data wrangling and data imputation (Bhardwaj et al,
2015). One approach that most statisticians do when handling data with missing records is the
deletion process where rows that contain missing datasets get deleted all at once. This throws
away all other relevant data points which would have increased the accuracy of the analysis
gotten after running the analysis (Yang, 2018). Due to this we will be forced to explore other
data imputation algorithms in R software.
Data imputation and infilling might be unpleasant and tedious, but it must be practised to
aid in analysis and water resources management. Looking at the reasons as to why data cleaning
is needed, it is evident that the process of data imputation or infilling should not in any way done
in a lackadaisical manner. Imputation of a dataset by infilling is important for analysis, it is very
evident and data analysts can attest to this fact. However, what is even more important is infilling
as good and as appropriate and more relevant as possible. Poorly infilled data points in a dataset
might negatively affect outcome results and eventually affect decisions that are made from these
results (Liu, et al. 2016).
Key Words: Hydrological data, water, cycle, imputation, R language.
Document Page
Hydrological 4
2. Introduction
There are several missing mechanisms that make a data point or data points missing from
a dataset. These mechanisms must be understood in order to help us know how to handle missing
data. The mechanisms once understood can help us know why respective data points are missing
in the first place (Miao, et al. 2016). From below we start from the most complex to the more
general one.
2.1. Missing completely at Random:
A is missing completely at random, if the probability of its missingness is similar to that of the
other entries. Take for example questions answered by rolling a dice. If the face 5 of the dice
shows up then the question or entry is not filled but if any other face of the dice shows up then
the question is answered or the entry is filled. In such a case of missing completely at random
where the probability of missing is completely at random then throwing away cases with missing
data does not bias the results that we are to get out of our analysis (Li, et al. 2015).
2.2. Missing at Random:
In this case, a dataset has more variables and one variable’s answerability depends on the
answerability of all the other variables. This is a more general missing mechanism where a
variable's missingness depends on available information. By available information, we mean that
the other four independent variables have to be answered before the variable of interest is
answered. It is often recommendable to model this process as logistic regression since there is a
degree on independence and dependence. The outcome variable should be modelled as 1 for
observed cases and 0 for missing cases. As long as the logistic regression takes controls all the
variables that affect the missingness of this respective variable, then the missing cases can be
thrown out by being marked with an NA (Little and Rubin, 2019).
2.3. Missing depending on unobserved predictors:
Missing at random no longer plays a key role in the missing state of a data entry point as it
depends on the information that is not available, most importantly if this information predicts the
respective variable that is actually missing. Take for example when we are studying the time and
rate of water evaporation from the soil after it rains and that the grounds run dry faster and leave
the ground drier than ever making crops grown wither faster. Also, consider us having taken a
group of farmers who are experiencing the challenge of excessive evaporation and expecting
them to answer whether evaporation rate affects their farming returns negatively or positively. A
farmer who does not understand the importance of the study and who is totally mad at how his
agricultural proceeds are negatively affected by the rate of evaporation might fail to participate in
the survey. This leaves us with an unanswered entry. In this case, data is not missing at random
but missing because relevant information (the farmer) is missing, information that largely affects
the answering of this respective data point (Zhang and Wang, 2017).

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Hydrological 5
2.4. Missing dependent on the Value itself:
This is a scenario in which missing data is missing because of the data variable itself. For example, a
farmer who has been harvesting water for use in his or her farm during the drought seasons is not there
to reveal how much water one needs to harvest that would be enough for a specific farming land size
(Van Buuren, 2018).
3. Objectives
The actual objective of the report is to do a real imputation of hydrological data using the R
language. Imputation by far is finding and replacing missing data in a dataset. As we have discussed
above, data can be missing due to different multiple reasons. Hydrological department in the
metrological firm of the country has been getting complaints from multiple numbers of citizens in the
country. Complaints, when received by the hydrological department, must be related to water and its
circulation to and from the earth's surface. This then made the hydrological department to set out on a
research trail to get to study the hydrological movement from the sky to the earth’s surface and from
the earth’s surface back to the sky. Other issues that were to be monitored were the rate at which water
was percolating into the deeper layers of the soil. Different soil pores that determine their proficiency in
drainage were to be investigated. The reason for soils drainage investigation was because of the fact
that this is what determines whether there will be a flood in a region when it rains or not. One of the
complaints received by the hydrological department is based on floods. We know what causes floods
after a heavy downpour; it is the soil’s type and the actual drainage of the soil.
As we all know climate change parameters are measured in terms of rainfall amounts and water surface
elevations. The other sub parameter that should not be forgotten at all is the degree of solar radiation
that reaches the earth's surface and the water surface. This is a big way affects the water surface
elevation and how long water evaporates from or stays on the surface. But the complaints that have
reached the hydrological department are on matters excessive evaporation in some parts of the country
whereas there is too much flooding in other fewer parts of the country (Doyle, 2016). These complaints
that were known by the hydrological department even before being raised, the reason is that the
hydrological department is concerned with hydrological matters. It would be understandable if they go
the first-hand information even before getting the actual complaints from the general public. Water can
be a killer and it can be of benefit as well. In times of floods water kill, but it is the same water that helps
people in several ways that ensure people stay alive through in life. These complaints made the
hydrological department to set out for the hydrological data that it has collected in a bid of analyzing
them in an act of wanting to make informed decisions that can be sued to solve the hydrological
complaints of different complainants.
The dataset that was collected to a great extent though missed data series that needed infilling
by all means. This then required the machine learning algorithms employment. Since we had a mixed set
of solution implementation stakeholders, we had to employ two different data infilling (imputation
algorithms). Methods that are available for filling the missing sets of data are; nearest neighbour, Linear
Document Page
Hydrological 6
regression, artificial neural network, Fuzzy rules, Ordinary Kriging, Multiple linear regression (MLR),
Copula-based estimation, MLR using the EM algorithm, simple random imputation (Jones et al, 2018).
For simplicity in understanding two of the above methods are chosen to see if the same results will be
gotten. This is; regression and sample random imputation and the daily data between the years 2008 to
2018 of rainfall amounts having measured using rain gauges in the areas that experiences too much
flooding were taken from the respective weather stations in these areas, where the rate of floods
occurring was to be determined and therefore the independent variables were availability of proper
drainage systems, availability of reservoirs where excess flood water can be drawn into, and the present
soils drainage type. The next set of data that was picked was based on water evaporation amounts that
affected the soil’s humidity status (Armstrong, McHale, et al. 2019). The independent factor that was
there, in this case, was the solar radiation degree. Data on the rate of evaporation that left the soil drier
was taken from the periods ranging from 2008 to 2018.
The most significant aim of this study was to find the rate of water of evaporation from the ground and
the second was to find how water in the areas prone to flooding would be diverted to areas with
reservoirs to prevent the easy spread of water in order to prevent frequent flooding. It was evident that
the soils drainage system was poor and it cannot allow large amounts of water to percolate through into
the deeper layers (Gramlich, et al. 2018).
In the evaluation of obtained results, Multiple R, R2, and Standard Error were used to do
comparisons. The observed results indicate that the standard error in the normalized data was less than
the regular data.
4. Stakeholders and Benefits
There are two groups of stakeholders and these are; the hydrological department that is trying
to solve the hydrological menaces that are realized by the second set of stakeholders. The second set of
stakeholders involve the complainants and among them are just regular individuals that do complain of
frequent flooding that needs to be solved whereas the very first lot are farmers, who are usually
unsettled with matters high rate of evaporation.
Solving of two of the hydrological problem would help both classes of stakeholders in that there
will be data models that can be used to make predictions that would be used to modify means of
channelling excess water into reservoirs. The next solution that can be the modelling ways that prevent
excess evaporation of water from the ground by either the use of mulch or just planting more trees that
keep the land surface from being bare. The farmers that depend on moisturized soils for their proceeds
to grow can therefore benefit, water can be there for future use nicely stored in the reservoirs, lives
won’t be lost because of floods due to heavy downpour and in addition to that, the scientist of the
hydrological department will have benefitted in that their data analysis skills and how to related data
results to help in solving the and interpreting social problems will have improved a great deal.
Document Page
Hydrological 7
5. Literature Review
Imputation is performed more formally whenever more than a trivial function of data is
amiss. It is wise to understand missing data imputation and for an easier imputation to be done a
simpler approach will be considered, where imputation of the missing value of the amount of
water lost via evaporation is based on observed data for this variable. The hydrological
departments, having the urge of answering the complainants that frequently visit their door step
on matters hydrological data needs to impute the hydrological data that is to be analyzed in order
to generate the intended answers are to be generated. The best way to do this is by simply
growing into the whole process step by step from simpler codes and actually moving into more
complicated codes that would illustrate the whole imputations process of datasets that would
later be used in analysis.
The simplest function that can be used for understanding imputation function is as stated
below:
R Code: random.imp <- function (a){
missing <- is.na(a)
n.missing <- sum(missing)
a.obs <- a[!missing]
imputed <- a
imputed[missing] <- sample (a.obs, n.missing, replace=TRUE)
return (imputed)}.
In order to visualize and see how the functions worked, a small dataset was taken and
then the function as above was run line by line. Then use random.imp to create a complete data
vector for water lost through evaporation represented by WLTE.
R code: earnings.imp <- random.imp (WLTE).
But all in all the simple imputation method does not make lots of sense as it can ignore lots of
question raised via the process set for the entire process.
The best way to approach this to success is to set a regression model that can be used to predict
missing cases via observed cases. From below we will be able to see how this plays out very
well.
A simple and general imputation procedure that uses individual-level information uses regression
to the nonzero values of WLTE. We begin by setting up a data frame with all the variables we
shall use in our analysis:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Hydrological 8
sis <- data.frame (cbind (WLTE, WLTE.top, solar radiation))
and then fit a regression to positive values of earnings:
lm.imp.1 <- lm (WLTE ~ solar radiation
data=SIS, subset=solar radiation>0)
We then go through the steps needed to create deterministic and random imputations. We get
imputations for all data.
pred.1 <- predict (lm.imp.1, SIS).
The sis has been including in the vector function because this is what aids in getting the entire
dataset in that we are looking for.
Next we will need a function that will impute prediction into the missing values.
impute <- function (a, a.impute){ifelse (is.na(a), a.impute, a)}.
And then we use the below to compute our missing WLTE; water lost through evaporation data
entries:
earnings.imp.1 <- impute (WLTE, pred.1). (Bartlett and Morris, 2015)
From the above, we will have imputed the required missing data point and therefore we can go
ahead and you the same regression model to help predict the actual water amount that is lost due
to the high intensity of solar radiation. The high existence of high intensity of solar radiation is as
a result of an actual climate change in the atmosphere.
There is only one rule in linear regression, the rule states that there is only a linear
regression in which there is an independent variable and a dependent variable, In such a case
when both are drawn on a Cartesian plane the dependent variable will be on the y variable while
the predictor variable will be on the y variable and the equation of representation is something
like;
yi = b0 + b1xi1 + … + bpxip + ei. Where ei is the error term and the bijs are the coefficients of
the independent variables whereas yi is the dependent variable. But in linear regression, there is
only one dependent variable and an additional independent variable. The other values of the
coefficient term except b1 are zeros. If the value b1 was also a zero then the actual value that
would have been modelled as our target variable would have been the intercept itself. There is a
clear indication that the R-squared value and the adjusted R-squared values play a significant
role in data analysis in that when these values are extremely high, it indicates a good significance
level in that there is a true chance that the variables used for prediction do appear multiple
numbers of times and are reliable to be used for further predictions. When the values are too low
Document Page
Hydrological 9
even there is very low confidence in using the predictor variables involved. As for my case, the
values stood at each above 0.9, meaning the predictor variable for the linear regression appeared
more than 90% which is a good sign. The p-values as well must be close to 0.05 to help
determine the level of significance related to the entire model. The R-squared value determines
the p-value in a multiple numbers of ways; one shows the actual existence of the other. One is
enough to determine the next. R-squared being relevant shows the next is also relevant (Faraway,
2016).
When the linear model is determined in R the amount of the water that is lost via
evaporation can be predicted. Since complaints had been raised it is evident to see that the
amounts of evaporation of water from the soil would be very high. This then would be addressed
by storing water that can either be used for irrigation at the water reservoirs when it rains or
planting trees or even mulching to prevent the excess water evaporation causing withering of
plants and other crops that are used for food and other things.
On the second imputation on the issue of the flood, we will venture into multiple linear
regression, where our one and the only dependent variable is the amount of water lost through
floods (meaning amount of water that is not tapped into reservoirs and that affect people's lives)
when it rains. Here predictor variables will be the availability of reservoirs, availability of
drainage and soils drainage. The existence of enough reservoirs that would trap enough water
that would otherwise be lost to floods would help a great deal on the level of water lost via
floods and that that affect peoples' lives. The availability of proper drainage would clearly help in
that the more water would be channelled to reservoirs. Soil's drainage system is also
advantageous as it aids in percolating lots of the water into the lower layers. The absence of any
of these would mean more water lost to floods (McElreath, R., 2018).
6. Conclusion
Imputation of data is important in all aspects and just like in engineering and financial
markets, it is good to have complete data in the hydrological department as it would aid in the
provision of informed data that would aid I making an informed decision and solving menace
that would arise because of hydrological issues. Water and its factors can be in total luck. But
when all the relevant sets of data are available, then that entire problem can be addressed. The
hydrological department because of this should always collect all the relevant sets of data
without missing any entries to avoid future challenges that might arise due to the fact of data that
will need imputation.
Document Page
Hydrological 10
References
Armstrong, S., McHale, G., Ledesma-Aguilar, R.A. and Wells, G.G., 2019. Pinning-Free
Evaporation of Sessile Droplets of Water from Solid Surfaces. Langmuir.
Bartlett, J.W. and Morris, T.P., 2015. Multiple imputation of covariates by substantive-model
compatible fully conditional specification. The Stata Journal, 15(2), pp.437-456.
Bhardwaj, A., Deshpande, A., Elmore, A.J., Karger, D., Madden, S., Parameswaran, A.,
Subramanyam, H., Wu, E. and Zhang, R., 2015. Collaborative data analytics with DataHub.
Proceedings of the VLDB Endowment, 8(12), pp.1916-1919.
Doyle, J., 2016. Mediating climate change. Routledge.
Faraway, J.J., 2016. Linear models with R. Chapman and Hall/CRC.
Gramlich, A., Stoll, S., Stamm, C., Walter, T. and Prasuhn, V., 2018. Effects of artificial land
drainage on hydrology, nutrient and pesticide fluxes from agricultural fields–A review.
Agriculture, ecosystems & environment, 266, pp.84-99.
Jones, A., Keatley, A.C., Goulermas, J.Y., Scott, T.B., Turner, P., Awbery, R. and Stapleton, M.,
2018. Machine learning techniques to repurpose Uranium Ore Concentrate (UOC) industrial
records and their application to nuclear forensic investigation. Applied Geochemistry, 91,
pp.221-227.
Li, P., Stuart, E.A. and Allison, D.B., 2015. Multiple imputation: a flexible tool for handling
missing data. Jama, 314(18), pp.1966-1967.
Little, R.J. and Rubin, D.B., 2019. Statistical analysis with missing data (Vol. 793). Wiley.
Liu, H., Tk, A.K., Thomas, J.P. and Hou, X., 2016, March. Cleaning framework for bigdata: An
interactive approach for data cleaning. In 2016 IEEE Second International Conference on Big
Data Computing Service and Applications (BigDataService) (pp. 174-181). IEEE.
McElreath, R., 2018. Statistical rethinking: A Bayesian course with examples in R and Stan.
Chapman and Hall/CRC.
Miao, W., Ding, P. and Geng, Z., 2016. Identifiability of normal and normal mixture models with
nonignorable missing data. Journal of the American Statistical Association, 111(516), pp.1673-
1683.
Van Buuren, S., 2018. Flexible imputation of missing data. Chapman and Hall/CRC.
Woodrow, K., Lindsay, J.B. and Berg, A.A., 2016. Evaluating DEM conditioning techniques,
elevation source data, and grid resolution for field-scale hydrological parameter extraction.
Journal of hydrology, 540, pp.1022-1029.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Hydrological 11
Yang, J.G., 2018. The Advantages and Disadvantages of Internet Commerce in China. In
Entrepreneurship, Collaboration, and Innovation in the Modern Business Era (pp. 278-290). IGI
Global.
Zhang, Q. and Wang, L., 2017. Moderation analysis with missing data in the predictors.
Psychological Methods, 22(4), p.649.
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]