Data modeling in R: Conditional Probability, Entropy, Correlation and Covariance Coefficient

Verified

Added on  2023/06/14

|20
|3680
|203
AI Summary
This assignment covers topics such as conditional probability, entropy, correlation and covariance coefficient using data modeling in R. It includes analytical solutions and simulations for each topic. The subject is not mentioned, but the course code and college/university are not relevant. The slug is 'data-modelling-r'.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Data_Modelling
Student:
March 31, 2018
Title: Data modeling in R
Assignment type: Analytical computation and
simulation, in R
Student Name: XXX
Tutor’s Name: XXX
Due date: XX-XX

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Question 1
Conditional probability
A dice is tossed 7 times: A being probability of each value appearing at least once, B probability
outcome is alternate numbers, hence:
Analytical
Conditional prob(probability) Of A given B is-
PR(A|B)
Such that
Pr(a and b) = PR(A|B)*P(B)
Where, Pr(a and b) is the joint probability of A and B.
Therefore:
PR(A|B) = P (AB)
P(B)
Probability of a number on a dice occurring at least once i.e. event A
k =7 (n=sample number)
p=0.167 (p=probability of event occurring)
q=0.833 (q=probability of event not occurring)
Hence given a binomial probability b(x; k, P) = { k !
[ X ! ( k X ) !]} × PX× (1- p k-X)
( [ 7 !
6 !1 !0.16760.8331
] )
6
=0.000126
Probability of a number occurring but not adjacent, i.e. without replacement, event B
Given the probability a number occurs is 0.000126 then the probability that the succeeding
number is not the same is such that the succeeding number is either odd followed by even or
even followed by odd, hence
Probability of even= 7 !
6 !1!0.16760.8331=0.000126
Document Page
And
Probability of odd= 7 !
6 !1!0.16760.8331=0.000126
However given that joint the probability of A(odd) and B(even) is given by P(A B)
In such that P(A) + P(B)
But P(A)=0.000126, P(A)=0.000126
Therefore
0.000126 + 0.000126=0.000252
Alternatively
0.000126×2 =0.000252
Hence:
pr(b)=0.000252
Therefore calculating conditional probability:
PR(A|B) = P (AB)
P(B)
But
Pr(a and b) = Pr (a) * Pr (b) = 0.0001260.000252=0.000000031
Hence:
Pr ( a/ b ) = Pr ( ab )
P ( b ) = 0.000000031
0.000252 =0.000126
Pr (a/b) = 0.000126
Simulation
#calculating probability of event occurring when rolling dice once,
i.e. p
dicex<-6
n<-1
dice.prob<-function(dices) prod(1/dices)
dice.prob(c(n,dicex))
## [1] 0.1666667
Document Page
#calculating probability of an event not occurring when rolling a
dice, i.e. q
dicex<-6
n<-1
dice.prob<-function(dices) 1-prod(1/dices)
dice.prob(c(n,dicex))
## [1] 0.8333333
#calculating probability of a number occurring at least once in 7 dice
tosses, i.e. event A
eventA<-dbinom(6, size=7, prob=0.1666667)
eventA
## [1] 0.0001250287
#calculating probability of event B
evendice<-(dbinom(6, size=7, prob=0.1666667))
odddice<-(dbinom(6, size=7, prob=0.1666667))
EventB<-evendice+odddice
EventB
## [1] 0.0002500574
#calculating conditional probability for event A given event B p (A/B)
conditprob<-(EventB*eventA)/EventB
conditprob
## [1] 0.0001250287
Question 2
Entropy
Imputing missing data (handling NAs)
#raw data
#imputed data
#summary of imputed and raw data
#histogram of Y variables
assignment<-read.table("c:/data.csv",
header=TRUE,
sep=",")
assignment
newdata<-mice(assignment, m=10, maxt=40, meth='pmm', seed=1000)
fulldata<-complete(newdata, 1)
fulldata
summary(fulldata)

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
## X Y
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :1.00
## Mean :0.44 Mean :0.55
## 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
Classdata1<-fulldata$X
hist(Classdata1)
Classdata1<-fulldata$X
hist(Classdata1)
hist(Classdata1, main="Histogram of class assignment data for x
variables", xlab="X variables", border="blue", col="skyblue",
xlim=c(0,1),las=1,breaks=5)
#raw data
#imputed data
#summary of imputed and raw data
#histogram of Y variables
assignment<-read.table("c:/data.csv",
header=TRUE,
sep=",")
assignment
Document Page
library(mice)
md.pattern(assignment)
## X Y
## 78 1 1 0
## 9 0 1 1
## 12 1 0 1
## 1 0 0 2
## 10 13 23
newdata<-mice(assignment, m=10, maxt=40, meth='pmm', seed=1000)
fulldata<-complete(newdata, 1)
fulldata
summary(fulldata)
## X Y
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :1.00
## Mean :0.44 Mean :0.55
## 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
Classdata2<-fulldata$Y
hist(Classdata2)
Classdata2<-fulldata$Y
hist(Classdata2)
hist(Classdata2, main="Histogram of class assignment data for y
variables", xlab="Y variables", border="blue", col="skyblue",
xlim=c(0,1),las=1,breaks=5)
Document Page
i Probability.
#calculating probability distributions for X and Y
prop.table(table(fulldata))
## Y
## X 0 1
## 0 0.24 0.32
## 1 0.21 0.23
X Y 0 1 pi 0 0.23 0.32 0.55 1 0.24 0.21 0.45 pj 0.47 0.53 1.00
X
Y
0 1 pi
0 0.23 0.32 0.55
1 0.24 0.21 0.45
pj 0.47 0.53 1.00
Hence from the table:
P(x) =0.47
P(y) =0.55

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
P(x/y) = P ( xY )
P(Y ) = 0.23
0.55 = 0.41811
P(y/x) = P ( Xand Y )
P( X ) = 0.23
0.47 =0.48936
P(xy)=0.2585
i Entropy
Simulation
Table for calculating entropy for X variables
#raw data
#imputed data
#Calculating entropy
assignment<-read.table("c:/data.csv",
header=TRUE,
sep=",")
library(mice)
## Loading required package: lattice
md.pattern(assignment)
## X Y
## 78 1 1 0
## 9 0 1 1
## 12 1 0 1
## 1 0 0 2
## 10 13 23
newdata<-mice(assignment, m=10, maxt=40, meth='pmm', seed=1000)
fulldata<-complete(newdata, 1)
summary(fulldata)
## X Y
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :1.00
## Mean :0.44 Mean :0.55
## 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
Classdata1<-fulldata$X
entr <- function(CLS.FRQ){
classfrequency <- CLS.FRQ
Document Page
entr <- 0
for(i in 1:length(classfrequency)){
if(classfrequency[[i]] != 0){ # class definition
entrp <- -sum(classfrequency[[i]] *
log2(classfrequency[[i]])) #class entropy
}e else{
entrp <- 0
}
entr <- entr + entrp }
return(entr)
}
freqs <- table(Classdata1)/length(Classdata1)
freqs
## Classdata1
## 0 1
## 0.56 0.44
entr(freqs)
## [1] 0.9895875
Entropy= H(X) =-
k
¿¿)
H(X) =-
k
¿¿) k=n, n is total observations. I.e. n=100
H(X) =
k
¿¿
= 0.98996
Simulation
Table for calculating entropy for Y variables
#Calculating entropy
assignment<-read.table("c:/data.csv",
header=TRUE,
sep=",")
library(mice)
md.pattern(assignment)
## X Y
## 78 1 1 0
## 9 0 1 1
## 12 1 0 1
Document Page
## 1 0 0 2
## 10 13 23
newdata<-mice(assignment, m=10, maxt=40, meth='pmm', seed=1000)
fulldata<-complete(newdata, 1)
summary(fulldata)
## X Y
## Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :1.00
## Mean :0.44 Mean :0.55
## 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :1.00 Max. :1.00
Classdata2<-fulldata$Y
entr <- function(CLS.FRQ){ #defining entropy function
classfrequency <- CLS.FRQ
entr <- 0
for(i in 1:length(classfrequency)){
if(classfrequency[[i]] != 0){ # class definition
entrp <- -sum(classfrequency[[i]] *
log2(classfrequency[[i]])) #class entropy
}else{
entrp <- 0
}
entr <- entr + entrp }
return(entr)
}
freqs <- table(Classdata2)/length(Classdata2)
freqs
## Classdata2
## 0 1
## 0.45 0.55
entr(freqs)
## [1] 0.9927745
H(Y) =
k
¿¿
= 0.99278

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
H(X|Y )=
k
¿¿
= 0.96204
H(Y|X )=
k
¿¿
= 0.99962
Document Page
Question 3
Correlation and covariance coefficient
Given Gaussian r.v’s U = X Y and V = 2X + 3Y.
Analytical
Correlation and covariance coefficient
Given Gaussian r.v’s
U = X Y and V = 2X + 3Y.
a) Analytical
Solution
Given X and Y are independent standard Gaussian r.v’s
Step 1:
A Gaussian distribution is such that
X~ N (0, 1), Y~N (0, 1) with PDF:
FXX= 1
2 π e{ x2
x }
Fy Y= 1
2 π e{ y2
y }
Since the Gaussian is a normally distributed, therefore, it has mean 0 and variance 1, i.e. normal
distributions have mean 0 and variance 1
Hence:
Step 2:
Calculating covariance
Covariance
Covariance of X and Y is given as
Cov (X, Y) = E [XY] - E[X].E[Y]
Document Page
But given that the sample mean is calculated as:
E[Y]= y= 1
2
i=1
2
(2+3)=2.500
E[X]= x= 1
2
i=1
2
(11)=0.000
Therefore:
Step 3:
Sample covariance is calculated as in:
𝓸xy = 1
n1
i=1
n
( xi¿ x )( yi y ) ¿
Hence from values from the distribution and sample means of both x and y
1
21
1
2
(10.000)(22.5)+ 1
21
1
2
(10.000)(32.500)= -1
Thus cov of x and y= -1
*This is true Since independent random variables are not correlated, their covariance ranges
between (-1,0,1)
Correlation
The correlation between two independent random variables is given as:
Corr(X, Y) = Cov( X ,Y )
o X o Y
Previously covariance was calculated as -1 and but from definition of a normal distribution the
Var is 1.
Therefore, from the formulae
Corr of X and Y = 1
1 = -1.
This proves from * above, the random variables are negatively correlated.
Hence corr of X and Y =-1

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
#calculating covariance of X and Y
U=c(1,-1) #defining vector for U
V=c(2,3) #defining vector for V
for (i in 1:10000){ #carrying out 10000 simulations
out=cov(U,V)
}
out #calling out function
## [1] -1
#calculating correlation of X and Y
U=c(1,-1) #defining vector for U
V=c(2,3) #defining vector for V
for (i in 1:10000){
out=cor(U,V)
}
out #calling out function
## [1] -1
Question 4.
M.L.E (maximum likelihood estimator)
M.L.E (maximum likelihood estimator)
Determining the maximum likelihood estimator given [4,3,2,4,6,3,4,0,5,6,4,4,4,5,3,3,4,5,4,5]
from a Poisson distribution.
Analytical
Given:
PX(Xj)¿ {e
( λ0 ) 1
xj ! λ xj
0 ,if xj Rx
0 elsewhere
Likelihood funct
L (λ; x1, …, xn)=
j=1
n
e( λ) 1
Xj ! λxj
Hence log likelihood function is:
L (λ; x1,…, x
j=1
n
ln ( Xj ! )+ ln (λ)
j=1
n
Xjn)
Hence
Document Page
^λ=
i=1
n Xi
n = x
This implies the MLE of a Poisson distribution is same as sample mean of population.
Thus MLE= 4+3+2+4 +6+3+ 4+ 0+5+6+4+4+ 4+5+ 3+3+4+5+4+5
20
=3.9
# Finding MLE for poisson
dat<-c(4,3,2,4,6,3,4,0,5,6,4,4,4,5,3,3,4,5,4,5) #dataset
lama<-mean(dat)
poissloglik= function(lama, dat){ #defining a Poisson
distribution
#total number of observations
n= length(dat) #determining length of data
#equation
logglik=(sum((dat)*log(lama)-n*lama)) #defining the likelihood
function
}
library(stats4) #calling library stats4 for
calculating maximum likelihood function
ll<-function(ma, sig){
R=dnorm(dat, ma, sig)
#
-sum(log(R))
}
mle(ll, start=list(ma=1, sig=1)) #attaining the maximum likelihood
function
##
## Call: #calling mle to output the maximum likelihood
and sd
## mle(minuslogl = ll, start = list(ma = 1, sig = 1))
##
## Coefficients:
## ma sig
## 3.900000 1.337909
Optimizing parameters
#using optimize () function to find optimal parameters.
dat<-c(4,3,2,4,6,3,4,0,5,6,4,4,4,5,3,3,4,5,4,5) #data
lama<-mean(dat)
poisslik= function(lama, dat){ #likelihood function
#total number of observations
Document Page
n= length(dat)
#equation
lolik=(sum((dat)*log(lama)-n*lama)) #maximum likelihood
function
}
mle=optim(par=1, fn=poisslik,
dat=c(4,3,2,4,6,3,4,0,5,6,4,4,4,5,3,3,4,5,4,5), control=list(fnscale=-
1)) #optimizing parameters.
## Warning in optim(par = 1, fn = poisslik, dat = c(4, 3, 2, 4, 6, 3,
4, 0, : one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in log(lama): NaNs produced
## Warning in log(lama): NaNs produced
## Warning in log(lama): NaNs produced
mle
## $par
## [1] 0.1950195
##
## $value
## [1] -205.5109
##
## $counts
## function gradient
## 32 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Hence from $par the optimizing parameters is 0.1950195
Question 5
Central limit theorem
Given a sequence of 10 i.i.d from λ=10, justify central limit theorem. Using sample size: 100,
1,000, 10,000, and 100,000.
Definition:
We state the central limit theorem, for i.i.d r.v’s in a distribution, sample mean converges to
Gaussian as sample size increases

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Sample size 100
#R code for a normal Poisson distribution
#central limit theorem (clt)
cltp=runif(100) #proving clt using Poisson
distribution
par(mfrow=c(3,1))
out=vector()
for (i in 1:100){
out_i=mean(sample(cltp, size=10))
out=c(out, out_i)
}
x<-out
v<-hist(x, breaks=10, col="skyblue",xlab="mean", main="Histogram with
gauss curve of means for a Poisson distribution ")
fttingX<-seq(min(x),max(x),length=40)
fttingY<-dnorm(fttingX,mean=mean(x),sd=sd(x))
fttingY <- fttingY*diff(v$mids[1:2])*length(x)
lines(fttingX, fttingY, col="red", lwd=3)
summary(cltp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04116 0.27143 0.61110 0.54385 0.77833 0.99621
sd(cltp)
Document Page
## [1] 0.2942439
#R code for a normal Poisson distribution
#central limit theorem (clt)
cltp=runif(1000) #proving clt using Poisson
distribution
par(mfrow=c(3,1))
out=vector()
for (i in 1:1000){
out_i=mean(sample(cltp, size=10))
out=c(out, out_i)
}
x<-out
v<-hist(x, breaks=10, col="skyblue",xlab="mean", main="Histogram with
gauss curve of means for a Poisson distribution ")
fttingX<-seq(min(x),max(x),length=40)
fttingY<-dnorm(fttingX,mean=mean(x),sd=sd(x))
fttingY <- fttingY*diff(v$mids[1:2])*length(x)
lines(fttingX, fttingY, col="red", lwd=3)
summary(cltp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.004686 0.255798 0.479432 0.496862 0.744805 0.999099
sd(cltp)
## [1] 0.2854455
Document Page
Sample size 10000
#R code for a normal Poisson distribution
#central limit theorem (clt)
cltp=runif(10000) #proving clt using Poisson
distribution
par(mfrow=c(3,1))
out=vector()
for (i in 1:10000){
out_i=mean(sample(cltp, size=10))
out=c(out, out_i)
}
x<-out
v<-hist(x, breaks=10, col="skyblue",xlab="mean", main="Histogram with
gauss curve of means for a Poisson distribution ")
fttingX<-seq(min(x),max(x),length=40)
fttingY<-dnorm(fttingX,mean=mean(x),sd=sd(x))
fttingY <- fttingY*diff(v$mids[1:2])*length(x)
lines(fttingX, fttingY, col="red", lwd=3)
Sample Size 100000
}
#R code for a normal Poisson distribution
#central limit theorem (clt)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
cltp=runif(100000) #proving clt using Poisson
distribution
par(mfrow=c(3,1))
out=vector()
for (i in 1:100000){
out_i=mean(sample(cltp, size=10))
out=c(out, out_i)
}
x<-out
v<-hist(x, breaks=10, col="skyblue",xlab="mean", main="Histogram with
gauss curve of means for a Poisson distribution ")
fttingX<-seq(min(x),max(x),length=40)
fttingY<-dnorm(fttingX,mean=mean(x),sd=sd(x))
fttingY <- fttingY*diff(v$mids[1:2])*length(x)
lines(fttingX, fttingY, col="red", lwd=3)
These imply that a distribution follows the central limit theorem as the sample size gets
larger, i.e. the distribution becomes normalized. In our case the Poisson distribution, i.e. the
gauss normal curve assumes a bell shape as the sample size increases.
1 out of 20
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]