Note, for the first three exercises you need only pencil and paper.

1. Joint, marginal and conditional distributions I ★

The joint discrete probability table of $P_{A,B}(a,b)$ is given below:

	B.1	B.2	B.3
A.1	0.2	0.1	0.3
A.2	0.1	0.1	0.2

Derive the following probabilities:

$P_{A,B}(1,2)$
$P_{B}(2)$
$P_{A|B}(1|2)$
Are $A$ and $B$ independent?

Solution

The joint probability $P_{A,B}(1,2) = 0.1$ can be read directly from the table.
The marginal $P_{B}(2)$ is computed by summing over all joint probabilities with $B=2$, i.e. $P_{A,B}(1,2) + P_{A,B}(2,2) = 0.2$.
Using the results from above we can compute the conditional probability $P_{A|B}(1|2)=\frac{P_{A,B}(1,2)}{P_{B}(2)}=0.5$.
$A$ and $B$ are not independent. To see this calculate first the marginal probabilities. In this example the joint probability is not equal to the product of the marginal probabilities, $P_{A,B}(a,b) \neq P_{A}(a) \,P_{B}(b)$, hence $A$ and $B$ cannot be independent.

2. Joint, marginal and conditional distributions II ★

Assume the probability densities $p(E \mid B)$, $p(B)$, $p(A, D \mid E)$, and $p(C \mid B, E)$ are known.

Draw the corresponding directed acyclic graph of the conditional probabilities to visualize the dependence structure.
Derive $p(B, C, E)$
Derive the joint distribution of $A$, $B$, $C$, $D$, and $E$.
Derive $p(A, B \mid C, D, E)$
Derive $p(A \mid D)$
Derive $p(A \mid B, E)$

Solution

$p(B, C, E) = p(C \mid B, E) \, p(E \mid B) \, p(B)$
The joint distribution is \[ p(A, B, C, D, E) = p(C \mid B, E) \, p(E \mid B) \, p(B) \, p(A, D \mid E) \]
\[ p(A, B \mid C, D, E) = \frac{p(A, B, C, D, E)}{p(C, D, E)} = \frac{p(A, B, C, D, E)}{\int p(A, B, C, D, E)\, dA\,dB} \]
\[ p(A \mid D) = \frac{p(A, D)}{p(D)} = \frac{\int p(A, B, C, D, E)\, dB\,dC\,dE}{\int p(A, B, C, D, E)\, dA\,dB\,dC\,dE} = \frac{\int p(A, D \mid E)\,p(E \mid B)\,p(B) dB\,dE}{\int p(A, D \mid E)\,p(E \mid B)\,p(B) dA\,dB\,dE} \]
\[ p(A \mid B, E) = \frac{\int p(A, D\mid E)\,p(E \mid B) \,p(B) \, dD}{p(E \mid B) \,p(B)} = \int p(A, D\mid E) \, dD = p(A \mid E) \]

3. Compound distribution ♛

Assume that: \[ \mu \sim f_{\mu}(m) = \begin{cases} 0.1\exp(-0.1m) & m \geq 0 \\ 0 & \text{else} \end{cases} \] and \[ X \sim f_{X|\mu=m}(x \mid m) = \frac{1}{\sqrt{2\pi}} \exp\left(- \frac{(x-m)^2}{2}\right) \]

This means $X$ is normally distributed with mean $\mu$, and $\mu$ itself is exponentially distributed.

Derive and interpret:

$f_{X, \mu}(x, m)$
$f_{X}(x)$, a so called compound distribution.
$P(\mu>5)$
$f_{X,\mu|\mu>5}(x, m)$
$f_{X|\mu>5}(x)$

It is not the aim to find closed forms for the integrals.

Solution

joint distribution: \[ f_{X, \mu}(x, m) = f_{X|\mu}(x| m) \, f_{\mu}(m) = \frac{1}{\sqrt{2\pi}} e^{-\frac{(x-m)^2}{2}} 0.1e^{-0.1m} \]
compound distribution \[ f_{X}(x) = \int_0^\infty f_{X, \mu}(x, m) \,dm \]
probability that $\mu$ is larger than 5: \[ P(\mu>5) = \int_5^\infty f_{\mu}(m) \, dm = \exp (-0.1 \cdot 5) \]
The joint distribution under the constraint that $\mu>5$: \[ f_{X,\mu|\mu>5}(x, m)= \begin{cases} f_{X, \mu}(x, m) / P(\mu>5)& \text{if}\ m>5\\ 0 & \text{otherwise} \end{cases} \]
marginal of $X$ under constrain that $\mu>5$: \[ f_{X|\mu>5}(x) = \int_5^\infty f_{X,\mu|\mu>5}(x, m) \, dm \neq f_{X}(x) \]

4. Sampling vs. evaluating random variables ★

Take two random variables $X$ and $Y$ with the following distributions: \[ X \sim \text{Uniform}(0, 1)\\ Y \sim \text{Normal}(2, 10) \]

Evaluate the probability density $f_X(0.8)$ and $f_Y(0.8)$.
Generate 10000 samples from both random variables. Visualize the distributions as histograms.

Another random variable is defined as a function of $X$ as follows: \[ Z = \sin(2\pi X) \sqrt{X} \] While it is difficult to derive the probability density $f_Z(z)$ analytically, sampling from it is easy.

Visualize the density $f_Z(z)$. First generate 10000 samples from $X$ and then transforming the samples. Visualize as histogram.

Hints

R

Most important univariate probability distributions are already implemented in R. Type ?Distributions to get an overview. For every distribution __ four functions are defined with the following naming scheme:

d__(x, ...)   # evaluate pdf at x
p__(x, ...)   # evaluate cdf at x
q__(p, ...)   # evaluate the p-th quantile
r__(n, ...)   # sample n random numbers

For example, for the normal distribution the functions are called dnorm(), pnorm(), qnorm(), and rnorm().

Histograms are generated with the function hist. You can adjust the number of bins with the argument breaks, e.g. hist(rnorm(10000), breaks=100).

Julia

The package Distributions.jl provides access to many probability distributions, see the documentation.

Here are some useful functions you can apply to all distributions (not only Normal):

π = Normal(0, 10) # define distribution

pdf(π, x)       # evaluate pdf at x
logpdf(π, x)    # evaluate log pdf at x
cdf(π, x)       # evaluate cdf at x
quantile(π, p)  # evaluate the p-th quantile
rand(π, n)      # sample n random numbers

Histograms are generated with the function histogram from the package Plots.jl. You can adjust the number of bins with the argument nbins, e.g histogram(rand(Uniform(0,5), 10000), nbins = 100).

Python

This package scipy provides access to many probability distributions. The packages matplotlib and seaborn offer good ways to plot the graphs.

Here are some examples of how to call some functions:

uniform.pdf(x, loc, scale)  # Evaluate uniform pdf at x
norm.pdf(x, loc, scale)     # Evaluate normal pdf at x

Solution

R

evaluate pdf

dunif(0.8, 0, 1)

## [1] 1

dnorm(0.8, 2, 10)

## [1] 0.03960802

sample from distributions and plot them

xs <- runif(10000, 0, 1)
ys <- rnorm(10000, 2, 10)

par(mfrow=c(1,2))                       # optional, arranges both plots together
hist(xs)
hist(ys)

sample from $Z$

xs <- runif(10000, 0, 1)                # sample from X
zs <- sin(2*pi*xs) * sqrt(xs)           # transform samples

hist(zs, breaks=75)

Julia

Evaluate pdf

using Distributions

X = Uniform(0,1)
Y = Normal(2, 10)

Use the function pdf(Probability Density Function)

pdf_X = pdf(X, 0.8)

## 1.0

pdf_Y = pdf(Y, 0.8)

## 0.03960802117936561

Sample from distributions and plot them

X_s = rand(X, 10000)
Y_s = rand(Y, 10000)

using Plots
p = [
  histogram(X_s, nbins = 20,
    labels=false,
    xlabel="X sample",
    ylabel="Frequency",
    title="Histogram of X_s"),
  histogram(Y_s, nbins = 20,
    labels=false,
    xlabel="Y sample",
    ylabel="Frequency",
    title="Histogram of Y_s")
];
plot(p...)

Sample from $Z$

Z_s = sin.(2*π*X_s).*sqrt.(X_s) #the dot symbol means that the operation is done component by component

histogram(Z_s, nbins = 75,
    labels=false,
    xlabel="Z sample",
    ylabel="Frequency",
    title="Histogram of Z_s")

Python

The following libraries are required:

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import uniform, norm
import math
import numpy as np

evaluate pdf

# Uniform distribution (X ∼ Uniform(0,1))
pdf_uniform = uniform.pdf(0.8, loc=0, scale=1)  # loc and scale for the range [0,1]
# Normal distribution (Y ∼ Normal(2,10))
pdf_normal = norm.pdf(0.8, loc=2, scale=10)  # loc=2 is the mean, scale=10 is the standard deviation
# Print the probability densities
print(f"fX(0.8) = {pdf_uniform}")

## fX(0.8) = 1.0

print(f"fY(0.8) = {pdf_normal}")

## fY(0.8) = 0.03960802117936561

sample from distributions and plot them

# Generate 10,000 samples from X ∼ Uniform(0,1)
xs = uniform.rvs(loc=0, scale=1, size=10000)
# Generate 10,000 samples from Y ∼ Normal(2,10)
ys = norm.rvs(loc=2, scale=10, size=10000)

# Visualize the histograms of the samples
plt.figure(figsize=(14, 6))

# Histogram for Uniform distribution
plt.subplot(1, 2, 1)
sns.histplot(xs, kde=False, bins=30, color='blue', alpha=0.6)
plt.title('Histogram of xs')
plt.xlabel('xs')
plt.ylabel('Frequency')

# Histogram for Normal distribution
plt.subplot(1, 2, 2)
sns.histplot(ys, kde=False, bins=30, color='red', alpha=0.6)
plt.title('Histogram of ys')
plt.xlabel('ys')
plt.ylabel('Frequency')

# Display the histograms
plt.tight_layout()
plt.show()

sample from $Z$

# Generate 10,000 samples from X ∼ Uniform(0,1)
xs = uniform.rvs(loc=0, scale=1, size=10000)
# Define the transformation function for Z
def transform_X_to_Z(X):
    return math.sin(2 * math.pi * X) * math.sqrt(X)

# Apply the transformation function to the samples from X to generate samples for Z
zs = [transform_X_to_Z(x) for x in xs]

# Visualize the histograms of the samples
plt.figure(figsize=(14, 6))

sns.histplot(zs, kde=False, bins=75, color='green', alpha=0.6)
plt.title('Histogram of zs')
plt.xlabel('zs')
plt.ylabel('Frequency')

# Display the histograms
plt.tight_layout()
plt.show()

5. Generating data ★

Generate 1000 samples of two different 2-dimensional normal distributions. The mean of both distribution is $\boldsymbol{\mu}=(3,8)$.

For the first distribution, $\boldsymbol{Y}_\mathrm{obs,indep}$, both dimensions are independent and have a standard deviations of $\boldsymbol{\sigma}_\mathrm{obs,indep}=(2,5)$.
The second distribution, $\boldsymbol{Y}_\mathrm{obs,dep}$, the two dimensions are correlated. This is defined with the covariance matrix \[ \boldsymbol{\Sigma}_\mathrm{obs,dep}= \begin{pmatrix} 4 & 8\\ 8 & 25 \end{pmatrix} \].

The aim is obtain two matrices(with two columns and 1000 rows) containing the random samples.

Hints

R

You can use rnorm() to sample form a one-dimensional normal distribution. With cbind() you can combine vectors to a matrix. For the depended normal distribution use rmvnorm() from the package mvtnorm.

Julia

Given a mean vector $\mu$ and a correlation matrix $\Sigma$, you can construct a multivariate normal distribution with MvNormal(μ, Σ) from the package Distribution.jl. How does the covariance matrix for independent variables look like?

Python

You can use numpy.random.normal() and numpy.random.multivariate_normal() to sample the required distributions. Additionally, you can use np.column_stack to combine vectors to a matrix.

Solution

R

library(mvtnorm)
y.obs.indep <- cbind(rnorm(1000, mean=3, sd=2),
                     rnorm(1000, mean=8, sd=5))
Sigma.dep <- matrix(c(4,8,8,25), nrow=2, ncol=2)
y.obs.dep <- rmvnorm(1000, mean=c(3,8), sigma=Sigma.dep)

Julia

using Distributions

μs = [3, 8]
Σ_indep = [2^2 0; 0 5^2]
Σ_dep = [4 8; 8 25]

Y_1 = MvNormal(μs, Σ_indep)
y_1_sample = rand(Y_1, 1000)

Y_2 = MvNormal(μs, Σ_dep)
y_2_sample = rand(Y_2, 1000)

Python

from numpy.random import normal, multivariate_normal

# Generate 1000 samples from the independent 2-dimensional normal distribution
y_obs_indep = np.column_stack([
    normal(loc=3, scale=2, size=1000),
    normal(loc=8, scale=5, size=1000)
])

# Define the covariance matrix for the dependent 2-dimensional normal distribution
sigma_dep = np.array([
    [4, 8],
    [8, 25]
])

# Generate 1000 samples from the dependent 2-dimensional normal distribution
y_obs_dep = multivariate_normal(mean=[3, 8], cov=sigma_dep, size=1000)

# Print the samples
print(y_obs_dep)

## [[ 2.62139691  5.55823272]
##  [ 2.58461922  9.8608951 ]
##  [ 6.25447771 16.01117172]
##  ...
##  [ 4.66336834  7.92106748]
##  [-0.30834595  0.3235412 ]
##  [ 5.85587315 14.55102434]]

print(y_obs_indep)

## [[ 5.0067899   4.80257306]
##  [ 1.1739046  10.51268942]
##  [ 5.78557828 10.35711131]
##  ...
##  [ 0.49380557  8.51252522]
##  [ 1.71069324 17.19001855]
##  [ 2.03179415  2.48584711]]

6. Analyzing and visualizing data ★

Perform some preliminary analysis of the data generated in Task 5:

What are the interquartile and the 90%-interquantile ranges of your samples?
Plot and compare the histograms and the densities of all the marginals
Compare the scatterplots of $\boldsymbol{Y}_\mathrm{obs,indep}$ and $\boldsymbol{Y}_\mathrm{obs,dep}$
Compute the covariance and the correlation matrix of $\boldsymbol{Y}_\mathrm{obs,indep}$ and $\boldsymbol{Y}_\mathrm{obs,dep}$

Which of the above steps reveal a potential correlation structure in your data?

Hints

R

Try to arrange multiple plots in the same window by setting par(mfrow=c(<nrow>,<ncol>)). Use quantile() to calculate the interquantile range; use hist() and plot(density()) to visualize the data; use cov(),cor() for the covariance and the correlation, respectively.

Julia

You can use the function quantile from the Statistics.jl module. It provides also functions to compute the covariance and correlation matrix.

The package Plots.jl provides basic plotting function such as scatter and histogram. For density plots and other more specialized plots look at StatsPlots.jl.

Python

You can to arrange multiple plots in the same window by setting fig, axs = plt.subplots(nrows, ncols) and then calling plt.subplot(nrows, ncols, i). Use np.percentile() to calculate the percentiles; use seaborn and matplotlib to visualize the data; use np.cov(),np.corrcoef() for the covariance and the correlation, respectively.

Solution

R

Quantiles

quartiles.indep <- cbind(quantile(y.obs.indep[,1], probs = c(0.25,0.75)),
                         quantile(y.obs.indep[,2], probs = c(0.25,0.75)))
                                        # Or with `apply`:
quartiles.indep <- apply(y.obs.indep, MARGIN=2, FUN=quantile, probs=c(0.25,0.75))

                                        # then we use apply
quartiles.dep <- apply(y.obs.dep, MARGIN=2, FUN=quantile, probs=c(0.25,0.75))
quant5_95.indep <- apply(y.obs.indep, MARGIN=2, FUN=quantile, probs=c(0.05,0.95))
quant5_95.dep <- apply(y.obs.dep, MARGIN=2, FUN=quantile, probs=c(0.05,0.95))
cat("Interquartile range\nIndep: ", diff(quartiles.indep),
    "\nDep: ", diff(quartiles.dep), "\n")

## Interquartile range
## Indep:  2.637886 6.767833 
## Dep:  2.678544 6.626577

Histograms and densities

par(mfrow=c(2,2))
hist(y.obs.indep[,1], main="y1 of y.indep")
hist(y.obs.indep[,2], main="y2 of y.indep")
hist(y.obs.dep[,1],   main="y1 of y.dep")
hist(y.obs.dep[,2],   main="y2 of y.dep")

par(mfrow=c(2,2))
plot(density(y.obs.indep[,1]), main="y1 of y.indep")
plot(density(y.obs.indep[,2]), main="y2 of y.indep")
plot(density(y.obs.dep[,1]),   main="y1 of y.dep")
plot(density(y.obs.dep[,2]),   main="y2 of y.dep")

Scatterplots

plot(y.obs.indep, main="y.obs.indep")

plot(y.obs.dep, main="y.obs.dep")

Covariance and correlation

covariance.indep <- cov(y.obs.indep)
covariance.dep <- cov(y.obs.dep)
correlation.indep <- cor(y.obs.indep)
correlation.dep <- cor(y.obs.dep)

Julia

Quantiles

using Statistics
using StatsBase

interquartiles_indep = [iqr(y_1_sample[1, :]),
                        iqr(y_1_sample[2, :])]

## 2-element Vector{Float64}:
##  2.571162277898227
##  6.956472629475935


interquartiles_dep = [iqr(y_2_sample[1, :]),
                      iqr(y_2_sample[2, :])]

## 2-element Vector{Float64}:
##  2.7495032457564967
##  6.6715088843748145


inter_5_95_quantile_indep = [diff(quantile(y_1_sample[1, :], [0.05, 0.95])),
                             diff(quantile(y_1_sample[2, :], [0.05, 0.95]))]

## 2-element Vector{Vector{Float64}}:
##  [6.504044085960378]
##  [16.221351987285335]


inter_5_95_quantile_dep = [diff(quantile(y_2_sample[1, :], [0.05, 0.95])),
                           diff(quantile(y_2_sample[2, :], [0.05, 0.95]))]

## 2-element Vector{Vector{Float64}}:
##  [6.470824530683799]
##  [16.21167790353646]

Histograms and densities

hist = [
    histogram(y_1_sample[1, :], nbins = 15,
              xlabel="y_1_sample[1,:]"),
    histogram(y_1_sample[2, :], nbins = 15,
              xlabel="y_1_sample[2,:]"),
    histogram(y_2_sample[1, :], nbins = 15,
              xlabel="y_2_sample[1,:]"),
    histogram(y_2_sample[2, :], nbins = 15,
              xlabel="y_2_sample[2,:]")
]

plot(hist..., legend = false, ylabel = "frequency")

using StatsPlots
dens = [
    density(y_1_sample[1,:],
            xlabel="y_1_sample[1,:]"),
    density(y_1_sample[2,:],
            xlabel="y_1_sample[2,:]"),
    density(y_2_sample[1,:],
            xlabel="y_2_sample[1,:]"),
    density(y_2_sample[2,:],
            xlabel="y_2_sample[2,:]")
]

plot(dens..., legend = false, ylabel = "density")

Alternatively, we could also have written a loop for the four plots.

Scatterplots

scatter(y_1_sample[1, :], y_1_sample[2, :],
        labels=false,
        xlabel="y_1_sample 1",
        ylabel="y_1_sample 2",
        title="y_1_sample")

scatter(y_2_sample[1, :], y_2_sample[2, :],
        labels=false,
        xlabel="y_2_sample 1",
        ylabel="y_2_sample 2",
        title="y_2_sample")

Covariance and correlation

using Statistics
Statistics.cov(y_1_sample, dims=2)

## 2×2 Matrix{Float64}:
##  4.10456    0.285337
##  0.285337  26.1503

Statistics.cov(y_2_sample, dims=2)

## 2×2 Matrix{Float64}:
##  3.96535   7.82915
##  7.82915  24.3871

Statistics.cor(y_1_sample, dims=2)

## 2×2 Matrix{Float64}:
##  1.0        0.0275415
##  0.0275415  1.0

Statistics.cor(y_2_sample, dims=2)

## 2×2 Matrix{Float64}:
##  1.0       0.796149
##  0.796149  1.0

Python

Quantiles

# Calculate the interquartile ranges and 90%-interquantile ranges of the samples
quartiles_indep = np.percentile(y_obs_indep, [25, 75], axis=0)
quartiles_dep = np.percentile(y_obs_dep, [25, 75], axis=0)

quant5_95_indep = np.percentile(y_obs_indep, [5, 95], axis=0)
quant5_95_dep = np.percentile(y_obs_dep, [5, 95], axis=0)

print(f"Interquartile range\nIndep: {np.diff(quartiles_indep, axis=0)[0]}")

## Interquartile range
## Indep: [2.62401887 6.29310802]

print(f"Dep: {np.diff(quartiles_dep, axis=0)[0]}")

## Dep: [2.51924248 6.84747748]

Histograms and densities

# Plot and compare the histograms of all marginals
fig, axs = plt.subplots(2, 2, figsize=(14, 6))

# Subplot 1
plt.subplot(2, 2, 1)
sns.histplot(y_obs_indep[:, 0], kde=False, bins=30, color='blue', alpha=0.6)
plt.title('y1 of y.indep')
plt.xlabel('y_obs_indep[:, 0]')
plt.ylabel('Frequency')

# Subplot 2
plt.subplot(2, 2, 2)
sns.histplot(y_obs_indep[:, 1], kde=False, bins=30, color='blue', alpha=0.6)
plt.title('y2 of y.indep')
plt.xlabel('y_obs_indep[:, 1]')
plt.ylabel('Frequency')

# Subplot 3
plt.subplot(2, 2, 3)
sns.histplot(y_obs_dep[:, 0], kde=False, bins=30, color='blue', alpha=0.6)
plt.title('y1 of y.dep')
plt.xlabel('y_obs_dep[:, 0]')
plt.ylabel('Frequency')

# Subplot 4
plt.subplot(2, 2, 4)
sns.histplot(y_obs_dep[:, 1], kde=False, bins=30, color='blue', alpha=0.6)
plt.title('y2 of y.dep')
plt.xlabel('y_obs_dep[:, 1]')
plt.ylabel('Frequency')

# Display the histograms
plt.tight_layout()
plt.show()

# Plot and compare the densities of all marginals
fig, axs = plt.subplots(2, 2, figsize=(14, 6))

# Density plot for y1 of y.indep
plt.subplot(2, 2, 1)
sns.kdeplot(y_obs_indep[:, 0], color='blue', alpha=0.6)
plt.title('Density of y1 of y.indep')
plt.xlabel('y_obs_indep[:, 0]')
plt.ylabel('Density')

# Density plot for y2 of y.indep
plt.subplot(2, 2, 2)
sns.kdeplot(y_obs_indep[:, 1], color='blue', alpha=0.6)
plt.title('Density of y2 of y.indep')
plt.xlabel('y_obs_indep[:, 1]')
plt.ylabel('Density')

# Density plot for y1 of y.dep
plt.subplot(2, 2, 3)
sns.kdeplot(y_obs_dep[:, 0], color='blue', alpha=0.6)
plt.title('Density of y1 of y.dep')
plt.xlabel('y_obs_dep[:, 0]')
plt.ylabel('Density')

# Density plot for y2 of y.dep
plt.subplot(2, 2, 4)
sns.kdeplot(y_obs_dep[:, 1], color='blue', alpha=0.6)
plt.title('Density of y2 of y.dep')
plt.xlabel('y_obs_dep[:, 1]')
plt.ylabel('Density')

# Adjust layout and display the plots
plt.tight_layout()
plt.show()

Scatterplots

# Compare the scatterplots of y_obs_indep and y_obs_dep
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# Scatterplot for y_obs_indep
sns.scatterplot(x=y_obs_indep[:, 0], y=y_obs_indep[:, 1], alpha=0.7, ax=axs[0])
axs[0].set_title("y.obs.indep")
axs[0].set_xlabel('y_obs_indep[:, 0]')
axs[0].set_ylabel('y_obs_indep[:, 1]')

# Scatterplot for y_obs_dep
sns.scatterplot(x=y_obs_dep[:, 0], y=y_obs_dep[:, 1], alpha=0.7, ax=axs[1])
axs[1].set_title("y.obs.dep")
axs[1].set_xlabel('y_obs_dep[:, 0]')
axs[1].set_ylabel('y_obs_dep[:, 1]')

# Display the scatterplots
plt.tight_layout()
plt.show()

Covariance and correlation

# Compute the covariance and correlation matrices of y.obs.indep and y.obs.dep
covariance_indep = np.cov(y_obs_indep, rowvar=False)
covariance_dep = np.cov(y_obs_dep, rowvar=False)

correlation_indep = np.corrcoef(y_obs_indep, rowvar=False)
correlation_dep = np.corrcoef(y_obs_dep, rowvar=False)

print("Covariance matrix of y.obs.indep:", covariance_indep)

## Covariance matrix of y.obs.indep: [[ 3.97555151 -0.22370194]
##  [-0.22370194 24.86845039]]

print("\nCovariance matrix of y.obs.dep:", covariance_dep)

## 
## Covariance matrix of y.obs.dep: [[ 3.80530384  7.68472437]
##  [ 7.68472437 24.58583812]]

print("\nCorrelation matrix of y.obs.indep:", correlation_indep)

## 
## Correlation matrix of y.obs.indep: [[ 1.         -0.02249814]
##  [-0.02249814  1.        ]]

print("\nCorrelation matrix of y.obs.dep:", correlation_dep)

## 
## Correlation matrix of y.obs.dep: [[1.         0.79449491]
##  [0.79449491 1.        ]]

7. Working with dataframes ★

Real data often contain columns of different data types (e.g. numbers and strings). Dataframes are designed to work with this kind of data conveniently.

Import the file ./data/model_growth.csv as dataframe. Perform some analyses similar to the ones in Task 6.
Add a new column (say C_new) to the growth data frame that contains some random values .

Hints

R

Read the data using read.table("</path/to/somefile.txt>", header=TRUE) to indicate that the first row are the column names (use file ../data/model_growth.csv). To select the column C_M from a dataframe, say data, you can use data$C_M or data[,"C_M"].

To add columns, you can use cbind to column-bind the new data to the available matrix and convert everything to a data.frame. Then, you can rename the columns by assigning the desired names with function colnames() applied to the newly created data.frame.

Julia

Use the package DataFrames.jl to read and manipulate data. To import csv (comma separated values) files, we also use CSV.jl. You can use pwd() the find your currently working directory.

using DataFrames
import CSV

path = joinpath("data", "myfile.csv")
data = DataFrame(CSV.File(path; delim=";"))

# select columns "x"
data.x

Python

Use the pandas library to work with data frames. Read the data using pd.read_csv(file_path, sep=" "). To select a specific column, user can simply type: dataframe['columnname']

Solution

R

## Loading data
## ----------------------------------------------
growth.dat <- read.table("../../data/model_growth.csv", header=TRUE)
C_M <- growth.dat$C_M
C_S <- growth.dat$C_S

## Plotting data
## ----------------------------------------------
plot(growth.dat$t, growth.dat$C_S, type="b", xlab="Time", ylab="Concentration")
lines(growth.dat$t, growth.dat$C_M, type="b", pch=17, col=2)
legend("topright", legend=c("Substrate", "Microorganisms"), col=c(1,2), pch=c(1,17), lty=1)

## Combining Data
## ----------------------------------------------
## Data for new row
y <- rnorm(nrow(growth.dat))

## Add the row to the dataframe
growth.dat.2 <- cbind(growth.dat, C_new = y)
head(growth.dat.2)

##     t      C_M      C_S      C_new
## 1 0.0 13.27717 48.10293 -0.4242892
## 2 0.1 13.19659 45.11676 -0.7412177
## 3 0.2 15.62860 36.28523 -0.7845806
## 4 0.3 19.44555 32.06841  0.4977962
## 5 0.4 20.05754 19.68459 -0.6675241
## 6 0.5 23.17631 10.07305  1.6641594

Julia

using CSV
using DataFrames
pwd()

Given your working directory, go to the desired file with a relative path. Remember that .. allows to come one step back.

growth_dat = CSV.read(joinpath("..", "..", "data", "model_growth.csv"),
                      DataFrame)
# show(growth_dat, allrows=true) if you want to see all you data

Extraction of a column can be done with the following syntax.

C_M = growth_dat[:, "C_M"]
C_S = growth_dat[:, "C_S"]
time = growth_dat[:, "t"]
## or alternatively
time = growth_dat.t

Plotting

plot(time, C_M,
    labels="Microorganisms",
    xlabel="Time",
    ylabel="Concentration");
plot!(time, C_S,
    labels="Substrat")

# generate some data
C_new = rand(nrow(growth_dat))

# add new column
growth_dat.Cnew = C_new

Python

import pandas as pd

## Loading data
# Specify the path to the CSV file
file_path = r"../../data/model_growth.csv"

# Load the CSV file into a pandas DataFrame
growth_dat = pd.read_csv(file_path, sep=" ")

# Access the 'C_M' and 'C_S' columns from the DataFrame
t = growth_dat['t']
C_M = growth_dat['C_M']
C_S = growth_dat['C_S']

## Plotting data
# Set the figure size (width, height) in inches
plt.figure(figsize=(14, 6))

# Create a line plot for 'C_S' against 't'
sns.lineplot(x=t, y=C_S, marker='o', label='Substrate')

# Create a line plot for 'C_M' against 't'
sns.lineplot(x=t, y=C_M, marker='^', color='red', label='Microorganisms')

# Add x-axis and y-axis labels
plt.xlabel('Time')
plt.ylabel('Concentration')

# Add a legend to the plot
plt.legend(loc='upper right')

# Optimize the plot space
plt.tight_layout()

# Display the plot
plt.show()

# Number of rows in the growth_dat dataframe
n = len(growth_dat)

# Generate an array of normally distributed random numbers with the same number of rows as growth_dat
y = np.random.normal(size=n)

# Add the new row to the dataframe as a new column named 'C_new'
growth_dat['C_new'] = y

# Display the first few rows of the updated dataframe
print(growth_dat.head())

##      t        C_M        C_S     C_new
## 0  0.0  13.277165  48.102929 -0.083110
## 1  0.1  13.196591  45.116760  1.228254
## 2  0.2  15.628600  36.285227 -0.627723
## 3  0.3  19.445545  32.068411  0.307233
## 4  0.4  20.057542  19.684587  0.583476

8. Error propagation and functions

If $f()$ is non-linear, generally, \[ f(E[X]) \neq E[f(X)] \] where $E$ is the expected value and $X$ is a random variable. Implement the non-linear function $f(x) = \sin\sqrt{x}$ and assume that $X$ is log-normally distribute. Use samples to estimate $f(E[X])$, $E[f(X)]$, $\mathrm{Var}[X]$, $\mathrm{Var}[f(X)]$ and compare them.

Hints

R

Use rlnorm to sample from a log-normal distribution, which accepts the mean and the standard deviation on the log-scale, not on the original scale. Additionally, keep in mind that in R a general function can be defined as:

function.name <- function(arg1,arg2){
  result <- arg1 + arg2 # or any other operation
  return(result)
}

Most basic functions are already available, and those include both sin and sqrt. Try ?sin in the R console to get access to the manual of the harmonic functions. These can be used anywhere in the code, including inside a custom function.

Julia

Use LogNormal() from the package Distributions.jlfor a log-normal distribution. You can create you own Julia functions following the syntax

function choose_your_name(arg_1, arg_2)
    res = arg_1 + arg_2
    return res
end

Very short functions can be defined as

foo(a, b) = a + b

Python

In python, one can define a function as follows:

def fun_square(number):
    """
    This part is optional and is preferable used to describe what the function
    does to the users.
    This function returns the square of the given number.
    
    Arguments:
    ----------
    - number: number to be processed.
    
    Returns:
    -------
    - result: the squared given number.
    """
    
    result = number * number
    
    return result

# Example usage
num = 4
result = fun_square(num)
print(results) # The output will be 16

Solution

R

fun <- function(x){
  return(sin(sqrt(x)))
}

X <- rlnorm(1000, meanlog=0, sdlog=1)

## mean of X:  1.532428
## mean of f(X):  0.7507829
## f(mean of X):  0.945104
## variance of X: 3.879901
## variance of f(X): 0.05379731

plot(X,fun(X))

hist(X)

hist(fun(X))

Julia

f(x) = sin(sqrt(x))

x = rand(LogNormal(), 1000)

using Statistics

# mean of X
mean(x)

## 1.6425246789152488

# mean of f(X)
mean(f.(x))

## 0.75761771528133

# f(mean of X)
f(mean(x))

## 0.9584762841019947

# variance of x
var(x)

## 3.8578568723913795

# variance of f(X)
var(f.(x))

## 0.05356226759233008

Plotting

using Plots

scatter(x, f.(x),
        labels = false,
        xlabel = "X",
        ylabel = "f(X)")

histogram(x, nbins=20,
          labels = false,
          xlabel = "X",
          ylabel = "Frequency",
          title = "Histogram of X")

histogram(f.(x), nbins=20,
          labels = false,
          xlabel = "f(X)",
          ylabel = "Frequency",
          title = "Histogram of f(X)")

Python

Define the function

# Define the function in Python
def fun(x):
    return np.sin(np.sqrt(x))

Generate the samples and compute the values

# Generate 1000 samples from a log-normal distribution with meanlog=0 and sdlog=1
X = np.random.lognormal(mean=0, sigma=1, size=1000)

# Calculate the mean of X
mean_X = np.mean(X)

# Calculate the mean of f(X)
mean_fX = np.mean(fun(X))

# Calculate the function value of the mean of X
f_mean_X = fun(mean_X)

# Calculate the variance of X
variance_X = np.var(X)

# Calculate the variance of f(X)
variance_fX = np.var(fun(X))

Print the results

# Print the results
print("Mean of X:", mean_X)

## Mean of X: 1.6110755522977038

print("Mean of f(X):", mean_fX)

## Mean of f(X): 0.7533362151460489

print("f(mean of X):", f_mean_X)

## f(mean of X): 0.9548877343121427

print("Variance of X:", variance_X)

## Variance of X: 4.759319081961184

print("Variance of f(X):", variance_fX)

## Variance of f(X): 0.05099708082022862

d.Plot the results

# Plot X against fun(X) using seaborn
plt.figure(figsize=(14, 6))
sns.scatterplot(x=X, y=fun(X), alpha=0.5, color='blue')
plt.title('Scatter Plot of X vs. fun(X)')
plt.xlabel('X')
plt.ylabel('fun(X)')
plt.grid(True)
plt.show()

# Create a figure with two subplots
plt.figure(figsize=(14, 6))

# Plot the histogram of X
plt.subplot(1, 2, 1)
sns.histplot(X, bins=30, color='blue', alpha=0.6)
plt.title('Histogram of X')
plt.xlabel('X')
plt.ylabel('Frequency')

# Plot the histogram of fun(X)
plt.subplot(1, 2, 2)
sns.histplot(fun(X), bins=30, color='red', alpha=0.6)
plt.title('Histogram of fun(X)')
plt.xlabel('fun(X)')
plt.ylabel('Frequency')

# Optimize the plot space
plt.tight_layout()

# Display the plots
plt.show()

Introductory exercises

Eawag Summer School in Environmental Systems Analysis

1. Joint, marginal and conditional distributions I ★

Solution

2. Joint, marginal and conditional distributions II ★

Solution

3. Compound distribution ♛

Solution

4. Sampling vs. evaluating random variables ★

Hints

R

Julia

Python

Solution

R

Julia

Python

5. Generating data ★

Hints

R

Julia

Python

Solution

R

Julia

Python

6. Analyzing and visualizing data ★

Hints

R

Julia

Python

Solution

R

Julia

Python

7. Working with dataframes ★

Hints

R

Julia

Python

Solution

R

Julia

Python

8. Error propagation and functions

Hints

R

Julia

Python

Solution

R

Julia

Python