Note, for the first three exercises you need only pencil and paper.

1. Joint, marginal and conditional distributions I ★

The joint discrete probability table of $P_{A,B}(a,b)$ is given below:

	B.1	B.2	B.3
A.1	0.2	0.1	0.3
A.2	0.1	0.1	0.2

Derive the following probabilities:

$P_{A,B}(1,2)$
$P_{B}(2)$
$P_{A|B}(1|2)$
Are $A$ and $B$ independent?

2. Joint, marginal and conditional distributions II ★

Assume the probability densities $p(E \mid B)$, $p(B)$, $p(A, D \mid E)$, and $p(C \mid B, E)$ are known.

Draw the corresponding directed acyclic graph of the conditional probabilities to visualize the dependence structure.
Derive $p(B, C, E)$
Derive the joint distribution of $A$, $B$, $C$, $D$, and $E$.
Derive $p(A, B \mid C, D, E)$
Derive $p(A \mid D)$
Derive $p(A \mid B, E)$

3. Compound distribution ♛

Assume that: \[ \mu \sim f_{\mu}(m) = \begin{cases} 0.1\exp(-0.1m) & m \geq 0 \\ 0 & \text{else} \end{cases} \] and \[ X \sim f_{X|\mu=m}(x \mid m) = \frac{1}{\sqrt{2\pi}} \exp\left(- \frac{(x-m)^2}{2}\right) \]

This means $X$ is normally distributed with mean $\mu$, and $\mu$ itself is exponentially distributed.

Derive and interpret:

$f_{X, \mu}(x, m)$
$f_{X}(x)$, a so called compound distribution.
$P(\mu>5)$
$f_{X,\mu|\mu>5}(x, m)$
$f_{X|\mu>5}(x)$

It is not the aim to find closed forms for the integrals.

4. Sampling vs. evaluating random variables ★

Take two random variables $X$ and $Y$ with the following distributions: \[ X \sim \text{Uniform}(0, 1)\\ Y \sim \text{Normal}(2, 10) \]

Evaluate the probability density $f_X(0.8)$ and $f_Y(0.8)$.
Generate 10000 samples from both random variables. Visualize the distributions as histograms.

Another random variable is defined as a function of $X$ as follows: \[ Z = \sin(2\pi X) \sqrt{X} \] While it is difficult to derive the probability density $f_Z(z)$ analytically, sampling from it is easy.

Visualize the density $f_Z(z)$. First generate 10000 samples from $X$ and then transforming the samples. Visualize as histogram.

Hints

R

Most important univariate probability distributions are already implemented in R. Type ?Distributions to get an overview. For every distribution __ four functions are defined with the following naming scheme:

d__(x, ...)   # evaluate pdf at x
p__(x, ...)   # evaluate cdf at x
q__(p, ...)   # evaluate the p-th quantile
r__(n, ...)   # sample n random numbers

For example, for the normal distribution the functions are called dnorm(), pnorm(), qnorm(), and rnorm().

Histograms are generated with the function hist. You can adjust the number of bins with the argument breaks, e.g. hist(rnorm(10000), breaks=100).

Julia

The package Distributions.jl provides access to many probability distributions, see the documentation.

Here are some useful functions you can apply to all distributions (not only Normal):

π = Normal(0, 10) # define distribution

pdf(π, x)       # evaluate pdf at x
logpdf(π, x)    # evaluate log pdf at x
cdf(π, x)       # evaluate cdf at x
quantile(π, p)  # evaluate the p-th quantile
rand(π, n)      # sample n random numbers

Histograms are generated with the function histogram from the package Plots.jl. You can adjust the number of bins with the argument nbins, e.g histogram(rand(Uniform(0,5), 10000), nbins = 100).

Python

This package scipy provides access to many probability distributions. The packages matplotlib and seaborn offer good ways to plot the graphs.

Here are some examples of how to call some functions:

uniform.pdf(x, loc, scale)  # Evaluate uniform pdf at x
norm.pdf(x, loc, scale)     # Evaluate normal pdf at x

5. Generating data ★

Generate 1000 samples of two different 2-dimensional normal distributions. The mean of both distribution is $\boldsymbol{\mu}=(3,8)$.

For the first distribution, $\boldsymbol{Y}_\mathrm{obs,indep}$, both dimensions are independent and have a standard deviations of $\boldsymbol{\sigma}_\mathrm{obs,indep}=(2,5)$.
The second distribution, $\boldsymbol{Y}_\mathrm{obs,dep}$, the two dimensions are correlated. This is defined with the covariance matrix \[ \boldsymbol{\Sigma}_\mathrm{obs,dep}= \begin{pmatrix} 4 & 8\\ 8 & 25 \end{pmatrix} \].

The aim is obtain two matrices(with two columns and 1000 rows) containing the random samples.

Hints

R

You can use rnorm() to sample form a one-dimensional normal distribution. With cbind() you can combine vectors to a matrix. For the depended normal distribution use rmvnorm() from the package mvtnorm.

Julia

Given a mean vector $\mu$ and a correlation matrix $\Sigma$, you can construct a multivariate normal distribution with MvNormal(μ, Σ) from the package Distribution.jl. How does the covariance matrix for independent variables look like?

Python

You can use numpy.random.normal() and numpy.random.multivariate_normal() to sample the required distributions. Additionally, you can use np.column_stack to combine vectors to a matrix.

6. Analyzing and visualizing data ★

Perform some preliminary analysis of the data generated in Task 5:

What are the interquartile and the 90%-interquantile ranges of your samples?
Plot and compare the histograms and the densities of all the marginals
Compare the scatterplots of $\boldsymbol{Y}_\mathrm{obs,indep}$ and $\boldsymbol{Y}_\mathrm{obs,dep}$
Compute the covariance and the correlation matrix of $\boldsymbol{Y}_\mathrm{obs,indep}$ and $\boldsymbol{Y}_\mathrm{obs,dep}$

Which of the above steps reveal a potential correlation structure in your data?

Hints

R

Try to arrange multiple plots in the same window by setting par(mfrow=c(<nrow>,<ncol>)). Use quantile() to calculate the interquantile range; use hist() and plot(density()) to visualize the data; use cov(),cor() for the covariance and the correlation, respectively.

Julia

You can use the function quantile from the Statistics.jl module. It provides also functions to compute the covariance and correlation matrix.

The package Plots.jl provides basic plotting function such as scatter and histogram. For density plots and other more specialized plots look at StatsPlots.jl.

Python

You can to arrange multiple plots in the same window by setting fig, axs = plt.subplots(nrows, ncols) and then calling plt.subplot(nrows, ncols, i). Use np.percentile() to calculate the percentiles; use seaborn and matplotlib to visualize the data; use np.cov(),np.corrcoef() for the covariance and the correlation, respectively.

7. Working with dataframes ★

Real data often contain columns of different data types (e.g. numbers and strings). Dataframes are designed to work with this kind of data conveniently.

Import the file ./data/model_growth.csv as dataframe. Perform some analyses similar to the ones in Task 6.
Add a new column (say C_new) to the growth data frame that contains some random values .

Hints

R

Read the data using read.table("</path/to/somefile.txt>", header=TRUE) to indicate that the first row are the column names (use file ../data/model_growth.csv). To select the column C_M from a dataframe, say data, you can use data$C_M or data[,"C_M"].

To add columns, you can use cbind to column-bind the new data to the available matrix and convert everything to a data.frame. Then, you can rename the columns by assigning the desired names with function colnames() applied to the newly created data.frame.

Julia

Use the package DataFrames.jl to read and manipulate data. To import csv (comma separated values) files, we also use CSV.jl. You can use pwd() the find your currently working directory.

using DataFrames
import CSV

path = joinpath("data", "myfile.csv")
data = DataFrame(CSV.File(path; delim=";"))

# select columns "x"
data.x

Python

Use the pandas library to work with data frames. Read the data using pd.read_csv(file_path, sep=" "). To select a specific column, user can simply type: dataframe['columnname']

8. Error propagation and functions

If $f()$ is non-linear, generally, \[ f(E[X]) \neq E[f(X)] \] where $E$ is the expected value and $X$ is a random variable. Implement the non-linear function $f(x) = \sin\sqrt{x}$ and assume that $X$ is log-normally distribute. Use samples to estimate $f(E[X])$, $E[f(X)]$, $\mathrm{Var}[X]$, $\mathrm{Var}[f(X)]$ and compare them.

Hints

R

Use rlnorm to sample from a log-normal distribution, which accepts the mean and the standard deviation on the log-scale, not on the original scale. Additionally, keep in mind that in R a general function can be defined as:

function.name <- function(arg1,arg2){
  result <- arg1 + arg2 # or any other operation
  return(result)
}

Most basic functions are already available, and those include both sin and sqrt. Try ?sin in the R console to get access to the manual of the harmonic functions. These can be used anywhere in the code, including inside a custom function.

Julia

Use LogNormal() from the package Distributions.jlfor a log-normal distribution. You can create you own Julia functions following the syntax

function choose_your_name(arg_1, arg_2)
    res = arg_1 + arg_2
    return res
end

Very short functions can be defined as

foo(a, b) = a + b

Python

In python, one can define a function as follows:

def fun_square(number):
    """
    This part is optional and is preferable used to describe what the function
    does to the users.
    This function returns the square of the given number.
    
    Arguments:
    ----------
    - number: number to be processed.
    
    Returns:
    -------
    - result: the squared given number.
    """
    
    result = number * number
    
    return result

# Example usage
num = 4
result = fun_square(num)
print(results) # The output will be 16

Introductory exercises

Eawag Summer School in Environmental Systems Analysis

1. Joint, marginal and conditional distributions I ★

2. Joint, marginal and conditional distributions II ★

3. Compound distribution ♛

4. Sampling vs. evaluating random variables ★

Hints

R

Julia

Python

5. Generating data ★

Hints

R

Julia

Python

6. Analyzing and visualizing data ★

Hints

R

Julia

Python

7. Working with dataframes ★

Hints

R

Julia

Python

8. Error propagation and functions

Hints

R

Julia

Python