Summary of "lab 3: Simulation and Bootstrap"

Overview

This lecture (Lab 3) covers two computational methods for studying sampling distributions when analytic formulas are difficult or unavailable: simulation (Monte Carlo) and the bootstrap. It contrasts these computational approaches with classical analytic results, explains when each applies, demonstrates implementation in R, and lists practical conditions that determine how well they work.

Key concepts and lessons

A statistic (e.g., sample mean x̄, sample variance S², median) computed from a random sample is itself a random variable with a sampling distribution.
Knowing a statistic’s sampling distribution is necessary for probability and decision questions (for example, P(2100 ≤ x̄ ≤ 2300)?).

Classical analytic results: - Central Limit Theorem (CLT): x̄ ≈ Normal(μ, σ²/n) if the population is normal or n is large (commonly n ≥ 30). - For sample variance: (n − 1)S² / σ² ~ χ²(n − 1) only when the population is normal. - Analytic derivations can be complex or impossible for some statistics (median, complicated functions, non-standard populations).

Simulation (Monte Carlo) approach

When to use

Use simulation when the population distribution is known (its PDF/CDF or parameters are available).

Basic idea

Draw many independent samples from the known population, compute the statistic for each sample, and use the empirical distribution of those computed statistics as the sampling distribution.

Advantages

Conceptually simple, straightforward to implement on a computer, and able to reproduce analytic results when conditions are met.

Main factors affecting accuracy

Number of repetitions B: the most important factor — larger B gives closer agreement with the true sampling distribution (examples: 2,500; 5,000; 10,000).
Sample size n: affects how well CLT or other approximations hold; larger n typically reduces discrepancy.
Population characteristics: normal vs. non-normal populations change expected analytic behavior.

Use-case example

Flashlight factory inspection: draw many samples, compute x̄ for each, and estimate the probability that x̄ falls in an acceptable range.

Bootstrap (nonparametric resampling) approach

When to use

Use the bootstrap when the population distribution is unknown and you have only a single observed sample.

Basic idea

From the observed sample of size n, repeatedly draw bootstrap samples of size n with replacement. Compute the statistic of interest on each bootstrap sample to build the bootstrap distribution of that statistic.

Key implementation detail

Sampling with replacement (replace = TRUE) so bootstrap samples can include repeated observations from the original sample.

Conditions and limitations

Bootstrap accuracy depends on how well the original sample represents the population (e.g., whether x̄ ≈ μ and S² ≈ σ²).
Larger original sample sizes better capture population features and improve bootstrap performance.
Number of bootstrap replicates B should be large (commonly hundreds to thousands; examples: 2,500–5,000).
For some statistics and population shapes, bootstrap can perform poorly; theoretical justification exists but can be advanced.

Alternatives

Jackknife and other resampling techniques (briefly mentioned).

How to judge whether a sample is “good” for bootstrap

Compare sample moments (x̄, S²) to known or expected population values if available.
Use visual and diagnostic checks.
If the sample is biased, too small, or lacks variability, bootstrap results may be unreliable.

Detailed step-by-step methodologies

Simulation (when the population is known)

Define the population distribution and parameters (e.g., Normal(μ, σ²), Exponential(λ)).
Choose sample size n for each simulated sample.
Choose the number of repetitions B (recommended large: 1,000–10,000+ depending on desired precision).
Repeat B times:
- Draw a sample of size n from the known population.
- Compute the statistic of interest for that sample.
- Store the computed statistic.
Use the B stored statistics to:
- Plot a histogram or empirical CDF.
- Overlay theoretical PDF/CDF if available.
- Estimate probabilities, confidence intervals, bias, standard error, etc.

Bootstrap (nonparametric, when population unknown)

Start with the observed sample of size n.
Decide on the statistic(s) of interest (e.g., mean, median, standard deviation).
Choose number of bootstrap replicates B (commonly 1,000–10,000).
Repeat B times:
- Draw a bootstrap sample of size n by sampling with replacement from the observed sample.
- Compute and store the statistic on this bootstrap sample.
Use the empirical bootstrap distribution to:
- Plot histogram or empirical CDF.
- Estimate standard error (bootstrap SE = sample SD of bootstrap statistics).
- Compute bootstrap confidence intervals (percentile, basic, or other methods).
Perform diagnostics: check sensitivity to B and whether the original sample is representative.

R-specific workflow (conceptual)

Prepare R environment:
- Install and load needed packages (e.g., readxl).
Read data from Excel:
- data <- read_excel(“path/to/file.xlsx”)
Clean data:
- Remove rows with NA: data_clean <- na.omit(data)
- Check sample size and variable names.
Extract the column of interest:
- sample_vector <- data_clean$body_mass_g
Simulation or bootstrap:
- For simulation from a known distribution: use rnorm(), rexp(), etc.
- For bootstrap: use sample(sample_vector, size = n, replace = TRUE) inside a loop or replicate().
- Example structure:
  - boot_stats <- replicate(B, statistic_function(sample(sample_vector, n, replace = TRUE)))
  - statistic_function can be mean, median, sd, or a custom function.
Visualization and comparison:
- Plot histogram of simulated/bootstrap statistics and overlay theoretical curves if available.
- Increase B and/or n to assess convergence and sensitivity.
Interpret results: compare bootstrap/simulation distributions to theoretical expectations and adjust B or data cleaning as necessary.

Practical recommendations and observations

Use simulation when you know the generating distribution; use bootstrap when you only have observed data and the population is unknown.
Do not confuse simulation (draw from the theoretical population) with bootstrap (resample from the observed sample); bootstrap requires sampling with replacement.
The three main factors affecting accuracy:
1. Quality/representativeness of the original sample (for bootstrap).
2. Original sample size n (affects CLT and variability).
3. Number of repetitions B (affects Monte Carlo error); choose large enough B (often thousands).
Overlay empirical distributions (histogram or CDF) with analytic curves to visually judge agreement.
If an analytic distribution is known (e.g., CLT or χ² under normality), use it as a benchmark.
Tools mentioned: R (primary), readxl package, Excel (data source); MATLAB and Java were noted as alternate environments.

Examples from the lecture

Flashlight factory inspection: use sampling to decide whether production meets luminous intensity specifications via the sampling distribution of x̄.
Simulated normal examples: show sampling distribution of x̄ and compare with Normal(μ, σ²/n) using B = 2,500, 5,000, 10,000.
Sample variance example: demonstrate (n − 1)S² / σ² ~ χ²(n − 1) when the population is normal; use simulation to confirm and compare with theory for varying B and n.
Real data example: penguin dataset (body mass) read from Excel into R; clean NAs; apply bootstrap to estimate sampling distribution of standard deviation, median, or mean.

Limitations and cautions

Bootstrap depends on the observed sample capturing essential population features; if the sample is biased or too small, bootstrap results can be misleading.
Some statistics or distributional situations are mathematically delicate; analytic theory and bootstrap proofs can be advanced and situation-dependent.
Always check sensitivity to B and sample size; visualize results and compare with theoretical expectations where possible.

Speakers / sources and tools referenced

Single instructor presented the material (unnamed).
Software/tools referenced: R (primary), readxl package, Excel; MATLAB and Java mentioned as alternatives.
Methods referenced: Simulation (Monte Carlo), Bootstrap (nonparametric), Jackknife (briefly), and classical analytic results (CLT, χ² for variance).