False discovery rate:

E[V/ max(R, 1) ] , the expected fraction of false positives among all discoveries.

Family-wise error rate: Bonferroni correction

Family-wise error rate: P(V > 0) , the probability of one or more false positives. For large m0 , this is

di cult to keep small.

Bonferroni adjusted p-values:

m

· Suppose we conduct hypothesis tests for each g = 1, . . . , m , producing each a p-value pg

p = min{m , 1}

˜g

pg

Selecting all tests with p ≤ α˜g

Proof: Boole’s inequality implies:

⋃

i=1

pi

controls the FWER at level , ie.,α Pr(V > 0) ≤ α .

FWER = P{ ( ≤ ) } ≤ {P( ≤ ) } = m0

m0

α

m

m0

∑

i=1

pi

α

m

α

m

≤ m = α.

α

m

This control does not require any assumptions about dependence among the p-values or about how

many of the null hypotheses are true.

In R: p.adjust(p_values, method = “bonferroni”)

Layers:

made up of geometric objects that represent data (geom_)

points, lines, boxplots, …

Aesthetics:

describes visual characteristics that represent data (aes)

· position (x,y), color, size, shape, transparency

Family-wise error rate:

Tidy data conclusions

For data to be considered tidy, it has to pass certain criteria

Each column must represent one and only one variable indicated in its name.

Columns musn’t be values (eg. 1999, $10-20k).

Must be aggregated in the right amount of tables – avoid lists.

Separating and uniting (1 <-> more variables)

1 variable -> multiple variables

multiple variables -> 1 variable

other useful functions:

– tidyr::separate()

– tidyr::unite()

– data.table::tstrsplit, strsplit, paste, substr

Transform data form long to wide?

data.table::dcast()

tidyr::spread()

Transform data from wide to long?

data.table::melt()

tidyr::gather()

Perform the conditional analysis by testing it with a full model that depends on the genotype
(marker 5091) and a reduced model that does not depend on the genotype. full <- lm(growth_rate ~ genotype + condition_on_mk_geno, data=dt) reduced <- lm(growth_rate ~ condition_on_mk_geno, data=dt) anova(reduced, full)
## Analysis of Variance Table
##
## Model 1: growth_rate ~ condition_on_mk_geno
## Model 2: growth_rate ~ genotype + condition_on_mk_geno
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 152 208.28
## 2 151 207.23 1 1.0414 0.7588 0.3851

## note: observe the probability under the null hypothesis of an F statistic as

## extreme as the one observed here is rather high (> 0.05) so we would not

## reject the null hypothesis

Grammar of Graphics

The Grammar of Graphics is a visualization theory developed by Leland Wilkinson in 1999.

Separation of data from aesthetics (e.g. x and y axis, color-coding)

Denition of common plot/chart elements (e.g. dot plots, boxplots, etc.)

Composition of these common elements (one can combine elements as layers)

m <- lm(y ~ x1 * x2)

lm(response ~ terms):

where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.

