Note: This script is intended only as an aid for
those who need a refresher in R. We will not talk about it explicitly
during QHELP.
If you have completed the exercises and have questions about them or any
other questions about R, I would be happy to answer your questions
during QHELP. Just talk to me :)
If you notice any errors, please send me an email to: julian.mollenhauer@uni-tuebingen.de
Simple mathematical operations may be entered directly at the command line.
2 + 2 # plus
3 - 7 # minus
2 * 2 # times
5 / 4 # divide
log(3) # natural logarithm
exp(1) # exponential function
Brackets can be used in longer expressions.
sqrt(2) * ( (5 - 1 / 6)^2 - pi / 2^(1/3) ) # pi = 3.14
(R ignores any input after the #
symbol. Therefore,
comments to the code are written preceded by #
.)
Results of operations and function calls may be stored in a variable.
The assignment operator is <-
(less than + minus).
In order to print the value of a variable, the variable name needs to be entered.
<- sqrt(2)
x <- x^2
y x
## [1] 1.414214
y
## [1] 2
Valid variable names may consist of letters,
numbers, underscores, and dots (e. g., errors.item6
), but
they may only start with letters or dots. R is case sensitive (both for
functions and variable names).
Vectors are created using the function c()
(combine).
<- c(60, 72, 57, 90, 95, 72)
weight weight
## [1] 60 72 57 90 95 72
Mathematical operations may be applied to each element.
- mean(weight))^2 # squared deviation (weight
## [1] 205.444444 5.444444 300.444444 245.444444
## [5] 427.111111 5.444444
Many R functions accept vectors as arguments.
log(weight)
## [1] 4.094345 4.276666 4.043051 4.499810 4.553877
## [6] 4.276666
R offers a number of offline sources for getting help. The text-based
documentation is called by ?
. For example, the
documentation for the rnorm()
function is accessed by
# or help(rnorm) ?rnorm
Words with syntactic meaning and special characters need to be enclosed in quotes
"if" ?
At the bottom of the help page, there usually are examples that can be copied and pasted into the R console.
Typing apropos("norm")
displays functions in the
currently loaded packages having the expression ‘norm’ in their
name.
The help system of all installed packages is searched for the expression ‘norm’ via
help.search("norm") # or ??"norm"
R ships with several manuals. An HTML based (local) help system is available via
help.start()
The manuals may also be downloaded in pdf format from https://cran.r-project.org/manuals.html. Highly readable are
Further sources of information
These are available via help.start()
, too.
A wealth of information may be found on the R website http://www.r-project.org.
Under R Project/Search
RSiteSearch("topic")
from inside RUnder Documentation
R code examples for common data analysis tasks
At the Stack Exchange network, there are popular Q&A websites for R https://stackoverflow.com/tags/r and for applied statistics with R https://stats.stackexchange.com/tags/r
All objects that are created during a session are kept in the working memory. Objects that are not needed any more should be deleted.
ls() # list objects in working memory
rm(A) # remove object A
rm(list = ls()) # remove all objects
When quitting a session (q()
) the working memory can be
written to a file, so the objects are available upon the next start of
R.
Note that this is usually unnecessary. Rather write the R code that created the objects into a file (using a text editor). In this way, you will keep your entire analysis documented.
When R is started, its working directory is set per default.
getwd() # get working directory
dir() # list files in working directory
It is recommended to change the working directory to the current project directory. This keeps file paths short, and prevents you from accidentally overwriting data.
setwd("D:/projects/experiment3/analysis") # set working directory
Note that this command is likely to cause an error on any computer but yours, because the directory is specific to you. Delete or comment out this command before sharing an R script.
R comes with a standard library of packages
A large part of the functionality of R is available in additional add-on packages.
library() # which packages are installed?
library(foreign) # load package foreign
search() # which packages are loaded?
detach(package:foreign) # 'unload' package foreign
In order to install additional packages, the computer must be connected to the Internet. Packages are installed via
install.packages("packagename")
The R command line allows for:
<Tab>
Use your editor (e.g., RStudio editor / Vim etc.) to:
Short/Simple Code, ‘trial & error’, data input
More commplex Code, Loops (for, while), if/else, replicate, etc.
It is good practice to regularly test whether an R script can run in
its entirety, especially before sharing it. To do so, clear your working
memory rm(list = ls())
(or restart R) and execute the
entire R code in a text file:
source("filename.R") # source file located in R's working directory
Hints on ‘good’ R style are given in the following documents:
These documents are opinionated, but readable. There is no official style guide by the R Core Team, although there are strong opinions about what constitutes good R style. Consistency is usually considered an important criterion for good style. Style errors in your submitted homeworks will be pointed out.
“RStudio is an Integrated Development Environment [IDE; ‘integrierte
Entwicklungsumgebung’] It includes a console, syntax-highlighting editor
that supports direct code execution, as well as tools for plotting,
history, debugging and workspace management.”
- RStudio.com
Open Source Edition:
Some useful shortcuts
Save active document: Ctrl + S
or
Command + S
Run current line/selection: Ctrl + Enter
or
Command + Enter
Source the current document: Ctrl + Shift + S
or
Shift + Command + S
Comment/uncomment current line/selection:
Ctrl + Shift + C
or
Shift + Command + C
Reindent lines: Ctrl + I
or
Command + I
To access all shortcuts, type
Alt + Shift + K
on Linux and Windows or
Option + Shift + K
on a Mac.
Click the checkbox below for exercises
Create two vectors \(\mathbf{x} = (5~7~8~1~2~4~5)'\) and \(\mathbf{y} = (2~8~1~3~2~4~1)'\)
For both vectors, calculate the mean, variance, standard deviation and median. When calculating, first try to use the formulas given below. Calculate these parameters again using the implemented functions in R and compare whether the results are the same.
Generate a scatter plot of \(\mathbf{x}\) and \(\mathbf{y}\).
Use the R help:
With ?...
you can call the documentation of certain
functions and with ??...
you can search for keywords in the
documentation. The latter is helpful if you do not yet know the exact
name of the function.
When searching for keywords, it can happen that a lot of results are
displayed. Since in this task only functions from the packages
base
, stats
or graphics
are used,
look for the prefix base::
, stats::
or
graphics::
. For this task, the following R help calls may
be useful:
- ?c
- ?sum
- ?length
- ??"arithmetic mean"
- ??"standard deviation"
- ??median
- ??variance
- ?plot
Use ls()
to check whether the results of all calculations
are assigned to one variable each and are available in memory.
Many things in R are done using function calls. Functions consist of
Example:
<- c(60, 72, 57, 90, 95, 72)
weight <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
height plot(height, weight)
Arguments may be set to default values; what they are is documented
in ?plot.default
(which plot()
calls if no
specific plot method exists for the first argument).
Arguments are passed
plot(height, weight, pch = 16, col = "blue")
Even if no arguments are passed, the brackets need to be written, e.
g., ls()
, dir()
, getwd()
.
Entering only the name of a function without brackets will display the R code of that function.
Vectors are of a defined mode and each element in that vector has that same mode, e.g.:
mode(c(60, 72, 57, 90, 95, 72))
## [1] "numeric"
Other than numeric vectors there are
character vectors with their elements in quotation marks
logical vectors with values TRUE
and
FALSE
c("subj01", "subj02", "subj03") # character vector
c(TRUE, TRUE, FALSE, TRUE) # logical vector
Logical vectors often result from a comparison
> 60 weight
## [1] FALSE TRUE FALSE TRUE TRUE TRUE
Vectors are created by
concatenating elements using c()
sequences
0:10 # 11 numbers from 0 to 10
seq(1.65, 1.90, .05) # sequence in steps of 0.05 from 1.65 to 1.90
repeating elements
rep(1:5, 3) # repeat vector 1:5 3 times
rep(c("high", "low"), each = 3) # repeat each element 3 times
rep(c("f", "m"), c(5, 2)) # repeat "f" 5 times and "m" 2 times
paste("stim", 1:5, sep = "") # append 1:5 to "stim"
Generating a 3 × 4 matrix
<- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
A A
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
Labeling and transposing the matrix
rownames(A) <- c("a1", "a2", "a3") # respectively: colnames()
t(A) # transpose
## a1 a2 a3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
R implements linear algebra and matrix operations, such as matrix
multiplication (using %*%
) and inversion (using
solve()
).
A matrix may be created by concatenating columns or rows.
cbind(a1 = 1:4, a2 = 5:8, a3 = 9:12) # "column bind"
rbind(a1 = 1:4, a2 = 5:8, a3 = 9:12) # "row bind"
Arrays are data structures having more than two dimensions.
array(c(A, 2 * A), c(3, 4, 2)) # 3 x 4 x 2 array
Factors are data structures for categorical variables, such as diagnosis, socio-economic status, or sex.
<- factor(c("low", "inter", "high", "low"))
ses ses
## [1] low inter high low
## Levels: high inter low
The ordering of factor levels is changed via the levels argument.
<- factor(ses, levels = c("low", "inter", "high"))
ses2 ses2
## [1] low inter high low
## Levels: low inter high
The labels argument is used to change the labels of factor levels.
<- factor(ses2, labels = c("normal", "normal", "high"))
ses3 ses3
## [1] normal normal high normal
## Levels: normal high
Experimental factors may easily be generated by combining
factor()
and rep()
, and providing the desired
labels.
factor(rep(1:2, each = 10), labels = c("on", "off"))
## [1] on on on on on on on on on on off off
## [13] off off off off off off off off
## Levels: on off
Click the checkbox below for exercises
NA
and mean()
Use the function mean()
to find the mean value of the
vector \(\mathbf{a} =
(1~3~7~1~2~6~7~8~9~\mathrm{NA})'\).
NA
?Using the function mean()
, understand what is meant in
the script by the terms ‘positional matching’ and ‘keyword
matching’.
?mean
)?trim
argument do?First use all arguments without explicitly specifying their names (positional matching). Then change the order of the arguments (keyword matching).
Create a vector \(\mathbf{x} =
(1~5~8~3~7~2~6)'\).
Create a second vector \(\mathbf{y}\)
of the same length containing the odd numbers \(1, 3, 5, \ldots\). Use the function
seq()
for this.
Calculate a vector \(\mathbf{z}\) as follows: \(\mathbf{z} = 4\mathbf{x} + 2\mathbf{y}\) (\(\mathbf{z}\) is a linear combination of \(\mathbf{x}\) and \(\mathbf{y}\)).
Create as efficiently as possible a matrix \(\mathbf{B}\) consisting of the column
vectors \(\mathbf{x}\), \(\mathbf{y}\), and \(\mathbf{z}\).
Then create a matrix \(\mathbf{C}\) in
which the three vectors form the row vectors in two ways:
First by combining the vectors \(\mathbf{x}\), \(\mathbf{y}\), and \(\mathbf{z}\) into a new matrix, and then by
transposing the already created matrix \(\mathbf{B}\).
Read out the vector \(\mathbf{y}\)
from \(\mathbf{B}\) and \(\mathbf{C}\) again by indexing (see
?Extract
).
Use the function seq()
to create the following sequences
of numbers:
Each sequence should contain ten elements. Try to create the vectors
also with the colon operator :
.
Create the vector \((a~a~b~c~d~e~e)'\) with the
rep()
function.
Why does the input rep(letters[1:5], 2)
not give the
desired result?
Create the following vectors (without ‘writing them out’):
What is the length of each vector? Answer this question by calling a function.
Generate the matrix \[ \mathbf{A} = \begin{pmatrix}11 & 12 & 13 & 14\. 15 & 16 & 17 & 18\end{pmatrix}. \] Form \(\mathbf{A}'\) (the transposed matrix of \(\mathbf{A}\)) once by using a function and once by indexing.
Create a factor G
with
G <- factor(rep(c(0, 2, 5), 10))
. Execute the following
R commands:
mean(G)
mean(as.numeric(G))
mean(as.character(G))
mean(as.numeric(as.character(G)))
Compare the results and explain the differences.
Use the functions table()
or xtabs()
to
check whether each level of the factor is present equally often.
It is often necessary to store an ordered collection of R objects with more than one mode in a single object: a list.
<- list(w = weight, h = height, s = ses2, A = A) list1
The components of a list are extracted using $
and the
name of the component, or using [[]]
and either the the
name or number (index) of the component.
$h # or list1[["h"]] or list1[[2]] list1
## [1] 1.75 1.80 1.65 1.90 1.74 1.91
$A # or list1[["A"]] or list1[[4]] list1
## [,1] [,2] [,3] [,4]
## a1 1 2 3 4
## a2 5 6 7 8
## a3 9 10 11 12
Data frames are lists that consist of vectors and factors of equal length. The rows in a data frame refer to one unit (observation or subject).
<- factor(paste("s", 1:6, sep = ""))
id <- data.frame(id, weight, height)
dat dat
## id weight height
## 1 s1 60 1.75
## 2 s2 72 1.80
## 3 s3 57 1.65
## 4 s4 90 1.90
## 5 s5 95 1.74
## 6 s6 72 1.91
Variables in a data frame are extracted by $
or
[]
. (see also: Indexing data frames)
$id dat
## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6
1] dat[,
## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6
"id"] dat[,
## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6
Frequently used functions (not only) for data frames
dim(dat) # show number of rows and columns
## [1] 6 3
names(dat) # variable names
## [1] "id" "weight" "height"
str(dat) # show variables of dat
## 'data.frame': 6 obs. of 3 variables:
## $ id : Factor w/ 6 levels "s1","s2","s3",..: 1 2 3 4 5 6
## $ weight: num 60 72 57 90 95 72
## $ height: num 1.75 1.8 1.65 1.9 1.74 1.91
summary(dat) # descriptive statistics
## id weight height
## s1:1 Min. :57.00 Min. :1.650
## s2:1 1st Qu.:63.00 1st Qu.:1.742
## s3:1 Median :72.00 Median :1.775
## s4:1 Mean :74.33 Mean :1.792
## s5:1 3rd Qu.:85.50 3rd Qu.:1.875
## s6:1 Max. :95.00 Max. :1.910
plot(dat) # pairwise plots
Elements of a variable can be accessed using []
(see
?Extract
).
<- c(60, 72, 57, 90, 95, 72)
weight 4] # 4th element
weight[4] <- 92 # change 4th element
weight[c(1, 2, 6)] # elements 1, 2, 6
weight[1:5] # elements 1 to 5
weight[-3] # without element 3 weight[
Indices may be logical.
> 60] # values greater than 60
weight[weight > 60 & weight < 80] # between 60 and 80 weight[weight
Logical expressions may contain
&
|
==
!=
<
, <=
, >
,
>=
, %in%
.For more information on logical expressions, see
?Comparison
(or ?">"
), ?Logic
(or ?"&"
), and ?match
(or
?"%in%"
).
Data frames have a row and a column index. Indices can be numeric or character vectors. Omitting one index selects all rows or columns, respectively.
3, 2] # 3rd row, 2nd column
dat[1:4, ] # rows 1 to 4, all columns
dat[c(2,4)] # all rows, colums 2 and 4
dat[, "id"] # all rows, column "id"
dat[, c("id", "weight")] # all rows, columns "id" and "weight" dat[,
Using logical indices, specific rows may be selected for which certain conditions hold.
$id == "s2", ] # all obs. of s2
dat[dat$id %in% c("s2","s5"), ] # all obs. of s2 and s5
dat[dat$id == "s2" | dat$id == "s5", ] # same as above
dat[dat$weight > 60, ] # all obs. above 60kg dat[dat
Matrices and arrays have two or more indices.
<- array(c(A, 2 * A), c(3, 4, 2))
arr 2] # 2nd "slice" ("Schicht") arr[, ,
Data frames are most conveniently sorted using order()
which returns the indices of the ordered variable.
order(dat$weight) # weight: 60 72 57 90 95 72
order(dat$weight), ] # sort by weight dat[
order()
accepts multiple arguments to allow for sorting
by more than a single variable.
order(data$A, data$B), ] # first sort by A, then by B data[
Contingency tables contain the counts of one or more variables, usually factors:
table(mtcars$gear)
##
## 3 4 5
## 15 12 5
table(mtcars$gear, mtcars$cyl)
##
## 4 6 8
## 3 1 2 12
## 4 8 4 0
## 5 2 1 2
The xtabs()
function takes a formula as its first
argument:
xtabs(~ gear + cyl, mtcars)
## cyl
## gear 4 6 8
## 3 1 2 12
## 4 8 4 0
## 5 2 1 2
Frequently, one wants to aggregate a data frame, for example in order
to average over multiple observations within factor levels. This may be
achieved using the aggregate()
function.
# Single response variable, single grouping variable
aggregate(y ~ x, data, FUN, ...)
# Multiple response and grouping variables
aggregate(cbind(y1, y2) ~ x1 + x2, data, FUN, ...)
The y
variables are numeric data to be split into groups
according to the grouping variables x
(usually factors).
FUN
is a function to compute the statistics.
Average horse power (hp) and fuel efficiency (mpg; miles per gallon) for each combination of number of cylinders (cyl) and gears (gear).
aggregate(hp ~ cyl + gear, mtcars, mean)
## cyl gear hp
## 1 4 3 97.0000
## 2 6 3 107.5000
## 3 8 3 194.1667
## 4 4 4 76.0000
## 5 6 4 116.5000
## 6 4 5 102.0000
## 7 6 5 175.0000
## 8 8 5 299.5000
aggregate(mpg ~ cyl + gear, mtcars, mean)
## cyl gear mpg
## 1 4 3 21.500
## 2 6 3 19.750
## 3 8 3 15.050
## 4 4 4 26.925
## 5 6 4 19.750
## 6 4 5 28.200
## 7 6 5 19.700
## 8 8 5 15.400
# combined in one function call
aggregate(cbind(hp, mpg) ~ cyl + gear, mtcars, mean)
## cyl gear hp mpg
## 1 4 3 97.0000 21.500
## 2 6 3 107.5000 19.750
## 3 8 3 194.1667 15.050
## 4 4 4 76.0000 26.925
## 5 6 4 116.5000 19.750
## 6 4 5 102.0000 28.200
## 7 6 5 175.0000 19.700
## 8 8 5 299.5000 15.400
Frequently, data are in either long (one line per observation) or
wide (one line per case) format. reshape()
can be used to
switch between both formats.
head(Indometh) # long data format
# From long to wide
<- reshape(Indometh, v.names = "conc", timevar = "time",
df_w idvar = "Subject", direction = "wide")
Reshaping: wide to long
# From wide to long
<- reshape(df_w, varying = list(2:12), v.names = "conc",
df_l idvar = "Subject", direction = "long",
times = c(.25, .5, .75, 1, 1.25, 2, 3, 4, 5, 6, 8))
#Then, optionally, re-order data frame by subject
<- df_l[order(df_l$Subject), ] df_s
Click the checkbox below for exercises
Create the following objects in R:
1:250
;Store \(\mathbf{x}\), \(\mathbf{y}\) and \(\mathbf{A}\) in a list and name the elements of this list. Extract the three elements of this list in three different ways.
Create a dataframe dat
for 60 subjects and three
variables:
id subject number with levels 1 to 60
order with the levels ‘same’ and ‘different’ (pay attention to the order of the levels!)
condition with levels A, B and C.
Order and condition are crossed factors in a balanced design,
i.e. there are ten subjects in each of the six experimental conditions.
Apply str()
, summary()
, table()
or xtabs()
to the dataframe to check the structure.
Simulate the result of a psychological test for each subject and
write this into a new variable Score
in the previously
created dataframe.
To do this, first create the variable Score
using
dat$Score <- NA
and thus assign the value
NA
to each test person. Now assume that the following
results (points achieved in the test) are possible depending on the
condition:
For the subjects in each condition, randomly draw a score from the
corresponding possible scores and write it into the variable
Score
of the dataframe.
Then use aggregate()
to calculate the mean values of the
test results in all six experimental conditions. Find the minimum and
maximum of the test results in each experimental condition using the
min()
and max()
functions in combination with
aggregate()
.
What types of indexing do you know to extract individual columns from
the dataframe you have just created? Read the variable
Score
with all these possibilities.
ASCII text files in tabular or spread sheet form (one line per
observation, one column per variable) are read via
read.table()
.
<- read.table("D:/experiment/data.txt", header = TRUE) dat
This creates a data frame, numerical variables are converted to numerical vectors, character variables to character vectors. It is recommended to first open any file which (possibly) is a text file in your text editor (e.g., Vim / RStudio Editor) to find out which options to set when reading it into R.
Frequently used options
header
: variable names in the first line?stringsAsFactors
: convert character vectors to
factors?sep
: which delimiter between variables (white space,
tabulator, comma, semicolon etc.)?dec
: decimal point or comma?skip
: number of lines to skip before reading dataASCII text files in arbitrary form are read using
readLines()
.
<- readLines("rawdata.txt") raw
This creates a character vector with as many elements as lines of
text. The character vector is further processed using string functions
or regular expressions (?regex
).
Example:
readLines(file.path(Sys.getenv("R_HOME"), "doc", "THANKS"))
Several possible strategies
Save data as text file in other software. Then employ one of the
functions from the read.table()
family.
Reading SPSS data files
library(foreign) # load add-on package
<- read.spss("file.sav", to.data.frame = TRUE) dat
Additional information (e. g., about Excel files, data bases) in R Data Import/Export manual
Data frames having the same column names can be combined using
rbind()
.
<- rbind(dat1, dat2, dat3) dat
More flexibility is provided by merge()
.
Example: merge by ID
<- merge(dat1, dat2, by = "id", all = TRUE) dat
This creates a data frame with as many rows as different IDs. The new
data frame includes all columns in dat1
and
dat2
, possibly with missing values if IDs are in either
dat1
or dat2
, but not in both.
Small data vectors can be copied into the R console using
scan()
:
<- scan(text = "<data>") # for numeric input
x <- scan(text = "<data>",
y what = "character") # for character input
A simple spread-sheet editor is invoked by
<- edit(data.frame()) # start with empty dataframe dat
Usually, it is more convenient to use external software (text editor, spread-sheet application) for data entry. These (text) files with the raw data are then read into R.
For reproducibility: Data preparation, processing, and validation should be entirely done in R. Only as little manual editing as necessary, but as much script-based data processing as possible!
Writing text files
write.table(dat, "exp1_data.txt", ...) # many optional arguments
Text files are maximally portable (also across software packages), human readable, but may lose some features of the data structures (e. g., ordering of factor levels).
R binary files
save(dat, file = "exp1_data.rda") # write binary file
load("exp1_data.rda") # recreate dat
R binary files are exact representations of arbitrary data structures (e.g., data frames, lists), portable across platforms (within R only), but not human readable.
Conditional execution of code is available in R via
if(expr_1) {
expr_2else {
}
expr_3 }
where expr_1
must evaluate to a single logical value.
Additional if conditions can be implemended via
else if()
:
if(expr_1) {
expr_2else if(expr3) {
}
expr_4else {
}
expr_5 }
where expr_3
must also evaluate to a single logical
value.
Example
<- 3
x if(!is.na(x)) {
<- x ^ 2
y else {
} stop("x is missing")
}
y
<- NA
x if(!is.na(x)) {
<- x ^ 2
y else {
} stop("x is missing")
}
There is also a vectorized version of the if/else construct, the
ifelse()
function:
<- 1:3
x <- ifelse(x > 1, x * 2, x / 2) y
There are for()
and while()
loops for
repeated execution:
for()
-loop is used to iterate over items of an
object.while()
-loop allows code to be executed repeatedly
while a given Boolean condition is true.for()
-loop example:Subtract the mean value \(\bar{x}\) from each \(x_i\) with \(i = 1, 2, \ldots,n\).
<- rnorm(10)
x <- numeric(10) # allocate space for results
y for(i in seq_along(x)) { # for i = 1, 2, ..., 10
<- x[i] - mean(x) # replace element with result
y[i]
}
y
# In R, however, no loop is required for this
# since a vectorized calculation is possible.
<- x - mean(x)
y1 y1
while()
-loop exampleBabylonian method (a.k.a Heron’s method) to approximate \(\sqrt{S}\). Iterative formula:
\[ x_{n+1}={\frac {1}{2}}\left(x_{n}+{\frac {S}{x_{n}}}\right) \]
<- 2
S
<- 1 # starting value
x <- x/2
last_x while(abs(x - last_x) > 1e-10){ # boolean condition to check before each iteration
<- x
last_x print(paste0("Current approximation: ", x))
<- (1 / 2) * (last_x + (S / last_x))
x }
Warning: loops are used in R code much less often than in compiled languages. Code that takes a ‘whole object’ (vectorized) view is likely to be both clearer and faster in R.
The apply()
family of functions may be used in many
places where in traditional languages loops are employed.
Matrices and arrays: apply()
apply()
is used to work vector-wise on matrices or
arrays. Appropriate functions can be applied to the columns or rows of a
matrix or array without explicitly writing code for a loop.
<- matrix(rnorm(20), nrow = 5, ncol = 4)
X apply(X, 2, max) # maximum for each column
## [1] 0.7687309 1.0348034 1.0014192 1.2547062
apply(X, 1, mean) # mean of each row; rowMeans(x)
## [1] 0.4091624 0.3549513 0.4588916 0.2324149
## [5] -0.2341359
apply(X, 1, function(x) {max(x) - min(x)}) # custom function
## [1] 0.5796948 1.4062476 1.7534325 1.3125742 2.5166080
Data frames, lists and vectors: lapply()
and
sapply()
Using the function lapply()
(l because the value
returned is a list), a function can be applied element-wise to other
objects, for example, data frames, lists, or simply vectors. The
resulting list has as many elements as the original object to which the
function is applied.
<- as.data.frame(X)
X_df # apply(X_df, max) does NOT work, a dataframe is not an array
lapply(X_df, max) # same result as apply(X, 2, max)
## $V1
## [1] 0.7687309
##
## $V2
## [1] 1.034803
##
## $V3
## [1] 1.001419
##
## $V4
## [1] 1.254706
The function sapply()
(s for simplify) works like
lapply()
with the exception that it tries to simplify the
value it returns. It might become a vector or a matrix.
sapply(iris, class) # works column wise on data frame returns a character vector
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## "numeric" "numeric" "numeric" "numeric"
## Species
## "factor"
Group-wise calculations: tapply()
tapply()
(t for table) may be used (instead of
aggregate()
) to do group-wise calculations on vectors.
Frequently it is employed to calculate group-wise means.
data(Oats, package = "nlme") # load df Oats from nlme packages
# One factor
tapply(Oats$yield, Oats$Block, mean)
## VI V III IV II
## 96.25000 90.91667 95.91667 98.16667 107.25000
## I
## 135.33333
# Two factors
tapply(Oats$yield, list(Oats$Block, Oats$Variety), mean)
## Golden Rain Marvellous Victory
## VI 90.25 109.00 89.50
## V 95.50 85.25 92.00
## III 86.75 118.50 82.50
## IV 108.00 95.00 91.50
## II 113.25 121.25 87.25
## I 133.25 129.75 143.00
# compared to aggregate()
aggregate(yield ~ Block + Variety, Oats, mean)
## Block Variety yield
## 1 VI Golden Rain 90.25
## 2 V Golden Rain 95.50
## 3 III Golden Rain 86.75
## 4 IV Golden Rain 108.00
## 5 II Golden Rain 113.25
## 6 I Golden Rain 133.25
## 7 VI Marvellous 109.00
## 8 V Marvellous 85.25
## 9 III Marvellous 118.50
## 10 IV Marvellous 95.00
## 11 II Marvellous 121.25
## 12 I Marvellous 129.75
## 13 VI Victory 89.50
## 14 V Victory 92.00
## 15 III Victory 82.50
## 16 IV Victory 91.50
## 17 II Victory 87.25
## 18 I Victory 143.00
New functions may be defined interactively in a running R session
using function()
.
Example: a handmade t-test function (there is one in R already!)
<- function(y1, y2) { # definition
twosam <- length(y1) # sample size
n1 <- length(y2)
n2 <- mean(y1) # mean
yb1 <- mean(y2)
yb2 <- var(y1) # variance
s1 <- var(y2)
s2 <- ((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2) # pooled variance
s_p <- (yb1 - yb2) / sqrt(s_p * (1 / n1 + 1 / n2)) # T statistic
tst return(tst) # return value; exactly one object; is often a list
}
Calling the function
<- twosam(y1 = chickwts[chickwts$feed == "linseed", "weight"],
tstat y2 = chickwts[chickwts$feed == "soybean", "weight"])
tstat
# check T statistic
t.test(chickwts[chickwts$feed == "linseed", "weight"],
$feed == "soybean", "weight"], var.equal = TRUE) chickwts[chickwts
If there is a function fun1 defined by
<- function(data, data.frame, graph, limit) {
fun1
... }
then the function may be invoked in several ways, for example
fun1(d, df, TRUE, 20)
fun1(d, df, graph = TRUE, limit = 20)
fun1(data = d, limit = 20, graph = TRUE, data.frame = df)
All of them are equivalent (cf. positional matching and keyword matching).
In many cases, arguments can be given commonly appropriate default values, in which case they may be omitted from the call.
<- function(data, data.frame, graph = TRUE, limit = 20) {
fun1
... }
It could be called as
<- fun1(d, df) ans
which is now equivalent to the three cases above, or as
<- fun1(d, df, limit = 10) ans
which changes one of the defaults.
The return value of a function may have a specified class. The class
of an object determines how it will be treated by so-called generic
functions. To find out which generic functions have a method for a given
class, use: methods(class = "nameOfClass")
Example:
methods(class = "factor")
Conversely, to find out all the class-based methods defined for a
generic function, use: methods(genericFunction)
Example:
methods(summary)
Access the documentation (if available) for a method via
e.g. ?summary.factor
A print method for the twosam()
function could be
<- function(obj) { # functionName.className
print.twsm cat("\nMy two-sample t test\n\n")
cat("Value of test statistic: ", obj$tst, "\n")
invisible(obj) # return the object invisibly
}
(See ?cat
and ?Quotes
for message
formatting.)
For this to work, the original function needs to return an object of class ‘twsm’.
<- function(y1, y2) {
twosam
...<- list(tst = tst) # define return value
retval class(retval) <- "twsm" # assign class to return value
return(retval)
}
When trying to find errors in the code, it may help to look inside a
function while running it, and to execute it step by step. This can be
done using debug()
.
debug(twosam)
twosam(rnorm(10), rnorm(10) + 2)
At the prompt, any R code may be entered. Some useful commands:
ls()
displays what the function ‘sees’ at each
stepn
(or the return key) evaluates the next command of the
functionQ
quits the function browser?browser
In order to disable the debugging mode for the function, type
undebug(twosam)
Learning to program in a new languange means making a lot of errors. Some errors are unexpected both for beginners and for those who have programmed before in a different language.
Reading the FAQs, R mailing lists (and Stackoverflow), the built-in documentation, and the manuals, will help!
‘The R Inferno’ by Patrick Burns (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) is an entertaining and informative guide to the traps of the R language.
Click the checkbox below for exercises
Create the vector \(\mathbf{x} = (3~5~7~9~11~13~15~17)'\) using a for loop. Note: Use the formula \(i \cdot2 + 1\). Implement two methods:
allocate memory: Start with a zero vector of the right length
(tip: numeric()
) and iteratively replace its
elements.
growing: Start with a NULL
object
(x <- NULL
) and iteratively append new results.
Note: The first method is more efficient especially for longer vectors.
For those interested: Test the efficiency of both methods using
system.time()
and a sufficiently long vector.
Implement the following function in R: \[ f(x) = \begin{cases} -1 & \text{wenn } x < 0,\\ 0 & \text{wenn } x = 0,\\ 1 & \text{wenn } x > 0. \end{cases} \]
Write a for loop for \(x_{i + 1} = 0.5
\cdot (x_i + \frac{b}{x_i})\), with \(i
= 1, \dots, 15\), \(x_1 = 10\)
and \(b = 5\). How does the output
change for different values of \(x_1\)
and \(b\)? Plot \(x\) with
plot(x, type = "l", main = b)
.
The plot for b <- 5
should look something like
this.
Create the \(10 \times 10\) triangular matrix \[ \begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \vdots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots& \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{matrix} \] using two methods:
iterate with two nested for-loops through rows (index \(i\)) and columns (index \(j\)) of a zero matrix; replace an element by \(1\) if \(i \geq j\).
search for a suitable R-function with ??
.
Hint: If you get a logical matrix with the R-function, you have to
translate it into a numerical matrix. To do this, either think of
mode()
or test what happens when you apply simple
arithmetic operations to the logical matrix.
Write R code that cumulatively adds up the values of the vector \(\mathbf{x} = (1~2~3~4 \dots 10)'\). The result should look like this: \(\mathbf{y} = (1~3~6~10 \dots 55)'\). Solve the problem in three ways:
with a loop,
by matrix multiplication (vectorised solution; hint: use the triangular matrix above),
with an R-function (search with ??
).
Two graphics systems come with R
The former is easier to control, the latter provides more flexibility. The lattice package is based on the grid graphics system; it implements Trellis plots (a group of smaller plots arranged in a grid) in R.
Both systems provide high-level functions that produce complete plots and low-level functions that add further output to an existing plot.
Another widely used R graphics system is the ggplot2
package.
We will only deal with the traditional (base) graphics here and take a quick look at the lattice package.
Manual, documentation, examples
ggplot2
can be found at https://ggplot2.tidyverse.org/High-level functions
plot() # scatter plot, specialized plot methods
boxplot() # boxplot (Median, 1. & 3. Quartile, whiskers)
hist() # histogram
qqnorm() # normal quantile-quantile plot
barplot() # barplot
curve() # function plot
pie() # pie chart
pairs() # scatter plot matrix
persp() # 3d plot
contour() # contour plot
coplot() # conditional plot
interaction.plot()
Low-level functions
points() # add points
lines() # add lines
rect() # add rectangle
polygon() # add polygon
abline() # add line with intercept a, slope b (or horizontal h, vertical v)
arrows() # add arrows; useful for e.g. confidence intervals
text() # add text in plotting region
mtext() # add text in margins region
axis() # customize axes
box() # draw box around plot
legend() # add a figure legend
Use ?functionName
to find more detailed information
about the functions and arguments to use
A figure consists of two regions
Example: scatter plot
<- runif(50, 0, 2) # 50 uniform random numbers
x <- runif(50, 0, 2)
y plot(x, y, main = "Title", sub = "Subtitle", xlab = "x-label",
ylab = "y-label") # produce plotting window
## Writing text into plotting region
text(0.6, 0.6, "Text at (0.6, 0.6)")
## horizont. and vertic. lines
abline(h = 1.1, v = 1.6, lty = 2)
## Writing text into margin region
for(side in 1:4) {
mtext(-1:4, side = side, at = .7, line = -1:4)
}mtext(paste("Side", 1:4), side = 1:4, line = -1, font = 2)
In order to have full control over the details of a plot, it is recommended to build it step by step. Consider as an example the mean reaction times in a two-factorial experiment; the interaction is to be graphically displayed.
<- read.table(header = TRUE, stringsAsFactors = TRUE,
dat text = "A B rt
a1 b1 825
a1 b2 792
a1 b3 840
a2 b1 997
a2 b2 902
a2 b3 786")
## First produce only the plotting region and the coordinate system.
plot(rt ~ as.numeric(B), dat, type = "n", axes = FALSE,
xlim = c(.8, 3.2), ylim = c(750, 1000),
xlab = "Difficulty", ylab = "Mean reaction time (ms)")
## Plot the data points separately for each level of factor A.
points(rt ~ as.numeric(B), dat[dat$A == "a1", ], type = "b", pch = 16)
points(rt ~ as.numeric(B), dat[dat$A == "a2", ], type = "b", pch = 4)
## Add axes.
axis(side = 1, at = 1:3, expression(B[1], B[2], B[3]))
axis(side = 2)
## Add a legend and box.
legend(2.5, 975, expression(A[1], A[2]), pch = c(16, 4),
bty = "n", title = "Task")
box()
Step-by-Step:
Proper formatting of mathematical expressions is possible with
expression()
, which accepts syntax documented in
?plotmath
Additionally:
arrows()
function.par()
many graphical parameters may be set (see
?par
), for example par(mgp = c(2, .7, 0))
reduces the distance between labels and axes
Some graphical parameters may only be set by a call to
par()
.
# size of figure margins (in inches)
mai # size of figure margins (in lines of text)
mar # number of sub-figures in the plotting device
mfrow # size of outer margins (in lines of text)
oma # size of outer margins (in inches) omi
Parameters changed via par()
are automatically reset
when a new device is opened.
Some graphical parameters are also accepted as arguments by high- and low-level plotting functions.
# justification of text
adj # box type for legend
bty # size of text or data symbols (multiplier)
cex # color, see colors() for the available colors
col # rotation of text in margins
las # line type (solid, dashed, dotted, ...)
lty # line width
lwd # placement of axis ticks and tick labels
mgp # data symbol type
pch # length of axis ticks
tck # type of plot (points, lines, both, none) type
Graphic devices control the format in which a plot is produced. When using R interactively, the standard output of graphic functions is on the screen. Additional windows can be opened:
x11() # open new plotting window (X or Windows)
quartz() # Mac OS Quartz window
There are several file devices available:
postscript() # vector graphics
pdf()
win.metafile() # Windows only; native format for Word etc.
png() # bitmap graphics
tiff()
jpeg()
bmp()
Plots may be exported to these file formats. With vector graphics (such as eps or wmf) plots can be re-scaled without losing quality. The procedure is: First open a file device, then execute some plotting commands (their results are plotted into the device), finally close the device.
A pdf graphic is produced by
pdf("myplot.pdf", height = 3.4, width = 3.4, pointsize = 10)
# -- some plotting commands -- #
dev.off() # close device
A Windows compatible wmf graphic is obtained by
win.metafile("myplot.wmf", height = 3.4, width = 3.4, pointsize = 10)
# -- some plotting commands -- #
dev.off() # close device
The plot should fit on an A4 sheet of paper. A4 paper in inches is 8.3 x 11.7 Taking into account the page margin, a plot should be a maximum of about 7.5 x 9 (or 9 x 7.5) inches in size.
Click the checkbox below for exercises
Load the dataset cars
(see ?cars
for
information on this). Plot distance
as a function of
speed
: Set type = "n"
and
axes = FALSE
. Now build up the figure piece by piece with
points()
, axis()
etc.
Use the R dataset PlantGrowth
. Create a figure with two
columns and two rows (graphic parameter mfrow
):
boxplot()
) of the
three groups; additionally write the respective group mean in each box
with text()
.Make sure that the x-axes of the histograms are identical and choose a uniform width/number of `bins’.
Hint: First create a data frame containing the mean
values using an aggregate()
command. This could look like
this, for example:
## group weight
## 1 ctrl 5.032
## 2 trt1 4.661
## 3 trt2 5.526
The finished plot could look like this:
Given the following function: \(f(x) = \frac{1}{3}x^3 + 2x^2 + 9\)
Plot the curve of this function in the interval [-10, 10] and the first 2 derivatives of the function. To do this, create a vector in this interval with step size 0.01.
<- seq(-10, 10, 0.01) x
Then apply the respective functions to the vector and plot the two
vectors with plot()
. Set type = 'l'
. How does
the plot change if you do not do this? Plot all functions in the same
plot (you can do this with lines()
) and use different
colours for the functions (argument col
). At the end you
can add a legend.
The figure could look like this:
Find out about the function curve()
. Can you use it to
create the same figure?
The stats package, which is automatically loaded when R starts up, contains functions for statistical calculations and random number generation. For a complete list of functions, see library(help=stats).
Examples of functions to perform classical hypothesis testing in R
Nominal data
binom.test()
performs an exact test of a simple null
hypothesis about the probability of success in a Bernoulli
experimentchisq.test()
performs chi-squared contingency table
tests and goodness-of-fit testsMetric response variables
cor.test()
for association between paired samplest.test()
performs one- and two-sample t testsvar.test()
performs an F-test to compare the variances
of two samples from normal populationsOrdinal-continuous response variables
wilcox.test()
performs one- and two-sample Wilcoxon
testsThe p-value is the probability of obtaining a test statistic at least as extreme as the one that was observed, given that the null hypothesis is true. If the p-value is smaller than the probability \(\alpha\) of a Type I decision error, then the null hypothesis is rejected.
<- 2.3 # calculated T statistic = 2.3
T.statistic <- 18 # df = 18
df
<- seq(-4, 4, .01)
t <- dt(t, df)
den_t plot(t, den_t, type = "l", ylab = "Density")
polygon(c(t[t >= T.statistic ], max(t), T.statistic),
c(den_t[t >= T.statistic ], 0, 0), col = "blue",
border = 1)
polygon(c(t[t <= -T.statistic ], -T.statistic, min(t)),
c(den_t[t <= -T.statistic ], 0, 0), col = "blue",
border = 1)
abline(v = c(-T.statistic, T.statistic), lty = 2, col = "blue") # draw T.stat
abline(h = 0, col = "grey")
<- pt(-T.statistic, df) * 2 # p.value = left tail * 2 (symmetric distribution)
p <- qt(.975, df) # crit. quantile; alpha = .05; 1-alphha/2 = .975
crit.q
abline(v = c(-crit.q, crit.q), col = "red", lty = 2) # draw crit. quantile
text(0, .1, paste0("p.value = ", round(p, 3)))
Assumptions:
Hypotheses:
alternative = "two-sided"
)alternative = "less" # left-tailed
)alternative = "greater" # right-tailed
)Test statistic: \[ T = \sum^{n}_{i = 1} X_i, \qquad T \sim B(n, p_0) \]
R function:
binom.test(x = 8, n = 20, p = 0.25)
Assumptions:
Hypotheses:
Test statistic: \[ X^2 = \sum^r_{k = 1}\frac{(Y_k - np_k)^2}{np_k}, \qquad X^2 \sim \chi^2(r - 1) \]
R function:
<- xtabs(count ~ spray, InsectSprays)
tab chisq.test(tab)
Assumptions:
Hypotheses:
Test statistic: \[ X^2 = \sum^m_{j = 1}\sum^r_{k = 1}\frac{(x_{jk} - \hat\mu_{jk})^2}{\hat\mu_{jk}}, \qquad X^2 \sim \chi^2((m-1)(r-1)) \]
R function:
<- xtabs(~ education + induced, infert)
tab chisq.test(tab)
Assumptions:
Hypotheses:
Test statistic (cf. test of homogeneity): \[ X^2 = \sum^m_{j = 1}\sum^r_{k = 1}\frac{(x_{jk} - \hat\mu_{jk})^2}{\hat\mu_{jk}}, \qquad X^2 \sim \chi^2((m-1)(r-1)) \]
R function:
chisq.test(HairEyeColor[, , "Female"])
Assumptions (Student’s \(t\) test):
Assumptions (approximate \(t\) test):
Hypotheses:
Test statistic: \[ T = \frac{\bar{X} - \mu_0}{S} \, \sqrt{n}, \qquad T \sim t(n-1) \] (approx. distribution of \(T\) in case of approximate \(t\) test)
Rejection region:
with the \(p\)-quantile \(t_p (n-1)\) of the \(t(n-1)\) distribution
R function:
<- t.test(sleep[sleep$group == 1, "extra"], mu = 0) ht
(Equivalent) Test decision:
Critical value
<- 0.05
alpha <- ht$parameter
dgf qt(p = 1 - alpha / 2, df = dgf) # critical value
## [1] 2.262157
$statistic # observed test statistic ht
## t
## 1.32571
Confidence interval
The confidence level of the confidence interval is chosen via the
conf.level
argument of t.test()
$conf.int ht
## [1] -0.5297804 2.0297804
## attr(,"conf.level")
## [1] 0.95
\(p\)-value
$p.value ht
## [1] 0.2175978
Assumptions:
Hypotheses:
Test statistic: \[ T = \frac{\bar x - \bar y - \delta_0}{\hat \sigma_{\bar x - \bar y}}, \qquad T \sim t(n+m-2) \]
R function:
t.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ],
var.equal = TRUE)
Assumptions:
Hypotheses:
Test statistic: \[ T = \frac{\bar{X} - \bar{Y} - \delta_0} {\sqrt{\frac{S_x^2}{n} + \frac{S_y^2}{m}}}, \qquad T \sim t(k) \] \[ k = (S_x^2 / n + S_y^2 / m)^2 ~ / \left(\frac{1}{n-1}\, \left(\frac{S_x^2}{n}\right)^2 + \frac{1}{m-1}\, \left(\frac{S_y^2}{m}\right)^2\right) \]
R function:
t.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ],
var.equal = FALSE)
Assumptions:
Hypotheses:
Test statistic: \[ T = \frac{\bar d - \delta_0}{\hat\sigma_{\bar d}}, \qquad T \sim t(n - 1) \]
R function:
t.test(sleep[sleep$group == 1, "extra"],
$group == 2, "extra"], paired = TRUE) sleep[sleep
Assumptions:
Hypotheses:
Test statistic: \[ T = \frac{r_{XY}}{\sqrt{1 - r^2_{XY}}}\sqrt{n - 2}, \qquad T \sim t(n - 2) \]
R function:
cor.test(~ speed + dist, cars)
Assumptions:
Hypotheses:
Test statistic: \[ T = \frac{S^2_x/\sigma^2_x}{S^2_y/\sigma^2_y}, \qquad T \sim F(n - 1, m - 1) \] for \(\sigma^2_x = \sigma^2_y\)
R function:
var.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ])
Assumptions:
Hypotheses:
Test statistic: \[
U = \sum^n_{i = 1}{rk(X_i)} - \frac{n(n + 1)}{2}
\] (for its distribution, see ?pwilcox
)
R function:
wilcox.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ])
(More information is in the ‘Statistical models in R’ chapter in the An Introduction to R manual.)
Linear models of the form \[
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k +
\varepsilon,
\] where y is assumed to be independent and normally distributed
as \(N(\mu, \sigma^2)\), are fitted in
R using lm()
:
lm(y ~ x1 + x2 + ... + xk, data) # lm(formula, data)
ANOVA models, which constitute a subset of linear models, can be fit
using either lm()
or aov()
.
Formally, linear models are concisely defined using vectors and matrices:
\[ \mathbf{y} = \mathbf{X} \, \mathbf{\beta} + \mathbf{e} \] Fully written out, it is: \[ \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{pmatrix} = \begin{pmatrix} x_{10} & x_{11} & x_{12} & \ldots & x_{1p}\\ x_{20} & x_{21} & x_{22} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & & \vdots\\ x_{n0} & x_{n1} & x_{n2} & \ldots & x_{np} \end{pmatrix} \cdot \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_p \end{pmatrix} + \begin{pmatrix} e_1\\ e_2\\ \vdots\\ e_n \end{pmatrix} \] The model matrix \(\mathbf{X}\) represents the observed values of the explanatory variables for each subject (row). The columns represent the observed variables. For categorical variables, i.e. factors, the creation of these variables is called the (contrast) coding of factors.
Coding of factors: dummy coding example Let’s look at a one-way ANOVA with three factor levels.
For three subjects, belonging to the groups 1, 2, and 3, respectively, the result is a set of equations like this: \[\begin{array}{ccccccc} y_{1} & = & \mu & +~~~\alpha_1 & & & +~~~e_1\\ & \vdots & \\ y_{i} & = & \mu & & +~~~\alpha_2 & & +~~~e_i\\ & \vdots & \\ y_{n} & = & \mu & & & +~~~\alpha_3 & +~~~e_n \end{array}\]
We define the first group as the reference group with mean \(\mu\): \[ \alpha_1 := 0 \]
Writing out the model equations, we now obtain: \[\begin{array}{cccccc} y_{1} & = & x_{10} \, \mu & +~~~ x_{12} \, \alpha_2 & +~~~ x_{13} \, \alpha_3 & +~~~e_1\\ & \vdots & \\ y_{i} & = & x_{i0} \, \mu & +~~~ x_{i2} \, \alpha_2 & +~~~ x_{i3} \, \alpha_3 & +~~~e_i\\ & \vdots & \\ y_{n} & = & x_{n0} \, \mu & +~~~ x_{n2} \, \alpha_2 & +~~~ x_{n3} \, \alpha_3 & +~~~e_n \end{array}\]
All other groups ‘start’ from the baseline, defined by the reference group, so we set \(x_{i0} = 1\) for all \(i\).
Finally, we set up dummy variables which indicate, for each subject, whether they fall into one of the other groups: \[ x_{ij} = \begin{cases} 1 & \text{if $i$ is in group $j$}\\ 0 & \text{otherwise} \end{cases} \]
In standard terminology of linear models, we have \[\mu = \beta_0;\quad \alpha_2 = \beta_1;\quad \alpha_3 = \beta_2\] So, for our three subjects from three different groups, we get: \[\begin{array}{ccccccc} y_{1} & = & \beta_0 & +~~~ & & & +~~~e_1\\ & \vdots & \\ y_{i} & = & \beta_0 & & +~~~\beta_1 & & +~~~e_i\\ & \vdots & \\ y_{n} & = & \beta_0 & & & +~~~\beta_2 & +~~~e_n \end{array}\]
These \(\beta\) coefficients are
provided by lm()
and aov()
. \(\beta_0\) is the expected value for the
reference level, and \(\beta_1,
\beta_2\) are the respective differences in the other two levels,
compared to the reference level.
Coding of factors in R
Factors with \(k\) levels are represented by \(k - 1\) indicator variables.
Dummy coding (default
): Indicators only take the values
\(0\) and \(1\). If all dummies are zero, the reference
level is indicated. Dummy coded effects are interpreted with respect to
the reference level. Dummy coding is set in R by:
options(contrasts = c("contr.treatment", "contr.poly"))
Effect coding: Indicators take the values \(-1\), \(0\), and \(1\). Effects are interpreted as a deviation from the mean and sum to zero. Effect coding is switched on in R by
options(contrasts = c("contr.sum", "contr.poly"))
Statistical models are represented by R formulae. R formulae usually correspond closely to the referring statistical models.
Formulae used in lm()
and aov()
, let A, B
be fixed factors
R formula | model |
---|---|
y ~ 1 + x |
\(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\) |
y ~ x |
(short form) |
y ~ 0 + x |
\(y_i = \beta_1 x_i + \varepsilon_i\) |
y ~ x + I(x^2) |
\(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i\) |
y ~ A + B |
\(y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk}\) |
y ~ A*B |
\(y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\) |
y ~ 1 + A + B + A:B |
(long form) |
There are many extractor functions that work on lm objects.
coef() # Extract the regression coefficients
summary() # Print a comprehensive summary of the results of
# the regression analysis. This also returns a
# list from which further useful information
# can be extracted.
anova() # Compare nested models and produce an analysis
# of variance table (for incremental F tests)
residuals() # Extract the residuals
plot() # Produce four plots, showing residuals, fitted
# values and some diagnostics
fitted() # Extract the fitted values
predict() # Either predicted values for the observed data,
# i.e. the fitted values
#
# Or a new data frame can be supplied having the
# same variables specified with the same labels
# as the original. The value is a vector or
# matrix of predicted values corresponding to the
# determining variable values in the data frame
vcov() # Return the variance-covariance matrix of the
# main parameters of a fitted model object
model.matrix() # Return the design matrix
A method for analyzing the effect of a discrete variable, factor \(A\) with \(m\) levels, on a continuous variable \(Y\). For the factor levels \(i = 1, \ldots, m\), we collect independent samples of size \(n_i\).
The resulting sample is: \[ Y_{ij} ~~\text{with}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i \]
Model with dummy coding
\[ Y_{ij} = \mu_i ~+~ \varepsilon_{ij} ~~\text{with}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i ,~~\text{and}~~ \varepsilon_{ij} \sim N (0, \sigma^2) \]
Therefore, \[ E(Y_{ij}) = E(\mu_i + \varepsilon_{ij}) = \mu_i ~~\text{and}~~ Y_{ij} \sim N (\mu_i, \sigma^2) \]
So we assume stochastic independence and equal normal distributions for all \(\varepsilon_{ij}\) and for all \(Y_{ij}\)
Hypotheses: \[ H_0\colon~ \mu_1 = \mu_2 = \ldots = \mu_m \] \[ H_1\colon~ \mu_i \neq \mu_{i'} ~~\text{for at least one pair}~~ i, i' \in \{1, \ldots, m\} \]
Model with effect coding \[ Y_{ij} = \mu + \alpha_i + \varepsilon_{ij} ~~\text{mit}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i \] und \(\varepsilon_{ij} \sim N (0, \sigma^2)\)
Hypotheses: \[ H_0\colon~ \alpha_1 = \alpha_2 = \ldots = \alpha_m = 0 \] \[ H_1\colon~ \alpha_i \neq 0 ~~\text{for at least one}~~ i \in \{1, \ldots, m\} \]
\(\mu\) is the grand mean. We obtain it using \(N = n_1 + \ldots + n_m\): \[ \mu = \frac{1}{N} \, \sum_{i=1}^m n_i \mu_i \] \(\alpha_i\) is the effect of factor level \(i\): \(~~\alpha_i = \mu_i - \mu\)
Test statistic \[ F = \frac{MS_B}{MS_W} = \frac{~~~\frac{1}{m-1} \, \sum_{i=1}^m n_i \, (\bar{Y}_i - \bar{Y})^2~~~} {\frac{1}{N-m} \, \sum_{i=1}^m \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y}_i)^2} \] Distribution of test statistic, given \(H_0\): \[ F \sim F (m - 1, N - m) \] Rejection region \[ F > F_{1-\alpha} (m - 1, N - m) \]
Data scheme:
b\(_1\) | \(\ldots\) | b\(_j\) | \(\ldots\) | b\(_J\) | |
---|---|---|---|---|---|
a\(_1\) | \(y_{111},\; \ldots,\;y_{11k},\; \ldots,\; y_{11K}\) | \(\ldots\) | \(y_{1j1},\; \ldots, \;y_{1jk}, \; \ldots,\; y_{1jK}\) | \(\ldots\) | \(y_{1J1},\;\ldots, \;y_{1Jk}, \; \ldots,\; y_{1JK}\) |
\(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
a\(_i\) | \(y_{i11},\; \ldots, \;y_{i1k}, \; \ldots,\; y_{i1K}\) | \(\ldots\) | \(y_{ij1},\; \ldots, \;y_{ijk}, \; \ldots,\; y_{ijK}\) | \(\ldots\) | \(y_{iJ1},\; \ldots, \;y_{iJk}, \; \ldots,\; y_{iJK}\) |
\(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) |
a\(_I\) | \(y_{I11},\; \ldots, \;y_{I1k}, \; \ldots,\; y_{I1K}\) | \(\ldots\) | \(y_{Ij1},\; \ldots, \;y_{Ijk}, \; \ldots,\; y_{IjK}\) | \(\ldots\) | \(y_{IJ1},\; \ldots, \;y_{IJk}, \; \ldots,\; y_{IJK}\) |
Note that the number of observations, \(K\), is assumed to be equal for all cells. Perfect variance decomposition between both main effects and their interaction is only possible in this specific case.
Model: Dummy coding \[ Y_{ijk} = \mu_{ij} + \varepsilon_{ijk} \]
Model: Effect coding
\[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk} \]
For both coding types: \[ i = 1, \ldots, I, j = 1, \ldots, J, k = 1, \ldots, K \] \[ \varepsilon_{ijk} \sim N (0, \sigma^2) \] \[ Y_{ijk} \sim N (\mu_{ij}, \sigma^2) \]
Hypotheses concerning interaction: \[ H_0^{A \times B}\colon~ (\alpha\beta)_{ij} = 0 ~\text{for all}~ i = 1, \ldots, I,~ j = 1, \ldots, J \] \[ H_1^{A \times B}\colon~ (\alpha\beta)_{ij} \neq 0 ~\text{for at least one pair}~ (i, j) \]
Hypotheses concerning main effect of factor \(A\) \[ H_0^{A}\colon~ \alpha_i = 0 ~\text{for all}~ i = 1, \ldots, I \] \[ H_1^{A}\colon~ \alpha_i \neq 0 ~\text{for at least one}~ i \]
Hypotheses concerning main effect of factor \(B\) \[ H_0^{B}\colon~ \beta_j = 0 ~\text{for all}~ j = 1, \ldots, J \] \[ H_1^{B}\colon~ \beta_j \neq 0 ~\text{for at least one}~ j \]
See Fahrmeir et al. (2013) for calculation of the mean squares \(MS_A, MS_B, MS_{A\times B}, MS_W\)
Test statistic for \(H_0^{A \times B}\) vs. \(H_1^{A \times B}\) \[ F_{A \times B} = \frac{MS_{A \times B}}{MS_W}, \qquad F_{A \times B} \sim F((I - 1)(J - 1),~ IJ(K - 1)) \]
Test statistic for \(H_0^A\) vs. \(H_1^A\) \[ F_A = \frac{MS_A}{MS_W}, \qquad F_A \sim F(I - 1,~ IJ(K - 1)) \]
Test statistic for \(H_0^B\) vs. \(H_1^B\) \[ F_B = \frac{MS_B}{MS_W}, \qquad F_B \sim F(J - 1,~ IJ(K - 1)) \]
Example
Two influencing factors on satisfaction with the study programme (Fahrmeir et al., 2013)
Satisfaction with the study programme was measured with a score ranging from 0 to 100, which was constructed so that it is approximately normally distributed.
For each factor level combination, five students were randomly selected and their satisfaction score was measured.
Data entry
<- read.table(header = TRUE, stringsAsFactors = TRUE,
dat text = "
motiv fam score
high partner 85
high partner 89
high partner 91
high partner 95
high partner 80
high single 50
high single 52
high single 65
high single 71
high single 72
low partner 34
low partner 30
low partner 28
low partner 23
low partner 40
low single 30
low single 28
low single 33
low single 16
low single 23
")
Fitting the model with either lm()
or
aov()
<- aov(score ~ motiv * fam, dat)
aov.out <- lm(score ~ motiv * fam, dat) lm.out
Result
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## motiv 1 10811 10811 190.507 2.64e-10 ***
## fam 1 1201 1201 21.167 0.000295 ***
## motiv:fam 1 551 551 9.714 0.006644 **
## Residuals 16 908 57
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.out)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## motiv 1 10811.2 10811.2 190.5066 2.643e-10 ***
## fam 1 1201.2 1201.2 21.1674 0.0002953 ***
## motiv:fam 1 551.2 551.2 9.7137 0.0066437 **
## Residuals 16 908.0 56.8
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Simple interaction plot with means using
interaction.plot()
with(dat, interaction.plot(motiv, fam, score, type = "b",
pch = 16, xlab = "Motivation", ylab = "Score"))
Stepwise plot with data points, means, and standard errors
## plot raw data points
# fam == single
plot(score ~ as.numeric(motiv), dat[dat$fam == "single", ],
xlab = "Motivation", ylab = "Score", col = "grey55",
axes = FALSE, ylim = c(20, 100), xlim = c(.8, 2.2),
pch = 16)
# fam == partner
points(score ~ as.numeric(motiv), dat[dat$fam == "partner", ],
col = "grey55")
axis(1, at = 1:2, labels = levels(dat$motiv)) # draw x-axis
axis(2) # draw y-axis
<- aggregate(score ~ motiv + fam, dat, mean) # aggregate mean
ag $se <- aggregate(score ~ motiv + fam, dat, \(x) sd(x)/sqrt(length(x)))[, 3] # SEM
ag
## add mean
# fam == single
points(score ~ as.numeric(motiv), ag[ag$fam == "single", ],
type = "b", pch = 16)
# fam == partner
points(score ~ as.numeric(motiv), ag[ag$fam == "partner", ],
type = "b", lty = 2)
## add SEM arrows
with(ag, arrows(as.numeric(motiv), score - se,
as.numeric(motiv), score + se, code = 3, angle = 90,
length = .05))
## add legend and box
legend("topright", legend = c("Partner", "Single"), lty = c(2, 1), pch = c(1, 16))
box()
Estimated parameters
coef(aov.out) # or: summary(lm.out)
## (Intercept) motivlow
## 88 -57
## famsingle motivlow:famsingle
## -26 21
E.g., the predicted value for a subject with predictors ‘low’ and ‘single’ is \(88 - 57 - 26 + 21 = 26\), as was the observed mean for these characteristics:
aggregate(score ~ motiv + fam, dat, mean)
## motiv fam score
## 1 high partner 88
## 2 low partner 31
## 3 high single 62
## 4 low single 26
Stochastic model: \[ y_i = \alpha + \beta \cdot x_i + \varepsilon_i ~\text{with}~ E(\varepsilon_i) = 0, ~\text{for all}~ i = 1, \ldots, n \]
\[ \varepsilon_i \sim N(0, \sigma^2) \] \[ Y_i \sim N(\alpha + \beta \cdot x_i, \sigma^2) \]
Wald test for \(\alpha\) or \(\beta\)
Hypotheses
Test statistic \[ T_{\alpha_0} = \frac{\hat{\alpha} - \alpha_0}{\hat{\sigma}_{\hat{\alpha}}} \qquad T_{\alpha_0} \sim t (n - 2) \]
Rejection region
with the \(p\)-quantile \(t_p (n-2)\) of the \(t\) distribution with \(df = n - 2\) degrees of freedom
The test of the \(\beta\) coefficient (and other coefficients in multiple linear regression) follows the same scheme.
Example
Math ability (amount of math problems solved in 2 minutes) as a linear function of working memory capacity (forwards letter span).
<- read.table("dataInvers.txt", header = TRUE)
dat <- na.omit(dat)
dat plot(jitter(Rechnen, 1) ~ jitter(LetSp_v, 1), dat)
Fit the regression model
<- lm(Rechnen ~ LetSp_v, dat)
lm1 summary(lm1)
##
## Call:
## lm(formula = Rechnen ~ LetSp_v, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.2196 -5.8799 -0.8799 5.3402 24.5603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7602 1.4474 3.289 0.00104 **
## LetSp_v 1.7799 0.2474 7.195 1.22e-12 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.934 on 996 degrees of freedom
## Multiple R-squared: 0.04941, Adjusted R-squared: 0.04846
## F-statistic: 51.77 on 1 and 996 DF, p-value: 1.223e-12
Parameter interpretation and hypothesis tests
Useful functions for model objects:
coef(lm1) # Coefficients
fitted(lm1) # Predicted values
resid(lm1) # Residuals
confint(lm1) # Confidence intervals for parameters
abline(lm1) # add regression line to scatter plot
With only one prediction variable, the proportion of variance (of the dependent variable) explained by the regression is equal to the squared correlation coefficient between the two variables: \[R^2 = \frac{\sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2}{\sum_{i=1}^n(Y_i - \bar{Y})^2} = r_{XY}^2\]
# Variance explained
# See ?summary.lm and str(summary(lm1))
summary(lm1)$r.squared
# Squared correlation coefficient
# See ?cor.test
<- cor.test(~ Rechnen + LetSp_v, dat)
cor.t $estimate^2 cor.t
Using predict()
, the predicted values for the dependent
variable can be calculated based on
the observed values of the predictor(s), obtaining the fitted values for the sample
predict(lm1) # or fitted(lm1)
any supplied values for the predictor(s)
<- predict(lm1, newdata = data.frame(LetSp_v = c(3, 10))) pred
In the latter case, a new data frame with predictor variable(s) needs to be supplied. The general syntax is:
predict(mod, newdata = data.frame(x1 = new_xvals1, x2 = new_xvals2))
# See ?predict.lm
Predicted values fall on the fitted regression line.
When using linear models, we should check whether the model assumptions (approximately) hold.
This is often done graphically, with graphics such as:
Scatter plot
Plot of residuals as a function of the predictors or the predicted values
Normal-quantile plots of the (standardized) residuals
Most of this is conveniently performed by the plot method of
lm
objects:
par(mfrow = c(2, 2), # plot layout: 2 rows, 2 columns -> 4 plots in one window
oma = c(0,0,0,0), # outer margin
mar = c(3,3,4,1), # margin
mgp = c(2, .7, 0)) # position axis title/label/line
plot(lm1)
Illustriation of a ‘Residuals vs Fitted’ plot for different (non-)violations of the model assumptions (left-side: scatterplot with regression line, right-side: ‘Residuals vs Fitted’ plot)
Generalizing the simple linear regression, we get:
\[ Y_i = \beta_0 + \beta_1 \cdot X_{i1} + \ldots + \beta_p \cdot X_{ip} + \varepsilon_i \] with \(E(\varepsilon_i) = 0\), for all \(i = 1, \ldots, n\)
\[Y_i \sim N (\mu_i, \sigma^2)\] with \(\mu_i = \beta_0 + \beta_1 \cdot x_{i1} + \ldots + \beta_p \cdot x_{ip}\)
More predictors is not always better!
The main issue is multicollinearity, which arises due to correlated predictors. Multicollinearity causes instability of the regression equation, may bias the estimated test statistics, and impedes parameter interpretation (when the size or even sign of a parameter estimate depends on the inclusion of other predictors).
A Wald test for a single parameter \(\beta_j \quad(j = 1, \dots, p)\) can be performed as in simple linear regression, with \(T_{\beta_j} \sim t (n - p)\)
Overall \(F\) test \[ \begin{align} H_0&\colon~ \beta_1 = \ldots = \beta_p = 0\\ H_1&\colon~ \beta_j \neq 0 ~\text{ for at least one}~ j \in \{1, \ldots, p\} \end{align} \] Test statistic \[ F = \frac{R^2}{1 - R^2} \cdot \frac{n - p - 1}{p}, \qquad F \sim F(p, n - p - 1) \] Rejection region \[ F > F_{1-\alpha} (p, n - p - 1) \]
Hypothesis tests involve model comparison.
Wald- and F-tests are appropriate for nested (‘hierarchical’) models, which are created by parameter restriction. Some restriction types are:
Implementation of hierarchical models by parameter restriction in R.
Parameter elimination (parameter is set to \(0\))
lm(y ~ x1) # (restricted model)
lm(y ~ x1 + x2) # (unrestricted model)
Parameter fixation (parameter is set to \(k\))
lm(y ~ offset(k * x1)) # (restricted model)
lm(y ~ x1) # (unrestricted model)
Parameter equation (parameters are set to the same value) (for two continuous predictors \(x_1\) and \(x_2\))
lm(y ~ I(x1 + x2)) # (restricted model)
lm(y ~ x1 + x2) # (unrestricted model)
Incremental F test
With this test, we can compare two hierarchical (a.k.a. nested) models:
The restricted model \(M_0\) may have parameter restrictions on more than one parameter.
Test statistic and its distribution under the null hypothesis: \[ \begin{align} F &= \frac{(SS_{\hat{Y}1} - SS_{\hat{Y}0})/ (q_1 - q_0)}{SS_{\hat{\varepsilon}1}/(N - q_1)} = \frac{(SS_{\hat{\varepsilon}0} - SS_{\hat{\varepsilon}1})/ (q_1 - q_0)}{SS_{\hat{\varepsilon}1}/(N - q_1)}\\ &= \frac{N - q_1}{q_1 - q_0} \cdot \frac{R^2_1 - R^2_0}{1 - R^2_1} \end{align} \] \[F \sim F(q_1 - q_0, N - q_1)\] The null hypothesis \(H_0\) is stated by the parameter restriction. For the examples further above, they are:
By testing the \(H_0\), we hence
decide whether the restricted model has to be rejected in favor of the
more general model. The incremental F test for linear models is a
special case of the more general Likelihood-ratio test, which can be
applied to (almost) any hierachical models, not only linear models. In R
we can use the function anova()
to compare nested
models.
Math ability and working memory data
<- na.omit(read.table("dataInvers.txt", header = TRUE))
dat
<- lm(Rechnen ~ LetSp_v, dat) # restricted model
lm1 <- lm(Rechnen ~ LetSp_v + LetSp_r, dat) lm2
lm2
adds another working memory predictorlm2
has exactly one more parameter than
lm1
, i.e. \(q_1 - q_0 =
1\)Overall \(F\)-test and Wald tests
summary(lm2)
##
## Call:
## lm(formula = Rechnen ~ LetSp_v + LetSp_r, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0125 -5.8373 -0.9911 5.1627 25.1627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5207 1.4809 2.378 0.017619 *
## LetSp_v 1.3077 0.2796 4.676 3.33e-06 ***
## LetSp_r 0.8676 0.2445 3.548 0.000406 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.888 on 995 degrees of freedom
## Multiple R-squared: 0.06129, Adjusted R-squared: 0.0594
## F-statistic: 32.48 on 2 and 995 DF, p-value: 2.16e-14
Interpretation: Overall \(F\)-test and Wald tests (for \(\alpha = .05\))
The model lm2
is: \[
\begin{align}
Rechnen_i = \beta_0 & + \beta_1 \cdot X_{i,LetSp_v}
+ \beta_2 \cdot X_{i,LetSp_r} + \varepsilon_i
\end{align}
\]
What does a one unit increase in \(X_{i,LetSp_v}\) mean?
Let’s look at the difference in prediction for persons \(i\) and \(j\) with:
\[ \begin{align} \widehat{Rechnen_i} &= \beta_0 + \beta_1 \cdot X_{i,LetSp_v} &+ \beta_2 \cdot c\\ \widehat{Rechnen_j} &= \beta_0 + \beta_1 \cdot (X_{i,LetSp_v} + 1) &+ \beta_2 \cdot c\\ \widehat{Rechnen_j} - \widehat{Rechnen_i} &= \beta_1 \end{align} \] So \(\beta_1\) is the difference in math ability for two persons with equal backward letter span but a one unit difference in forward letter span. It is the effect of forward letter span, all else being equal (ceteris paribus).
Incremental \(F\)-test (model comparison)
anova(lm1, lm2)
## Analysis of Variance Table
##
## Model 1: Rechnen ~ LetSp_v
## Model 2: Rechnen ~ LetSp_v + LetSp_r
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 996 62696
## 2 995 61912 1 783.25 12.588 0.0004064 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s look at a situation where \(q_1 - q_0 > 1\).
First, create a categorical variable for age with four categories
quantile(dat$Alter)
## 0% 25% 50% 75% 100%
## 18 21 24 45 87
$AlterKat <- ifelse(dat$Alter < 21,
dat"18to20", ifelse(dat$Alter < 24,
"21to23", ifelse(dat$Alter < 45,
"24to44", "45to87")))
$AlterKat <- factor(dat$AlterKat)
dat
summary(dat$AlterKat)
## 18to20 21to23 24to44 45to87
## 247 240 250 261
table(dat$AlterKat)
##
## 18to20 21to23 24to44 45to87
## 247 240 250 261
Add effect of age category to effect of forward letter span
<- lm(Rechnen ~ LetSp_v + AlterKat, dat) lm3
In dummy coding, factors with \(k\) levels are represented by \(k - 1\) dummy variables:
AlterKat | AlterKat21to23 | AlterKat24to44 | AlterKat45to87 |
---|---|---|---|
18to20 | 0 | 0 | 0 |
21to23 | 1 | 0 | 0 |
24to44 | 0 | 1 | 0 |
45to87 | 0 | 0 | 1 |
So the model lm3
is: \[
\begin{align}
Rechnen_i = \beta_0 & + \beta_1 \cdot X_{i,LetSp_v}
+ \beta_2 \cdot X_{i,AlterKat21to23} \\
& + \beta_3 \cdot X_{i,AlterKat24to44}
+ \beta_4 \cdot X_{i,AlterKat45to87}
+ \varepsilon_i
\end{align}
\]
Wald tests for lm3
summary(lm3)
##
## Call:
## lm(formula = Rechnen ~ LetSp_v + AlterKat, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.426 -5.662 -1.189 5.338 25.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2741 1.5219 2.151 0.031690
## LetSp_v 1.8979 0.2470 7.683 3.73e-14
## AlterKat21to23 0.6298 0.7132 0.883 0.377360
## AlterKat24to44 -0.1332 0.7058 -0.189 0.850381
## AlterKat45to87 2.6310 0.7012 3.752 0.000185
##
## (Intercept) *
## LetSp_v ***
## AlterKat21to23
## AlterKat24to44
## AlterKat45to87 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.867 on 993 degrees of freedom
## Multiple R-squared: 0.06827, Adjusted R-squared: 0.06451
## F-statistic: 18.19 on 4 and 993 DF, p-value: 1.977e-14
Model selection
anova(lm1, lm3) # Model lm3 is chosen
## Analysis of Variance Table
##
## Model 1: Rechnen ~ LetSp_v
## Model 2: Rechnen ~ LetSp_v + AlterKat
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 996 62696
## 2 993 61452 3 1243.4 6.6973 0.0001779 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm3)
, because the Wald tests are each
concerned with a single \(\beta\)
parameter.For drawing random numbers from a statistical distribution, the
distribution name is prefixed by r
.
Examples (see ?Distributions
for a list of
distributions):
rnorm(10) # draw from standard normal distribution
runif(10) # draw from uniform distribution
Sampling with or without replacement from a vector is performed by
the sample()
function.
sample(1:5, size = 10, replace = TRUE)
The random number generator in R is seeded: In each run, new random numbers are generated. To obtain the same results as in a previous simulation, the seed (‘starting value’) can be set explicitly:
set.seed(1223) # set seed, so on each run
runif(3) # random numbers will be identical
## [1] 0.6289619 0.1267469 0.3285822
set.seed(1223) # set seed, so on each run
runif(3) # random numbers will be identical
## [1] 0.6289619 0.1267469 0.3285822
replicate()
replicate()
is a wrapper function for
sapply()
designed for repeated execution of the same
expression. It makes tasks like random number generation very simple and
avoids writing loops:
# Two alternative calls using sapply()
sapply(1:3, FUN = rnorm, n = 5)
## [,1] [,2] [,3]
## [1,] 0.4336592 1.8009529 2.994275
## [2,] 1.9911233 0.6681418 3.093450
## [3,] 1.6196675 2.2842026 2.342798
## [4,] -0.2708700 2.2736581 3.587997
## [5,] -0.2273296 2.0647675 1.838956
sapply(1:3, function(x) rnorm(5))
## [,1] [,2] [,3]
## [1,] 0.04962477 -0.4377098 -0.5655823
## [2,] 0.30150230 -0.2300038 -0.3912522
## [3,] -0.98600835 0.8156111 -1.3839161
## [4,] 0.56849428 -0.6653574 1.8774840
## [5,] -0.10353400 0.4355629 0.1575512
# Same number generation, but simpler call using replicate()
replicate(3, rnorm(5))
## [,1] [,2] [,3]
## [1,] -1.7661080 0.73675570 1.0209950
## [2,] -2.2177442 0.23925922 1.1338884
## [3,] -0.6701462 0.01314772 2.4327815
## [4,] -0.1332934 -0.23338271 0.1711988
## [5,] -1.0773198 0.95471266 1.4126050
A set of multiple expressions may be repeated.
Example: Unbiased estimation of the population variance \(\sigma^2\)
# Using replicate()
<- replicate(100000, {
v <- rnorm(n = 10, mean = 0, sd = 5)
x sum((x - mean(x))^2) / (10 - 1)
})mean(v)
## [1] 25.0417
The power is the probability of a test to become significant given H\(_0\) is false, i.e., to (correctly) detect an effect.
Power can be estimated by simulation: It is the proportion of significant tests for a specified effect size, variability, sample size, and \(\alpha\).
Example: Power of one-sample \(t\)-test to detect an effect of 3 Hz at \(\sigma = 8\) Hz, \(n = 30\), and \(\alpha = 5\)%
<- replicate(1000, {
pval <- rnorm(30, mean = 443, sd = 8) # draw from effect distrib. 443 - 440 = 3 Hz
x t.test(x, mu = 440)$p.value # test against H0
})mean(pval < 0.05) # simulated power (proportion of significant tests)
## [1] 0.504
Bootstrap: simulation methods for frequentist inference
The name is based on the famous story about the Baron Munchhausen, who was stuck with his horse in a swamp and got out by pulling on his own bootstraps (in German instead: am eigenen Schopf). We’ll see that something similar to this can actually be done in statistics …
Useful when (Davison, 2006):
Algorithm (Efron & Tibshirani, 1993; Hesterberg et al., 2005, fig. 18.5):
Select \(B\) bootstrap samples \(\mathbf{x}^{*1}, \mathbf{x}^{*2}, \ldots, \mathbf{x}^{*B}\), each consisting of \(n\) data values drawn with replacement from the original data \(\mathbf{x}\).
Compute the bootstrap replicate \(\mathbf{\hat\vartheta}^* = s(\mathbf{x}^*)\) of the statistic of interest \(\mathbf{\hat\vartheta} = s(\mathbf{x})\) for each sample.
Assess the variability of the statistic via the distribution of bootstrap replicates.
Bootstrap algorithm
Bootstrap confidence intervals
For a given significance level \(\alpha\), the corresponding limits of the percentile interval of the bootstrap distribution are equal to its \(\alpha/2\) and \(1-\alpha/2\) quantiles.
This percentile interval is a \((1 - \alpha)\) bootstrap confidence interval for \(\vartheta\).
## Mouse data (Efron & Tibshirani, 1993, p. 11)
<- data.frame(
mouse grp = rep(c("trt", "ctl"), c(7, 9)),
surv = c(94, 197, 16, 38, 99, 141, 23, # trt
52, 104, 146, 10, 50, 31, 40, 27, 46)) # ctl
## Observed difference of means
t.test(surv ~ grp, mouse, var.equal = TRUE)
##
## Two Sample t-test
##
## data: surv by grp
## t = -1.1214, df = 14, p-value = 0.281
## alternative hypothesis: true difference in means between group ctl and group trt is not equal to 0
## 95 percent confidence interval:
## -89.22770 27.95786
## sample estimates:
## mean in group ctl mean in group trt
## 56.22222 86.85714
<- with(mouse, diff(tapply(surv, grp, mean)))
meandiff
## Resampling the observations and calculating the difference of means
<- numeric(1000)
sam1 for(i in seq_along(sam1)) {
<- sample(mouse[mouse$grp == "trt", "surv"], 7, replace = T)
trt <- sample(mouse[mouse$grp == "ctl", "surv"], 9, replace = T)
ctl <- mean(trt) - mean(ctl)
sam1[i]
}
## 95% bootstrap confidence interval for difference of means
quantile(sam1, c(.025, .975))
## 2.5% 97.5%
## -19.30992 79.88492
hist(sam1, breaks=20, freq = F)
abline(v = meandiff, lwd = 2, col = "darkblue") # add observed diff.
abline(v = quantile(sam1, c(.025, .975)), lty = 2) # add bootstrap CI