Statistical Models

Appendix B

Appendix B:
More on R

Outline of Appendix B

  1. Lists
  2. Data Frames
  3. Data Entry
  4. R Style Guide

Part 1:
Lists

Lists

  • Vectors can contain only one data type (number, character, boolean)

  • Lists are data structures that can contain any R object

  • Lists can be created similarly to vectors, with the command list()

# List containing a number, a vector, and a string
my_list <- list(2, c(T,F,T,T), "hello")

# Print the list
print(my_list)
[[1]]
[1] 2

[[2]]
[1]  TRUE FALSE  TRUE  TRUE

[[3]]
[1] "hello"

Retrieving elements

Elements of a list can be retrieved by indexing

  • my_list[[k]] returns k-th element of my_list


# Consider again the same list
my_list <- list(2, c(T,F,T,T), "hello")

# Access 2nd element of my_list and store it in variable
second_element <- my_list[[2]]

# In this case the variable second_element is a vector
print(second_element)
[1]  TRUE FALSE  TRUE  TRUE

List slicing

You can return multiple items of a list via slicing

  • my_list[c(k1, ..., kn)] returns elements in positions k1, ..., kn
  • my_list[k1:k2] returns elements k1 to k2
my_list <- list(2, c(T,F), "Cat", "Dog", pi, 42)

# We store 1st, 3rd, 5th entries of my_list in slice
slice <- my_list[c(1, 3, 5)]

print(slice)
[[1]]
[1] 2

[[2]]
[1] "Cat"

[[3]]
[1] 3.141593

List slicing

my_list <- list(2, c(T,F), "Cat", "Dog", pi, 42)

# We store 2nd to 4th entries of my_list in slice
slice <- my_list[2:4]

print(slice)
[[1]]
[1]  TRUE FALSE

[[2]]
[1] "Cat"

[[3]]
[1] "Dog"

Naming

  • Components of a list can be named. Names can be assigned with
    • names(my_list) <- c("name_1", ..., "name_k")
# Create list with 3 elements
my_list <- list(2, c(T,F,T,T), "hello")

# Name each of the 3 elements
names(my_list) <- c("number", "TF_vector", "string")

# Print the named list: the list is printed along with element names 
print(my_list)
$number
[1] 2

$TF_vector
[1]  TRUE FALSE  TRUE  TRUE

$string
[1] "hello"

Accessing a name

  • A component of my_list named my_name can be accessed with dollar operator
    • my_list$my_name
# Create list with 3 elements and name them
my_list <- list(2, c(T,F,T,T), "hello")
names(my_list) <- c("number", "TF_vector", "string")

# Access 2nd element using dollar operator and store it in variable
second_component <- my_list$TF_vector

# Print 2nd element
print(second_component)
[1]  TRUE FALSE  TRUE  TRUE

Part 2:
Data Frames

Data Frames

  • Data Frames are the best way of presenting a data set in R:

    • Each variable has assigned a collection of recorded observations
  • Data frames can contain any R object

  • Data Frames are similar to Lists, with the difference that:

    • Members of a Data Frame must all be vectors of equal length

Constructing a Data Frame

  • Data frames are constructed similarly to lists, using data.frame()

  • Important: Elements of data frame must be vectors of the same length

  • Example: We construct the Family Guy data frame. Variables are

    • person – Name of character
    • age – Age of character
    • sex – Sex of character
family <- data.frame(
  person = c("Peter", "Lois", "Meg", "Chris", "Stewie"),
  age = c(42, 40, 17, 14, 1),
  sex = c("M", "F" , "F", "M", "M")
)

Printing a Data Frame

  • R prints data frames like matrices
  • First row contains vector names
  • First column contains row names
  • Data are paired: e.g. Peter is 42 and Male
family <- data.frame(
  person = c("Peter", "Lois", "Meg", "Chris", "Stewie"),
  age = c(42, 40, 17, 14, 1),
  sex = c("M", "F" , "F", "M", "M")
)

print(family)
  person age sex
1  Peter  42   M
2   Lois  40   F
3    Meg  17   F
4  Chris  14   M
5 Stewie   1   M

Extracting data

  • Think of a data frame as a matrix

  • You can extract element in position (m,n) by using

    • my_data[m, n]
  • Example: Peter is in 1st row. We can extract Peter’s name as follows

extracted <- family[1, 1]

print(extracted)
[1] "Peter"

Extracting data

To extract multiple elements on the same row or column type

  • my_data[c(k1,...,kn), m] \quad or \quad my_data[k1:k2, m]
  • my_data[n, c(k1,...,km)] \quad or \quad my_data[n, k1:k2]

Example: Meg is listed in 3rd row. We extract her age and sex

meg_data <- family[3, 2:3]

print(meg_data)
  age sex
3  17   F

Extracting data

To extract entire rows or columns type

  • my_data[c(k1,...,kn), ] \quad or \quad my_data[k1:k2, ]
  • my_data[, c(k1,...,km)] \quad or \quad my_data[, k1:k2]
peter_data <- family[1, ]      # Extracts first row - Peter
sex_age <- family[, c(3,2)]    # Extracts third and second columns:
                               # sex and age

print(peter_data)
print(sex_age)
  person age sex
1  Peter  42   M
  sex age
1   M  42
2   F  40
3   F  17
4   M  14
5   M   1

Extracting data

Use dollar operator to access data frame columns

  • Suppose data set my_data contains a variable called my_variable
  • my_data$my_variable accesses column my_variable
  • my_data$my_variable is a vector

Example: To access age in the family data frame type

ages <- family$age        # Stores ages in a vector

cat("Ages of the Family Guy characters are", ages)
cat("Meg's age is", ages[3])
Ages of the Family Guy characters are 42 40 17 14 1
Meg's age is 17

Size of a data frame

The size of a data frame can be discovered using:

  • nrow(my_data) \quad number of rows
  • ncol(my_data) \quad number of columns
  • dim(my_data) \quad \quad vector containing number of rows and columns
family_dim <- dim(family)    # Stores dimensions of family in a vector

cat("The Family Guy data frame has", family_dim[1], 
    "rows and", family_dim[2], "columns")
The Family Guy data frame has 5 rows and 3 columns

Adding Data

Adding data to an existing data frame my_data

  • Add more records (adding to rows)
    • Create single row data frame new_record
    • new_record must match the structure of my_data
    • Add to my_data with my_data <- rbind(my_data, new_record)
  • Add a set of observations for a new variable (adding to columns)
    • Create a vector new_variable
    • new_variable must have as many components as rows in my_data
    • Add to my_data with my_data <- cbind(my_data, new_variable)

Example: Add new record

  • Consider the usual Family Guy data frame family
  • Suppose we want to add data for Brian
  • Create a new record: a single row data frame with columns
    • person, age, sex
new_record <- data.frame(
  person = "Brian",
  age = 7,
  sex = "M"
)

print(new_record)
  person age sex
1  Brian   7   M

Example: Add new record

  • Now we add new_record to family
family <- rbind(family, new_record)

print(family)
  person age sex
1  Peter  42   M
2   Lois  40   F
3    Meg  17   F
4  Chris  14   M
5 Stewie   1   M
6  Brian   7   M

Example: Add new variable

  • We want to add a new variable to the Family Guy data frame family
  • This variable is called funny
  • It records how funny each character is, with levels
    • Low, Med, High
  • Create a vector funny with entries matching each character (including Brian)
funny <- c("High", "High", "Low", "Med", "High", "Med")

print(funny)
[1] "High" "High" "Low"  "Med"  "High" "Med" 

Example: Add new variable

  • Add funny to the Family Guy data frame family
family <- cbind(family, funny)

print(family)
  person age sex funny
1  Peter  42   M  High
2   Lois  40   F  High
3    Meg  17   F   Low
4  Chris  14   M   Med
5 Stewie   1   M  High
6  Brian   7   M   Med

Adding a new variable: alternative way

Instead of using cbind we can add a new varibale using dollar operator:

  • We want to add a variable called new_variable
  • Create a vector v containing values for the new variable
  • v must have as many components as rows in my_data
  • Add to my_data with my_data$new_variable <- v

Adding a new variable: alternative way

Example:

  • We add age expressed in months to the Family Guy data frame family
  • Age in months can be computed by multiplying vector family$age by 12
v <- family$age * 12       # Computes vector of ages in months

family$age.months <- v     # Stores vector as new column in family

print(family)
  person age sex funny age.months
1  Peter  42   M  High        504
2   Lois  40   F  High        480
3    Meg  17   F   Low        204
4  Chris  14   M   Med        168
5 Stewie   1   M  High         12
6  Brian   7   M   Med         84

Logical Record Subsets

  • We saw how to use logical flag vectors to subset vectors

  • We can use logical flag vectors to subset data frames as well

  • Suppose to have data frame my_data containing a variable my_variable

  • Want to subset records in my_data for which my_variable satisfies a condition

  • Use commands

    • flag <- condition(my_data$my_variable)
    • my_data[flag, ]

Logical Record Subsets

Example:

  • Consider again the Family Guy data frame family
  • We subset Male characters using flag family$sex == "M"
# Create flag vector for male Family Guy characters
flag <- (family$sex == "M")

# Subset data frame "family" and store in data frame "subset"
subset <- family[flag, ]

# Print subset
print(subset)
  person age sex funny age.months
1  Peter  42   M  High        504
4  Chris  14   M   Med        168
5 Stewie   1   M  High         12
6  Brian   7   M   Med         84

Part 3:
Data Entry

Reading data from files

  • R has a many functions for reading characters from stored files

  • We will see how to read Table-Format files

  • Table-Formats are just tables stored in plain-text files

  • Typical file estensions are:

    • .txt for plain-text files
    • .csv for comma-separated values
  • Table-Formats can be read into R with the command

    • read.table()

Table-Formats

4 key features

  1. Header:
    • If present, header should be the first line of the file
    • Header is used to provide names for each column of data
    • If a header is present, you need to tell this to R when importing
    • If not, R cannot tell if first line is a header or observed data values

Table-Formats

4 key features

  1. Delimiter:
    • A character used to separate the entries in each line
    • Delimiter character cannot be used for anything else in the file
    • Delimiter tells R when a specific entry begins and ends
    • Default delimiter is whitespace

Table-Formats

4 key features

  1. Missing value:
    • Character string used exclusively to denote a missing value
    • When reading the file, R will turn these entries into NA

Table-Formats

4 key features

  1. Comments:
    • Table files can include comments
    • Comment lines start with \quad #
    • R ignores such comments

Table-Formats

Example

  • Table-Format for Family Guy characters can be downloaded here family_guy.txt
  • The text file looks like this

  • Remarks:
    • Header is present
    • Delimiter is whitespace
    • Missing values denoted by *

read.table command

  • Table-Formats can be read via read.table()
    • This reads a .txt or .csv file and outputs a data frame
  • Options of read.table()
    • header = T/F – Tells R if a header is present
    • na.strings = "string" – Tells R that "string" means NA

Reading our first Table-Format file

To read family_guy.txt into R proceed as follows:

  1. Download family_guy.txt and move file to Desktop

  2. Open the R Console and change working directory to Desktop

# In MacOS type
setwd("~/Desktop")

# In Windows type
setwd("C:/Users/YourUsername/Desktop")

Reading our first Table-Format file

  1. Read family_guy.txt into R and store it in data frame family with code
family = read.table(file = "family_guy.txt",
                    header = TRUE,
                    na.strings = "*"
                    )
  1. Note that we are telling read.table() that
    • family_guy.txt has a header
    • Missing values are denoted by *

Reading our first Table-Format file

  1. Print data frame family to screen
print(family)
  person age sex funny age.mon
1  Peter  NA   M  High     504
2   Lois  40   F  <NA>     480
3    Meg  17   F   Low     204
4  Chris  14   M   Med     168
5 Stewie   1   M  High      NA
6  Brian  NA   M   Med      NA
  • For comparison this is the .txt file

Application: t-test

Example: Analysis of Consumer Confidence Index for 2008 crisis from Lecture 4

  • We imported data into R using c()
  • This is ok for small datasets
  • Suppose the CCI data is stored in a .txt file instead

Goal: Perform t-test on CCI difference for mean difference \mu = 0

  • By reading CCI data into R using read.table()
  • By manipulating CCI data using data frames

Application: t-test

  • The CCI dataset can be downloaded here 2008_crisis.txt

  • The text file looks like this

Application: t-test

To perform the t-test on data 2008_crisis.txt we proceed as follows:

  1. Download dataset 2008_crisis.txt and move file to Desktop

  2. Open the R Console and change working directory to Desktop

# In MacOS type
setwd("~/Desktop")

# In Windows type
setwd("C:/Users/YourUsername/Desktop")
  1. Read 2008_crisis.txt into R and store it in data frame scores with code
scores = read.table(file = "2008_crisis",
                    header = TRUE
                    )

Application: t-test

  1. Store 2nd and 3rd columns of scores into 2 vectors
# CCI from 2007 is stored in 2nd column
score_2007 <- scores[, 2]

# CCI from 2009 is stored in 3nd column
score_2009 <- scores[, 3]
  1. Now the t-test can be performed as done in Lecture 4
# Compute vector of differences
difference <- score_2007 - score_2009

# Perform t-test on difference with null hypothesis mu = 0
t.test(difference, mu = 0)

Application: t-test

  1. We obtain the same result of Lecture 4
    • p-value is p < 0.05
    • Reject H_0: The mean difference is not 0
    • In details, the output of t.test is below

    One Sample t-test

data:  difference
t = 38.144, df = 11, p-value = 4.861e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 68.15960 76.50706
sample estimates:
mean of x 
 72.33333 

Part 4:
R Style Guide

R Style Guide

  • Styling your code is optional
  • However it is considered good manners to do so
  • Good coding style makes code more readable
  • Highly recommended, especially for assignments
  • The next few slides on Style are based on these two posts:
    • Style Guide by Hadley Wickham (link)
    • Google’s R Style Guide (link)

File names

They should be meaningful and end in .R

# Good
football-models.R  
utility-functions.R
homework_1.R
homework1.R

# Bad
footballmodels.r # Hard to read
stuff.r          # What is inside this file?
code.r           # Same as above

Objects names

  • Objects names shoulde be lowercase
  • Use an underscore (_) to separate words within a name
  • Variable names should be nouns, not verbs
  • Come up with names that are concise and meaningful
# Good
day_one  # This will clearly store the value of first day
day_1    # Still clear


# Bad
first_day_of_the_month  # Too long
dayone                  # Hard to read
DayOne                  # Mix of upper and lower case
fdm                     # Hard to guess what this means

Functions names

  • Name functions with BigCamelCase (link)
  • This is to clearly distinguish functions from other objects
  • Functions names should be verbs
  • Come up with names that are concise and meaningful
# Good
DoNothing <- function() {
  return(invisible(NULL))
}

# Bad
donothing <- function() {
  return(invisible(NULL))
}

Object and functions names

If possible avoid using names of existing functions and variables

# Bad
T <- FALSE                  # T is reserved for the boolean TRUE
c <- 10                     # c denotes the concatenation operator
mean <- function(x) sum(x)  # mean already denotes a built in function

Assignment

Use <- and not = for assignment

# Good
x <- 5

# Bad
x = 5

Spacing

  • Spacing is really something you should be careful about
  • Place spaces around all infix operators (=, +, -, <-, etc.)
  • Place spaces around = when calling a function
  • Always put a space after a comma, never before (like in regular English)
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

Spacing with Brackets

  • Do not place spaces around code in parentheses or square brackets
  • Unless there is a comma
# Good
if (condition) do(x)
diamonds[5, ]

# Bad
if ( condition ) do(x)  # No spaces around condition
x[1,]                   # Needs a space after the comma
x[1 ,]                  # Space goes after comma not before

Spacing - Exceptions

  • Symbols :, :: and ::: do not need spacing
# Good
x <- 1:10

# Bad
x <- 1 : 10
  • Place a space before left parentheses, except in a function call
#Good
if (condition) do(x)
plot(x, y)

# Bad
if(condition)do(x)    # (condition) needs spacing
plot (x, y)           # This does not need spacing

Extra Spacing

Extra spacing is ok if it improves alignment of = or <-

list(
  total = a + b + c, 
  mean  = (a + b + c) / n
)

Curly braces

  • An opening curly brace should never go on its own line
  • An opening curly brace should always be followed by a new line
  • Always indent the code inside curly braces
# Good

if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} 
# Bad

if (y < 0 && debug)
message("Y is negative")


if (y == 0) 
{
  log(x)} 

Line length

  • Limit code to 80 characters per line
  • This fits comfortably on a printed page
  • If you run out of room, encapsulate some of the work in separate function

Indentation

  • When indenting your code, use two spaces
  • Never use tabs or mix tabs and spaces
  • Indentation should be used for functions, if, for, etc.
SumTwoNumbers <- function(x, y) {
  s = x + y
  return(s)
}

Indentation - Exception

If a function definition runs over multiple lines, indent the second line to where the definition starts

long_function_name <- function(a = "a long argument", 
                               b = "another argument",
                               c = "another long argument") {
  # As usual code is indented by two spaces.
}

Use explicit returns

  • Functions can return objects
  • R has an implicit return feature
  • Do not rely on this feature, but explicitly mention return(object)
# Good
AddValues <- function(x, y) {
  return(x + y)                     # Function returns x+y
}

# Bad
AddValues <- function(x, y) {
  x + y                             # Function still returns x+y
}                                   # but it is not immediate to see it

Named arguments

  • Often you can call a function without explicitly naming arguments:

    • plot(height, weight)
    • mean(weight)
  • This might be fine for plot() or mean

  • However for less common functions:

    • One might struggle to remember the meaning of arguments positions
    • It is therefore good practice to name arguments
# Good
seq(from = 1, to = 11, by = 1)

# Bad
seq(1, 11, 1)

Comments

  • Most importantly: Comment your code
  • Each line of a comment should begin with comment symbol # and a single space
# Here we sum two numbers  
x+y
  • Use commented lines of - and = to break up code into easily readable chunks
# Load data ---------------------------

# Plot data ---------------------------