plot(rnorm(1000))
Lecture 3
Two hypotheses are complementary if exactly one of them can be true
Complementary hypotheses are called:
Goal: Find a way to decide which between H_0 and H_1 is true
We denote by:
For \Theta_0 \subset \Theta we define the associated null and alternative hypotheses as \begin{align*} H_0 \colon & \theta \in \Theta_0 & \qquad \text{ null hypothesis} \\ H_1 \colon & \theta \in \Theta_0^c & \qquad \text{ alternative hypothesis} \end{align*}
A hypothesis test is a rule to decide:
The sample space is partitioned into two regions:
In most cases: Critical region is defined in terms of a test statistic W(\mathbf{x})
Example: We could decide to reject H_0 if W(\mathbf{x}) \in R with R \subset \mathbb{R} some rejection region
Let \theta be one dimensional parameter. A hypothesis test is:
One-sided: if the null and alternative hypotheses are of the form H_0 \colon \theta \leq \theta_0 \,, \qquad H_1 \colon \theta > \theta_0 or also H_0 \colon \theta \geq \theta_0 \,, \qquad H_1 \colon \theta < \theta_0
Two-sided: if the null and alternative hypotheses are of the form H_0 \colon \theta = \theta_0 \,, \qquad H_1 \colon \theta \neq \theta_0
We want to assess whether a coin is fair
To test fairness, toss the coin many times and record outcome
\theta = proportion of Heads
The decision is between:
Hypothesis test: \qquad \quad H_0 \colon \theta = \frac12 \,, \qquad H_1 \colon \theta \neq \frac12
A University wants to advertise its MBA Program: \text{ MBA } = \text{ higher salary }
Is this a true or false statement?
The University has only access to incomplete data (could not ask all former students). Need hypothesis testing
\theta = average change in salary after completing the MBA program
Hypothesis test: \qquad \quad H_0 \colon \theta \leq 0 \,, \qquad H_1 \colon \theta > 0
Goal: estimate the mean \mu of a normal population N(\mu,\sigma^2). If \mu_0 is guess for \mu H_0 \colon \mu = \mu_0 \qquad H_1 \colon \mu \neq \mu_0
Compute the estimated standard error \mathop{\mathrm{e.s.e.}}= \frac{s}{\sqrt{n}}
Compute the sample t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}}
\mu_0 is the value of the null hypothesis H_0
After computing t-statistic, we need to compute the p-value
p-value is a measure of likely we are to observe the data if we assume the null hypothesis is true
We have 2 options:
In this module we reject H_0 for p-values p<0.05
For the two-sided t-test, the p-value is defined as p := 2P(t_{n-1} > |t| \, | \, H_0) where t_{n-1} follows the t-distribution with n-1 degrees of freedom
In other words, the p-value is p = 2P(\text{Observing values more extreme than |t| }| \, \mu=\mu_0)
p<0.05 means that the test statistic t is extreme: \,\, P(t_{n-1} > |t|)<0.025
t falls in the grey areas in the t_{n-1} plot below: Each grey area measures 0.025
Find Table 13.1 in this file
The critical value t^* = t_{n-1}(0.025) found in the table satisfies P(t_{n-1}>t^*) = 0.025
By definition of p-value for two-sided t-test we have p := 2P(t_{n-1}>|t|)
Therefore, for |t|>t^* \begin{align*} p & := 2P(t_{n-1}>|t|) \\ & < 2P(t_{n-1}>t^*) = 2 \cdot (0.025) = 0.05 \end{align*}
Conclusion: \quad |t|>t^* \iff p<0.05 \qquad (Extreme t \iff low p-value)
Recall that p = 2P ( \text{Observing values more extreme than t } | \mu = \mu_0)
We have two possibilities:
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
Using the available data, we need to compute:
Sample mean and standard deviation \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \qquad s = \sqrt{\frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1}}
Test statistic t = \frac{\overline x - \mu_0}{s/\sqrt{n}}
CCI | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
\begin{align*} \overline{x} & =\frac{1}{n} \sum_{i=1}^{n} x_i=\frac{1}{12} \left(62+64+67+{\ldots}+68+69\right)=\frac{868}{12}=72.33 \\ \sum_{i=1}^{n} x_i^2 & = 62^2+64^2+67^2+{\ldots}+68^2+69^2 = 63260 \\ s & = \sqrt{ \frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1} } = \sqrt{\frac{63260-12\left(\frac{868}{12}\right)^2}{11}} = \sqrt{\frac{474.666}{11}} = 6.5689 \end{align*}
Find Table 13.1 in this file
We have computed:
Therefore |t| = 38.145 > 2.201 = t^*
This implies rejecting the null hypothesis H_0 \colon \mu = 0
t-test implies that mean difference in CCI is \mu \neq 0
The sample mean difference is positive (\bar{x}=72.33)
Conclusions:
Concise Statistics with R
Comprehensive R manual
We have installed R. What now?
Launch the R Console. There are two ways:
Don’t have a laptop in class: Run R code in browser
>
Enter
to executeExample: Plotting 1000 values randomly generated from N(0,1) distribution
Below you can see R code and the corresponding answer
The interactive R Console is OK for short codes
For longer code use R scripts
.txt
or .R
extensionsource("file_name.R")
Examples of text editors
R session has a working directory associated with it
Unless specified, R will use a default working directory
To check the location of the working directory, use the getwd
function
On my MacOS system I get \qquad
File paths are always enclosed in double quotation marks
Note that R uses forward slashes (not backslashes) for paths
You can change the default working directory using the function setwd
In RStudio you can set the working directory from the menu bar:
->
Set Working Directory ->
Choose Directoryhelp(object_name)
help()
can be crypticLet us go back to the example of the command plot(rnorm(1000))
The function rnorm(n)
outputs n randomly generated numbers from N(0,1)
The above values can be plotted by concatenating the plot
command
Note:
The values plotted (next slide) are, for sure, different from the ones listed above
This is because every time you call rnorm(5)
, new values are generated
We need to store the generated values if we want to re-use them (more later)
<-
Example:
2
to the variable x
, enter x <- 2
x
, just type x
Continuation of Example:
x
has the value 2
x
can be used in subsequent operationsx
If you save the following code in a .R
file and run it, you will obtain no output
This is because you need to tell R to print x
to screen
print()
[1] 2
sentence
to screen we can usecat()
cat
can be used to combine strings and variables in a single outputSave to a plain text file named either
my_first_code.R
my_first_code.txt
Move this file to Desktop
Open the R Console and change working directory to Desktop
Code run successfully!
The sum of 1 and 2 is 3
ls()
You can remove variables from workspace by using
rm()
To completely clear the workspace use
rm(list = ls())
save.image("file_name.RData")
file_name.RData
load("file_name.RData")
Recommended: keep all the files related to a project in a single folder
Such folder will have to be set as working directory in R Console
Saving the workspace could be dangerous
Always store your code in R Scripts
To quit the R Console type q()
.RData
file in the working directory.txt
or .R
text files.txt
filesc()
# Constuct two vectors of radius and height of 6 cylinders
radius <- c(6, 7, 5, 9, 9, 7)
height <- c(1.7, 1.8, 1.6, 2, 1, 1.9)
# Compute the volume of each cylinder and store it in "volume"
volume <- pi * radius^2 * height
# Print volume
print(volume)
[1] 192.2655 277.0885 125.6637 508.9380 254.4690 292.4823
a
has 7 components while b
has 2 componentsa + b
is executed as follows:
b
is copied 4 times to match the length of a
a + b
is then obtained by
a + \tilde{b} = (1, 2, 3, 4, 5, 6, 7) + (0, 1, 0, 1, 0, 1, 0) =
(1, 3, 3, 5, 5, 7, 7)
Useful applications of broadcasting are:
Two very useful vector operators are:
sum(x)
which returns the sum of the components of x
length(x)
which returns the length of x
x <- c(1, 2, 3, 4, 5)
sum <- sum(x)
length <- length(x)
cat("Here is the vector x:", x)
cat("The components of vector x sum to", sum)
cat("The length of vector x is", length)
Here is the vector x: ( 1 2 3 4 5 )
The components of vector x sum to 15
The length of vector x is 5
Given a vector \mathbf{x}= (x_1,\ldots,x_n) we want to compute sample mean and variance \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \,, \qquad s^2 = \frac{\sum_{i=1}^n (x_i - \overline{x})^2 }{n-1}
mean(x)
computes the sample mean of x
sd(x)
computes the sample standard deviation of x
var(x)
computes the sample variance of x
Mathematician | x_1 | x_2 | x_3 | x_4 | x_5 | x_6 | x_7 | x_8 | x_9 | x_{10} |
---|---|---|---|---|---|---|---|---|---|---|
Wage | 36 | 40 | 46 | 54 | 57 | 58 | 59 | 60 | 62 | 63 |
Question:
# First store the wage data into a vector
x <- c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)
# Compute the sample mean using formula
xbar = sum(x) / length(x)
# Compute the sample mean using built in R function
xbar_check = mean(x)
# We now print both results to screen
cat("Sample mean computed with formula is", xbar)
cat("Sample mean computed with R function is", xbar_check)
cat("They coincide!")
Sample mean computed with formula is 53.5
Sample mean computed with R function is 53.5
They coincide!
# Compute the sample variance using formula
xbar = mean(x)
n = length(x)
s2 = sum( (x - xbar)^2 ) / (n - 1)
# Compute the sample variance using built in R function
s2_check = var(x)
# We now print both results to screen
cat("Sample variance computed with formula is", s2)
cat("Sample variance computed with R function is", s2_check)
cat("They coincide!")
Sample variance computed with formula is 90.27778
Sample variance computed with R function is 90.27778
They coincide!
rivers
Question: Compute the average distance from the center for the rivers
data set
Hint: The absolute value of y \in \mathbb{R} is computed with abs(y)
To compute the average distance from the center for the rivers
data set, we use the following R functions
mean
sum
abs
length
Goal of t-test: Estimate the mean \mu of normal population N(\mu,\sigma^2)
Hypotheses: If \mu_0 is guess for \mu H_0 \colon \mu = \mu_0 \qquad H_1 \colon \mu \neq \mu_0
Method: Given the sample X_1 ,\ldots,X_n, we consider the statistic T = \frac{\overline{X}-\mu_0}{S/\sqrt{n}} \sim t_{n-1}
Given the data x_1,\ldots,x_n, compute the t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}} with sample mean and sample standard deviation \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \,, \qquad s = \sqrt{\frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1}}
Find the critical value t^* = t_{n-1}(0.025) in Statistical Table 13.1
Given the sample x_1,\ldots,x_n, R can compute the t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}}
R can compute the precise p-value (no need for Statistical Tables) p = 2P(|t_{n-1}|>t)
Note: The above steps can be done simultaneously by using the command t.test
data_vector <- c(x1, ..., xn)
data_vector
with null hypothesis mu0
using
t.test(data_vector, mu = mu0)
mu = mu0
tells R to test the hypothesis
H_0 \colon \mu = \mu_0 \,, \qquad H_1 \colon \mu \neq \mu_0
If mu = mu0
is not specified, R assumes \mu_0 = 0
alternative = "greater"
tells R to perform one-sided t-test
H_0 \colon \mu \leq \mu_0 \,, \qquad
H_1 \colon \mu > \mu_0
alternative = "less"
tells R to perform one-sided t-test
H_0 \colon \mu \geq \mu_0 \,, \qquad
H_1 \colon \mu < \mu_0
conf.level = n
changes the confidence interval level to n
(default is 0.95)
Let us go back to the 2008 Crisis example
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
We want to test if there was a change in CCI from 2007 to 2009
We interested in the difference in CCI
The null hypothesis is that there was (on average) no change in CCI H_0 \colon \mu = 0
The alternative hypothesis is that there was some change: H_1 \colon \mu \neq 0
# Enter CCI data in 2 vectors using function c()
score_2007 <- c(86, 86, 88, 90, 99, 97, 97, 96, 99, 97, 90, 90)
score_2009 <- c(24, 22, 21, 21, 19, 18, 17, 18, 21, 23, 22, 21)
# Compute vector of differences in CCI
difference <- score_2007 - score_2009
# Perform t-test on difference with null hypothesis mu = 0
# Store answer in "answer"
answer <- t.test(difference, mu = 0)
# Print the answer
print(answer)
One Sample t-test
data: difference
t = 38.144, df = 11, p-value = 4.861e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
68.15960 76.50706
sample estimates:
mean of x
72.33333
One Sample t-test
data: difference
t.test
has automatically assumed that a one-sample test is desireddifference
t = 38.144, df = 11, p-value = 4.861e-13
This is the best part:
Note:
alternative hypothesis: true mean is not equal to 0
R tells us the alternative hypothesis is \qquad H_1 \colon \mu \neq 0
Hence the Null hypothesis tested is \qquad \quad H_0 \colon \mu = 0
Warning:
95 percent confidence interval:
68.15960 76.50706
95 \% confidence interval for the true mean \mu – an interval [a,b] s.t. P(\mu \in [a,b]) \geq 1 - \alpha = 0.95
Interpretation: If you repeat the experiment (on new data) over and over, the interval [a,b] will contain \mu about 95\% of the times
Constructing the confidence interval for t-test:
Recall the t-statistic has t-distribution t = \frac{\overline{x}-\mu}{\mathop{\mathrm{e.s.e.}}} \, \sim \, t_{n-1}
We impose that t is observed with probability 1-\alpha P(- t^* \leq t \leq t^*) = 1-\alpha \,, \qquad t^* = t_{n-1}(\alpha/2)
The 1-\alpha confidence interval is obtained by solving for \mu P(\mu \in [a,b] ) = 1 - \alpha \,, \qquad a = \overline{x} - t^* \times \mathop{\mathrm{e.s.e.}}, \qquad b = \overline{x} + t^* \times \mathop{\mathrm{e.s.e.}}
To obtain 95\% confidence, we need \alpha = 0.05, so that 1-\alpha = 0.95
In this case the confidence interval is \left[ \overline{x} - t^* \times \mathop{\mathrm{e.s.e.}}, \overline{x} + t^* \times \mathop{\mathrm{e.s.e.}}\right] \,, \qquad t^* = t_{n-1}(0.025)
R calculated the above for us, giving the confidence interval \mu \in [68.15960, 76.50706]
Interpretation: If you repeat the experiment (on new data) over and over, the interval [a,b] will contain \mu about 95\% of the times
sample estimates:
mean of x
72.33333
mean(difference)
The key information is:
A consumer group wishes to see whether the actual mileage of a new SUV matches the advertised 17 miles per gallon
The group suspects it is lower
To test the claim, the group fills the SUV’s tank and records the mileage
This is repeated ten times. The results are below
11.4 | 13.1 | 14.7 | 14.7 | 15.0 | 15.5 | 15.6 | 15.9 | 16.0 | 16.8 |
---|
Question: The data is assumed to be normal. Use R to test the claim H_0 \colon \mu = 17 \,, \qquad H_1 \colon \mu < 17
This is a one-sided t-test. The p-value is computed as p = P( t_{n-1} < t \, | \, \mu = 17 )
One Sample t-test
data: mpg
t = -4.2847, df = 9, p-value = 0.001018
alternative hypothesis: true mean is less than 17
95 percent confidence interval:
-Inf 15.78127
sample estimates:
mean of x
14.87
Conclusion: The p-value is very small \quad p < 0.05 \quad \implies \quad reject H_0
R has extensive built in graphing functions:
Fancier graphing functions are contained in the library ggplot2
(see link)
However we will be using the basic built in R graphing functions
x
and y
of same lengthplot(x, y)
Example: Suppose to have data of weights and heights of 6 people
plot()
in R Console the plot will appear in a pop-up windowpch = 2
pch
stands for plotting characterplot(x, y)
lines
to linearly interpolate the scatter plot:
lines(x, y)
Let us plot the parabola y = x^2 \,, \qquad x \in [-1,1]
The previous plot was quite rough
This is because we only computed y=x^2 on the grid x = (-1, -0.5, 0, 0.5, 1)
We could refine the grid by hand, but this is not practical
To generate a finer grid we can use the built in R function
seq()
seq(from, to, by, length.out)
generates a vector containing a sequence:
from
– The beginning number of the sequenceto
– The ending number of the sequenceby
– The step-size of the sequence (the increment)length.out
– The total length of the sequenceExample: Generate the vector of even numbers from 2 to 20
Note: The following commands are equivalent:
seq(from = x1, to = x2, by = s)
seq(x1, x2, s)
Example: Generate the vector of odd numbers from 1 to 11
x <- seq(from = 1, to = 11, by = 2)
y <- seq(1, 11, 2)
cat("Vector x is: (", x, ")")
cat("Vector y is: (", y, ")")
cat("They are the same!")
Vector x is: ( 1 3 5 7 9 11 )
Vector y is: ( 1 3 5 7 9 11 )
They are the same!
Let us go back to the example of plotting random normal values
x
with 1000 random normal valuesx
via plot(x)
plot(x)
implicitly assumes that:
x
is the second argument: Values to plot on y-axisseq(1, 1000)
seq(1, 1000)
is the vector of components numbers of x
Functions are a class of objects
Format of a function is name followed by parentheses containing arguments
Functions take arguments and return a result
We already encountered several built in functions:
plot(x, y)
lines(x, y)
seq(x)
print("Stats is great!")
cat("R is great!")
mean(x)
sin(x)
plot(x, y)
has formal arguments two vectors x
and y
plot(height, weight)
has actual arguments height
and weight
plot(height, weight)
the arguments are matched:
height
corresponds to x-variableweight
corresponds to y-variableIf a function has a lot of arguments, positional matching is tedious
For example plot()
accepts the following (and more!) arguments
Argument | Description |
---|---|
x |
x coordinate of points in the plot |
y |
y coordinate of points in the plot |
type |
Type of plot to be drawn |
main |
Title of the plot |
xlab |
Label of x axis |
ylab |
Label of y axis |
pch |
Shape of points |
Issue with having too many arguments is the following:
pch = 2
pch
x
y
type
xlab
ylab
pch = 2
by the call
plot(weight, height, pch = 2)
weight
is implicitly matched to x
height
is implicitly matched to y
pch
is explicitly matched to 2
plot(x = weight, y = height, pch = 2)
plot(height, weight)
plot(x = height, y = weight)
plot(y = weight, x = height)
We have already seen another example of named actual arguments
seq(from = 1, to = 11, by = 2)
seq(1, 11, 2)
If however we want to divide the interval [1, 11] in 5 equal parts:
seq(1, 11, length.out = 6)
seq(1, 11, 6)
seq()
is by
seq(1, 11, 6)
assumes that by = 6
()
getwd()
– which outputs current working directoryls()
– which outputs names of objects currently in memorymy_function
is belowmy_function(arguments)
The R function mean(x)
computes the sample mean of vector x
We want to define our own function to compute the mean
Example: The mean of x
could be computed via
sum(x) / length(x)
We want to implement this code into the function my_mean(x)
my_mean
takes vector x
as argumentmy_mean
returns a scalar – the mean of x
my_mean
on an example# Generate a random vector of 1000 entries from N(0,1)
x <- rnorm(1000)
# Compute mean of x with my_mean
xbar <- my_mean(x)
# Compute mean of x with built in function mean
xbar_check <- mean(x)
cat("Mean of x computed with my_mean is:", xbar)
cat("Mean of x computed with R mean is:", xbar_check)
cat("They coincide!")
Mean of x computed with my_mean is: -0.01736574
Mean of x computed with R mean is: -0.01736574
They coincide!
Print
and cat
produce different output on character vectors:
print(x)
prints all the strings in x
separatelycat(x)
concatenates strings. There is no way to tell how many were thereTRUE
, FALSE
or NA
TRUE
and FALSE
can be abbreviated with T
and F
NA
stands for not availableLogical vectors are extremely useful to evaluate conditions
Example:
x
t
# Generate a vector containing sequence 1 to 8
x <- seq(from = 1 , to = 8, by = 1)
# Generate vector of flags for entries strictly above 5
y <- ( x > 5 )
cat("Vector x is: (", x, ")")
cat("Entries above 5 are: (", y, ")")
Vector x is: ( 1 2 3 4 5 6 7 8 )
Entries above 5 are: ( FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE )
Question: How to do this?
Hint: T/F
are interpreted as 1/0
in arithmetic operations
sum(x)
sums the entries of a vector x
sum(x)
to count the number of T
entries in a logical vector x
x <- rnorm(1000) # Generates vector with 1000 normal entries
y <- (x > 0) # Generates logical vector of entries above 0
above_zero <- sum(y) # Counts entries above zero
cat("Number of entries which are above the average 0 is", above_zero)
cat("This is pretty close to 500!")
Number of entries which are above the average 0 is 502
This is pretty close to 500!
NA
value - Not AvailableNA
is carried through in computations: operations on NA
yield NA
as the resultComponents of a vector can be retrieved by indexing
vector[k]
returns k-th component of vector
To modify an element of a vector use the following:
vector[k] <- value
stores value
in k-th component of vector
Returning multiple items of a vactor is known as slicing
vector[c(k1, ..., kn)]
returns components k1, ..., kn
vector[k1:k2]
returns components k1
to k2
x
can be deleted by using
x[ -c(k1, ..., kn) ]
which deletes entries k1, ..., kn
# Create a vector x
x <- c(11, 22, 33, 44, 55, 66, 77, 88, 99, 100)
# Print vector x
cat("Vector x is:", x)
# Delete 2nd, 3rd and 7th entries of x
x <- x[ -c(2, 3, 7) ]
# Print x again
cat("Vector x with 2nd, 3rd and 7th entries removed:", x)
Vector x is: 11 22 33 44 55 66 77 88 99 100
Vector x with 2nd, 3rd and 7th entries removed: 11 44 55 66 88 99 100
Code: Suppose given a vector x
Create a flag vector by using
flag <- condition(x)
condition()
is any function which returns T/F
vector of same length as x
Subset x
by using
x[flag]
x[ x < 0 ]
# Create numeric vector x
x <- c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)
# Get negative components from x and store them in neg_x
neg_x <- x[ x < 0 ]
cat("Vector x is:", x)
cat("Negative components of x are:", neg_x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Negative components of x are: -2.3 -8
a
and b
&
x[ (x > a) & (x < b) ]
# Create numeric vector
x <- c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)
# Get components between 0 and 100
range_x <- x[ (x > 0) & (x < 100) ]
cat("Vector x is:", x)
cat("Components of x between 0 and 100 are:", range_x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Components of x between 0 and 100 are: 5 4 4 4 6 8 10
which()
allows to convert a logical vector flag
into a numeric index vector
which(flag)
is vector of indices of flag
which correspond to TRUE
# Create a logical flag vector
flag <- c(T, F, F, T, F)
# Indices for flag which
true_flag <- which(flag)
cat("Flag vector is:", flag)
cat("Positions for which Flag is TRUE are:", true_flag)
Flag vector is: TRUE FALSE FALSE TRUE FALSE
Positions for which Flag is TRUE are: 1 4
which()
can be used to delete certain entries from a vector x
Create a flag vector by using
flag <- condition(x)
condition()
is any function which returns T/F
vector of same length as x
Delete entries flagged by condition
using the code
x[ -which(flag) ]
# Create numeric vector x
x <- c(5, -2.3, 4, 4, 4, 6, 8, 10, 40221, -8)
# Print x
cat("Vector x is:", x)
# Flag positive components of x
flag_pos_x <- (x > 0)
# Remove positive components from x
x <- x[ -which(flag_pos_x) ]
# Print x again
cat("Vector x with positive components removed:", x)
Vector x is: 5 -2.3 4 4 4 6 8 10 40221 -8
Vector x with positive components removed: -2.3 -8
The main functions to generate vectors are
c()
concatenateseq()
sequencerep()
replicateWe have already met c()
and seq()
but there are more details to discuss
Recall: c()
generates a vector containing the input values
c()
can also concatenate vectorsYou can assign names to vector elements
This modifies the way the vector is printed
Given a named vector x
names(x)
unname(x)
# Create named vector
x <- c(first = "Red", second = "Green", third = "Blue")
# Access names of x via names(x)
names_x <- names(x)
# Access values of x via unname(x)
values_x <- unname(x)
cat("Names of x are:", names(x))
cat("Values of x are:", unname(x))
Names of x are: first second third
Values of x are: Red Green Blue
seq
is
seq(from =, to =, by =, length.out =)
by = 1
seq(x1, x2)
is equivalent to x1:x2
x1:x2
is preferred to seq(x1, x2)
# Generate two vectors of integers from 1 to 6
x <- seq(1, 6)
y <- 1:6
cat("Vector x is:", x)
cat("Vector y is:", y)
cat("They are the same!")
Vector x is: 1 2 3 4 5 6
Vector y is: 1 2 3 4 5 6
They are the same!
rep
generates repeated values from a vector:
x
vectorn
integerrep(x, n)
repeats n
times the vector x
# Create a vector with 3 components
x <- c(2, 1, 3)
# Repeats 4 times the vector x
y <- rep(x, 4)
cat("Original vector is:", x)
cat("Original vector repeated 4 times:", y)
Original vector is: 2 1 3
Original vector repeated 4 times: 2 1 3 2 1 3 2 1 3 2 1 3
The second argument of rep()
can also be a vector:
x
and y
vectorsrep(x, y)
repeats entries of x
as many times as corresponding entries of y
x <- c(2, 1, 3) # Vector to replicate
y <- c(1, 2, 3) # Vector saying how to replicate
z <- rep(x, y) # 1st entry of x is replicated 1 time
# 2nd entry of x is replicated 2 times
# 3rd entry of x is replicated 3 times
cat("Original vector is:", x)
cat("Original vector repeated is:", z)
Original vector is: 2 1 3
Original vector repeated is: 2 1 1 3 3 3
rep()
can be useful to create vectors of labels
Comments
#