```{r, include = FALSE}
knitr::opts_chunk$set(fig.width=4, fig.height=4, fig.path = "fig/matrices-")
```
# Working with data in a matrix
## Loading data
Our example data is quality measurements (particle size) on PVC plastic production, using eight different resin batches, and three different machine operators.
The data set is stored in comma-separated value (CSV) format. Each row is a resin batch, and each column is an operator. In RStudio, open `pvc.csv` and have a look at what it contains.
```{r, results="hide"}
read.csv("r-intro-files/pvc.csv", row.names=1)
```
> ### Tip {.callout}
>
> The location of the file is given relative to your "working directory". You can see the location of your working directory in the title of the console pane in RStudio. It is most likely "~", indicating your personal home directory. You can change working directory with `setwd`.
>
> The filename "r-intro-files/pvc.csv" means from the current working directory, in the sub-directory "r-intro-files", the file "pvc.csv".
>
> You can check that the file is actually in this location using the "Files" pane in the bottom right corner of RStudio.
>
> If you are working on your own machine rather than our training server, and downloaded and unarchived the r-intro-files.zip file, the file may be in a different location.
>
We have called `read.csv` with two arguments: the name of the file we want to read, and which column contains the row names.
The filename needs to be a character string, so we put it in quotes. Assigning the second argument, `row.names`, to be `1` indicates that the data file has row names, and which column number they are stored in. If we don't specify `row.names` the result will not have row names.
> ### Tip {.callout}
>
> `read.csv` actually has many more arguments that you may find useful when
> importing your own data in the future.
```{r, results="hide"}
dat <- read.csv("r-intro-files/pvc.csv", row.names=1)
```
```{r}
dat
class(dat)
str(dat)
```
`read.csv` has loaded the data as a data frame. A data frame contains a collection of "things" (rows) each with a set of properties (columns) of different types.
Actually this data is better thought of as a matrix[^matrixnote]. In a data frame the columns contain different types of data, but in a matrix all the elements are the same type of data. A matrix in R is like a mathematical matrix, containing all the same type of thing (usually numbers).
[^matrixnote]: We use matrix here in the mathematical sense, not the biological sense.
R often but not always lets these be used interchangably. It's also helpful when thinking about data to distinguish between a data frame and a matrix. Different operations make sense for data frames and matrices. Data frames are very central to R, and mastering R is very much about thinking in data frames. However anything statistical will often involve using matrices. For example when we work with RNA-Seq data we use a matrix of read counts. So it will be worth our time to learn to use matrices as well.
Let us insist to R that what we have is a matrix. `as.matrix` "casts" our data to have matrix type.
```{r}
mat <- as.matrix(dat)
class(mat)
str(mat)
```
Much better.
> ### Tip {.callout}
>
> Matrices can be created in various ways.
>
> `matrix` converts a vector into a matrix with a specified number of rows and columns.
>
> `rbind` stacks several vectors as rows one on top of another to form a matrix, or it can stack smaller matrices on top of each other to form a larger matrix.
>
> `cbind` similarly stacks several vectors as columns next to each other to form a matrix, or it can stack smaller matrices next to each other to form a larger matrix.
>
## Indexing matrices
We can check the size of the matrix with the functions `nrow` and `ncol`:
```{r}
nrow(mat)
ncol(mat)
```
This tells us that our matrix, `mat`, has `r nrow(mat)` rows and `r ncol(mat)` columns.
If we want to get a single value from the matrix, we can provide a row and column index in square brackets:
```{r}
# first value in mat
mat[1, 1]
# a middle value in mat
mat[4, 2]
```
If our matrix has row names and column names, we can also refer to rows and columns by name.
```{r}
mat["Resin4","Bob"]
```
An index like `[4, 2]` selects a single element of a matrix, but we can select whole sections as well. For example, we can select the first two operators (columns) of values for the first four resins (rows) like this:
```{r}
mat[1:4, 1:2]
```
The slice `1:4` means the numbers from 1 to 4. It's the same as `c(1,2,3,4)`.
The slice does not need to start at 1, e.g. the line below selects rows 5 through 8:
```{r}
mat[5:8, 1:2]
```
We can use vectors created with `c` to select non-contiguous values:
```{r}
mat[c(1,3,5), c(1,3)]
```
We also don't have to provide an index for either the rows or the columns.
If we don't include an index for the rows, R returns all the rows; if we don't include an index for the columns, R returns all the columns.
If we don't provide an index for either rows or columns, e.g. `mat[, ]`, R returns the full matrix.
```{r}
# All columns from row 5
mat[5, ]
# All rows from column 2
mat[, 2]
```
## Summary functions
Now let's perform some common mathematical operations to learn about our data. When analyzing data we often want to look at partial statistics, such as the maximum value per resin or the average value per operator. One way to do this is to select the data we want as a new temporary variable, and then perform the calculation on this subset:
```{r}
# first row, all of the columns
resin_1 <- mat[1, ]
# max particle size for resin 1
max(resin_1)
```
We don't actually need to store the row in a variable of its own.
Instead, we can combine the selection and the function call:
```{r}
# max particle size for resin 2
max(mat[2, ])
```
R has functions for other common calculations, e.g. finding the minimum, mean, median, and standard deviation of the data:
```{r}
# minimum particle size for operator 3
min(mat[, 3])
# mean for operator 3
mean(mat[, 3])
# median for operator 3
median(mat[, 3])
# standard deviation for operator 3
sd(mat[, 3])
```
> ### Challenge - Subsetting data in a matrix {.challenge}
>
> Suppose you want to determine the maximum particle size for resin 5 across operators 2 and 3.
> To do this you would extract the relevant slice from the matrix and calculate the maximum value.
> Which of the following lines of R code gives the correct answer?
>
> (a) `max(mat[5, ])`
> (b) `max(mat[2:3, 5])`
> (c) `max(mat[5, 2:3])`
> (d) `max(mat[5, 2, 3])`
## Summarizing matrices
What if we need the maximum particle size for all resins, or the average for each operator? As the diagram below shows, we want to perform the operation across a margin of the matrix:
![Operations Across Axes](fig/r-operations-across-axes.png)
To support this, we can use the `apply` function.
> ### Tip {.callout}
>
> To learn about a function in R, e.g. `apply`, we can read its help
> documention by running `help(apply)` or `?apply`.
`apply` allows us to repeat a function on all of the rows (`MARGIN = 1`) or columns (`MARGIN = 2`) of a matrix. We can think of `apply` as collapsing the matrix down to just the dimension specified by `MARGIN`, with rows being dimension 1 and columns dimension 2 (recall that when indexing the matrix we give the row first and the column second).
Thus, to obtain the average particle size of each resin we will need to calculate the mean of all of the rows (`MARGIN = 1`) of the matrix.
```{r}
avg_resin <- apply(mat, 1, mean)
```
And to obtain the average particle size for each operator we will need to calculate the mean of all of the columns (`MARGIN = 2`) of the matrix.
```{r}
avg_operator <- apply(mat, 2, mean)
```
Since the second argument to `apply` is `MARGIN`, the above command is equivalent to `apply(dat, MARGIN = 2, mean)`.
> ### Tip {.callout}
>
> Some common operations have more concise alternatives. For example, you
> can calculate the row-wise or column-wise means with `rowMeans` and
> `colMeans`, respectively.
> ### Challenge - summarizing the matrix {.challenge}
>
> How would you calculate the standard deviation for each resin?
>
> Advanced: How would you calculate the values two standard deviations above and below the mean for each resin?
>
## t test
R has many statistical tests built in. One of the most commonly used tests is the t test. Do the means of two vectors differ significantly?
```{r}
mat[1,]
mat[2,]
t.test(mat[1,], mat[2,])
```
Actually, this can be considered a paired sample t-test, since the values can be paired up by operator. By default `t.test` performs an unpaired t test. We see in the documentation (`?t.test`) that we can give `paired=TRUE` as an argument in order to perform a paired t-test.
```{r}
t.test(mat[1,], mat[2,], paired=TRUE)
```
> ### Challenge - using t.test {.challenge}
>
> Can you find a significant difference between any two resins?
>
When we call `t.test` it returns an object that behaves like a `list`. Recall that in R a `list` is a miscellaneous collection of values.
```{r}
result <- t.test(mat[1,], mat[2,], paired=TRUE)
names(result)
result$p.value
```
This means we can write software that uses the various results from `t.test`, for example performing a whole series of t tests and reporting the significant results.
> ### Tip - Types in R under the hood {.callout}
>
> How could we have known that `t.test` gave us a result that behaved like a list?
>
> We used `class` earlier to see what type various values were. Here this tells us it is an `"htest"`, but this is actually just the "public face" of the value. Sometimes we need to Scooby Doo a value and see how R is thinking of it under the hood. `typeof` reveals how R is thinking of a value internally. If we uncover that the `typeof` of an object is `"list"`, we then know we can use `$` or `[[ ]]` to access its elements.
>
> ```{r}
> class(result)
> typeof(result)
> ```
>
> Try this with `dat` and `mat` as well.
## Plotting
The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," and the best way to develop insight is often to visualize data.
Visualization deserves an entire lecture (or course) of its own, but we can explore a few of R's plotting features.
Let's take a look at the average particle size per resin.
Recall that we already calculated these values above using `apply(mat, 1, mean)` and saved them in the variable `avg_resin`.
Plotting the values is done with the function `plot`.
```{r}
plot(avg_resin)
```
Above, we gave the function `plot` a vector of numbers corresponding to the average per resin across all operators.
`plot` created a scatter plot where the y-axis is the average particle size and the x-axis is the order, or index, of the values in the vector, which in this case correspond to the 8 resins.
`plot` can take many different arguments to modify the appearance of the output. Here is a plot with some extra arguments:
```{r}
plot(avg_resin,
xlab="Resin",
ylab="Particle size",
main="Average particle size per resin",
type="b")
```
> ### Challenge - plotting data {.challenge}
>
> Create a plot showing the standard deviation for each resin.
>
## Saving plots
It's possible to save a plot as a .PNG or .PDF from the RStudio interface with the "Export" button. However if we want to keep a complete record of exactly how we create each plot, we prefer to do this with R code.
Plotting in R is sent to a "device". By default, this device is RStudio. However we can temporarily send plots to a different device, such as a .PNG file (`png("filename.png")`) or .PDF file (`pdf("filename.pdf")`).
```{r eval=FALSE}
pdf("test.pdf")
plot(avg_resin)
dev.off()
```
**`dev.off()` is very important.** It tells R to stop outputting to the pdf device and return to using the default device. If you forget, your interactive plots will stop appearing as expected!
The file you created should appear in the file manager pane of RStudio, you can view it by clicking on it.