4 Proteomics Data Viz
4.1 Visualizing Proteomics data with ggplot2
Last session we worked on Cross-linking Mass Spectrometry Data. The data consisted of interaction between 300 yeast nuclear proteins.
We also learned about concepts behind Grammer of Graphics
and plotting using ggplot2
.
In this tutorial, we will be working on the data from Yeast Nuclear Protein interaction study using Cross-linking Mass Spectrometry.
We continue with the Cross-linking Proteomics dataset from Cytoscape tutorial.
The data was in excel, therefore we first converted it into Comma separated file (csv)
format.
4.1.1 Optional: Reading excel file into R
If you want to load the excel files files directly to R
then you can use another library readxl
.
library(readxl)
nuclear_xl_ms_excel <- readxl::read_excel("r-intro-files/Nuclear_XL_MS.xlsx")
head(nuclear_xl_ms_excel)
## # A tibble: 6 x 7
## Protein1 Protein2 NameProtein1 NameProtein2 PPINovelty PPIEvidenceInfo…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 P02293 P04911 H2B1 H2A1 Known Structure
## 2 P02293 P02309 H2B1 H4 Known Structure
## 3 P02994 P32471 EF1A EF1B Known Structure
## 4 P0CX51 P38011 RS16A GBLP Known Structure
## 5 P02406 P0CX49 RL28 RL18A Novel STRING
## 6 P33297 P53549 PRS6A PRS10 Known Structure
## # … with 1 more variable: NumberUniqueLysLysContacts <dbl>
4.2 Exploring the data
## # A tibble: 6 x 7
## Protein1 Protein2 NameProtein1 NameProtein2 PPINovelty PPIEvidenceInfo…
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 P02293 P04911 H2B1 H2A1 Known Structure
## 2 P02293 P02309 H2B1 H4 Known Structure
## 3 P02994 P32471 EF1A EF1B Known Structure
## 4 P0CX51 P38011 RS16A GBLP Known Structure
## 5 P02406 P0CX49 RL28 RL18A Novel STRING
## 6 P33297 P53549 PRS6A PRS10 Known Structure
## # … with 1 more variable: NumberUniqueLysLysContacts <dbl>
Now lets examine the dataset with two base R functions str
and summary
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 228 obs. of 7 variables:
## $ Protein1 : chr "P02293" "P02293" "P02994" "P0CX51" ...
## $ Protein2 : chr "P04911" "P02309" "P32471" "P38011" ...
## $ NameProtein1 : chr "H2B1" "H2B1" "EF1A" "RS16A" ...
## $ NameProtein2 : chr "H2A1" "H4" "EF1B" "GBLP" ...
## $ PPINovelty : chr "Known" "Known" "Known" "Known" ...
## $ PPIEvidenceInfoGroup : chr "Structure" "Structure" "Structure" "Structure" ...
## $ NumberUniqueLysLysContacts: num 12 6 5 1 2 3 2 3 3 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Protein1 = col_character(),
## .. Protein2 = col_character(),
## .. NameProtein1 = col_character(),
## .. NameProtein2 = col_character(),
## .. PPINovelty = col_character(),
## .. PPIEvidenceInfoGroup = col_character(),
## .. NumberUniqueLysLysContacts = col_double()
## .. )
## Protein1 Protein2 NameProtein1 NameProtein2
## Length:228 Length:228 Length:228 Length:228
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PPINovelty PPIEvidenceInfoGroup NumberUniqueLysLysContacts
## Length:228 Length:228 Min. : 1.000
## Class :character Class :character 1st Qu.: 1.000
## Mode :character Mode :character Median : 1.000
## Mean : 1.342
## 3rd Qu.: 1.000
## Max. :12.000
While the data in PPINovelty
and PPIEvidenceInfoGroup
are characters/strings, they can also be thought of as categorical.
We will now look into the unique values for the PPINovelty and PPIEvidenceInfoGroup columns
## [1] "Known" "Novel"
## [1] "Structure" "STRING" "APID" "Unexplained" "Genetic"
R has a class for categorical data known as factors. We can convert these columns to factors and provide an order to those categories (levels). By default, R will order the levels of a factor alphabetically but we can override this behaviour by defining the level order. Here we will order the Evidence based on Strength of evidence for the interaction
nuclear_xl_ms$PPINovelty<-factor(nuclear_xl_ms$PPINovelty)
nuclear_xl_ms$PPIEvidenceInfoGroup <- factor(nuclear_xl_ms$PPIEvidenceInfoGroup,
levels = c("Structure","APID", "STRING", "Genetic", "Unexplained"))
str(nuclear_xl_ms)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 228 obs. of 7 variables:
## $ Protein1 : chr "P02293" "P02293" "P02994" "P0CX51" ...
## $ Protein2 : chr "P04911" "P02309" "P32471" "P38011" ...
## $ NameProtein1 : chr "H2B1" "H2B1" "EF1A" "RS16A" ...
## $ NameProtein2 : chr "H2A1" "H4" "EF1B" "GBLP" ...
## $ PPINovelty : Factor w/ 2 levels "Known","Novel": 1 1 1 1 2 1 2 1 1 1 ...
## $ PPIEvidenceInfoGroup : Factor w/ 5 levels "Structure","APID",..: 1 1 1 1 3 1 3 1 1 2 ...
## $ NumberUniqueLysLysContacts: num 12 6 5 1 2 3 2 3 3 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Protein1 = col_character(),
## .. Protein2 = col_character(),
## .. NameProtein1 = col_character(),
## .. NameProtein2 = col_character(),
## .. PPINovelty = col_character(),
## .. PPIEvidenceInfoGroup = col_character(),
## .. NumberUniqueLysLysContacts = col_double()
## .. )
4.3 Plotting interactions types
Firstly we will plot the number of Known and novel interactions using with geom_bar
Now, we will rotate the bars to Y-Axis using coord_flip()
ggplot(nuclear_xl_ms, aes(x = PPINovelty)) +
geom_bar(position = "dodge") +
xlab("Novelty") +
coord_flip() +
theme_bw()
Next, step would be to create a stacked bar chart by adding PPIEvidenceInfoGroup data on top of each bar
ggplot(nuclear_xl_ms, aes(PPINovelty)) +
geom_bar(aes(fill=PPIEvidenceInfoGroup)) +
xlab("Novelty") +
coord_flip() +
theme_bw()
4.3.0.1 Add Color blind safe color scheme
4.4 Individual Proteins
First we will calculate how many times a protein appeared in NameProtein1 column using table
function and then sorting by descending order.
Next we will use head()
function to print first five proteins with most observation.
## # A tibble: 161 x 2
## NameProtein1 n
## <chr> <int>
## 1 H2B1 10
## 2 EF1A 7
## 3 RL27A 5
## 4 H3 4
## 5 EF3A 3
## 6 ODP2 3
## 7 OSTB 3
## 8 RL11A 3
## 9 RL32 3
## 10 RL7A 3
## # … with 151 more rows
We could do same thing for NameProtein2 column as well.
## # A tibble: 185 x 2
## NameProtein2 n
## <chr> <int>
## 1 H3 6
## 2 RS15 5
## 3 NOP56 4
## 4 BFR1 3
## 5 PRS4 3
## 6 RL14A 3
## 7 RL6A 3
## 8 ATPG 2
## 9 EF1A 2
## 10 ERP1 2
## # … with 175 more rows
Now we store the rows containing H2B1 and EF1A proteins in NameProtein1 column in a datafame.
4.5 Volcano Plot [Optional]
In this section, we will see how to plot a volcano plot for a quantitative proteomics dataset. This dataset is derived from label-free quantitative proteomics experiment investigating differences in protein profiles between Benign and Malignant Prostate cancers.
The details can be found on LFQ-Analyst under the Demo tab.
There are 20 samples in total with n=10
in each group.
A moderated t-test was performed to find differentially expressed proteins in the dataset. Each row represents a protein along with log fold change and p-values.
In this tutorial we will visualise the results in the form of Volcano Plot.
Firstly, we will load the data.
## [1] 2389
## [1] 10
The data has quantitative information about 2389 proteins and has 10 columns.
Now lets see the column names.
## [1] "Gene Name"
## [2] "Protein IDs"
## [3] "Benign_vs_Malignant_log2 fold change"
## [4] "Benign_vs_Malignant_p.val"
## [5] "Benign_vs_Malignant_p.adj"
## [6] "significant"
## [7] "Benign_vs_Malignant_significant"
## [8] "imputed"
## [9] "num_NAs"
## [10] "Protein.names"
For plotting the volcano plot, we need to focus on th three columns
- Benign_vs_Malignant_log2 fold change
- Benign_vs_Malignant_p.adj
- significant
Next we will on the fly convert FDR values to -log10
and plot it against log2 fold change on the X-axis