R hands on - handling supplementary material of a paper of interest in R

R tutorial
1. Download the supplementary material from a paper of interest concerning gene expression data
The paper that we will use here is “Chromatin state dynamics during blood formation” by Lara-Astiago et al. 2014, recently published in Science
You can download the file (the TableS2 of the supplementary material) from this link: https://ki.box.com/shared/static/av9p9sbsr7tt9xrdrxu5.txt
If you are curious about the paper, here is the link: http://www.sciencemag.org/content/345/6199/943.abstract

Read the file into R, but first:

install the packages that we will use: ggplot2 and reshape2
set the working directory to the directory where you have downloaded the file using the setwd command

#load libraries
#if you don't have them installed run:
#install.packages("gglot2")
#install.packages("reshape2")

library(ggplot2)
library(reshape2)

setwd("~/Desktop")

data <- read.table("1256271tableS2.txt", header=TRUE, sep="\t")

3. Suppose you want to check the expressions of your favorite genes in the cell populations studied in the paper and you want to make a plot
- Have a look at how many rows and column your table has (using the function dim)
- Have a look at the column names to identify the column with the gene name information
- The see the first 6 lines of the column containing the gene names
- Then define a vector with your genes of interest

dim(data)

## [1] 31645    18

colnames(data)

##  [1] "UNIQUD"      "NAME"        "LT.HSC"      "HSC"         "MPP"        
##  [6] "CLP"         "CMP"         "GMP"         "MF"          "Granulocyte"
## [11] "Mono"        "B"           "CD4"         "CD8"         "NK"         
## [16] "MEP"         "EryA"        "EryB"

head(data$NAME)

## [1] Fam227b Mir7227 Eif4a1  Efhd1   Rhbdd1  Fam175a
## 23953 Levels: 0 0610005C13Rik 0610007N19Rik 0610007P14Rik ... Zzz3

mygenes <- c("Vamp1", "Vamp2","Vamp3","Vamp4","Vamp5","Vamp6","Vamp7","Vamp8")
mygenes

## [1] "Vamp1" "Vamp2" "Vamp3" "Vamp4" "Vamp5" "Vamp6" "Vamp7" "Vamp8"

Now we will only keep the expression data for the genes in your vector “mygenes”: - using the function which we can match the rows in the dataframe referring to genes in our vector - now you can melt the created dataset (function within the reshape2 package) - have a look at it - as you can see the first 9 rows are not revelant for looking at gene expression, so we cna remove them

data_mygenes <- data[which(data$NAME %in% mygenes),]

data_mygenesM <- melt(data_mygenes, id="NAME")

## Warning: attributes are not identical across measure variables; they will
## be dropped

#ignore the warning message

head(data_mygenesM, n=20)

##     NAME variable        value
## 1  Vamp3   UNIQUD    NM_009498
## 2  Vamp7   UNIQUD    NM_011515
## 3  Vamp1   UNIQUD NM_001080557
## 4  Vamp5   UNIQUD    NM_016872
## 5  Vamp1   UNIQUD    NM_009496
## 6  Vamp8   UNIQUD    NM_016794
## 7  Vamp4   UNIQUD    NM_016796
## 8  Vamp5   UNIQUD NM_001080742
## 9  Vamp2   UNIQUD    NM_009497
## 10 Vamp3   LT.HSC      376.237
## 11 Vamp7   LT.HSC        67.53
## 12 Vamp1   LT.HSC       77.177
## 13 Vamp5   LT.HSC     2247.776
## 14 Vamp1   LT.HSC       77.177
## 15 Vamp8   LT.HSC     2624.014
## 16 Vamp4   LT.HSC      385.884
## 17 Vamp5   LT.HSC     2247.776
## 18 Vamp2   LT.HSC     1196.241
## 19 Vamp3      HSC     149.6353
## 20 Vamp7      HSC       38.478

data_mygenesM <- unique(data_mygenesM[10:nrow(data_mygenesM),])

Now we are ready to plot. We will do it using ggplot2, an R package that makes nicer plot than the default R package for printing.
The information you will give to the function ggplot are:
- the name of the dataset
- the columns that will represent the x and y in your plot
- the kind of plot (a bar plot)
- How the different bars should be colored (in this case by cell type)

ggplot(data_mygenesM, aes(data_mygenesM$NAME, as.numeric(data_mygenesM$value))) +    geom_bar(aes(fill = data_mygenesM$variable), position = "dodge", stat="identity")

You can also make the plot prettier by adding proper axis labels and a title to your plot:

ggplot(data_mygenesM, aes(data_mygenesM$NAME, as.numeric(data_mygenesM$value))) +    geom_bar(aes(fill = data_mygenesM$variable), position = "dodge", stat="identity") + xlab("") + ylab("Expression values") + ggtitle("Expression of Vamp proteins")

R hands on - handling supplementary material of a paper of interest in R

Bianca Tesi

October 1, 2014