R tutorial
1. Download the supplementary material from a paper of interest concerning gene expression data
The paper that we will use here is “Chromatin state dynamics during blood formation” by Lara-Astiago et al. 2014, recently published in Science
You can download the file (the TableS2 of the supplementary material) from this link: https://ki.box.com/shared/static/av9p9sbsr7tt9xrdrxu5.txt
If you are curious about the paper, here is the link: http://www.sciencemag.org/content/345/6199/943.abstract
#load libraries
#if you don't have them installed run:
#install.packages("gglot2")
#install.packages("reshape2")
library(ggplot2)
library(reshape2)
setwd("~/Desktop")
data <- read.table("1256271tableS2.txt", header=TRUE, sep="\t")
3. Suppose you want to check the expressions of your favorite genes in the cell populations studied in the paper and you want to make a plot |
---|
- Have a look at how many rows and column your table has (using the function dim) |
- Have a look at the column names to identify the column with the gene name information |
- The see the first 6 lines of the column containing the gene names |
- Then define a vector with your genes of interest |
dim(data)
## [1] 31645 18
colnames(data)
## [1] "UNIQUD" "NAME" "LT.HSC" "HSC" "MPP"
## [6] "CLP" "CMP" "GMP" "MF" "Granulocyte"
## [11] "Mono" "B" "CD4" "CD8" "NK"
## [16] "MEP" "EryA" "EryB"
head(data$NAME)
## [1] Fam227b Mir7227 Eif4a1 Efhd1 Rhbdd1 Fam175a
## 23953 Levels: 0 0610005C13Rik 0610007N19Rik 0610007P14Rik ... Zzz3
mygenes <- c("Vamp1", "Vamp2","Vamp3","Vamp4","Vamp5","Vamp6","Vamp7","Vamp8")
mygenes
## [1] "Vamp1" "Vamp2" "Vamp3" "Vamp4" "Vamp5" "Vamp6" "Vamp7" "Vamp8"
Now we will only keep the expression data for the genes in your vector “mygenes”: - using the function which we can match the rows in the dataframe referring to genes in our vector - now you can melt the created dataset (function within the reshape2 package) - have a look at it - as you can see the first 9 rows are not revelant for looking at gene expression, so we cna remove them
data_mygenes <- data[which(data$NAME %in% mygenes),]
data_mygenesM <- melt(data_mygenes, id="NAME")
## Warning: attributes are not identical across measure variables; they will
## be dropped
#ignore the warning message
head(data_mygenesM, n=20)
## NAME variable value
## 1 Vamp3 UNIQUD NM_009498
## 2 Vamp7 UNIQUD NM_011515
## 3 Vamp1 UNIQUD NM_001080557
## 4 Vamp5 UNIQUD NM_016872
## 5 Vamp1 UNIQUD NM_009496
## 6 Vamp8 UNIQUD NM_016794
## 7 Vamp4 UNIQUD NM_016796
## 8 Vamp5 UNIQUD NM_001080742
## 9 Vamp2 UNIQUD NM_009497
## 10 Vamp3 LT.HSC 376.237
## 11 Vamp7 LT.HSC 67.53
## 12 Vamp1 LT.HSC 77.177
## 13 Vamp5 LT.HSC 2247.776
## 14 Vamp1 LT.HSC 77.177
## 15 Vamp8 LT.HSC 2624.014
## 16 Vamp4 LT.HSC 385.884
## 17 Vamp5 LT.HSC 2247.776
## 18 Vamp2 LT.HSC 1196.241
## 19 Vamp3 HSC 149.6353
## 20 Vamp7 HSC 38.478
data_mygenesM <- unique(data_mygenesM[10:nrow(data_mygenesM),])
Now we are ready to plot. We will do it using ggplot2, an R package that makes nicer plot than the default R package for printing.
The information you will give to the function ggplot are:
- the name of the dataset
- the columns that will represent the x and y in your plot
- the kind of plot (a bar plot)
- How the different bars should be colored (in this case by cell type)
ggplot(data_mygenesM, aes(data_mygenesM$NAME, as.numeric(data_mygenesM$value))) + geom_bar(aes(fill = data_mygenesM$variable), position = "dodge", stat="identity")
You can also make the plot prettier by adding proper axis labels and a title to your plot:
ggplot(data_mygenesM, aes(data_mygenesM$NAME, as.numeric(data_mygenesM$value))) + geom_bar(aes(fill = data_mygenesM$variable), position = "dodge", stat="identity") + xlab("") + ylab("Expression values") + ggtitle("Expression of Vamp proteins")