Skip to content

Commit b3f33b7

Browse files
Ryan WilliamsRyan Williams
Ryan Williams
authored and
Ryan Williams
committed
updated the visualization info and added a dataset to play with
1 parent 4a29c91 commit b3f33b7

File tree

2 files changed

+114
-0
lines changed

2 files changed

+114
-0
lines changed

R_visualization.md

+113
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,116 @@ ggplot(dataset)+geom_boxplot(aes(x=categorical_variable, y= variable))
7777
With this plot we can see the distributions of data (quantiles and median) categorized by our categorical variable. While this visualization can show you how your data is distributed (is it skewed?), you can also begin comparing between categorical variables (is my variable greater under one category than another)?
7878

7979

80+
Applying Visualization to a Dataset
81+
-----------------------------------
82+
83+
First we will install the following packages
84+
```
85+
install.packages("ggplot2")
86+
install.packages("reshape2")
87+
install.packages("plyr")
88+
install.pacakges("vegan")
89+
90+
library(reshape2)
91+
library(plyr)
92+
library(ggplot2)
93+
library(vegan)
94+
```
95+
These packages include many visualization, statistical, and data management tools that can be used to summarize your data and produce publication-ready plots.
96+
97+
First we will read in the data (this will be at the github repo too), we will remove some nonsenical labels for ease of visualization here. We will also add a variable that allows us to count up SNPs in the database
98+
```
99+
data<-read.csv("~/Desktop/journal.pone.0081760.s004.csv")
100+
data<-subset(data, RefGenomeID != "" & locus_tag != "")
101+
data$count<-1
102+
103+
```
104+
105+
This dataset displays SNPs among multiple E.coli strains. We are interested in looking at how many total SNPs there are per E.coli genome. We will do this using ddply.
106+
107+
```
108+
av_snps<-ddply(dataset_melt, .(RefGenomeID),summarise,
109+
average=mean(value),
110+
SD=sd(value)
111+
)
112+
```
113+
Here we could calculate multiple descriptive statistics that can be useful when visualizing relationships.
114+
Now we will use ggplot to produce visualizations of these data. These functions work in a step-wise manner, where we add visual layers to our data.
115+
116+
```
117+
ggplot(av_snps)
118+
119+
```
120+
The errror shows that we have no layers in our plot, so lets add bars to show the average number of SNPs per genome.
121+
122+
```
123+
data<-arrange(data, -average)
124+
ggplot(av_snps)+geom_bar(aes(x=variable, y=average),stat="identity")
125+
126+
```
127+
128+
We can order them within the plot as well
129+
130+
```
131+
ggplot(av_snps)+geom_bar(aes(x=reorder(variable,average), y=average),stat="identity")
132+
133+
```
134+
135+
These are hard to see. We will arrange them, look at the top 10, and flip the axes so we can read labels
136+
137+
```
138+
av_snps<-arrange(av_snps,-average)
139+
ggplot(av_snps[1:10,])+geom_bar(aes(x=reorder(variable,average), y=average),stat="identity")+coord_flip()
140+
```
141+
142+
Lets add a layer that will show the variability around these averages (standard deviation)
143+
144+
```
145+
p<-ggplot(av_snps[1:10,])+geom_bar(aes(x=reorder(variable,average), y=average),stat="identity")+coord_flip()
146+
p+geom_errorbar(aes(x=reorder(variable,average),y=average,ymax=average+SD,ymin=average-SD))
147+
148+
```
149+
150+
We can also add informative labels easily
151+
152+
```
153+
p<-ggplot(av_snps[1:10,])+geom_bar(aes(x=reorder(variable,average), y=average),stat="identity")+coord_flip()
154+
p.error.bars<-p+geom_errorbar(aes(x=reorder(variable,average),y=average,ymax=average+SD,ymin=average-SD))
155+
p.error.bars+labs(x="Average Number of SNPs",y="E. Coli Genomes")
156+
```
157+
158+
Colors can be used as well
159+
160+
```
161+
p<-ggplot(av_snps[1:10,])+geom_bar(aes(x=reorder(variable,average), y=average,fill=variable),stat="identity")+coord_flip()
162+
p.error.bars<-p+geom_errorbar(aes(x=reorder(variable,average),y=average,ymax=average+SD,ymin=average-SD,colour=variable))
163+
p.error.bars+labs(x="Average Number of SNPs",y="E. Coli Genomes")
164+
165+
```
166+
167+
We may also be interested in a multivariate analysis of these data. We can ask the question, "Which genomes are most similar based on SNPs?" This is analogous to saying, "Which of my samples are the most similar?" We will do this using nonmetric multidimensional scaling or NMDS.
168+
169+
First we will transform data to presence-absence
170+
171+
```
172+
library(vegan)
173+
transformed_data<-decostand(dataset[,-1],"pa")
174+
```
175+
176+
Then we will produce the NMDS and look at one of the outputs
177+
178+
```
179+
mds<-metaMDS(transformed_data, distance="jaccard",k=2,autotransform=F)
180+
scores<-data.frame(dataset[,1],scores(mds))
181+
head(scores)
182+
names(scores)[1]<-"RefGenomeID"
183+
```
184+
We changed the neame above of the first column for ease of understanding.
185+
Now let's look at the first two axes from our NMDS analysis and colour points by genome identity
186+
187+
```
188+
ggplot(scores)+geom_point(aes(x=NMDS1,y=NMDS2,colour=RefGenomeID))
189+
```
190+
191+
This visualization shows where points are in multidimensional space in relation to one another compressed down into a 2D form. Points that are very close together are very similar while points that are very far apart are dissimilar.
192+

journal.pone.0081760.s004.csv

+1
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)