-
Notifications
You must be signed in to change notification settings - Fork 298
/
Advanced_graphics_in_R.Rmd
334 lines (227 loc) · 10.7 KB
/
Advanced_graphics_in_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
---
title: 'Bioinformatics for Big Omics Data: Advanced graphics in R using ggplot2'
author: "Raphael Gottardo"
date: "December 26, 2014"
output:
ioslides_presentation:
fig_caption: yes
fig_retina: 1
fig_width: 6.5
keep_md: yes
smaller: yes
---
```{r, echo=FALSE}
library("knitr")
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=80),cache=TRUE)
```
## Exploratory data analysis (EDA)
What is EDA?
- Statistical practice concerned with (among other things): uncover underlying structure, extract important variables, detect outliers and anomalies, test underlying assumptions, develop models
- Named by John Tukey
- *Extremely Important*
R provides a powerful environment for EDA and visualization
## EDA techniques
- Mostly graphical
- Plotting the raw data (histograms, scatterplots, etc.)
- Plotting simple statistics such as means, standard deviations, medians, box plots, etc
- Positioning such plots so as to maximize our natural pattern-recognition abilities
- A clear picture is worth a thousand words!
## A few tips
- Avoid 3-D graphics
- Don't show too much information on the same graph (color, patterns, etc)
- Stay away from Excel, Excel is not a statistics package!
- R provides a great environment for EDA with good graphics capabilities
## Graphics in R
Generic function for plotting of objects in R is `plot`. The output will vary depending on the object type (e.g. scatter-plot, pair-plot, etc)
```{r plot, fig.height=4}
# Load the Iris data
data(iris)
plot(iris)
```
## Graphics in R
R provides many other graphics capabilities, such as histograms, boxplots, etc.
```{r histogram}
hist(iris$Sepal.Length)
```
## Graphics in R
```{r boxplots}
boxplot(iris)
```
Anything wrong with the Species boxplot?
## Graphics in R (end)
While R's graphics is very powerful and fully customizable, it can be difficult and painful to get to the desired plot.
For example, there are many different parameters, not necessarily consistent across function, etc.
**Friends don't let friends use R base graphics!**
[ggplot2](http://ggplot2.org/) provides a much richer and consistent environment for graphics in R.
## ggplot2
Description from Hadley Wickham:
> ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
Provides lots of different geometrics for data visualization:
See http://docs.ggplot2.org/current/ for details.
ggplot2 provides two main APIs: `qplot` and `ggplot`
## ggplot2 - qplot
`qplot` is basically a replacement for `plot` but provide a cleaner output with a consistent interface.
```{r qplot-on-iris, fig.height=3}
# Load library
library(ggplot2)
# Using the variables directly
qplot(iris$Sepal.Length,iris$Sepal.Width)
```
## ggplot2 - qplot
```{r qplot-on-iris-df}
# Can also use a dataframe
qplot(Sepal.Length, Sepal.Width, data=iris)
```
## ggplot2 - plot types
Plot types can be specified with the `geom` option
```{r qplot-on-iris-point}
# Points
qplot(Sepal.Length, Sepal.Width, data=iris, geom="point")
```
## ggplot2 - plot types
```{r qplot-on-iris-line}
# line
qplot(Sepal.Length, Sepal.Width, data=iris, geom="line")
```
## ggplot2 - combining plot types
Plot types can be specified with the `geom` option, multiple types can be combined
```{r qplot-on-iris-point-line, fig.height=4}
# Points and line
qplot(Sepal.Length, Sepal.Width, data=iris, geom=c("line","point"))
```
## ggplot2 - combining plot types
```{r qplot-on-iris-boxplot, fig.height=4}
# Points and line
qplot(x=Species, y=Sepal.Length, data=iris, geom=c("boxplot","point"))
```
Notice anything wrong on the plot above? Try changing the `point` option with `jitter`.
## ggplot2 - mapping aesthetics variables
In ggplot2, additional variables can be mapped to plot aesthetics including `color`, `fill`, `shape`, `size`, `alpha`, `linetype`.
```{r qplot-on-iris-boxplot-aes, fig.height=4}
# Points and line
qplot(x=Species, y=Sepal.Length, data=iris, geom=c("boxplot","jitter"), color=Sepal.Width)
```
## ggplot2 - mapping aesthetics variables
```{r qplot-on-iris-scatter-aes}
# Points and line
qplot(x=Sepal.Width, y=Sepal.Length, data=iris, geom=c("point","smooth"), color=Species, size=Petal.Width, method="lm")
```
## ggplot2 - facetting
Sometimes it can be convenient to visualize some characteristic of a dataset conditioning on the levels of some other variable.
Such feature is readily available in ggplot2 using the `facets` argument.
```{r qplot-on-iris-scatter-facet, fig.height=3}
# Points and line
qplot(x=Sepal.Width, y=Sepal.Length, data=iris, geom=c("point","smooth"), color=Species, size=Petal.Width, method="lm", facets=~Species)
```
## ggplot2 - facetting
Or if you prefer to facet by rows
```{r qplot-on-iris-scatter-facet2, fig.height=3.5}
# Points and line
qplot(x=Sepal.Width, y=Sepal.Length, data=iris, geom=c("point","smooth"), color=Species, size=Petal.Width, method="lm", facets= Species~.)
```
## Reshaping your data with reshape2
It is often necessary to reshape (e.g. pivot) your data before analysis. This can easily be done in R using the `reshape2` package.
This package provides main functions `melt` and `*cast`. `melt` basically "melts" a dataframe in wide format into a long format. `*cast` goes in the other direction.
## Reshaping your data with reshape2
Let's revisite our `iris` dataset.
```{r reshape2}
# We first load the library
library(reshape2)
# Only display the first few lines
head(iris)
```
We can see in the data above that we are measuring both width and length on two different flower characteristics: Sepal, and Petal. So we could store the same information with only one length (resp. width) column and an additional variable for type (Sepal/Petal).
## reshape2 - melt
The `melt` function provides some good default options that will try to best guess how to "melt" the data.
```{r reshape2-melt}
# We first need to add a column to keep track of the flower
iris$flower_id <- rownames(iris)
# Default options
iris_melted <- melt(iris)
head(iris_melted)
```
## reshape2 - melt
```{r reshape2-melt-suite}
# We first split that variable to get the columns we need
split_variable <- strsplit(as.character(iris_melted$variable),split="\\.")
# Create two new variables
iris_melted$flower_part <- sapply(split_variable, "[", 1)
iris_melted$measurement_type <- sapply(split_variable, "[", 2)
# Remove the one we don't need anymore
iris_melted$variable <- NULL
head(iris_melted)
```
This is close but not quite what we want, let's see if cast can help us do what we need.
## reshape2 - cast
Use `acast` or `dcast` depending on whether you want vector/matrix/array output or data frame output. Data frames can have at most two dimensions.
```{r reshape2-cast}
iris_cast <- dcast(iris_melted, formula=flower_id+Species+flower_part~measurement_type)
head(iris_cast)
```
**Q:** Why are the elements of `flower_id` not properly ordered?
`melt` and `*cast` are very powerful. These can also be used on `data.tables`. More on this latter.
**Exercise:** Try to reorder the variable names in the formula. What happens?
## Back to ggplot2
Using our long format dataframe, we can further explore the iris dataset.
```{r multi-facet, fig.height=4}
# We can now facet by Species and Petal/Sepal
qplot(x=Width, y=Length, data=iris_cast, geom=c("point","smooth"), color=Species, method="lm", facets= flower_part~Species)
```
It would be nice to see if we could have free scales for the panels, but before we explore this, let's talk about the `ggplot` API as an alternative to qplot. Can we also customize the look and feel?
## ggplot2 and the grammar of graphics
`ggplot2` provides another API via the `ggplot` command, that directly implements the idea of a "grammar of graphics".
The grammar defines the components of a plot as:
- a default dataset and set of mappings from variables to aesthetics,
- one or more layers, with each layer having one geometric object, one statistical transformation, one position adjustment, and optionally, one dataset and set of aesthetic mappings,
- one scale for each aesthetic mapping used,
- a coordinate system,
- the facet specification.
## ggplot2 and the grammar of graphics
```{r ggplot-iris, fig.height=4}
ggplot(data=iris_cast, aes(x=Width, y=Length))+ # Add points and use free scales in the facet
geom_point()+facet_grid(Species~flower_part, scale="free")+
# Add a regression line
geom_smooth(method="lm")+
# Use the black/white theme and increase the font size
theme_bw(base_size=24)
```
## ggplot2 and the grammar of graphics
Let's try to map one of the variable to the shape
```{r ggplot-iris-suite, fig.height=4}
my_plot <- ggplot(data=iris_cast, aes(x=Width, y=Length, shape=flower_part, color=flower_part))+
# Add points
geom_point()+facet_grid(~Species)+ geom_smooth(method="lm")
my_plot
```
Note: the `ggplot` API requires the use of a dataframe. Many different layers can be added to obtain the desired results. Different data can be used in the different layers.
**Exercise:** Your turn to try! Try to facet by `flower_part` and use Species as an aesthetic variable. Try to use `facet_wrap` instead of `facet_grid`.
## Having some fun with ggplot2
Excel theme
```{r ggthemes, fig.height=4}
library(ggthemes)
my_plot+theme_excel(base_size=24)
```
## Having some fun with ggplot2
Wall Street Journal theme
```{r ggthemes2, fig.height=4}
my_plot+theme_wsj(base_size=18)
```
## Having some fun with ggplot2
With ggplot2 you can create your own theme, and save it if you want to reuse it later.
```{r ggplot-themes}
my_plot+
# Polar coordinate!
coord_polar()+
theme(legend.background=element_rect(fill = "pink"), text=element_text(size=24))
```
Please look at the [documentation](http://docs.ggplot2.org/current/theme.html) for more details.
You have no more excuses to create boring graphs!
## References
Here are some good references for mastering ggplot2
- [R Graphics Cookbook](http://www.amazon.com/dp/1449316956/ref=cm_sw_su_dp?tag=ggplot2-20) by Winston Chang
- [ggplot2 recipes](http://www.cookbook-r.com/Graphs/) by Winston Chang (Free online)
- [ggplot2: Elegant Graphics for Data Analysis (Use R!)](http://www.amazon.com/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20) by Hadley Wickham
Please pay attention to the fonts, fontsize and colors you use. You may want to look at:
- [Colorbrewer](http://colorbrewer2.org/), palettes available through the `RColorBrewer` package. `ggplot2` provides options to use Colorbrewer palettes.
- [Using your favorite fonts in R](http://blog.revolutionanalytics.com/2012/09/how-to-use-your-favorite-fonts-in-r-charts.html)