-
Notifications
You must be signed in to change notification settings - Fork 0
/
Module_02_Reproducible_Research_and_Data.Rmd
454 lines (283 loc) · 16.4 KB
/
Module_02_Reproducible_Research_and_Data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
---
title: 'Module 2: Reproducible Research and Data'
author: "Brian P Steves"
date: "`r Sys.Date()`"
output:
html_document:
keep_md: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
Module 2: Reproducible Research and Data
====================
## 1 Course Description and Objectives
This course is designed to explore the concept of reproducible research for environmental sciences. We will cover proper data storage and the use of the R statistical computing environment to manipulate that data for statistical and graphical analysis. A background in a computer language and statistics is a plus but not required.
### 1.1 What is "Reproducible Research"?
In a very general sense, reproducible research is one of the fundamental aims of science. When we write a paper we include methods and results sections so that, in theory, other researchers can potentially reproduce and confirm our results.
Lately, the term "Reproducible Research" has been used to really mean "Reproducible Data Analysis". Every step of data analysis (often in the form of code) is provided..
* Import, manipulate and transform raw data and meta-data into processed data.
* Run statistical analysis.
* Create plots and tables.
* Incorporate analysis and plots into a report.
### 1.2 Benefits of "Reproducible Research".
Many journals currently have data sharing and code sharing policies in place and that number has been increasing in recent years (Stodden et al 2013).
Even if you aren't planning on publishing in a journal with such policies, there is much to be said about being able to reproduce your own data analysis in as little effort as possible.
#### 1.2.1 Example of non-reproducible data analysis
1. Enter raw data into Excel
2. Hand edit data for QA/QC
3. Copy and paste data subsets of data into new worksheet as "processed data"
4. Do analysis in Excel or export/import data into other analysis software (R, SAS, Sigmaplot, etc..)
5. Point-n-click your way through your analysis
6. Save formatted plots and table output
7. Insert formatted plots and tables into Word Document
**What happens if you obtain new data or find an error in your raw data?**
You start over from step 1 or 2 and spend potentially just as much time as you initially did running your analysis.
#### 1.2.2 Example of reproducible data analysis
1. Enter raw data into database, Excel, or CSV files
2. Use R to assist QA/QC
3. Use R to process your data for further analysis
4. Use R to preform statistical and graphical analysis
5. Use R markdown script to incorporate plots and tables into final report
6. Use Pandoc command to convert R markdown into a final report in html, word doc, PDF, etc..
7. Create a master script that combines all of these steps into one.
**What happens now if you get more data or need to change something?**
You edit your data and scripts as needed and rerun the master script. Yes, you spent a bit more time setting things up initially but now rerunning analyses is trivial.
### 1.2 CRAN Task View on Reproducible Research
For an overview of R packages for Reproducible Research see the following..
[CRAN Task View: Reproducible Research](http://cran.r-project.org/web/views/ReproducibleResearch.html)
## 2 Data Storage
### 2.1 Spreadsheets
Microsoft Excel and other spreadsheets (such as Google Speadsheets or Open Office) are probably the most common method for data entry and data sharing. However, many people use spreadsheets in a way that makes the data practially unusable to others software.
![Example of a bad spreadsheet design](figure/badspreadsheet.jpg)
In this example the spreadsheet is formatted like a data entry sheet and/or reporting table which is fine, unless you need to share the raw data with other software.
A proper data spreadsheet should look more like a database tabale with unique data as rows and variables as columns with the column headers listing those variables.
One other complication to using spreadsheets in this fashion is that there can be incompatibilites based on which version of the spreadsheet software is being used.
### 2.2 Relational Databases (Microsoft Access, MySQL, PostgreSQL, etc..)
Sometimes data is too big and too complicated to store in a simple spreadsheet. When this happens databases are needed. Relational databases are collections of data tables in which the relationships between the tables is formally described. For example, site information might be in one table called "sites" while sample information might be in a table called "samples". Each site might have multiple samples and each sample belongs to one particular site. This relationship would be called a "one-to-many" relationship. For each record in the samples table there would be a "site id" of some sort which helps define how each sample record is related to the sites table.
Microsoft Access is great for creating quick desktop based data entry forms and has a nice visual method of creating data queries.
Other databases like MySQL and PostgreSQL are a bit more powerful and are often used in conjunction with a webserver to serve datasets online.
There are special packages in R that allow you to access data from relational databases.
### 2.3 CSV (comma seperated values)
Almost every spreadsheet and database can export properly formatted data into CSV files. Likewise, just about every program used for data analysis can import CSV files. Because CSV files are simple text files, they are the *lingua franca* of data formats. However, large amounts of data in text format can take awhile to load.
The following code shows how long it takes to load a 59.5 Mb file with 500,000 rows of data and 13 variables.
```{r import_csv}
csv_speed<-system.time(data<-read.csv("H:/Downloads/500000-Sales-Records/500000 Sales Records.csv"))
print(csv_speed)
```
Here I've wrapped the 'system.time' function call around the 'read.csv' function that loads the data file so that I can time it. This can be useful if you ever want to know how long a particular part of your analysis takes.
### 2.4 Binary Files (e.g. '.Rdata')
One option to speed this process up in R is to save the data in a binary Rdata file. This file format compresses the data in such a way that it can be reopened at least 10 times faster then .
```{r open_binary}
save(data, file="data/500k_Sales_Records.Rdata")
binary_speed<-system.time(load(file="data/500k_Sales_Records.Rdata"))
print(binary_speed)
```
The loading of the binary data file was over **`r round(as.numeric(csv_speed[3]/binary_speed[3]),0)`** times faster than loading the csv file!
### 2.5 R Objects
``` {r fundamental_structures_table, results='asis', echo=FALSE}
fund_strs<-data.frame(Dimensions=c("1d","2d","3d"), Homogeneous=c("Atomic Vector","Matrix","array"), Heterogeneous=c("List","Data Frame",""))
knitr::kable(fund_strs, row.names=FALSE)
```
#### Vectors
The most basic variable structure is a vector (1d, homogeneous). Generally created using "c()" to concatenate values of the same type.
A vector of numbers
```{r number_vector, echo=FALSE}
n <- c(1:12)
print(n)
```
A vector of characters
```{r letter_vector, echo=FALSE}
l<-letters[1:12]
print(l)
```
A vector of booleans
```{r boolean_vector, echo=FALSE}
b<-c(T,T,F,T,T,F,F,F,T,F,T,F)
print(b)
```
to access values in a vector we use brackets "[]"
To access the third element of our previous vectors
```{r}
n[3]
l[3]
b[3]
```
#### Matrices
Matrices are two dimensional structures of a single data type.
A matrix of numbers
```{r number_matrix, echo=FALSE}
mn<-matrix(1:24, nrow=4)
print(mn)
```
A matrix of characters
```{r letter_matrix, echo=FALSE}
ml<-matrix(letters[1:24], nrow=4)
print(ml)
```
To access values in a matrix we use the "[]" but with two elements inside representing rows and columns. The first element is for the first dimension (rows) and the second element is for the second dimension (columns)
```{r}
# first element row = 1 , column = 1
ml[1,1]
# entire first row
ml[1,]
#entire first column
ml[,1]
```
#### Arrays
An array is like a matrix to but with more dimensions..
```{r number_array, echo=FALSE}
a<-array(1:24, c(3,4,2))
print(a)
```
Accessing values in an array is similar to that of matrices but you add more elements in the brackets for each additional dimension [Rows, Columns, Extra Dimension 1, Extra Dimension 2, etc..]
```{r}
# first element (row = 1, column = 1, extra dimension element = 1)
a[1,1,1]
# second column of second extra dimension element
a[,2,2]
# third row of first extra dimension element
a[3,,1]
```
#### Data Frames
Most data tables are represented as data frames in R. They are two dimensional like a matrix, but allow for mulitple data types with in a single structure. However, each column of a data frame is has to be a consistant type and all columns must be the same length.
``` {r dataframe, echo=FALSE}
df<-data.frame(l,n,b)
print(df)
```
Accessing elements of a data frame can be done the same way as a matrix using "[row, column]" notation. Data frame columns can also be referred to by name using the "$" to separate the data frame name and the column name
```{r}
#first element of the data frame
df[1,1]
#first row of the data frame
df[1,]
#first column of the data frame
df[,1]
# or access the first column by name "l"
df$l
# first value of the "l" column
df$l[1]
```
#### Lists
You can think of lists as mutli-dimensional data.frames. Another way to think about them is a list of various objects. They take a bit to wrap your head around, but can be powerful once you do. Like data frames, lists can contain a heterogenaous set of data types. However, unlike data frames, the elements of a list don't have to be the same size.
``` {r list, echo=FALSE}
lst<-list(l,n,b, df)
print(lst)
```
To access values in a list you access the object in the list using "[[]]" and then, depending on that object's type, you access the values accordingly..
```{r}
# the first object in lst
lst[[1]]
# the third item of the first object.
lst[[1]][3]
# the 4th object
lst[[4]]
# the fourth oject in the first column of the 4th object
lst[[4]][4,1]
# or
lst[[4]]$l[4]
```
#### Numeric
Numeric variables are simply numbers with a decimal point
#### Integer
Numbers without a decimal point.
#### Character
Strings.
#### Date
Dates can be tricky and there a variety of R functions and packages designed to help deal with them. Often dates are imported as character strings and need to be converted to date format if you want to treat them as dates. This is important for timelines and grouping by year, month, day, etc..
```{r, error=TRUE}
today <- Sys.Date()
format(today, "%d %B %Y")
months(today)
weekdays(today)
mydates <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
months(mydates)
mydates2<-as.Date(mydates, "%d%b%Y")
months(mydates2)
```
More on dates later..
#### logical
Simply TRUE or FALSE.
#### Factors
Factors as a data type in R serve two main purposes...
1.) Factors reduce memory usage by coding large repeating portions of the data and replacing it with an integer.
2.) We often need factors anyway in our statistics. For example, in an ANOVA we might want to compare a dependant variable like growth among treatments. These treatments are "factors". By identifing this variable as a factor, R will treat it as such in analysis.
``` {r factors, echo=TRUE}
#list of treatments by sample
treatments<-c(1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,4,4,4,4)
print(treatments)
summary(treatments)
# This is Min, Median, Max, etc.. of the numeric values of the treatment numbers. Not what we want.
#Turn treatments into a factor
treatments<-factor(treatments)
print(treatments)
summary(treatments)
# now the summary of the treatments is the count by treatment. Makes much more sense.
```
### 2.6 NULL and NA
There is often some confusion between NA and NULL values in R.
NULL represents a completely NULL object (i.e. the object contains no values). If an object doesn't exist R will return a NULL.
NA represents a missing value within an object. NA's appear within an object as "NA".
### 2.7 Data Format (Wide vs Narrow)
#### 2.7.1 Wide Data Format
Most people are familiar with the "wide" data format where each row represents an observation and each column represents all the various attributes for that observation. In the following example, each species has its own column of count data.
```{r wide_table, results='asis', echo=FALSE}
# loading a useful package for creating formatted tables
spdat<-read.csv("data/wk1spdat.csv")
knitr::kable(spdat, row.names=FALSE)
```
#### 2.7.2 Narrow Data Format
Another representation of this data is called the "narrow" data format. In this next example, species name becomes a single data field called "Species" and a new column called "Count" is created. Also in this example, counts of zero for species not present are dropped.
```{r narrow_table, results='asis', echo=FALSE}
# loading a package for reshaping data between wide and narrow formats
library(reshape2)
# converting wide format to narrow format
spdat2<-melt(spdat, id=c("Site", "Date"), variable.name="Species", value.name="Count")
# removing data rows where species counts are 0
spdat2<-spdat2[spdat2$Count>0,]
spdat2<-spdat2[with(spdat2, order(Date, Site, Species)),]
knitr::kable(spdat2, row.names=FALSE)
```
#### 2.7.3 Which format is best?
Initally, the wide format probably the easiest to read. However, the narrow format is has many advantages, particularly in this example with species counts ...
1. It is easier to transform from narrow to wide than it is from wide to narrow
2. Adding a new species doesn't require a new data column
3. No need to record zeros all the time
4. Species becomes a factor which makes subsetting the data easier (more on this later)
The choice between wide and narrow table structure will depend on nature of the data being represented. In general, if you have multiple columns that could be represented easily as a single factor and a related value field (like species counts) than narrow is the way to go.
### 2.8 Normalizing Data Tables
I mentioned eariler that if your data set is big and complex a database is often needed. The process of deciding how to break up your data into related tables is called "normalization". We do this primarly to organize the data and to reduce redundancy.
Imagine you have sites and sample information. Each site has a name, a latitude and longitude, and several other descriptive fields and each sample within a site has information on how and when it was sampled as well as what was found.
The following table is includes everything but is unnormalized.
``` {r site_samples_table, results='asis', echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr)
sites<-read.csv("data/wk1sites.csv")
samples<-read.csv("data/wk1samples.csv")
site_samples<-left_join(sites, samples, by="siteID") %>% select(sampleID, site, lake, lat, lon, date, sampler, temp_C, fish)
knitr::kable(site_samples)
```
Notice that all of the site related information gets repeated for each sample it has.
Here is that same information normalized into a "sites" table
``` {r sites_table, results='asis', echo=FALSE}
knitr::kable(sites)
```
and a "samples" table
``` {r samples_table, results='asis', echo=FALSE}
knitr::kable(samples)
```
Now the site related information is only recorded once and we use a "site_id" to related the samples table to the sites table.
We can also join these two tables back up based on the "site_id" if for example we need the latitude and longitude where a sample was collected.
### Meta-data
Metadata is "data about data".
Who collected this data?
Purpose of the data?
How the data was created?
When the data was created?
Descriptions of all tables and data fields.
The more information you can record the better.
### Citations
[1] Stodden V, Guo P, Ma Z (2013) Toward Reproducible Computational Research: An Empirical Analysis of Data and Code Policy Adoption by Journals. PLoS ONE 8(6): e67111. doi:10.1371/journal.pone.0067111 [link] (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067111)
## Homework
1. Find at least one dataset you might be interested in working with over the next couple classes. If you're stumped, try the following data repository...
http://datadryad.org
2. Find or create the metadata required to describe this found dataset. Include the who, what, where, when, and how of the dataset as well as a description of the various data fields themselves.