-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-made-for-data.Rmd
395 lines (301 loc) · 14.6 KB
/
03-made-for-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
# A language made for data analysis
::: {.goals}
- The most important data types and data structures
- Assigning data to variables
- Writing comments
- Working with vectors
- Creating data frames
:::
R is a programming language that supports many different programming paradigms but to get
started with data analysis, its enough to understand R's basic data types --- numeric,
logical, and character --- and two data structures, vectors and data frames. These are
made for data manipulation and analysis, and along with the many convenience functions
that R provides, they allow you to do data analysis without worrying much about things
that are fundamental in other programming languages, like loops and conditionals. As you
become more familiar with R, you can add those to your repertoire, too, and utilize R's
full power and flexibility.
::: {.more}
If you are a seasoned programmer, I recommend to skip this chapter and instead read
*Advanced R,* chapters 2-4 (<https://adv-r.hadley.nz/vectors-chap.html>). These provide a
much more thorough overview of R's data types and data structures.
:::
## Basic Data Types
The [**data type**](https://en.wikipedia.org/wiki/Data_type) of a piece of data determines
what operations you can perform with it. The bread and butter of data types in R are
*numeric*, *logical*, and *character*.
#### Numeric {.unnumbered}
Numeric data can be integer, decimal[^made-for-data-1], or complex, but usually the
difference does not matter. Sometimes, R might display a number in scientific notation,
e.g., `1e+23`. To the uninitiated, these can look cryptic and a bit scary. But don't
worry! It's just a very large or very small number that is written in a more compact form
and if you want you can [change this
behavior.](https://statisticsglobe.com/disable-exponential-scientific-notation-in-r)
[^made-for-data-1]: Decimal numbers are also called floating point numbers or "floats" for
short. Note that R uses periods `.` to designate the decimal places, not commas `,` as
you might be used to, depending on where you are from.
```{r, eval=FALSE}
1e3 + 230 + 4L + 0.5 + .06 + 7i
```
#### Logical {.unnumbered}
Logical data (also known as Boolean data) can take only two values, `TRUE` or `FALSE` (or
their shorthands, `T` and `F`). These are often used to represent binary categorical
variables, as function arguments, or as intermediate steps when selecting and filtering
data.
```{r, eval=FALSE}
(TRUE || FALSE) != (F == T)
```
#### Character {.unnumbered}
Also known as *strings*. Characters are the most flexible data type. Almost anything you
type becomes a character string if you surround it with `""` or `''`.
```{r, eval=FALSE}
"Hello World!"
```
## Assignment
If you want to store data for later use you can assign it to a
**variable**[^made-for-data-2] with the **assignment operator** `<-` (use the keyboard
shortcut <kbd>Alt/Option</kbd>+<kbd>-</kbd> to save yourself some hand gymnastics). Once
assigned, you can use the variable's name to represent its content in other operations.
Typing a variable name by itself will print the variable's content.
[^made-for-data-2]: In official R terminology, you are assigning values to **objects**,
but people who have experience with other programming languages sometimes find the
term object confusing. Therefore, I will use the word *variable* in this book.
```{r}
a <- 1
b <- 9
c <- a + b
c
```
Note how newly created variables appear in the **Environment panel** in RStudio. When you
are working with R, you are working in a **session***.* The session provides an
**environment** in which data is stored, and the variable name allows you to retrieve the
data again. Until you explicitly delete the variable or end the session, the data will be
around for you to use it.
### Variable Names
There are some rules for variables names: They must start with a letter and contain only
the letters of the English alphabet, numbers, underscores `_`, hyphens `-`, and dots
`.`.[^made-for-data-3] For long variable names, [it is
recommended](https://style.tidyverse.org/syntax.html#object-names) to use **snake case**,
which looks like this: `my_firssst_variable` 🐍. Also note that R is case-sensitive.
[^made-for-data-3]: Unlike in other programming languages, dots have no special function
or meaning in R and can be used in variable names.
::: {.exercise}
What would be the result of running the code below? Think about it first and then run it
to check your answer.
```{r eval=FALSE}
a <- 10
a + b # What will be the result of a + b?
c # What has happened to c?
C # Why is this not working?
```
:::
::: {.more}
**What's with the arrow, though?**
Many other programming languages use `=` as the assignment operator. You can, for the most
part, also use `=` for assignment in R --- just be aware that it will let everyone now
that [you are not a true believer](https://en.wikipedia.org/wiki/Shibboleth).
You can also use `->` to assign from left to right. This can it come in handy when you are
messing around in the console, but it may not be a good idea to use it too much in your
scrips, because it can make your code harder to read:
```{r eval=FALSE}
"Never look back!" -> x
```
Finally, there's `<<-`, which allows assignment to variables in an enclosing environment.
This can be useful if you need to update *global variables* from within a function.
However, this is far beyond the scope of this intro, and chances are you will never
encounter a `<<-` in the wild.
You can read more about R's assignment operators
[here](https://colinfay.me/r-assignment/).
:::
## Comments
As you can see in the example above, you can use `#` to leave **comments** in your script.
Everything on the right side of the `#` will be ignored by R. You can quickly "comment
out" multiple lines at the same time by going to `Code ▸ Comment/Uncomment Lines` or
pressing <kbd>Ctrl</kbd>/<kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd>. Learn to write
good comments and write comments gratuitously -- your collaborators and future self will
thank you for it! Sarah Drasner's presentation [*The Art of Code
Comments*](https://www.youtube.com/watch?v=yhF7OmuIILc)is a very entertaining introduction
to good comment writing.
## Vectors
In computer programming, [**data
structures**](https://en.wikipedia.org/wiki/Data_structure) are a way to organize and
process particular types of data efficiently. When working with data in R, you will
encounter two important data structures: *vectors* and *data frames*. Vectors are made for
storing values *of the same data type* (numerical, character, logical, ...) in a
particular order.
### Creating vectors
To create a vector, you can use the "combine" function, `c()`. There are also the `:` and
`seq()` functions, that can help you create a vector that contains a sequence of numbers:
```{r, results="hold"}
c(1,2,3,4) # create a vector manually
0:4 # a vector of all integer numbers from 0 to 4
4:0 # it also works backwards
seq(0, 2, 0.5) # numbers 1.0 to 2.0 in 0.5 increments
```
### Working with vectors
R is really good at working with vectors! See how easy it is to multiply all numbers by a
scalar, get the mean, or sum all elements --- all without using a loop or a temporary
variable:
```{r, results = "hold"}
x <- c(1,2,3,4) # Create vector and assign to variable `x`
x * 10
mean(x)
sum(x)
```
### Accessing elements of a vector
You can access specific positions of a vector using square brackets. In R, indices start
at 1, that is, index 1 will return the first element of the vector.
```{r}
x[1]
```
You can also use negative indices to exclude elements. For example, index `-1` will return
everything but the first element of a vector.
```{r}
x[-1]
```
Use `:` within brackets to access a range of indices.
```{r}
x[2:3]
```
You can also use a vector of indices to index into another vector in order to select
specific elements
```{r}
x[c(1,3)]
```
You can also use the bracket syntax to replace individual elements:
```{r}
x[2:3] <- 0 # replace 2nd and 3rd element with 0
x[2:3] <- c(7,8) # replace 2nd and 3rd element with 7 and 8, respectively
```
### Named vectors
Values in a vector can be named. This comes in handy, for example, if you need a simple
[**key-value store**:](https://en.wikipedia.org/wiki/Key%E2%80%93value_database)
```{r}
fav_food <- c(
Garfield = "Lasagna",
Popeye = "Spinach",
Obelix = "Roast Wild Boar"
)
fav_food["Popeye"]
```
::: {.more}
**Implicit Coercion**
Vectors can only contain a *single* data type. If you are trying to mix data types, R will
quietly (without throwing an error) *coerce* (that is, convert) all data types to a data
type that can represent all elements. This coercion follows a certain hierarchy:
- logical becomes numerical (`FALSE = 0` and `TRUE = 1`)
- numerical becomes character
The coercion only goes in one direction. For example, character strings that happen to
contain numbers will nevertheless be treated as characters. Consider these examples:
```{r}
c(FALSE, TRUE, 2, 3) # logical becomes numerical
c(1, 2, "3") # numerical becomes character
```
This also happens if you update a vector after it has already been created.
```{r}
x <- 1:3
x[1] <- TRUE # this logical will become numerical
x
```
Vectors cannot contain other vectors, so nested vectors automatically get unnested, and as
they get unnested, the elements will be coerced to a common data type:
```{r, results='hold'}
c(1, c(2,3,4))
c(c(TRUE, 0), c(TRUE, "0")) # logical and numerical become character
```
This behavior can sometimes be used for good, but in general it's best to avoid it. Make
coercion explicit. For example, if you want to convert a character representation of a
number to numeric, do so with the `as.numeric()` function. For more complex applications
consider static typing with <https://github.com/moodymudskipper/typed>. If you really need
to store a sequence of mixed data types, you will have to use another data structure
called *list*.
:::
## Data Frames & Tibbles
Usually, your research data will be more complex than just a single sequence of values.
For this, R has data frames. Data frames are two-dimensional like a spreadsheet -- with
*rows* usually containing different observations and *columns* usually containing
different variables.
You may see the word [**Tibble**](https://tibble.tidyverse.org/) pop up occasionally in
the examples in this book. A Tibble is a modern implementation of data frames. Tibbles are
backwards-compatible with data frames and for the most part interchangeable. Throughout
this book, we will use the word *data frame* to mean both of them.
### Creating data frames
You will often get a data frame by reading data from a tabular data file like a
spreadsheet but you can also make one yourself, by combining vectors with the
`data.frame()` function:
```{r}
cool_numbers <- data.frame(
"number" = c(1,13,23,42),
"isPrime" = c(FALSE, TRUE, TRUE, FALSE),
"oneTwo" = c(1, 2)
)
cool_numbers
```
You don't have to name the columns in your data frame at all, but it is good practice to
give them names that are easy to remember and easy to type. Rows of a data frame can be
named, too, but row names can be complicated to work with so it's better to include them
as an additional column.
Note how the vector `c(1,2)` in the `oneTwo` column did get **recycled** to match the
length of the other two columns. This works, as long as the total length is a multiple of
the shorter vector's length. Otherwise, R will produce an error.
### Data frame operations
A quick way to access a single column of a data frame is with the `$` operator.
```{r}
cool_numbers$number
```
Because columns in a data frame are vectors, everything that can be done with a vector can
be done with a data frame column!
::: {.more}
**Everything is a vector**
In fact, all "atomic" data types in R (numeric, characters, logical) are *always* stored
as vectors -- even a single number is actually a numerical vector of length one! This is a
really powerful idea and the source of much of R's strength!
:::
::: {.exercise}
Make a new data frame `my_cool_numbers` with a column `a` that contains all numbers from 1
to 100 and column `b` that contains your favorite number 100 times.
::: {.answer}
``` {.r}
my_cool_numbers <- data.frame(
"a" = 1:100,
"b" = 42
)
```
:::
Since data frame columns are vectors, we can use the same operations on them. calculate
the standard deviation of all numbers in column `a` in our `my_cool_numbers` data frame!
Use the RStudio help to find the right function for it.
::: {.answer}
``` {.r}
sd(my_cool_numbers$a)
```
:::
:::
As for vectors, R comes out of the box with many more commands to create and modify data
frames, select columns, and filter rows. You will eventually need to get familiar with
these. However, in this book we will skip these in favor of the more streamlined and
beginner-friendly functions from the *Tidyverse* -- more about that later.
## Other Data Structures
R has a couple of other data structures that you will eventually encounter. I don't want
to dwell on them for too long, because they will not play a role in the rest this book. If
you are interested you can read more [here](http://adv-r.had.co.nz/). Additional R data
structures include ...
- **Factor** -- Factors are numerical vectors with a fixed set of possible values
(levels), whereby each level can be assigned a label. The levels may be ordered or
unordered. They are most useful as a data structure for ordinal and categorical
variables in statistical models. Unfortunately, they are often also a point of
confusion for R novices.
- **List** -- Lists are similar to vectors, but they can contain a mix of data types and
data structures, including other lists.
- **Matrix** -- Matrices are multidimensional vectors. Like data frames, they have rows
and columns. Unlike data frames, all elements in a matrix must be of the same data
type. Unless you are implementing statistical methods yourself and want to use R's
matrix algebra functions you probably won't have to work with matrices directly.
- **Object** -- Objects can be used to organize functionality. For instance, a
statistical model may be an object that has various attributes and associated
functions. R even has three types of objects: R6, S3, and S4.
- [**Environment**](https://adv-r.hadley.nz/environments.html) -- Do you remember when I
said that your R session provides an environment in which data is stored? It turns out
that environments are yet another data structure in R. An environment is a data
structure that contains other named R data structures. Understanding environments is
central to a deep understanding of R, but as a beginner it's nothing you need to worry
about.