-
Notifications
You must be signed in to change notification settings - Fork 0
/
search.json
2790 lines (2790 loc) · 954 KB
/
search.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[
{
"objectID": "index.html",
"href": "index.html",
"title": "R for Data Science (2e)",
"section": "",
"text": "1 тавтай морил\nЭнэ бол “R for Data Science” сэтгүүлийн 2 дахь хэвлэлд зориулагдсан вэбсайт юм. Энэхүү ном нь танд R-ээр өгөгдлийн шинжлэх ухаан хэрхэн хийхийг заах болно: Та өөрийн өгөгдлийг R-д хэрхэн оруулах, түүнийг хамгийн хэрэгцээтэй бүтцэд оруулах, хувиргах, дүрслэн харуулах аргад суралцах болно.\nЭнэ номноос та мэдээллийн шинжлэх ухааны ур чадварын практикийг олох болно. Химийн эмч туршилтын хоолойг хэрхэн цэвэрлэж, лаборатори нөөцлөх талаар сурдаг шиг та өгөгдлийг хэрхэн цэвэрлэж, график зурах талаар сурах болно. Эдгээр нь өгөгдлийн шинжлэх ухааныг бий болгох боломжийг олгодог ур чадварууд бөгөөд эндээс та R-тэй эдгээр зүйл бүрийг хийх шилдэг туршлагуудыг олох болно. Та цаг хэмнэхийн тулд график, бичиг үсэгт тайлагдсан програмчлал, хуулбарлах судалгааны дүрмийг хэрхэн ашиглах талаар суралцах болно. Мөн та мэдээлэл солилцох, дүрслэх, судлах явцад нээлтийг хөнгөвчлөх танин мэдэхүйн нөөцийг хэрхэн удирдах талаар суралцах болно.\nЭнэ вэб сайт нь CC BY-NC-ND 3.0 лицензийн дагуу лицензтэй бөгөөд үргэлж үнэ төлбөргүй байх болно. Хэрэв та номын биет хуулбарыг авахыг хүсвэл [Amazon] (https://www.amazon.com/dp/1492097403?&tag=hadlwick-20) дээр захиалж болно. Хэрэв та энэ номыг үнэ төлбөргүй уншиж байгаад талархаж байгаа бөгөөд буцааж өгөхийг хүсвэл Kākāpō Recovery: kākāpō-д хандив өргөөрэй. /www.youtube.com/watch?v =9T1vfsHYiKY) (R4DS-ийн нүүрэн дээр гардаг) нь шүүмжлэлтэй ханддаг. Шинэ Зеландаас гаралтай ховордсон тоть; ердөө 248 үлдсэн.\nХэрэв та өөр хэлээр ярьдаг бол 1-р хэвлэлд үнэгүй орчуулагдсан орчуулгыг сонирхож магадгүй юм.\nТа https://mine-cetinkaya-rundel.github.io/r4ds-solutions дээрх номноос дасгалын санал болгож буй хариултуудыг олох боломжтой.\nR4DS нь Contributor-ийн ёс зүйн дүрмийг ашигладаг болохыг анхаарна уу. Энэ номонд хувь нэмрээ оруулснаар та түүний нөхцлийг дагаж мөрдөхийг зөвшөөрч байна.",
"crumbs": [
"<span class='chapter-number'>1</span> <span class='chapter-title'>тавтай морил</span>"
]
},
{
"objectID": "index.html#талархал",
"href": "index.html#талархал",
"title": "R for Data Science (2e)",
"section": "1.1 Талархал",
"text": "1.1 Талархал\nR4DS-ийг https://www.netlify.com нээлттэй эхийн программ хангамж болон нийгэмлэгүүдийг дэмжих нэг хэсэг болгон зохион байгуулдаг. sss",
"crumbs": [
"<span class='chapter-number'>1</span> <span class='chapter-title'>тавтай морил</span>"
]
},
{
"objectID": "preface-2e.html",
"href": "preface-2e.html",
"title": "Preface to the second edition",
"section": "",
"text": "Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. We’re also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).\nA brief summary of the biggest changes follows:\n\nThe first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.\nThe second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the ggplot2 book, but now R4DS covers more of the most important techniques.\nThe third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room to cover all the details.\nThe fourth part of the book is called “Import”. It’s a new set of chapters that goes beyond reading flat text files to working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.\nThe “Program” part remains, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes details on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier and more important over the last few years. We’ve added a new chapter on important base R functions that you’re likely to see in wild-caught R code.\nThe modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the tidymodels packages and reading Tidy Modeling with R by Max Kuhn and Julia Silge.\nThe “Communicate” part remains, but has been thoroughly updated to feature Quarto instead of R Markdown. This edition of the book has been written in Quarto, and it’s clearly the tool of the future.",
"crumbs": [
"Preface to the second edition"
]
},
{
"objectID": "intro.html",
"href": "intro.html",
"title": "Introduction",
"section": "",
"text": "What you will learn\nData science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The goal of “R for Data Science” is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly, and to have some fun along the way 😃. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges using the best parts of R.\nData science is a vast field, and there’s no way you can master it all by reading a single book. This book aims to give you a solid foundation in the most important tools and enough knowledge to find the resources to learn more when necessary. Our model of the steps of a typical data science project looks something like Figure 1.\nFigure 1: In our model of the data science process, you start with data import and tidying. Next, you understand your data with an iterative cycle of transforming, visualizing, and modeling. You finish the process by communicating your results to other humans.\nFirst, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!\nOnce you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with how it is stored. In brief, when your data is tidy, each column is a variable and each row is an observation. Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.\nOnce you have tidy data, a common next step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling because getting your data in a form that’s natural to work with often feels like a fight!\nOnce you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling. These have complementary strengths and weaknesses, so any real data analysis will iterate between them many times.\nVisualization is a fundamentally human activity. A good visualization will show you things you did not expect or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question or that you need to collect different data. Visualizations can surprise you, but they don’t scale particularly well because they require a human to interpret them.\nModels are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are fundamentally mathematical or computational tools, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature, a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.\nThe last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.\nSurrounding all these tools is programming. Programming is a cross-cutting tool that you use in nearly every part of a data science project. You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks and solve new problems with greater ease.\nYou’ll use these tools in every data science project, but they’re not enough for most projects. There’s a rough 80/20 rule at play: you can tackle about 80% of every project using the tools you’ll learn in this book, but you’ll need other tools to tackle the remaining 20%. Throughout this book, we’ll point you to resources where you can learn more.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#how-this-book-is-organized",
"href": "intro.html#how-this-book-is-organized",
"title": "Introduction",
"section": "How this book is organized",
"text": "How this book is organized\nThe previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you’ll iterate through them multiple times). In our experience, however, learning data importing and tidying first is suboptimal because, 80% of the time, it’s routine and boring, and the other 20% of the time, it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualization and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.\nWithin each chapter, we try to adhere to a consistent pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. Although it can be tempting to skip the exercises, there’s no better way to learn than by practicing on real problems.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#what-you-wont-learn",
"href": "intro.html#what-you-wont-learn",
"title": "Introduction",
"section": "What you won’t learn",
"text": "What you won’t learn\nThere are several important topics that this book doesn’t cover. We believe it’s important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic.\n\nModeling\nModeling is super important for data science, but it’s a big topic, and unfortunately, we just don’t have the space to give it the coverage it deserves here. To learn more about modeling, we highly recommend Tidy Modeling with R by our colleagues Max Kuhn and Julia Silge. This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.\n\n\nBig data\nThis book proudly and primarily focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data. The tools you’ll learn throughout the majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with a few gigabytes of data. We’ll also show you how to get data out of databases and parquet files, both of which are often used to store big data. You won’t necessarily be able to work with the entire dataset, but that’s not a problem because you only need a subset or subsample to answer the question that you’re interested in.\nIf you’re routinely working with larger data (10–100 GB, say), we recommend learning more about data.table. We don’t teach it here because it uses a different interface than the tidyverse and requires you to learn some different conventions. However, it is incredibly faster, and the performance payoff is worth investing some time in learning it if you’re working with large data.\n\n\nPython, Julia, and friends\nIn this book, you won’t learn anything about Python, Julia, or any other programming language useful for data science. This isn’t because we think these tools are bad. They’re not! And in practice, most data science teams use a mix of languages, often at least R and Python. But we strongly believe that it’s best to master one tool at a time, and R is a great place to start.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#prerequisites",
"href": "intro.html#prerequisites",
"title": "Introduction",
"section": "Prerequisites",
"text": "Prerequisites\nWe’ve made a few assumptions about what you already know to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some basic programming experience already. If you’ve never programmed before, you might find Hands on Programming with R by Garrett to be a valuable adjunct to this book.\nYou need four things to run the code in this book: R, RStudio, a collection of R packages called the tidyverse, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, documentation that describes how to use them, and sample data.\n\nR\nTo download R, go to CRAN, the comprehensive R archive network, https://cloud.r-project.org. A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions that require you to re-install all your packages, but putting it off only makes it worse. We recommend R 4.2.0 or later for this book.\n\n\nRStudio\nRStudio is an integrated development environment, or IDE, for R programming, which you can download from https://posit.co/download/rstudio-desktop/. RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out, so there’s no need to check back. It’s a good idea to upgrade regularly to take advantage of the latest and greatest features. For this book, make sure you have at least RStudio 2022.02.0.\nWhen you start RStudio, Figure 2, you’ll see two key regions in the interface: the console pane and the output pane. For now, all you need to know is that you type the R code in the console pane and press enter to run it. You’ll learn more as we go along!1\n\n\n\n\n\n\n\n\nFigure 2: The RStudio IDE has two key regions: type R code in the console pane on the left, and look for plots in the output pane on the right.\n\n\n\n\n\n\n\nThe tidyverse\nYou’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together.\nYou can install the complete tidyverse with a single line of code:\n\ninstall.packages(\"tidyverse\")\n\nOn your computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on your computer.\nYou will not be able to use the functions, objects, or help files in a package until you load it with library(). Once you have installed a package, you can load it using the library() function:\n\nlibrary(tidyverse)\n#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#> ✔ dplyr 1.1.4 ✔ readr 2.1.5\n#> ✔ forcats 1.0.0 ✔ stringr 1.5.1\n#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n#> ✔ purrr 1.0.2 \n#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#> ✖ dplyr::filter() masks stats::filter()\n#> ✖ dplyr::lag() masks stats::lag()\n#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nThis tells you that tidyverse loads nine packages: dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr, tibble, tidyr. These are considered the core of the tidyverse because you’ll use them in almost every analysis.\nPackages in the tidyverse change fairly frequently. You can see if updates are available by running tidyverse_update().\n\n\nOther packages\nThere are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. This doesn’t make them better or worse; it just makes them different. In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data.\nWe’ll use many packages from outside the tidyverse in this book. For example, we’ll use the following packages because they provide interesting datasets for us to work with in the process of learning R:\n\ninstall.packages(\n c(\"arrow\", \"babynames\", \"curl\", \"duckdb\", \"gapminder\", \n \"ggrepel\", \"ggridges\", \"ggthemes\", \"hexbin\", \"janitor\", \"Lahman\", \n \"leaflet\", \"maps\", \"nycflights13\", \"openxlsx\", \"palmerpenguins\", \n \"repurrrsive\", \"tidymodels\", \"writexl\")\n )\n\nWe’ll also use a selection of other packages for one off examples. You don’t need to install them now, just remember that whenever you see an error like this:\n\nlibrary(ggrepel)\n#> Error in library(ggrepel) : there is no package called ‘ggrepel’\n\nYou need to run install.packages(\"ggrepel\") to install the package.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#running-r-code",
"href": "intro.html#running-r-code",
"title": "Introduction",
"section": "Running R code",
"text": "Running R code\nThe previous section showed you several examples of running R code. The code in the book looks like this:\n\n1 + 2\n#> [1] 3\n\nIf you run the same code in your local console, it will look like this:\n> 1 + 2\n[1] 3\nThere are two main differences. In your console, you type after the >, called the prompt; we don’t show the prompt in the book. In the book, the output is commented out with #>; in your console, it appears directly after your code. These two differences mean that if you’re working with an electronic version of the book, you can easily copy code out of the book and paste it into the console.\nThroughout the book, we use a consistent set of conventions to refer to code:\n\nFunctions are displayed in a code font and followed by parentheses, like sum() or mean().\nOther R objects (such as data or function arguments) are in a code font, without parentheses, like flights or x.\nSometimes, to make it clear which package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate() or nycflights13::flights. This is also valid R code.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#acknowledgments",
"href": "intro.html#acknowledgments",
"title": "Introduction",
"section": "Acknowledgments",
"text": "Acknowledgments\nThis book isn’t just the product of Hadley, Mine, and Garrett but is the result of many conversations (in person and online) that we’ve had with many people in the R community. We’re incredibly grateful for all the conversations we’ve had with y’all; thank you so much!\nThis book was written in the open, and many people contributed via pull requests. A special thanks to all 259 of you who contributed improvements via GitHub pull requests (in alphabetical order by username): @a-rosenberg, Tim Becker (@a2800276), Abinash Satapathy (@Abinashbunty), Adam Gruer (@adam-gruer), adi pradhan (@adidoit), A. s. (@Adrianzo), Aep Hidyatuloh (@aephidayatuloh), Andrea Gilardi (@agila5), Ajay Deonarine (@ajay-d), @AlanFeder, Daihe Sui (@alansuidaihe), @alberto-agudo, @AlbertRapp, @aleloi, pete (@alonzi), Alex (@ALShum), Andrew M. (@amacfarland), Andrew Landgraf (@andland), @andyhuynh92, Angela Li (@angela-li), Antti Rask (@AnttiRask), LOU Xun (@aquarhead), @ariespirgel, @august-18, Michael Henry (@aviast), Azza Ahmed (@azzaea), Steven Moran (@bambooforest), Brian G. Barkley (@BarkleyBG), Mara Averick (@batpigandme), Oluwafemi OYEDELE (@BB1464), Brent Brewington (@bbrewington), Bill Behrman (@behrman), Ben Herbertson (@benherbertson), Ben Marwick (@benmarwick), Ben Steinberg (@bensteinberg), Benjamin Yeh (@bentyeh), Betul Turkoglu (@betulturkoglu), Brandon Greenwell (@bgreenwell), Bianca Peterson (@BinxiePeterson), Birger Niklas (@BirgerNi), Brett Klamer (@bklamer), @boardtc, Christian (@c-hoh), Caddy (@caddycarine), Camille V Leonard (@camillevleonard), @canovasjm, Cedric Batailler (@cedricbatailler), Christina Wei (@christina-wei), Christian Mongeau (@chrMongeau), Cooper Morris (@coopermor), Colin Gillespie (@csgillespie), Rademeyer Vermaak (@csrvermaak), Chloe Thierstein (@cthierst), Chris Saunders (@ctsa), Abhinav Singh (@curious-abhinav), Curtis Alexander (@curtisalexander), Christian G. Warden (@cwarden), Charlotte Wickham (@cwickham), Kenny Darrell (@darrkj), David Kane (@davidkane9), David (@davidrsch), David Rubinger (@davidrubinger), David Clark (@DDClark), Derwin McGeary (@derwinmcgeary), Daniel Gromer (@dgromer), @Divider85, @djbirke, Danielle Navarro (@djnavarro), Russell Shean (@DOH-RPS1303), Zhuoer Dong (@dongzhuoer), Devin Pastoor (@dpastoor), @DSGeoff, Devarshi Thakkar (@dthakkar09), Julian During (@duju211), Dylan Cashman (@dylancashman), Dirk Eddelbuettel (@eddelbuettel), Edwin Thoen (@EdwinTh), Ahmed El-Gabbas (@elgabbas), Henry Webel (@enryH), Ercan Karadas (@ercan7), Eric Kitaif (@EricKit), Eric Watt (@ericwatt), Erik Erhardt (@erikerhardt), Etienne B. Racine (@etiennebr), Everett Robinson (@evjrob), @fellennert, Flemming Miguel (@flemmingmiguel), Floris Vanderhaeghe (@florisvdh), @funkybluehen, @gabrivera, Garrick Aden-Buie (@gadenbuie), Peter Ganong (@ganong123), Gerome Meyer (@GeroVanMi), Gleb Ebert (@gl-eb), Josh Goldberg (@GoldbergData), bahadir cankardes (@gridgrad), Gustav W Delius (@gustavdelius), Hao Chen (@hao-trivago), Harris McGehee (@harrismcgehee), @hendrikweisser, Hengni Cai (@hengnicai), Iain (@Iain-S), Ian Sealy (@iansealy), Ian Lyttle (@ijlyttle), Ivan Krukov (@ivan-krukov), Jacob Kaplan (@jacobkap), Jazz Weisman (@jazzlw), John Blischak (@jdblischak), John D. Storey (@jdstorey), Gregory Jefferis (@jefferis), Jeffrey Stevens (@JeffreyRStevens), 蒋雨蒙 (@JeldorPKU), Jennifer (Jenny) Bryan (@jennybc), Jen Ren (@jenren), Jeroen Janssens (@jeroenjanssens), @jeromecholewa, Janet Wesner (@jilmun), Jim Hester (@jimhester), JJ Chen (@jjchern), Jacek Kolacz (@jkolacz), Joanne Jang (@joannejang), @johannes4998, John Sears (@johnsears), @jonathanflint, Jon Calder (@jonmcalder), Jonathan Page (@jonpage), Jon Harmon (@jonthegeek), JooYoung Seo (@jooyoungseo), Justinas Petuchovas (@jpetuchovas), Jordan (@jrdnbradford), Jeffrey Arnold (@jrnold), Jose Roberto Ayala Solares (@jroberayalas), Joyce Robbins (@jtr13), @juandering, Julia Stewart Lowndes (@jules32), Sonja (@kaetschap), Kara Woo (@karawoo), Katrin Leinweber (@katrinleinweber), Karandeep Singh (@kdpsingh), Kevin Perese (@kevinxperese), Kevin Ferris (@kferris10), Kirill Sevastyanenko (@kirillseva), Jonathan Kitt (@KittJonathan), @koalabearski, Kirill Müller (@krlmlr), Rafał Kucharski (@kucharsky), Kevin Wright (@kwstat), Noah Landesberg (@landesbergn), Lawrence Wu (@lawwu), @lindbrook, Luke W Johnston (@lwjohnst86), Kara de la Marck (@MarckK), Kunal Marwaha (@marwahaha), Matan Hakim (@matanhakim), Matthias Liew (@MatthiasLiew), Matt Wittbrodt (@MattWittbrodt), Mauro Lepore (@maurolepore), Mark Beveridge (@mbeveridge), @mcewenkhundi, mcsnowface, PhD (@mcsnowface), Matt Herman (@mfherman), Michael Boerman (@michaelboerman), Mitsuo Shiota (@mitsuoxv), Matthew Hendrickson (@mjhendrickson), @MJMarshall, Misty Knight-Finley (@mkfin7), Mohammed Hamdy (@mmhamdy), Maxim Nazarov (@mnazarov), Maria Paula Caldas (@mpaulacaldas), Mustafa Ascha (@mustafaascha), Nelson Areal (@nareal), Nate Olson (@nate-d-olson), Nathanael (@nateaff), @nattalides, Ned Western (@NedJWestern), Nick Clark (@nickclark1000), @nickelas, Nirmal Patel (@nirmalpatel), Nischal Shrestha (@nischalshrestha), Nicholas Tierney (@njtierney), Jakub Nowosad (@Nowosad), Nick Pullen (@nstjhp), @olivier6088, Olivier Cailloux (@oliviercailloux), Robin Penfold (@p0bs), Pablo E. Garcia (@pabloedug), Paul Adamson (@padamson), Penelope Y (@penelopeysm), Peter Hurford (@peterhurford), Peter Baumgartner (@petzi53), Patrick Kennedy (@pkq), Pooya Taherkhani (@pooyataher), Y. Yu (@PursuitOfDataScience), Radu Grosu (@radugrosu), Ranae Dietzel (@Ranae), Ralph Straumann (@rastrau), Rayna M Harris (@raynamharris), @ReeceGoding, Robin Gertenbach (@rgertenbach), Jajo (@RIngyao), Riva Quiroga (@rivaquiroga), Richard Knight (@RJHKnight), Richard Zijdeman (@rlzijdeman), @robertchu03, Robin Kohrs (@RobinKohrs), Robin (@Robinlovelace), Emily Robinson (@robinsones), Rob Tenorio (@robtenorio), Rod Mazloomi (@RodAli), Rohan Alexander (@RohanAlexander), Romero Morais (@RomeroBarata), Albert Y. Kim (@rudeboybert), Saghir (@saghirb), Hojjat Salmasian (@salmasian), Jonas (@sauercrowd), Vebash Naidoo (@sciencificity), Seamus McKinsey (@seamus-mckinsey), @seanpwilliams, Luke Smith (@seasmith), Matthew Sedaghatfar (@sedaghatfar), Sebastian Kraus (@sekR4), Sam Firke (@sfirke), Shannon Ellis (@ShanEllis), @shoili, Christian Heinrich (@Shurakai), S’busiso Mkhondwane (@sibusiso16), SM Raiyyan (@sm-raiyyan), Jakob Krigovsky (@sonicdoe), Stephan Koenig (@stephan-koenig), Stephen Balogun (@stephenbalogun), Steven M. Mortimer (@StevenMMortimer), Stéphane Guillou (@stragu), Sulgi Kim (@sulgik), Sergiusz Bleja (@svenski), Tal Galili (@talgalili), Alec Fisher (@Taurenamo), Todd Gerarden (@tgerarden), Tom Godfrey (@thomasggodfrey), Tim Broderick (@timbroderick), Tim Waterhouse (@timwaterhouse), TJ Mahr (@tjmahr), Thomas Klebel (@tklebel), Tom Prior (@tomjamesprior), Terence Teo (@tteo), @twgardner2, Ulrik Lyngs (@ulyngs), Shinya Uryu (@uribo), Martin Van der Linden (@vanderlindenma), Walter Somerville (@waltersom), @werkstattcodes, Will Beasley (@wibeasley), Yihui Xie (@yihui), Yiming (Paul) Li (@yimingli), @yingxingwu, Hiroaki Yutani (@yutannihilation), Yu Yu Aung (@yuyu-aung), Zach Bogart (@zachbogart), @zeal626, Zeki Akyol (@zekiakyol).",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#colophon",
"href": "intro.html#colophon",
"title": "Introduction",
"section": "Colophon",
"text": "Colophon\nAn online version of this book is available at https://r4ds.hadley.nz. It will continue to evolve in between reprints of the physical book. The source of the book is available at https://github.com/hadley/r4ds. The book is powered by Quarto, which makes it easy to write books that combine text and executable code.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "intro.html#footnotes",
"href": "intro.html#footnotes",
"title": "Introduction",
"section": "",
"text": "If you’d like a comprehensive overview of all of RStudio’s features, see the RStudio User Guide at https://docs.posit.co/ide/user.↩︎",
"crumbs": [
"Introduction"
]
},
{
"objectID": "whole-game.html",
"href": "whole-game.html",
"title": "Whole game",
"section": "",
"text": "Our goal in this part of the book is to give you a rapid overview of the main tools of data science: importing, tidying, transforming, and visualizing data, as shown in Figure 1. We want to show you the “whole game” of data science giving you just enough of all the major pieces so that you can tackle real, if simple, datasets. The later parts of the book will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.\n\n\n\n\n\n\n\n\nFigure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.\n\n\n\n\n\nFour chapters focus on the tools of data science:\n\nVisualization is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. In 2 Data visualization you’ll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.\nVisualization alone is typically not enough, so in 4 Data transformation, you’ll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.\nIn 6 Data tidying, you’ll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier. You’ll learn the underlying principles, and how to get your data into a tidy form.\nBefore you can transform and visualize your data, you need to first get your data into R. In 8 Data import you’ll learn the basics of getting .csv files into R.\n\nNestled among these chapters are four other chapters that focus on your R workflow. In 3 Workflow: basics, 5 Workflow: code style, and 7 Workflow: scripts and projects you’ll learn good workflow practices for writing and organizing your R code. These will set you up for success in the long run, as they’ll give you the tools to stay organized when you tackle real projects. Finally, 9 Workflow: getting help will teach you how to get help and keep learning.",
"crumbs": [
"Whole game"
]
},
{
"objectID": "data-visualize.html",
"href": "data-visualize.html",
"title": "2 Data visualization",
"section": "",
"text": "2.1 Introduction\nR has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.\nThis chapter will teach you how to visualize your data using ggplot2. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects – the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. We’ll finish off with saving your plots and troubleshooting tips.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#introduction",
"href": "data-visualize.html#introduction",
"title": "2 Data visualization",
"section": "",
"text": "“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey\n\n\n\n\n2.1.1 Prerequisites\nThis chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:\n\nlibrary(tidyverse)\n#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#> ✔ dplyr 1.1.4 ✔ readr 2.1.5\n#> ✔ forcats 1.0.0 ✔ stringr 1.5.1\n#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n#> ✔ purrr 1.0.2 \n#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#> ✖ dplyr::filter() masks stats::filter()\n#> ✖ dplyr::lag() masks stats::lag()\n#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nThat one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)1.\nIf you run this code and get the error message there is no package called 'tidyverse', you’ll need to first install it, then run library() once again.\n\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\n\nYou only need to install a package once, but you need to load it every time you start a new session.\nIn addition to tidyverse, we will also use the palmerpenguins package, which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago, and the ggthemes package, which offers a colorblind safe color palette.\n\nlibrary(palmerpenguins)\nlibrary(ggthemes)",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#first-steps",
"href": "data-visualize.html#first-steps",
"title": "2 Data visualization",
"section": "2.2 First steps",
"text": "2.2 First steps\nDo penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? How about by the island where the penguin lives? Let’s create visualizations that we can use to answer these questions.\n\n2.2.1 The penguins data frame\nYou can test your answers to those questions with the penguins data frame found in palmerpenguins (a.k.a. palmerpenguins::penguins). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). penguins contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER2.\nTo make the discussion easier, let’s define some terms:\n\nA variable is a quantity, quality, or property that you can measure.\nA value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.\nAn observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.\nTabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.\n\nIn this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.\nType the name of the data frame in the console and R will print a preview of its contents. Note that it says tibble on top of this preview. In the tidyverse, we use special data frames called tibbles that you will learn more about soon.\n\npenguins\n#> # A tibble: 344 × 8\n#> species island bill_length_mm bill_depth_mm flipper_length_mm\n#> <fct> <fct> <dbl> <dbl> <int>\n#> 1 Adelie Torgersen 39.1 18.7 181\n#> 2 Adelie Torgersen 39.5 17.4 186\n#> 3 Adelie Torgersen 40.3 18 195\n#> 4 Adelie Torgersen NA NA NA\n#> 5 Adelie Torgersen 36.7 19.3 193\n#> 6 Adelie Torgersen 39.3 20.6 190\n#> # ℹ 338 more rows\n#> # ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>\n\nThis data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse(). Or, if you’re in RStudio, run View(penguins) to open an interactive data viewer.\n\nglimpse(penguins)\n#> Rows: 344\n#> Columns: 8\n#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A…\n#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge…\n#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…\n#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…\n#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, …\n#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347…\n#> $ sex <fct> male, female, female, NA, female, male, female, m…\n#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…\n\nAmong the variables in penguins are:\n\nspecies: a penguin’s species (Adelie, Chinstrap, or Gentoo).\nflipper_length_mm: length of a penguin’s flipper, in millimeters.\nbody_mass_g: body mass of a penguin, in grams.\n\nTo learn more about penguins, open its help page by running ?penguins.\n\n\n2.2.2 Ultimate goal\nOur ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.\n\n\n\n\n\n\n\n\n\n\n\n2.2.3 Creating a ggplot\nLet’s recreate this plot step-by-step.\nWith ggplot2, you begin a plot with the function ggplot(), defining a plot object that you then add layers to. The first argument of ggplot() is the dataset to use in the graph and so ggplot(data = penguins) creates an empty graph that is primed to display the penguins data, but since we haven’t told it how to visualize it yet, for now it’s empty. This is not a very exciting plot, but you can think of it like an empty canvas you’ll paint the remaining layers of your plot onto.\n\nggplot(data = penguins)\n\n\n\n\n\n\n\n\nNext, we need to tell ggplot() how the information from our data will be visually represented. The mapping argument of the ggplot() function defines how variables in your dataset are mapped to visual properties (aesthetics) of your plot. The mapping argument is always defined in the aes() function, and the x and y arguments of aes() specify which variables to map to the x and y axes. For now, we will only map flipper length to the x aesthetic and body mass to the y aesthetic. ggplot2 looks for the mapped variables in the data argument, in this case, penguins.\nThe following plot shows the result of adding these mappings.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n)\n\n\n\n\n\n\n\n\nOur empty canvas now has more structure – it’s clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis). But the penguins themselves are not yet on the plot. This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot.\nTo do so, we need to define a geom: the geometrical object that a plot uses to represent data. These geometric objects are made available in ggplot2 with functions that start with geom_. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms (geom_bar()), line charts use line geoms (geom_line()), boxplots use boxplot geoms (geom_boxplot()), scatterplots use point geoms (geom_point()), and so on.\nThe function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. You’ll learn a whole bunch of geoms throughout the book, particularly in Chapter 10.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point()\n#> Warning: Removed 2 rows containing missing values or values outside the scale range\n#> (`geom_point()`).\n\n\n\n\n\n\n\n\nNow we have something that looks like what we might think of as a “scatterplot”. It doesn’t yet match our “ultimate goal” plot, but using this plot we can start answering the question that motivated our exploration: “What does the relationship between flipper length and body mass look like?” The relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn’t too much scatter around such a line). Penguins with longer flippers are generally larger in terms of their body mass.\nBefore we add more layers to this plot, let’s pause for a moment and review the warning message we got:\n\nRemoved 2 rows containing missing values (geom_point()).\n\nWe’re seeing this message because there are two penguins in our dataset with missing body mass and/or flipper length values and ggplot2 has no way of representing them on the plot without both of these values. Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. This type of warning is probably one of the most common types of warnings you will see when working with real data – missing values are a very common issue and you’ll learn more about them throughout the book, particularly in Chapter 19. For the remaining plots in this chapter we will suppress this warning so it’s not printed alongside every single plot we make.\n\n\n2.2.4 Adding aesthetics and layers\nScatterplots are useful for displaying the relationship between two numerical variables, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. For example, does the relationship between flipper length and body mass differ by species? Let’s incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between these variables. We will do this by representing species with different colored points.\nTo achieve this, will we need to modify the aesthetic or the geom? If you guessed “in the aesthetic mapping, inside of aes()”, you’re already getting the hang of creating data visualizations with ggplot2! And if not, don’t worry. Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)\n) +\n geom_point()\n\n\n\n\n\n\n\n\nWhen a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.\nNow let’s add one more layer: a smooth curve displaying the relationship between body mass and flipper length. Before you proceed, refer back to the code above, and think about how we can add this to our existing plot.\nSince this is a new geometric object representing our data, we will add a new geom as a layer on top of our point geom: geom_smooth(). And we will specify that we want to draw the line of best fit based on a linear model with method = \"lm\".\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)\n) +\n geom_point() +\n geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nWe have successfully added lines, but this plot doesn’t look like the plot from Section 2.2.2, which only has one line for the entire dataset as opposed to separate lines for each of the penguin species.\nWhen aesthetic mappings are defined in ggplot(), at the global level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping argument, which allows for aesthetic mappings at the local level that are added to those inherited from the global level. Since we want points to be colored based on species but don’t want the lines to be separated out for them, we should specify color = species for geom_point() only.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point(mapping = aes(color = species)) +\n geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nVoila! We have something that looks very much like our ultimate goal, though it’s not yet perfect. We still need to use different shapes for each species of penguins and improve labels.\nIt’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species to the shape aesthetic.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point(mapping = aes(color = species, shape = species)) +\n geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nNote that the legend is automatically updated to reflect the different shapes of the points as well.\nAnd finally, we can improve the labels of our plot using the labs() function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend. In addition, we can improve the color palette to be colorblind safe with the scale_color_colorblind() function from the ggthemes package.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point(aes(color = species, shape = species)) +\n geom_smooth(method = \"lm\") +\n labs(\n title = \"Body mass and flipper length\",\n subtitle = \"Dimensions for Adelie, Chinstrap, and Gentoo Penguins\",\n x = \"Flipper length (mm)\", y = \"Body mass (g)\",\n color = \"Species\", shape = \"Species\"\n ) +\n scale_color_colorblind()\n\n\n\n\n\n\n\n\nWe finally have a plot that perfectly matches our “ultimate goal”!\n\n\n2.2.5 Exercises\n\nHow many rows are in penguins? How many columns?\nWhat does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.\nMake a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.\nWhat happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?\nWhy does the following give an error and how would you fix it?\n\nggplot(data = penguins) + \n geom_point()\n\nWhat does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.\nAdd the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().\nRecreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?\n\n\n\n\n\n\n\n\n\nRun this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)\n) +\n geom_point() +\n geom_smooth(se = FALSE)\n\nWill these two graphs look different? Why/why not?\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point() +\n geom_smooth()\n\nggplot() +\n geom_point(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n ) +\n geom_smooth(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n )",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#sec-ggplot2-calls",
"href": "data-visualize.html#sec-ggplot2-calls",
"title": "2 Data visualization",
"section": "2.3 ggplot2 calls",
"text": "2.3 ggplot2 calls\nAs we move on from these introductory sections, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:\n\nggplot(\n data = penguins,\n mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n geom_point()\n\nTypically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to ggplot() are data and mapping, in the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of extra text, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in Chapter 26.\nRewriting the previous plot more concisely yields:\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + \n geom_point()\n\nIn the future, you’ll also learn about the pipe, |>, which will allow you to create that plot with:\n\npenguins |> \n ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + \n geom_point()",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#visualizing-distributions",
"href": "data-visualize.html#visualizing-distributions",
"title": "2 Data visualization",
"section": "2.4 Visualizing distributions",
"text": "2.4 Visualizing distributions\nHow you visualize the distribution of a variable depends on the type of variable: categorical or numerical.\n\n2.4.1 A categorical variable\nA variable is categorical if it can only take one of a small set of values. To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value.\n\nggplot(penguins, aes(x = species)) +\n geom_bar()\n\n\n\n\n\n\n\n\nIn bar plots of categorical variables with non-ordered levels, like the penguin species above, it’s often preferable to reorder the bars based on their frequencies. Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor.\n\nggplot(penguins, aes(x = fct_infreq(species))) +\n geom_bar()\n\n\n\n\n\n\n\n\nYou will learn more about factors and functions for dealing with factors (like fct_infreq() shown above) in Chapter 17.\n\n\n2.4.2 A numerical variable\nA variable is numerical (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. Numerical variables can be continuous or discrete.\nOne commonly used visualization for distributions of continuous variables is a histogram.\n\nggplot(penguins, aes(x = body_mass_g)) +\n geom_histogram(binwidth = 200)\n\n\n\n\n\n\n\n\nA histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a body_mass_g value between 3,500 and 3,700 grams, which are the left and right edges of the bar.\nYou can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. In the plots below a binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution. Similarly, a binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution. A binwidth of 200 provides a sensible balance.\nggplot(penguins, aes(x = body_mass_g)) +\n geom_histogram(binwidth = 20)\nggplot(penguins, aes(x = body_mass_g)) +\n geom_histogram(binwidth = 2000)\n\n\n\n\n\n\n\n\n\n\nAn alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution. We won’t go into how geom_density() estimates the density (you can read more about that in the function documentation), but let’s explain how the density curve is drawn with an analogy. Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.\n\nggplot(penguins, aes(x = body_mass_g)) +\n geom_density()\n#> Warning: Removed 2 rows containing non-finite outside the scale range\n#> (`stat_density()`).\n\n\n\n\n\n\n\n\n\n\n2.4.3 Exercises\n\nMake a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?\nHow are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?\n\nggplot(penguins, aes(x = species)) +\n geom_bar(color = \"red\")\n\nggplot(penguins, aes(x = species)) +\n geom_bar(fill = \"red\")\n\nWhat does the bins argument in geom_histogram() do?\nMake a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#visualizing-relationships",
"href": "data-visualize.html#visualizing-relationships",
"title": "2 Data visualization",
"section": "2.5 Visualizing relationships",
"text": "2.5 Visualizing relationships\nTo visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.\n\n2.5.1 A numerical and a categorical variable\nTo visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution. It is also useful for identifying potential outliers. As shown in Figure 2.1, each boxplot consists of:\n\nA box that indicates the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile. In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.\nVisual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.\nA line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.\n\n\n\n\n\n\n\n\n\nFigure 2.1: Diagram depicting how a boxplot is created.\n\n\n\n\n\nLet’s take a look at the distribution of body mass by species using geom_boxplot():\n\nggplot(penguins, aes(x = species, y = body_mass_g)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\nAlternatively, we can make density plots with geom_density().\n\nggplot(penguins, aes(x = body_mass_g, color = species)) +\n geom_density(linewidth = 0.75)\n\n\n\n\n\n\n\n\nWe’ve also customized the thickness of the lines using the linewidth argument in order to make them stand out a bit more against the background.\nAdditionally, we can map species to both color and fill aesthetics and use the alpha aesthetic to add transparency to the filled density curves. This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque). In the following plot it’s set to 0.5.\n\nggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +\n geom_density(alpha = 0.5)\n\n\n\n\n\n\n\n\nNote the terminology we have used here:\n\nWe map variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable.\nOtherwise, we set the value of an aesthetic.\n\n\n\n2.5.2 Two categorical variables\nWe can use stacked bar plots to visualize the relationship between two categorical variables. For example, the following two stacked bar plots both display the relationship between island and species, or specifically, visualizing the distribution of species within each island.\nThe first plot shows the frequencies of each species of penguins on each island. The plot of frequencies shows that there are equal numbers of Adelies on each island. But we don’t have a good sense of the percentage balance within each island.\n\nggplot(penguins, aes(x = island, fill = species)) +\n geom_bar()\n\n\n\n\n\n\n\n\nThe second plot, a relative frequency plot created by setting position = \"fill\" in the geom, is more useful for comparing species distributions across islands since it’s not affected by the unequal numbers of penguins across the islands. Using this plot we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.\n\nggplot(penguins, aes(x = island, fill = species)) +\n geom_bar(position = \"fill\")\n\n\n\n\n\n\n\n\nIn creating these bar charts, we map the variable that will be separated into bars to the x aesthetic, and the variable that will change the colors inside the bars to the fill aesthetic.\n\n\n2.5.3 Two numerical variables\nSo far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n geom_point()\n\n\n\n\n\n\n\n\n\n\n2.5.4 Three or more variables\nAs we saw in Section 2.2.4, we can incorporate more variables into a plot by mapping them to additional aesthetics. For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n geom_point(aes(color = species, shape = island))\n\n\n\n\n\n\n\n\nHowever adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.\nTo facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() is a formula3, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n geom_point(aes(color = species, shape = species)) +\n facet_wrap(~island)\n\n\n\n\n\n\n\n\nYou will learn about many other geoms for visualizing distributions of variables and relationships between them in Chapter 10.\n\n\n2.5.5 Exercises\n\nThe mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?\nMake a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?\nIn the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?\nWhat happens if you map the same variable to multiple aesthetics?\nMake a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?\nWhy does the following yield two separate legends? How would you fix it to combine the two legends?\n\nggplot(\n data = penguins,\n mapping = aes(\n x = bill_length_mm, y = bill_depth_mm, \n color = species, shape = species\n )\n) +\n geom_point() +\n labs(color = \"Species\")\n\nCreate the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?\n\nggplot(penguins, aes(x = island, fill = species)) +\n geom_bar(position = \"fill\")\nggplot(penguins, aes(x = species, fill = island)) +\n geom_bar(position = \"fill\")",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#sec-ggsave",
"href": "data-visualize.html#sec-ggsave",
"title": "2 Data visualization",
"section": "2.6 Saving your plots",
"text": "2.6 Saving your plots\nOnce you’ve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk:\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n geom_point()\nggsave(filename = \"penguin-plot.png\")\n\nThis will save your plot to your working directory, a concept you’ll learn more about in Chapter 7.\nIf you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.\nGenerally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 29.\n\n2.6.1 Exercises\n\nRun the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?\n\nggplot(mpg, aes(x = class)) +\n geom_bar()\nggplot(mpg, aes(x = cty, y = hwy)) +\n geom_point()\nggsave(\"mpg-plot.png\")\n\nWhat do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#common-problems",
"href": "data-visualize.html#common-problems",
"title": "2 Data visualization",
"section": "2.7 Common problems",
"text": "2.7 Common problems\nAs you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesn’t work on the first try!\nStart by carefully comparing the code that you’re running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every \" is paired with another \". Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.\nOne common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:\n\nggplot(data = mpg) \n+ geom_point(mapping = aes(x = displ, y = hwy))\n\nIf you’re still stuck, try the help. You can get help about any R function by running ?function_name in the console, or highlighting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.\nIf that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#summary",
"href": "data-visualize.html#summary",
"title": "2 Data visualization",
"section": "2.8 Summary",
"text": "2.8 Summary\nIn this chapter, you’ve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by leveraging additional aesthetic mappings and/or splitting your plot into small multiples using faceting.\nWe’ll use visualizations again and again throughout this book, introducing new techniques as we need them as well as do a deeper dive into creating visualizations with ggplot2 in Chapter 10 through Chapter 12.\nWith the basics of visualization under your belt, in the next chapter we’re going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because it’ll help you stay organized as you write increasing amounts of R code.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "data-visualize.html#footnotes",
"href": "data-visualize.html#footnotes",
"title": "2 Data visualization",
"section": "",
"text": "You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at https://conflicted.r-lib.org.↩︎\nHorst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎\nHere “formula” is the name of the thing created by ~, not a synonym for “equation”.↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>2</span> <span class='chapter-title'>Data visualization</span>"
]
},
{
"objectID": "workflow-basics.html",
"href": "workflow-basics.html",
"title": "3 Workflow: basics",
"section": "",
"text": "3.1 Coding basics\nYou now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place can cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.\nBefore we go any further, let’s ensure you’ve got a solid foundation in running R code and that you know some of the most helpful RStudio features.\nLet’s review some basics we’ve omitted so far in the interest of getting you plotting as quickly as possible. You can use R to do basic math calculations:\n1 / 200 * 30\n#> [1] 0.15\n(59 + 73 + 2) / 3\n#> [1] 44.66667\nsin(pi / 2)\n#> [1] 1\nYou can create new objects with the assignment operator <-:\nx <- 3 * 4\nNote that the value of x is not printed, it’s just stored. If you want to view the value, type x in the console.\nYou can combine multiple elements into a vector with c():\nprimes <- c(2, 3, 5, 7, 11, 13)\nAnd basic arithmetic on vectors is applied to every element of of the vector:\nprimes * 2\n#> [1] 4 6 10 14 22 26\nprimes - 1\n#> [1] 1 2 4 6 10 12\nAll R statements where you create objects, assignment statements, have the same form:\nobject_name <- value\nWhen reading that code, say “object name gets value” in your head.\nYou will make lots of assignments, and <- is a pain to type. You can save time with RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds <- with spaces, which is a good code formatting practice. Code can be miserable to read on a good day, so giveyoureyesabreak and use spaces.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "workflow-basics.html#comments",
"href": "workflow-basics.html#comments",
"title": "3 Workflow: basics",
"section": "3.2 Comments",
"text": "3.2 Comments\nR will ignore any text after # for that line. This allows you to write comments, text that is ignored by R but read by other humans. We’ll sometimes include comments in examples explaining what’s happening with the code.\nComments can be helpful for briefly describing what the following code does.\n\n# create vector of primes\nprimes <- c(2, 3, 5, 7, 11, 13)\n\n# multiply primes by 2\nprimes * 2\n#> [1] 4 6 10 14 22 26\n\nWith short pieces of code like this, leaving a comment for every single line of code might not be necessary. But as the code you’re writing gets more complex, comments can save you (and your collaborators) a lot of time figuring out what was done in the code.\nUse comments to explain the why of your code, not the how or the what. The what and how of your code are always possible to figure out, even if it might be tedious, by carefully reading it. If you describe every step in the comments, and then change the code, you will have to remember to update the comments as well or it will be confusing when you return to your code in the future.\nFiguring out why something was done is much more difficult, if not impossible. For example, geom_smooth() has an argument called span, which controls the smoothness of the curve, with larger values yielding a smoother curve. Suppose you decide to change the value of span from its default of 0.75 to 0.9: it’s easy for a future reader to understand what is happening, but unless you note your thinking in a comment, no one will understand why you changed the default.\nFor data analysis code, use comments to explain your overall plan of attack and record important insights as you encounter them. There’s no way to re-capture this knowledge from the code itself.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "workflow-basics.html#sec-whats-in-a-name",
"href": "workflow-basics.html#sec-whats-in-a-name",
"title": "3 Workflow: basics",
"section": "3.3 What’s in a name?",
"text": "3.3 What’s in a name?\nObject names must start with a letter and can only contain letters, numbers, _, and .. You want your object names to be descriptive, so you’ll need to adopt a convention for multiple words. We recommend snake_case, where you separate lowercase words with _.\n\ni_use_snake_case\notherPeopleUseCamelCase\nsome.people.use.periods\nAnd_aFew.People_RENOUNCEconvention\n\nWe’ll return to names again when we discuss code style in Chapter 5.\nYou can inspect an object by typing its name:\n\nx\n#> [1] 12\n\nMake another assignment:\n\nthis_is_a_really_long_name <- 2.5\n\nTo inspect this object, try out RStudio’s completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.\nLet’s assume you made a mistake, and that the value of this_is_a_really_long_name should be 3.5, not 2.5. You can use another keyboard shortcut to help you fix it. For example, you can press ↑ to bring the last command you typed and edit it. Or, type “this” then press Cmd/Ctrl + ↑ to list all the commands you’ve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.\nMake yet another assignment:\n\nr_rocks <- 2^3\n\nLet’s try to inspect it:\n\nr_rock\n#> Error: object 'r_rock' not found\nR_rocks\n#> Error: object 'R_rocks' not found\n\nThis illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. If not, you’re likely to get an error that says the object you’re looking for was not found. Typos matter; R can’t read your mind and say, “oh, they probably meant r_rocks when they typed r_rock”. Case matters; similarly, R can’t read your mind and say, “oh, they probably meant r_rocks when they typed R_rocks”.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "workflow-basics.html#calling-functions",
"href": "workflow-basics.html#calling-functions",
"title": "3 Workflow: basics",
"section": "3.4 Calling functions",
"text": "3.4 Calling functions\nR has a large collection of built-in functions that are called like this:\n\nfunction_name(argument1 = value1, argument2 = value2, ...)\n\nLet’s try using seq(), which makes regular sequences of numbers, and while we’re at it, learn more helpful features of RStudio. Type se and hit TAB. A popup shows you possible completions. Specify seq() by typing more (a q) to disambiguate or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.\nWhen you’ve selected the function you want, press TAB again. RStudio will add matching opening (() and closing ()) parentheses for you. Type the name of the first argument, from, and set it equal to 1. Then, type the name of the second argument, to, and set it equal to 10. Finally, hit return.\n\nseq(from = 1, to = 10)\n#> [1] 1 2 3 4 5 6 7 8 9 10\n\nWe often omit the names of the first several arguments in function calls, so we can rewrite this as follows:\n\nseq(1, 10)\n#> [1] 1 2 3 4 5 6 7 8 9 10\n\nType the following code and notice that RStudio provides similar assistance with the paired quotation marks:\n\nx <- \"hello world\"\n\nQuotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:\n> x <- \"hello\n+\nThe + tells you that R is waiting for more input; it doesn’t think you’re done yet. Usually, this means you’ve forgotten either a \" or a ). Either add the missing pair, or press ESCAPE to abort the expression and try again.\nNote that the environment tab in the upper right pane displays all of the objects that you’ve created:",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "workflow-basics.html#exercises",
"href": "workflow-basics.html#exercises",
"title": "3 Workflow: basics",
"section": "3.5 Exercises",
"text": "3.5 Exercises\n\nWhy does this code not work?\n\nmy_variable <- 10\nmy_varıable\n#> Error: object 'my_varıable' not found\n\nLook carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)\nTweak each of the following R commands so that they run correctly:\n\nlibary(todyverse)\n\nggplot(dTA = mpg) + \n geom_point(maping = aes(x = displ y = hwy)) +\n geom_smooth(method = \"lm)\n\nPress Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?\nLet’s revisit an exercise from the Section 2.6. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?\n\nmy_bar_plot <- ggplot(mpg, aes(x = class)) +\n geom_bar()\nmy_scatter_plot <- ggplot(mpg, aes(x = cty, y = hwy)) +\n geom_point()\nggsave(filename = \"mpg-plot.png\", plot = my_bar_plot)",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "workflow-basics.html#summary",
"href": "workflow-basics.html#summary",
"title": "3 Workflow: basics",
"section": "3.6 Summary",
"text": "3.6 Summary\nNow that you’ve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, we’ll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it’s selecting important variables, filtering down to rows of interest, or computing summary statistics.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>3</span> <span class='chapter-title'>Workflow: basics</span>"
]
},
{
"objectID": "data-transform.html",
"href": "data-transform.html",
"title": "4 Data transformation",
"section": "",
"text": "4.1 Introduction\nVisualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want. Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed from New York City in 2013.\nThe goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action. In later chapters, we’ll return to the functions in more detail as we start to dig into specific types of data (e.g., numbers, strings, dates).",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#introduction",
"href": "data-transform.html#introduction",
"title": "4 Data transformation",
"section": "",
"text": "4.1.1 Prerequisites\nIn this chapter, we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package and use ggplot2 to help us understand the data.\n\nlibrary(nycflights13)\nlibrary(tidyverse)\n#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#> ✔ dplyr 1.1.4 ✔ readr 2.1.5\n#> ✔ forcats 1.0.0 ✔ stringr 1.5.1\n#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1\n#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n#> ✔ purrr 1.0.2 \n#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#> ✖ dplyr::filter() masks stats::filter()\n#> ✖ dplyr::lag() masks stats::lag()\n#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors\n\nTake careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far, we’ve mostly ignored which package a function comes from because it doesn’t usually matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we’ll use the same syntax as R: packagename::functionname().\n\n\n4.1.2 nycflights13\nTo explore the basic dplyr verbs, we will use nycflights13::flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics and is documented in ?flights.\n\nflights\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nflights is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably View(flights), which opens an interactive, scrollable, and filterable view. Otherwise you can use print(flights, width = Inf) to show all columns, or use glimpse():\n\nglimpse(flights)\n#> Rows: 336,776\n#> Columns: 19\n#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…\n#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…\n#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…\n#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…\n#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…\n#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…\n#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…\n#> $ carrier <chr> \"UA\", \"UA\", \"AA\", \"B6\", \"DL\", \"UA\", \"B6\", \"EV\", \"B6\"…\n#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…\n#> $ tailnum <chr> \"N14228\", \"N24211\", \"N619AA\", \"N804JB\", \"N668DN\", \"N…\n#> $ origin <chr> \"EWR\", \"LGA\", \"JFK\", \"JFK\", \"LGA\", \"EWR\", \"EWR\", \"LG…\n#> $ dest <chr> \"IAH\", \"IAH\", \"MIA\", \"BQN\", \"ATL\", \"ORD\", \"FLL\", \"IA…\n#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…\n#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…\n#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…\n#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…\n#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…\n\nIn both views, the variable names are followed by abbreviations that tell you the type of each variable: <int> is short for integer, <dbl> is short for double (aka real numbers), <chr> for character (aka strings), and <dttm> for date-time. These are important because the operations you can perform on a column depend heavily on its “type.”\n\n\n4.1.3 dplyr basics\nYou’re about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:\n\nThe first argument is always a data frame.\nThe subsequent arguments typically describe which columns to operate on using the variable names (without quotes).\nThe output is always a new data frame.\n\nBecause each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |>. We’ll discuss the pipe more in Section 4.4, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y) is equivalent to f(x, y), and x |> f(y) |> g(z) is equivalent to g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:\n\nflights |>\n filter(dest == \"IAH\") |> \n group_by(year, month, day) |> \n summarize(\n arr_delay = mean(arr_delay, na.rm = TRUE)\n )\n\ndplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections, you’ll learn the most important verbs for rows, columns, and groups. Then, we’ll return to the join verbs that work on tables in Chapter 20. Let’s dive in!",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#rows",
"href": "data-transform.html#rows",
"title": "4 Data transformation",
"section": "4.2 Rows",
"text": "4.2 Rows\nThe most important verbs that operate on rows of a dataset are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss distinct() which finds rows with unique values. Unlike arrange() and filter() it can also optionally modify the columns.\n\n4.2.1 filter()\nfilter() allows you to keep rows based on the values of the columns1. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that departed more than 120 minutes (two hours) late:\n\nflights |> \n filter(dep_delay > 120)\n#> # A tibble: 9,723 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 848 1835 853 1001 1950\n#> 2 2013 1 1 957 733 144 1056 853\n#> 3 2013 1 1 1114 900 134 1447 1222\n#> 4 2013 1 1 1540 1338 122 2020 1825\n#> 5 2013 1 1 1815 1325 290 2120 1542\n#> 6 2013 1 1 1842 1422 260 1958 1535\n#> # ℹ 9,717 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nAs well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also combine conditions with & or , to indicate “and” (check for both conditions) or with | to indicate “or” (check for either condition):\n\n# Flights that departed on January 1\nflights |> \n filter(month == 1 & day == 1)\n#> # A tibble: 842 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 836 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n# Flights that departed in January or February\nflights |> \n filter(month == 1 | month == 2)\n#> # A tibble: 51,955 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 51,949 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nThere’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:\n\n# A shorter way to select flights that departed in January or February\nflights |> \n filter(month %in% c(1, 2))\n#> # A tibble: 51,955 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 51,949 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nWe’ll come back to these comparisons and logical operators in more detail in Chapter 13.\nWhen you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:\n\njan1 <- flights |> \n filter(month == 1 & day == 1)\n\n\n\n4.2.2 Common mistakes\nWhen you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. filter() will let you know when this happens:\n\nflights |> \n filter(month = 1)\n#> Error in `filter()`:\n#> ! We detected a named input.\n#> ℹ This usually means that you've used `=` instead of `==`.\n#> ℹ Did you mean `month == 1`?\n\nAnother mistakes is you write “or” statements like you would in English:\n\nflights |> \n filter(month == 1 | 2)\n\nThis “works”, in the sense that it doesn’t throw an error, but it doesn’t do what you want because | first checks the condition month == 1 and then checks the condition 2, which is not a sensible condition to check. We’ll learn more about what’s happening here and why in Section 13.3.2.\n\n\n4.2.3 arrange()\narrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns. For example, the following code sorts by the departure time, which is spread over four columns. We get the earliest years first, then within a year, the earliest months, etc.\n\nflights |> \n arrange(year, month, day, dep_time)\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nYou can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending (big-to-small) order. For example, this code orders flights from most to least delayed:\n\nflights |> \n arrange(desc(dep_delay))\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 9 641 900 1301 1242 1530\n#> 2 2013 6 15 1432 1935 1137 1607 2120\n#> 3 2013 1 10 1121 1635 1126 1239 1810\n#> 4 2013 9 20 1139 1845 1014 1457 2210\n#> 5 2013 7 22 845 1600 1005 1044 1815\n#> 6 2013 4 10 1100 1900 960 1342 2211\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nNote that the number of rows has not changed – we’re only arranging the data, we’re not filtering it.\n\n\n4.2.4 distinct()\ndistinct() finds all the unique rows in a dataset, so technically, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:\n\n# Remove duplicate rows, if any\nflights |> \n distinct()\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\n# Find all unique origin and destination pairs\nflights |> \n distinct(origin, dest)\n#> # A tibble: 224 × 2\n#> origin dest \n#> <chr> <chr>\n#> 1 EWR IAH \n#> 2 LGA IAH \n#> 3 JFK MIA \n#> 4 JFK BQN \n#> 5 LGA ATL \n#> 6 EWR ORD \n#> # ℹ 218 more rows\n\nAlternatively, if you want to keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.\n\nflights |> \n distinct(origin, dest, .keep_all = TRUE)\n#> # A tibble: 224 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 218 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nIt’s not a coincidence that all of these distinct flights are on January 1: distinct() will find the first occurrence of a unique row in the dataset and discard the rest.\nIf you want to find the number of occurrences instead, you’re better off swapping distinct() for count(). With the sort = TRUE argument, you can arrange them in descending order of the number of occurrences. You’ll learn more about count in Section 14.3.\n\nflights |>\n count(origin, dest, sort = TRUE)\n#> # A tibble: 224 × 3\n#> origin dest n\n#> <chr> <chr> <int>\n#> 1 JFK LAX 11262\n#> 2 LGA ATL 10263\n#> 3 LGA ORD 8857\n#> 4 JFK SFO 8204\n#> 5 LGA CLT 6168\n#> 6 EWR ORD 6100\n#> # ℹ 218 more rows\n\n\n\n4.2.5 Exercises\n\nIn a single pipeline for each condition, find all flights that meet the condition:\n\nHad an arrival delay of two or more hours\nFlew to Houston (IAH or HOU)\nWere operated by United, American, or Delta\nDeparted in summer (July, August, and September)\nArrived more than two hours late but didn’t leave late\nWere delayed by at least an hour, but made up over 30 minutes in flight\n\nSort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.\nSort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)\nWas there a flight on every day of 2013?\nWhich flights traveled the farthest distance? Which traveled the least distance?\nDoes it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#columns",
"href": "data-transform.html#columns",
"title": "4 Data transformation",
"section": "4.3 Columns",
"text": "4.3 Columns\nThere are four important verbs that affect the columns without changing the rows: mutate() creates new columns that are derived from the existing columns, select() changes which columns are present, rename() changes the names of the columns, and relocate() changes the positions of the columns.\n\n4.3.1 mutate()\nThe job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60\n )\n#> # A tibble: 336,776 × 21\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nBy default, mutate() adds new columns on the right-hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left-hand side2:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60,\n .before = 1\n )\n#> # A tibble: 336,776 × 21\n#> gain speed year month day dep_time sched_dep_time dep_delay arr_time\n#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>\n#> 1 -9 370. 2013 1 1 517 515 2 830\n#> 2 -16 374. 2013 1 1 533 529 4 850\n#> 3 -31 408. 2013 1 1 542 540 2 923\n#> 4 17 517. 2013 1 1 544 545 -1 1004\n#> 5 19 394. 2013 1 1 554 600 -6 812\n#> 6 -16 288. 2013 1 1 554 558 -4 740\n#> # ℹ 336,770 more rows\n#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, …\n\nThe . indicates that .before is an argument to the function, not the name of a third new variable we are creating. You can also use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position. For example, we could add the new variables after day:\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n speed = distance / air_time * 60,\n .after = day\n )\n\nAlternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is \"used\" which specifies that we only keep the columns that were involved or created in the mutate() step. For example, the following output will contain only the variables dep_delay, arr_delay, air_time, gain, hours, and gain_per_hour.\n\nflights |> \n mutate(\n gain = dep_delay - arr_delay,\n hours = air_time / 60,\n gain_per_hour = gain / hours,\n .keep = \"used\"\n )\n\nNote that since we haven’t assigned the result of the above computation back to flights, the new variables gain, hours, and gain_per_hour will only be printed but will not be stored in a data frame. And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to flights, overwriting the original data frame with many more variables, or to a new object. Often, the right answer is a new object that is named informatively to indicate its contents, e.g., delay_gain, but you might also have good reasons for overwriting flights.\n\n\n4.3.2 select()\nIt’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:\n\nSelect columns by name:\n\nflights |> \n select(year, month, day)\n\nSelect all columns between year and day (inclusive):\n\nflights |> \n select(year:day)\n\nSelect all columns except those from year to day (inclusive):\n\nflights |> \n select(!year:day)\n\nHistorically this operation was done with - instead of !, so you’re likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. We recommend using ! because it reads as “not” and combines well with & and |.\nSelect all columns that are characters:\n\nflights |> \n select(where(is.character))\n\n\nThere are a number of helper functions you can use within select():\n\nstarts_with(\"abc\"): matches names that begin with “abc”.\nends_with(\"xyz\"): matches names that end with “xyz”.\ncontains(\"ijk\"): matches names that contain “ijk”.\nnum_range(\"x\", 1:3): matches x1, x2 and x3.\n\nSee ?select for more details. Once you know regular expressions (the topic of Chapter 16) you’ll also be able to use matches() to select variables that match a pattern.\nYou can rename variables as you select() them by using =. The new name appears on the left-hand side of the =, and the old variable appears on the right-hand side:\n\nflights |> \n select(tail_num = tailnum)\n#> # A tibble: 336,776 × 1\n#> tail_num\n#> <chr> \n#> 1 N14228 \n#> 2 N24211 \n#> 3 N619AA \n#> 4 N804JB \n#> 5 N668DN \n#> 6 N39463 \n#> # ℹ 336,770 more rows\n\n\n\n4.3.3 rename()\nIf you want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():\n\nflights |> \n rename(tail_num = tailnum)\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nIf you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.\n\n\n4.3.4 relocate()\nUse relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:\n\nflights |> \n relocate(time_hour, air_time)\n#> # A tibble: 336,776 × 19\n#> time_hour air_time year month day dep_time sched_dep_time\n#> <dttm> <dbl> <int> <int> <int> <int> <int>\n#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515\n#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529\n#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540\n#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545\n#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600\n#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558\n#> # ℹ 336,770 more rows\n#> # ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, …\n\nYou can also specify where to put them using the .before and .after arguments, just like in mutate():\n\nflights |> \n relocate(year:dep_time, .after = time_hour)\nflights |> \n relocate(starts_with(\"arr\"), .before = dep_time)\n\n\n\n4.3.5 Exercises\n\nCompare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?\nBrainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.\nWhat happens if you specify the name of the same variable multiple times in a select() call?\nWhat does the any_of() function do? Why might it be helpful in conjunction with this vector?\n\nvariables <- c(\"year\", \"month\", \"day\", \"dep_delay\", \"arr_delay\")\n\nDoes the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?\n\nflights |> select(contains(\"TIME\"))\n\nRename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.\nWhy doesn’t the following work, and what does the error mean?\n\nflights |> \n select(tailnum) |> \n arrange(arr_delay)\n#> Error in `arrange()`:\n#> ℹ In argument: `..1 = arr_delay`.\n#> Caused by error:\n#> ! object 'arr_delay' not found",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#sec-the-pipe",
"href": "data-transform.html#sec-the-pipe",
"title": "4 Data transformation",
"section": "4.4 The pipe",
"text": "4.4 The pipe\nWe’ve shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs. For example, imagine that you wanted to find the fastest flights to Houston’s IAH airport: you need to combine filter(), mutate(), select(), and arrange():\n\nflights |> \n filter(dest == \"IAH\") |> \n mutate(speed = distance / air_time * 60) |> \n select(year:day, dep_time, carrier, flight, speed) |> \n arrange(desc(speed))\n#> # A tibble: 7,198 × 7\n#> year month day dep_time carrier flight speed\n#> <int> <int> <int> <int> <chr> <int> <dbl>\n#> 1 2013 7 9 707 UA 226 522.\n#> 2 2013 8 27 1850 UA 1128 521.\n#> 3 2013 8 28 902 UA 1711 519.\n#> 4 2013 8 28 2122 UA 1022 519.\n#> 5 2013 6 11 1628 UA 1178 515.\n#> 6 2013 8 27 1017 UA 333 515.\n#> # ℹ 7,192 more rows\n\nEven though this pipeline has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights data, then filter, then mutate, then select, then arrange.\nWhat would happen if we didn’t have the pipe? We could nest each function call inside the previous call:\n\narrange(\n select(\n mutate(\n filter(\n flights, \n dest == \"IAH\"\n ),\n speed = distance / air_time * 60\n ),\n year:day, dep_time, carrier, flight, speed\n ),\n desc(speed)\n)\n\nOr we could use a bunch of intermediate objects:\n\nflights1 <- filter(flights, dest == \"IAH\")\nflights2 <- mutate(flights1, speed = distance / air_time * 60)\nflights3 <- select(flights2, year:day, dep_time, carrier, flight, speed)\narrange(flights3, desc(speed))\n\nWhile both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.\nTo add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in Figure 4.1; more on %>% shortly.\n\n\n\n\n\n\n\n\nFigure 4.1: To insert |>, make sure the “Use native pipe operator” option is checked.\n\n\n\n\n\n\n\n\n\n\n\nmagrittr\n\n\n\nIf you’ve been using the tidyverse for a while, you might be familiar with the %>% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %>% whenever you load the tidyverse:\n\nlibrary(tidyverse)\n\nmtcars %>% \n group_by(cyl) %>%\n summarize(n = n())\n\nFor simple cases, |> and %>% behave identically. So why do we recommend the base pipe? Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse. Secondly, |> is quite a bit simpler than %>%: in the time between the invention of %>% in 2014 and the inclusion of |> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#groups",
"href": "data-transform.html#groups",
"title": "4 Data transformation",
"section": "4.5 Groups",
"text": "4.5 Groups\nSo far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: group_by(), summarize(), and the slice family of functions.\n\n4.5.1 group_by()\nUse group_by() to divide your dataset into groups meaningful for your analysis:\n\nflights |> \n group_by(month)\n#> # A tibble: 336,776 × 19\n#> # Groups: month [12]\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\ngroup_by() doesn’t change the data but, if you look closely at the output, you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.\n\n\n4.5.2 summarize()\nThe most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. In dplyr, this operation is performed by summarize()3, as shown by the following example, which computes the average departure delay by month:\n\nflights |> \n group_by(month) |> \n summarize(\n avg_delay = mean(dep_delay)\n )\n#> # A tibble: 12 × 2\n#> month avg_delay\n#> <int> <dbl>\n#> 1 1 NA\n#> 2 2 NA\n#> 3 3 NA\n#> 4 4 NA\n#> 5 5 NA\n#> 6 6 NA\n#> # ℹ 6 more rows\n\nUh-oh! Something has gone wrong, and all of our results are NAs (pronounced “N-A”), R’s symbol for missing value. This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an NA result. We’ll come back to discuss missing values in detail in Chapter 19, but for now, we’ll tell the mean() function to ignore all missing values by setting the argument na.rm to TRUE:\n\nflights |> \n group_by(month) |> \n summarize(\n avg_delay = mean(dep_delay, na.rm = TRUE)\n )\n#> # A tibble: 12 × 2\n#> month avg_delay\n#> <int> <dbl>\n#> 1 1 10.0\n#> 2 2 10.8\n#> 3 3 13.2\n#> 4 4 13.9\n#> 5 5 13.0\n#> 6 6 20.8\n#> # ℹ 6 more rows\n\nYou can create any number of summaries in a single call to summarize(). You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n(), which returns the number of rows in each group:\n\nflights |> \n group_by(month) |> \n summarize(\n avg_delay = mean(dep_delay, na.rm = TRUE), \n n = n()\n )\n#> # A tibble: 12 × 3\n#> month avg_delay n\n#> <int> <dbl> <int>\n#> 1 1 10.0 27004\n#> 2 2 10.8 24951\n#> 3 3 13.2 28834\n#> 4 4 13.9 28330\n#> 5 5 13.0 28796\n#> 6 6 20.8 28243\n#> # ℹ 6 more rows\n\nMeans and counts can get you a surprisingly long way in data science!\n\n\n4.5.3 The slice_ functions\nThere are five handy functions that allow you to extract specific rows within each group:\n\ndf |> slice_head(n = 1) takes the first row from each group.\ndf |> slice_tail(n = 1) takes the last row in each group.\ndf |> slice_min(x, n = 1) takes the row with the smallest value of column x.\ndf |> slice_max(x, n = 1) takes the row with the largest value of column x.\ndf |> slice_sample(n = 1) takes one random row.\n\nYou can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the flights that are most delayed upon arrival at each destination:\n\nflights |> \n group_by(dest) |> \n slice_max(arr_delay, n = 1) |>\n relocate(dest)\n#> # A tibble: 108 × 19\n#> # Groups: dest [105]\n#> dest year month day dep_time sched_dep_time dep_delay arr_time\n#> <chr> <int> <int> <int> <int> <int> <dbl> <int>\n#> 1 ABQ 2013 7 22 2145 2007 98 132\n#> 2 ACK 2013 7 23 1139 800 219 1250\n#> 3 ALB 2013 1 25 123 2000 323 229\n#> 4 ANC 2013 8 17 1740 1625 75 2042\n#> 5 ATL 2013 7 22 2257 759 898 121\n#> 6 AUS 2013 7 10 2056 1505 351 2347\n#> # ℹ 102 more rows\n#> # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, …\n\nNote that there are 105 destinations but we get 108 rows here. What’s up? slice_min() and slice_max() keep tied values so n = 1 means give us all rows with the highest value. If you want exactly one row per group you can set with_ties = FALSE.\nThis is similar to computing the max delay with summarize(), but you get the whole corresponding row (or rows if there’s a tie) instead of the single summary statistic.\n\n\n4.5.4 Grouping by multiple variables\nYou can create groups using more than one variable. For example, we could make a group for each date.\n\ndaily <- flights |> \n group_by(year, month, day)\ndaily\n#> # A tibble: 336,776 × 19\n#> # Groups: year, month, day [365]\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nWhen you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:\n\ndaily_flights <- daily |> \n summarize(n = n())\n#> `summarise()` has grouped output by 'year', 'month'. You can override using\n#> the `.groups` argument.\n\nIf you’re happy with this behavior, you can explicitly request it in order to suppress the message:\n\ndaily_flights <- daily |> \n summarize(\n n = n(), \n .groups = \"drop_last\"\n )\n\nAlternatively, change the default behavior by setting a different value, e.g., \"drop\" to drop all grouping or \"keep\" to preserve the same groups.\n\n\n4.5.5 Ungrouping\nYou might also want to remove grouping from a data frame without using summarize(). You can do this with ungroup().\n\ndaily |> \n ungroup()\n#> # A tibble: 336,776 × 19\n#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#> <int> <int> <int> <int> <int> <dbl> <int> <int>\n#> 1 2013 1 1 517 515 2 830 819\n#> 2 2013 1 1 533 529 4 850 830\n#> 3 2013 1 1 542 540 2 923 850\n#> 4 2013 1 1 544 545 -1 1004 1022\n#> 5 2013 1 1 554 600 -6 812 837\n#> 6 2013 1 1 554 558 -4 740 728\n#> # ℹ 336,770 more rows\n#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, …\n\nNow let’s see what happens when you summarize an ungrouped data frame.\n\ndaily |> \n ungroup() |>\n summarize(\n avg_delay = mean(dep_delay, na.rm = TRUE), \n flights = n()\n )\n#> # A tibble: 1 × 2\n#> avg_delay flights\n#> <dbl> <int>\n#> 1 12.6 336776\n\nYou get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.\n\n\n4.5.6 .by\ndplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the .by argument. group_by() and ungroup() aren’t going away, but you can now also use the .by argument to group within a single operation:\n\nflights |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE), \n n = n(),\n .by = month\n )\n\nOr if you want to group by multiple variables:\n\nflights |> \n summarize(\n delay = mean(dep_delay, na.rm = TRUE), \n n = n(),\n .by = c(origin, dest)\n )\n\n.by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.\nWe didn’t focus on this syntax in this chapter because it was very new when we wrote the book. We did want to mention it because we think it has a lot of promise and it’s likely to be quite popular. You can learn more about it in the dplyr 1.1.0 blog post.\n\n\n4.5.7 Exercises\n\nWhich carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))\nFind the flights that are most delayed upon departure from each destination.\nHow do delays vary over the course of the day? Illustrate your answer with a plot.\nWhat happens if you supply a negative n to slice_min() and friends?\nExplain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?\nSuppose we have the following tiny data frame:\n\ndf <- tibble(\n x = 1:5,\n y = c(\"a\", \"b\", \"a\", \"a\", \"b\"),\n z = c(\"K\", \"K\", \"L\", \"L\", \"K\")\n)\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what group_by() does.\n\ndf |>\n group_by(y)\n\nWrite down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also, comment on how it’s different from the group_by() in part (a).\n\ndf |>\n arrange(y)\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does.\n\ndf |>\n group_by(y) |>\n summarize(mean_x = mean(x))\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x), .groups = \"drop\")\n\nWrite down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\ndf |>\n group_by(y, z) |>\n mutate(mean_x = mean(x))",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#sec-sample-size",
"href": "data-transform.html#sec-sample-size",
"title": "4 Data transformation",
"section": "4.6 Case study: aggregates and sample size",
"text": "4.6 Case study: aggregates and sample size\nWhenever you do any aggregation, it’s always a good idea to include a count (n()). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. We’ll demonstrate this with some baseball data from the Lahman package. Specifically, we will compare what proportion of times a player gets a hit (H) vs. the number of times they try to put the ball in play (AB):\n\nbatters <- Lahman::Batting |> \n group_by(playerID) |> \n summarize(\n performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),\n n = sum(AB, na.rm = TRUE)\n )\nbatters\n#> # A tibble: 20,730 × 3\n#> playerID performance n\n#> <chr> <dbl> <int>\n#> 1 aardsda01 0 4\n#> 2 aaronha01 0.305 12364\n#> 3 aaronto01 0.229 944\n#> 4 aasedo01 0 5\n#> 5 abadan01 0.0952 21\n#> 6 abadfe01 0.111 9\n#> # ℹ 20,724 more rows\n\nWhen we plot the skill of the batter (measured by the batting average, performance) against the number of opportunities to hit the ball (measured by times at bat, n), you see two patterns:\n\nThe variation in performance is larger among players with fewer at-bats. The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you’ll see that the variation decreases as the sample size increases4.\nThere’s a positive correlation between skill (performance) and opportunities to hit the ball (n) because teams want to give their best batters the most opportunities to hit the ball.\n\n\nbatters |> \n filter(n > 100) |> \n ggplot(aes(x = n, y = performance)) +\n geom_point(alpha = 1 / 10) + \n geom_smooth(se = FALSE)\n\n\n\n\n\n\n\n\nNote the handy pattern for combining ggplot2 and dplyr. You just have to remember to switch from |>, for dataset processing, to + for adding layers to your plot.\nThis also has important implications for ranking. If you naively sort on desc(performance), the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they’re not necessarily the most skilled players:\n\nbatters |> \n arrange(desc(performance))\n#> # A tibble: 20,730 × 3\n#> playerID performance n\n#> <chr> <dbl> <int>\n#> 1 abramge01 1 1\n#> 2 alberan01 1 1\n#> 3 banisje01 1 1\n#> 4 bartocl01 1 1\n#> 5 bassdo01 1 1\n#> 6 birasst01 1 2\n#> # ℹ 20,724 more rows\n\nYou can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#summary",
"href": "data-transform.html#summary",
"title": "4 Data transformation",
"section": "4.7 Summary",
"text": "4.7 Summary\nIn this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange()), those that manipulate the columns (like select() and mutate()) and those that manipulate groups (like group_by() and summarize()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll return to that in the Transform part of the book, where each chapter provides tools for a specific type of variable.\nIn the next chapter, we’ll pivot back to workflow to discuss the importance of code style and keeping your code well organized to make it easy for you and others to read and understand.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "data-transform.html#footnotes",
"href": "data-transform.html#footnotes",
"title": "4 Data transformation",
"section": "",
"text": "Later, you’ll learn about the slice_*() family, which allows you to choose rows based on their positions.↩︎\nRemember that in RStudio, the easiest way to see a dataset with many columns is View().↩︎\nOr summarise(), if you prefer British English.↩︎\n*cough* the law of large numbers *cough*.↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>4</span> <span class='chapter-title'>Data transformation</span>"
]
},
{
"objectID": "workflow-style.html",
"href": "workflow-style.html",
"title": "5 Workflow: code style",
"section": "",
"text": "5.1 Names\nGood coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer, it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else. This chapter will introduce the most important points of the tidyverse style guide, which is used throughout this book.\nStyling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages(\"styler\"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any built-in RStudio command and many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts offered by styler. Figure 5.1 shows the results.\nWe’ll use the tidyverse and nycflights13 packages for code examples in this chapter.\nWe talked briefly about names in Section 3.3. Remember that variable names (those created by <- and those created by mutate()) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.\n# Strive for:\nshort_flights <- flights |> filter(air_time < 60)\n\n# Avoid:\nSHORTFLIGHTS <- flights |> filter(air_time < 60)\nAs a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.\nIf you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, you’re better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#spaces",
"href": "workflow-style.html#spaces",
"title": "5 Workflow: code style",
"section": "5.2 Spaces",
"text": "5.2 Spaces\nPut spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).\n\n# Strive for\nz <- (a + b)^2 / d\n\n# Avoid\nz<-( a + b ) ^ 2/d\n\nDon’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.\n\n# Strive for\nmean(x, na.rm = TRUE)\n\n# Avoid\nmean (x ,na.rm=TRUE)\n\nIt’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in mutate(), you might want to add spaces so that all the = line up.1 This makes it easier to skim the code.\n\nflights |> \n mutate(\n speed = distance / air_time,\n dep_hour = dep_time %/% 100,\n dep_minute = dep_time %% 100\n )",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#sec-pipes",
"href": "workflow-style.html#sec-pipes",
"title": "5 Workflow: code style",
"section": "5.3 Pipes",
"text": "5.3 Pipes\n|> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.\n\n# Strive for \nflights |> \n filter(!is.na(arr_delay), !is.na(tailnum)) |> \n count(dest)\n\n# Avoid\nflights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)\n\nIf the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.\n\n# Strive for\nflights |> \n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE),\n n = n()\n )\n\n# Avoid\nflights |>\n group_by(\n tailnum\n ) |> \n summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())\n\nAfter the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.\n\n# Strive for \nflights |> \n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE),\n n = n()\n )\n\n# Avoid\nflights|>\n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE), \n n = n()\n )\n\n# Avoid\nflights|>\n group_by(tailnum) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE), \n n = n()\n )\n\nIt’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.\n\n# This fits compactly on one line\ndf |> mutate(y = x + 1)\n\n# While this takes up 4x as many lines, it's easily extended to \n# more variables and more steps in the future\ndf |> \n mutate(\n y = x + 1\n )\n\nFinally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#ggplot2",
"href": "workflow-style.html#ggplot2",
"title": "5 Workflow: code style",
"section": "5.4 ggplot2",
"text": "5.4 ggplot2\nThe same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>.\n\nflights |> \n group_by(month) |> \n summarize(\n delay = mean(arr_delay, na.rm = TRUE)\n ) |> \n ggplot(aes(x = month, y = delay)) +\n geom_point() + \n geom_line()\n\nAgain, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:\n\nflights |> \n group_by(dest) |> \n summarize(\n distance = mean(distance),\n speed = mean(distance / air_time, na.rm = TRUE)\n ) |> \n ggplot(aes(x = distance, y = speed)) +\n geom_smooth(\n method = \"loess\",\n span = 0.5,\n se = FALSE, \n color = \"white\", \n linewidth = 4\n ) +\n geom_point()\n\nWatch for the transition from |> to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was written before the pipe was discovered.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#sectioning-comments",
"href": "workflow-style.html#sectioning-comments",
"title": "5 Workflow: code style",
"section": "5.5 Sectioning comments",
"text": "5.5 Sectioning comments\nAs your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:\n\n# Load data --------------------------------------\n\n# Plot data --------------------------------------\n\nRStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figure 5.2.\n\n\n\n\n\n\n\n\nFigure 5.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#exercises",
"href": "workflow-style.html#exercises",
"title": "5 Workflow: code style",
"section": "5.6 Exercises",
"text": "5.6 Exercises\n\nRestyle the following pipelines following the guidelines above.\n\nflights|>filter(dest==\"IAH\")|>group_by(year,month,day)|>summarize(n=n(),\ndelay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)\n\nflights|>filter(carrier==\"UA\",dest%in%c(\"IAH\",\"HOU\"),sched_dep_time>\n0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(\narr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#summary",
"href": "workflow-style.html#summary",
"title": "5 Workflow: code style",
"section": "5.7 Summary",
"text": "5.7 Summary\nIn this chapter, you’ve learned the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.\nIn the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "workflow-style.html#footnotes",
"href": "workflow-style.html#footnotes",
"title": "5 Workflow: code style",
"section": "",
"text": "Since dep_time is in HMM or HHMM format, we use integer division (%/%) to get hour and remainder (also known as modulo, %%) to get minute.↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>5</span> <span class='chapter-title'>Workflow: code style</span>"
]
},
{
"objectID": "data-tidy.html",
"href": "data-tidy.html",
"title": "6 Data tidying",
"section": "",
"text": "6.1 Introduction\nIn this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.\nIn this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#introduction",
"href": "data-tidy.html#introduction",
"title": "6 Data tidying",
"section": "",
"text": "“Happy families are all alike; every unhappy family is unhappy in its own way.”\n— Leo Tolstoy\n\n\n“Tidy datasets are all alike, but every messy dataset is messy in its own way.”\n— Hadley Wickham\n\n\n\n\n6.1.1 Prerequisites\nIn this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.\n\nlibrary(tidyverse)\n\nFrom this chapter on, we’ll suppress the loading message from library(tidyverse).",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#sec-tidy-data",
"href": "data-tidy.html#sec-tidy-data",
"title": "6 Data tidying",
"section": "6.2 Tidy data",
"text": "6.2 Tidy data\nYou can represent the same underlying data in multiple ways. The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: country, year, population, and number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.\n\ntable1\n#> # A tibble: 6 × 4\n#> country year cases population\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1999 745 19987071\n#> 2 Afghanistan 2000 2666 20595360\n#> 3 Brazil 1999 37737 172006362\n#> 4 Brazil 2000 80488 174504898\n#> 5 China 1999 212258 1272915272\n#> 6 China 2000 213766 1280428583\n\ntable2\n#> # A tibble: 12 × 4\n#> country year type count\n#> <chr> <dbl> <chr> <dbl>\n#> 1 Afghanistan 1999 cases 745\n#> 2 Afghanistan 1999 population 19987071\n#> 3 Afghanistan 2000 cases 2666\n#> 4 Afghanistan 2000 population 20595360\n#> 5 Brazil 1999 cases 37737\n#> 6 Brazil 1999 population 172006362\n#> # ℹ 6 more rows\n\ntable3\n#> # A tibble: 6 × 3\n#> country year rate \n#> <chr> <dbl> <chr> \n#> 1 Afghanistan 1999 745/19987071 \n#> 2 Afghanistan 2000 2666/20595360 \n#> 3 Brazil 1999 37737/172006362 \n#> 4 Brazil 2000 80488/174504898 \n#> 5 China 1999 212258/1272915272\n#> 6 China 2000 213766/1280428583\n\nThese are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.\nThere are three interrelated rules that make a dataset tidy:\n\nEach variable is a column; each column is a variable.\nEach observation is a row; each row is an observation.\nEach value is a cell; each cell is a single value.\n\nFigure 6.1 shows the rules visually.\n\n\n\n\n\n\n\n\nFigure 6.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.\n\n\n\n\n\nWhy ensure that your data is tidy? There are two main advantages:\n\nThere’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.\nThere’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in Section 4.3.1 and Section 4.5.2, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.\n\ndplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.\n\n# Compute rate per 10,000\ntable1 |>\n mutate(rate = cases / population * 10000)\n#> # A tibble: 6 × 5\n#> country year cases population rate\n#> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1999 745 19987071 0.373\n#> 2 Afghanistan 2000 2666 20595360 1.29 \n#> 3 Brazil 1999 37737 172006362 2.19 \n#> 4 Brazil 2000 80488 174504898 4.61 \n#> 5 China 1999 212258 1272915272 1.67 \n#> 6 China 2000 213766 1280428583 1.67\n\n# Compute total cases per year\ntable1 |> \n group_by(year) |> \n summarize(total_cases = sum(cases))\n#> # A tibble: 2 × 2\n#> year total_cases\n#> <dbl> <dbl>\n#> 1 1999 250740\n#> 2 2000 296920\n\n# Visualize changes over time\nggplot(table1, aes(x = year, y = cases)) +\n geom_line(aes(group = country), color = \"grey50\") +\n geom_point(aes(color = country, shape = country)) +\n scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000\n\n\n\n\n\n\n\n\n\n6.2.1 Exercises\n\nFor each of the sample tables, describe what each observation and each column represents.\nSketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:\n\nExtract the number of TB cases per country per year.\nExtract the matching population per country per year.\nDivide cases by population, and multiply by 10000.\nStore back in the appropriate place.\n\nYou haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#sec-pivoting",
"href": "data-tidy.html#sec-pivoting",
"title": "6 Data tidying",
"section": "6.3 Lengthening data",
"text": "6.3 Lengthening data\nThe principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:\n\nData is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.\nMost people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.\n\nThis means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.\ntidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We’ll first start with pivot_longer() because it’s the most common case. Let’s dive into some examples.\n\n6.3.1 Data in column names\nThe billboard dataset records the billboard rank of songs in the year 2000:\n\nbillboard\n#> # A tibble: 317 × 79\n#> artist track date.entered wk1 wk2 wk3 wk4 wk5\n#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87\n#> 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA\n#> 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66\n#> 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67\n#> 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17\n#> 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26\n#> # ℹ 311 more rows\n#> # ℹ 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, …\n\nIn this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1. Here, the column names are one variable (the week) and the cell values are another (the rank).\nTo tidy this data, we’ll use pivot_longer():\n\nbillboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\"\n )\n#> # A tibble: 24,092 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <chr> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94\n#> 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99\n#> 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA\n#> 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA\n#> 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA\n#> # ℹ 24,082 more rows\n\nAfter the data, there are three key arguments:\n\ncols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with(\"wk\").\nnames_to names the variable stored in the column names, we named that variable week.\nvalues_to names the variable stored in the cell values, we named that variable rank.\n\nNote that in the code \"week\" and \"rank\" are quoted because those are new variables we’re creating, they don’t yet exist in the data when we run the pivot_longer() call.\nNow let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset2, so we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE:\n\nbillboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\",\n values_drop_na = TRUE\n )\n#> # A tibble: 5,307 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <chr> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94\n#> # ℹ 5,301 more rows\n\nThe number of rows is now much lower, indicating that many rows with NAs were dropped.\nYou might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.\nThis data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.\n\nbillboard_longer <- billboard |> \n pivot_longer(\n cols = starts_with(\"wk\"), \n names_to = \"week\", \n values_to = \"rank\",\n values_drop_na = TRUE\n ) |> \n mutate(\n week = parse_number(week)\n )\nbillboard_longer\n#> # A tibble: 5,307 × 5\n#> artist track date.entered week rank\n#> <chr> <chr> <date> <dbl> <dbl>\n#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87\n#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82\n#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72\n#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77\n#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87\n#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94\n#> # ℹ 5,301 more rows\n\nNow that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. The code is shown below and the result is in Figure 6.2. We can see that very few songs stay in the top 100 for more than 20 weeks.\n\nbillboard_longer |> \n ggplot(aes(x = week, y = rank, group = track)) + \n geom_line(alpha = 0.25) + \n scale_y_reverse()\n\n\n\n\n\n\n\nFigure 6.2: A line plot showing how the rank of a song changes over time.\n\n\n\n\n\n\n\n6.3.2 How does pivoting work?\nNow that you’ve seen how we can use pivoting to reshape our data, let’s take a little time to gain some intuition about what pivoting does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient. We’ll create the data with tribble(), a handy function for constructing small tibbles by hand:\n\ndf <- tribble(\n ~id, ~bp1, ~bp2,\n \"A\", 100, 120,\n \"B\", 140, 115,\n \"C\", 120, 125\n)\n\nWe want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot df longer:\n\ndf |> \n pivot_longer(\n cols = bp1:bp2,\n names_to = \"measurement\",\n values_to = \"value\"\n )\n#> # A tibble: 6 × 3\n#> id measurement value\n#> <chr> <chr> <dbl>\n#> 1 A bp1 100\n#> 2 A bp2 120\n#> 3 B bp1 140\n#> 4 B bp2 115\n#> 5 C bp1 120\n#> 6 C bp2 125\n\nHow does the reshaping work? It’s easier to see if we think about it column by column. As shown in Figure 6.3, the values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted.\n\n\n\n\n\n\n\n\nFigure 6.3: Columns that are already variables need to be repeated, once for each column that is pivoted.\n\n\n\n\n\nThe column names become values in a new variable, whose name is defined by names_to, as shown in Figure 6.4. They need to be repeated once for each row in the original dataset.\n\n\n\n\n\n\n\n\nFigure 6.4: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.\n\n\n\n\n\nThe cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row. Figure 6.5 illustrates the process.\n\n\n\n\n\n\n\n\nFigure 6.5: The number of values is preserved (not repeated), but unwound row-by-row.\n\n\n\n\n\n\n\n6.3.3 Many variables in column names\nA more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables. For example, take the who2 dataset, the source of table1 and friends that you saw above:\n\nwho2\n#> # A tibble: 7,240 × 58\n#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554\n#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan 1980 NA NA NA NA NA\n#> 2 Afghanistan 1981 NA NA NA NA NA\n#> 3 Afghanistan 1982 NA NA NA NA NA\n#> 4 Afghanistan 1983 NA NA NA NA NA\n#> 5 Afghanistan 1984 NA NA NA NA NA\n#> 6 Afghanistan 1985 NA NA NA NA NA\n#> # ℹ 7,234 more rows\n#> # ℹ 51 more variables: sp_m_5564 <dbl>, sp_m_65 <dbl>, sp_f_014 <dbl>, …\n\nThis dataset, collected by the World Health Organisation, records information about tuberculosis diagnoses. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender (coded as a binary variable in this dataset), and the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example).\nSo in this case we have six pieces of information recorded in who2: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use pivot_longer() with a vector of column names for names_to and instructors for splitting the original variable names into pieces for names_sep as well as a column name for values_to:\n\nwho2 |> \n pivot_longer(\n cols = !(country:year),\n names_to = c(\"diagnosis\", \"gender\", \"age\"), \n names_sep = \"_\",\n values_to = \"count\"\n )\n#> # A tibble: 405,440 × 6\n#> country year diagnosis gender age count\n#> <chr> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 Afghanistan 1980 sp m 014 NA\n#> 2 Afghanistan 1980 sp m 1524 NA\n#> 3 Afghanistan 1980 sp m 2534 NA\n#> 4 Afghanistan 1980 sp m 3544 NA\n#> 5 Afghanistan 1980 sp m 4554 NA\n#> 6 Afghanistan 1980 sp m 5564 NA\n#> # ℹ 405,434 more rows\n\nAn alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Chapter 16.\nConceptually, this is only a minor variation on the simpler case you’ve already seen. Figure 6.6 shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that’s faster.\n\n\n\n\n\n\n\n\nFigure 6.6: Pivoting columns with multiple pieces of information in the names means that each column name now fills in values in multiple output columns.\n\n\n\n\n\n\n\n6.3.4 Data and variable names in the column headers\nThe next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the household dataset:\n\nhousehold\n#> # A tibble: 5 × 5\n#> family dob_child1 dob_child2 name_child1 name_child2\n#> <int> <date> <date> <chr> <chr> \n#> 1 1 1998-11-26 2000-01-29 Susan Jose \n#> 2 2 1996-06-22 NA Mark <NA> \n#> 3 3 2002-07-11 2004-04-05 Sam Seth \n#> 4 4 2004-10-10 2009-08-27 Craig Khai \n#> 5 5 2000-12-05 2005-02-28 Parker Gracie\n\nThis dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (dob, name) and the values of another (child, with values 1 or 2). To solve this problem we again need to supply a vector to names_to but this time we use the special \".value\" sentinel; this isn’t the name of a variable but a unique value that tells pivot_longer() to do something different. This overrides the usual values_to argument to use the first component of the pivoted column name as a variable name in the output.\n\nhousehold |> \n pivot_longer(\n cols = !family, \n names_to = c(\".value\", \"child\"), \n names_sep = \"_\", \n values_drop_na = TRUE\n )\n#> # A tibble: 9 × 4\n#> family child dob name \n#> <int> <chr> <date> <chr>\n#> 1 1 child1 1998-11-26 Susan\n#> 2 1 child2 2000-01-29 Jose \n#> 3 2 child1 1996-06-22 Mark \n#> 4 3 child1 2002-07-11 Sam \n#> 5 3 child2 2004-04-05 Seth \n#> 6 4 child1 2004-10-10 Craig\n#> # ℹ 3 more rows\n\nWe again use values_drop_na = TRUE, since the shape of the input forces the creation of explicit missing variables (e.g., for families with only one child).\nFigure 6.7 illustrates the basic idea with a simpler example. When you use \".value\" in names_to, the column names in the input contribute to both values and variable names in the output.\n\n\n\n\n\n\n\n\nFigure 6.7: Pivoting with names_to = c(\".value\", \"num\") splits the column names into two components: the first part determines the output column name (x or y), and the second part determines the value of the num column.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#widening-data",
"href": "data-tidy.html#widening-data",
"title": "6 Data tidying",
"section": "6.4 Widening data",
"text": "6.4 Widening data\nSo far we’ve used pivot_longer() to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to pivot_wider(), which makes datasets wider by increasing columns and reducing rows and helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.\nWe’ll start by looking at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:\n\ncms_patient_experience\n#> # A tibble: 500 × 5\n#> org_pac_id org_nm measure_cd measure_title prf_rate\n#> <chr> <chr> <chr> <chr> <dbl>\n#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63\n#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87\n#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86\n#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57\n#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85\n#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24\n#> # ℹ 494 more rows\n\nThe core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization. We can see the complete set of values for measure_cd and measure_title by using distinct():\n\ncms_patient_experience |> \n distinct(measure_cd, measure_title)\n#> # A tibble: 6 × 2\n#> measure_cd measure_title \n#> <chr> <chr> \n#> 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…\n#> 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate \n#> 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider \n#> 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education \n#> 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff \n#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources\n\nNeither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.\npivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from) and the column name (names_from):\n\ncms_patient_experience |> \n pivot_wider(\n names_from = measure_cd,\n values_from = prf_rate\n )\n#> # A tibble: 500 × 9\n#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2\n#> <chr> <chr> <chr> <dbl> <dbl>\n#> 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA\n#> 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87\n#> 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA\n#> # ℹ 494 more rows\n#> # ℹ 4 more variables: CAHPS_GRP_3 <dbl>, CAHPS_GRP_5 <dbl>, …\n\nThe output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, we also need to tell pivot_wider() which column or columns have values that uniquely identify each row; in this case those are the variables starting with \"org\":\n\ncms_patient_experience |> \n pivot_wider(\n id_cols = starts_with(\"org\"),\n names_from = measure_cd,\n values_from = prf_rate\n )\n#> # A tibble: 95 × 8\n#> org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 0446157747 USC CARE MEDICA… 63 87 86 57\n#> 2 0446162697 ASSOCIATION OF … 59 85 83 63\n#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44\n#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65\n#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64\n#> 6 0840109864 REX HOSPITAL INC 73 87 84 67\n#> # ℹ 89 more rows\n#> # ℹ 2 more variables: CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>\n\nThis gives us the output that we’re looking for.\n\n6.4.1 How does pivot_wider() work?\nTo understand how pivot_wider() works, let’s again start with a very simple dataset. This time we have two patients with ids A and B, we have three blood pressure measurements on patient A and two on patient B:\n\ndf <- tribble(\n ~id, ~measurement, ~value,\n \"A\", \"bp1\", 100,\n \"B\", \"bp1\", 140,\n \"B\", \"bp2\", 115, \n \"A\", \"bp2\", 120,\n \"A\", \"bp3\", 105\n)\n\nWe’ll take the values from the value column and the names from the measurement column:\n\ndf |> \n pivot_wider(\n names_from = measurement,\n values_from = value\n )\n#> # A tibble: 2 × 4\n#> id bp1 bp2 bp3\n#> <chr> <dbl> <dbl> <dbl>\n#> 1 A 100 120 105\n#> 2 B 140 115 NA\n\nTo begin the process pivot_wider() needs to first figure out what will go in the rows and columns. The new column names will be the unique values of measurement.\n\ndf |> \n distinct(measurement) |> \n pull()\n#> [1] \"bp1\" \"bp2\" \"bp3\"\n\nBy default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here there is only one column, but in general there can be any number.\n\ndf |> \n select(-measurement, -value) |> \n distinct()\n#> # A tibble: 2 × 1\n#> id \n#> <chr>\n#> 1 A \n#> 2 B\n\npivot_wider() then combines these results to generate an empty data frame:\n\ndf |> \n select(-measurement, -value) |> \n distinct() |> \n mutate(x = NA, y = NA, z = NA)\n#> # A tibble: 2 × 4\n#> id x y z \n#> <chr> <lgl> <lgl> <lgl>\n#> 1 A NA NA NA \n#> 2 B NA NA NA\n\nIt then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing. We’ll come back to this idea that pivot_wider() can “make” missing values in Chapter 19.\nYou might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and measurement “bp1”:\n\ndf <- tribble(\n ~id, ~measurement, ~value,\n \"A\", \"bp1\", 100,\n \"A\", \"bp1\", 102,\n \"A\", \"bp2\", 120,\n \"B\", \"bp1\", 140, \n \"B\", \"bp2\", 115\n)\n\nIf we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in Chapter 24:\n\ndf |>\n pivot_wider(\n names_from = measurement,\n values_from = value\n )\n#> Warning: Values from `value` are not uniquely identified; output will contain\n#> list-cols.\n#> • Use `values_fn = list` to suppress this warning.\n#> • Use `values_fn = {summary_fun}` to summarise duplicates.\n#> • Use the following dplyr code to identify duplicates.\n#> {data} |>\n#> dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |>\n#> dplyr::filter(n > 1L)\n#> # A tibble: 2 × 3\n#> id bp1 bp2 \n#> <chr> <list> <list> \n#> 1 A <dbl [2]> <dbl [1]>\n#> 2 B <dbl [1]> <dbl [1]>\n\nSince you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:\n\ndf |> \n group_by(id, measurement) |> \n summarize(n = n(), .groups = \"drop\") |> \n filter(n > 1)\n#> # A tibble: 1 × 3\n#> id measurement n\n#> <chr> <chr> <int>\n#> 1 A bp1 2\n\nIt’s then up to you to figure out what’s gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#summary",
"href": "data-tidy.html#summary",
"title": "6 Data tidying",
"section": "6.5 Summary",
"text": "6.5 Summary\nIn this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions, the main challenge is transforming the data from whatever structure you receive it in to a tidy format. To that end, you learned about pivot_longer() and pivot_wider() which allow you to tidy up many untidy datasets. The examples we presented here are a selection of those from vignette(\"pivot\", package = \"tidyr\"), so if you encounter a problem that this chapter doesn’t help you with, that vignette is a good place to try next.\nAnother challenge is that, for a given dataset, it can be impossible to label the longer or the wider version as the “tidy” one. This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn’t actually define what a variable is (and it’s surprisingly hard to do so). It’s totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest. So if you’re stuck figuring out how to do some computation, consider switching up the organisation of your data; don’t be afraid to untidy, transform, and re-tidy as needed!\nIf you enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the Tidy Data paper published in the Journal of Statistical Software.\nNow that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "data-tidy.html#footnotes",
"href": "data-tidy.html#footnotes",
"title": "6 Data tidying",
"section": "",
"text": "The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.↩︎\nWe’ll come back to this idea in Chapter 19.↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>6</span> <span class='chapter-title'>Data tidying</span>"
]
},
{
"objectID": "workflow-scripts.html",
"href": "workflow-scripts.html",
"title": "7 Workflow: scripts and projects",
"section": "",
"text": "7.1 Scripts\nThis chapter will introduce you to two essential tools for organizing your code: scripts and projects.\nSo far, you have used the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, as in Figure 7.1. The script editor is a great place to experiment with your code. When you want to change something, you don’t have to re-type the whole thing, you can just edit the script and re-run it. And once you have written code that works and does what you want, you can save it as a script file to easily return to later.\nFigure 7.1: Opening the script editor adds a new pane at the top-left of the IDE.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "workflow-scripts.html#scripts",
"href": "workflow-scripts.html#scripts",
"title": "7 Workflow: scripts and projects",
"section": "",
"text": "7.1.1 Running code\nThe script editor is an excellent place for building complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below.\n\nlibrary(dplyr)\nlibrary(nycflights13)\n\nnot_cancelled <- flights |> \n filter(!is.na(dep_delay)█, !is.na(arr_delay))\n\nnot_cancelled |> \n group_by(year, month, day) |> \n summarize(mean = mean(dep_delay))\n\nIf your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the following statement (beginning with not_cancelled |>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.\nInstead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that you’ve captured all the important parts of your code in the script.\nWe recommend you always start your script with the packages you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include install.packages() in a script you share. It’s inconsiderate to hand off a script that will change something on their computer if they’re not being careful!\nWhen working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.\n\n\n7.1.2 RStudio diagnostics\nIn the script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:\n\n\n\n\n\n\n\n\n\nHover over the cross to see what the problem is:\n\n\n\n\n\n\n\n\n\nRStudio will also let you know about potential problems:\n\n\n\n\n\n\n\n\n\n\n\n7.1.3 Saving and naming\nRStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, it’s a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.\nIt might be tempting to name your files code.R or myscript.R, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:\n\nFile names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.\nFile names should be human readable: use file names to describe what’s in the file.\nFile names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.\n\nFor example, suppose you have the following files in a project folder.\nalternative model.R\ncode for exploratory analysis.r\nfinalreport.qmd\nFinalReport.qmd\nfig 1.png\nFigure_02.png\nmodel_first_try.R\nrun-first.r\ntemp.txt\nThere are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReport1), and some names don’t describe their contents (run-first and temp).\nHere’s a better way of naming and organizing the same set of files:\n01-load-data.R\n02-exploratory-analysis.R\n03-model-approach-1.R\n04-model-approach-2.R\nfig-01.png\nfig-02.png\nreport-2022-03-20.qmd\nreport-2022-04-02.qmd\nreport-draft-notes.txt\nNumbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and temp is renamed to report-draft-notes to better describe its contents. If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "workflow-scripts.html#projects",
"href": "workflow-scripts.html#projects",
"title": "7 Workflow: scripts and projects",
"section": "7.2 Projects",
"text": "7.2 Projects\nOne day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.\nTo handle these real life situations, you need to make two decisions:\n\nWhat is the source of truth? What will you save as your lasting record of what happened?\nWhere does your analysis live?\n\n\n7.2.1 What is the source of truth?\nAs a beginner, it’s okay to rely on your current Environment to contain all the objects you have created throughout your analysis. However, to make it easier to work on larger projects or collaborate with others, your source of truth should be the R scripts. With your R scripts (and your data files), you can recreate the environment. With only your environment, it’s much harder to recreate your R scripts: you’ll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you’ll have to carefully mine your R history.\nTo help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running usethis::use_blank_slate()2 or by mimicking the options shown in Figure 7.2. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time nor will the objects you created or the datasets you read be available to use. But this short-term pain saves you long-term agony because it forces you to capture all important procedures in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculation in your environment, not the calculation itself in your code.\n\n\n\n\n\n\n\n\nFigure 7.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.\n\n\n\n\n\nThere is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:\n\nPress Cmd/Ctrl + Shift + 0/F10 to restart R.\nPress Cmd/Ctrl + Shift + S to re-run the current script.\n\nWe collectively use this pattern hundreds of times a week.\nAlternatively, if you don’t use keyboard shortcuts, you can go to Session > Restart R and then highlight and re-run your current script.\n\n\n\n\n\n\nRStudio server\n\n\n\nIf you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a clean slate.\n\n\n\n\n7.2.2 Where does your analysis live?\nR has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:\n\n\n\n\n\n\n\n\n\nAnd you can print this out in R code by running getwd():\n\ngetwd()\n#> [1] \"/Users/hadley/Documents/r4ds\"\n\nIn this R session, the current working directory (think of it as “home”) is in hadley’s Documents folder, in a subfolder called r4ds. This code will return a different result when you run it, because your computer has a different directory structure than Hadley’s!\nAs a beginning R user, it’s OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer. But you’re more than a handful of chapters into this book, and you’re no longer a beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.\nYou can set the working directory from within R but we do not recommend it:\n\nsetwd(\"/path/to/my/CoolProject\")\n\nThere’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the RStudio project.\n\n\n7.2.3 RStudio projects\nKeeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via projects. Let’s make a project for you to use while you’re working through the rest of this book. Click File > New Project, then follow the steps shown in Figure 7.3.\n\n\n\n\n\n\n\n\nFigure 7.3: To create new project: (top) first click New Directory, then (middle) click New Project, then (bottom) fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.\n\n\n\n\n\nCall your project r4ds and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!\nOnce this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:\n\ngetwd()\n#> [1] /Users/hadley/Documents/r4ds\n\nNow enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Then, create a new folder called “data”. You can do this by clicking on the “New Folder” button in the Files pane in RStudio. Finally, run the complete script which will save a PNG and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.\n\nlibrary(tidyverse)\n\nggplot(diamonds, aes(x = carat, y = price)) + \n geom_hex()\nggsave(\"diamonds.png\")\n\nwrite_csv(diamonds, \"data/diamonds.csv\")\n\nQuit RStudio. Inspect the folder associated with your project — notice the .Rproj file. Double-click that file to re-open the project. Notice you get back to where you left off: it’s the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you’re starting with a clean slate.\nIn your favorite OS-specific way, search your computer for diamonds.png and you will find the PNG (no surprise) but also the script that created it (diamonds.R). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files with R code and never with the mouse or the clipboard, you will be able to reproduce old work with ease!\n\n\n7.2.4 Relative and absolute paths\nOnce you’re inside a project, you should only ever use relative paths not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. When Hadley wrote data/diamonds.csv above it was a shortcut for /Users/hadley/Documents/r4ds/data/diamonds.csv. But importantly, if Mine ran this code on her computer, it would point to /Users/Mine/Documents/r4ds/data/diamonds.csv. This is why relative paths are important: they’ll work regardless of where the R project folder ends up.\nAbsolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\\\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.\nThere’s another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g., data/diamonds.csv) and Windows uses backslashes (e.g., data\\diamonds.csv). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "workflow-scripts.html#exercises",
"href": "workflow-scripts.html#exercises",
"title": "7 Workflow: scripts and projects",
"section": "7.3 Exercises",
"text": "7.3 Exercises\n\nGo to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!\nWhat other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "workflow-scripts.html#summary",
"href": "workflow-scripts.html#summary",
"title": "7 Workflow: scripts and projects",
"section": "7.4 Summary",
"text": "7.4 Summary\nIn this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.\nIn summary, scripts and projects give you a solid workflow that will serve you well in the future:\n\nCreate one RStudio project for each data analysis project.\nSave your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.\nOnly ever use relative paths, not absolute paths.\n\nThen everything you need is in one place and cleanly separated from all the other projects that you are working on.\nSo far, we’ve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data won’t be available in this way. So in the next chapter, you’re going to learn how load data from disk into your R session using the readr package.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "workflow-scripts.html#footnotes",
"href": "workflow-scripts.html#footnotes",
"title": "7 Workflow: scripts and projects",
"section": "",
"text": "Not to mention that you’re tempting fate by using “final” in the name 😆 The comic Piled Higher and Deeper has a fun strip on this.↩︎\nIf you don’t have usethis installed, you can install it with install.packages(\"usethis\").↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>7</span> <span class='chapter-title'>Workflow: scripts and projects</span>"
]
},
{
"objectID": "data-import.html",
"href": "data-import.html",
"title": "8 Data import",
"section": "",
"text": "8.1 Introduction\nWorking with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.\nSpecifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#introduction",
"href": "data-import.html#introduction",
"title": "8 Data import",
"section": "",
"text": "8.1.1 Prerequisites\nIn this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.\n\nlibrary(tidyverse)",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#reading-data-from-a-file",
"href": "data-import.html#reading-data-from-a-file",
"title": "8 Data import",
"section": "8.2 Reading data from a file",
"text": "8.2 Reading data from a file\nTo begin, we’ll focus on the most common rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas.\n\nStudent ID,Full Name,favourite.food,mealPlan,AGE\n1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4\n2,Barclay Lynn,French fries,Lunch only,5\n3,Jayendra Lyne,N/A,Breakfast and lunch,7\n4,Leon Rossini,Anchovies,Lunch only,\n5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five\n6,Güvenç Attila,Ice cream,Lunch only,6\n\nTable 8.1 shows a representation of the same data as a table.\n\n\n\n\nTable 8.1: Data from the students.csv file as a table.\n\n\n\n\n\n\n\n\n\n\n\n\n\nStudent ID\nFull Name\nfavourite.food\nmealPlan\nAGE\n\n\n\n\n1\nSunil Huffmann\nStrawberry yoghurt\nLunch only\n4\n\n\n2\nBarclay Lynn\nFrench fries\nLunch only\n5\n\n\n3\nJayendra Lyne\nN/A\nBreakfast and lunch\n7\n\n\n4\nLeon Rossini\nAnchovies\nLunch only\nNA\n\n\n5\nChidiegwu Dunkel\nPizza\nBreakfast and lunch\nfive\n\n\n6\nGüvenç Attila\nIce cream\nLunch only\n6\n\n\n\n\n\n\n\n\nWe can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and that it lives in the data folder.\n\nstudents <- read_csv(\"data/students.csv\")\n#> Rows: 6 Columns: 5\n#> ── Column specification ─────────────────────────────────────────────────────\n#> Delimiter: \",\"\n#> chr (4): Full Name, favourite.food, mealPlan, AGE\n#> dbl (1): Student ID\n#> \n#> ℹ Use `spec()` to retrieve the full column specification for this data.\n#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nThe code above will work if you have the students.csv file in a data folder in your project. You can download the students.csv file from https://pos.it/r4ds-students-csv or you can read it directly from that URL with:\n\nstudents <- read_csv(\"https://pos.it/r4ds-students-csv\")\n\nWhen you run read_csv(), it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in Section 8.3.\n\n8.2.1 Practical advice\nOnce you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students data with that in mind.\n\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne N/A Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nIn the favourite.food column, there are a bunch of food items, and then the character string N/A, which should have been a real NA that R will recognize as “not available”. This is something we can address using the na argument. By default, read_csv() only recognizes empty strings (\"\") in this dataset as NAs, and we want it to also recognize the character string \"N/A\".\n\nstudents <- read_csv(\"data/students.csv\", na = c(\"N/A\", \"\"))\n\nstudents\n#> # A tibble: 6 × 5\n#> `Student ID` `Full Name` favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nYou might also notice that the Student ID and Full Name columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `:\n\nstudents |> \n rename(\n student_id = `Student ID`,\n full_name = `Full Name`\n )\n#> # A tibble: 6 × 5\n#> student_id full_name favourite.food mealPlan AGE \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nAn alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once1.\n\nstudents |> janitor::clean_names()\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <chr> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nAnother common task after reading in data is to consider variable types. For example, meal_plan is a categorical variable with a known set of possible values, which in R should be represented as a factor:\n\nstudents |>\n janitor::clean_names() |>\n mutate(meal_plan = factor(meal_plan))\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age \n#> <dbl> <chr> <chr> <fct> <chr>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4 \n#> 2 2 Barclay Lynn French fries Lunch only 5 \n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7 \n#> 4 4 Leon Rossini Anchovies Lunch only <NA> \n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five \n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nNote that the values in the meal_plan variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>). You’ll learn more about factors in Chapter 17.\nBefore you analyze these data, you’ll probably want to fix the age column. Currently, age is a character variable because one of the observations is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in Chapter 21.\n\nstudents <- students |>\n janitor::clean_names() |>\n mutate(\n meal_plan = factor(meal_plan),\n age = parse_number(if_else(age == \"five\", \"5\", age))\n )\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nA new function here is if_else(), which has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE. Here we’re saying if age is the character string \"five\", make it \"5\", and if not leave it as age. You will learn more about if_else() and logical vectors in Chapter 13.\n\n\n8.2.2 Other arguments\nThere are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: read_csv() can read text strings that you’ve created and formatted like a CSV file:\n\nread_csv(\n \"a,b,c\n 1,2,3\n 4,5,6\"\n)\n#> # A tibble: 2 × 3\n#> a b c\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nUsually, read_csv() uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use skip = n to skip the first n lines or use comment = \"#\" to drop all lines that start with (e.g.) #:\n\nread_csv(\n \"The first line of metadata\n The second line of metadata\n x,y,z\n 1,2,3\",\n skip = 2\n)\n#> # A tibble: 1 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n\nread_csv(\n \"# A comment I want to skip\n x,y,z\n 1,2,3\",\n comment = \"#\"\n)\n#> # A tibble: 1 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n\nIn other cases, the data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings and instead label them sequentially from X1 to Xn:\n\nread_csv(\n \"1,2,3\n 4,5,6\",\n col_names = FALSE\n)\n#> # A tibble: 2 × 3\n#> X1 X2 X3\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nAlternatively, you can pass col_names a character vector which will be used as the column names:\n\nread_csv(\n \"1,2,3\n 4,5,6\",\n col_names = c(\"x\", \"y\", \"z\")\n)\n#> # A tibble: 2 × 3\n#> x y z\n#> <dbl> <dbl> <dbl>\n#> 1 1 2 3\n#> 2 4 5 6\n\nThese arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv file and read the documentation for read_csv()’s many other arguments.)\n\n\n8.2.3 Other file types\nOnce you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:\n\nread_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.\nread_tsv() reads tab-delimited files.\nread_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.\nread_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().\nread_table() reads a common variation of fixed-width files where columns are separated by white space.\nread_log() reads Apache-style log files.\n\n\n\n8.2.4 Exercises\n\nWhat function would you use to read a file where fields were separated with “|”?\nApart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?\nWhat are the most important arguments to read_fwf()?\nSometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like \" or '. By default, read_csv() assumes that the quoting character will be \". To read the following text into a data frame, what argument to read_csv() do you need to specify?\n\n\"x,y\\n1,'a,b'\"\n\nIdentify what is wrong with each of the following inline CSV files. What happens when you run the code?\n\nread_csv(\"a,b\\n1,2,3\\n4,5,6\")\nread_csv(\"a,b,c\\n1,2\\n1,2,3,4\")\nread_csv(\"a,b\\n\\\"1\")\nread_csv(\"a,b\\n1,2\\na,b\")\nread_csv(\"a;b\\n1;3\")\n\nPractice referring to non-syntactic names in the following data frame by:\n\nExtracting the variable called 1.\nPlotting a scatterplot of 1 vs. 2.\nCreating a new column called 3, which is 2 divided by 1.\nRenaming the columns to one, two, and three.\n\n\nannoying <- tibble(\n `1` = 1:10,\n `2` = `1` * 2 + rnorm(length(`1`))\n)",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#sec-col-types",
"href": "data-import.html#sec-col-types",
"title": "8 Data import",
"section": "8.3 Controlling column types",
"text": "8.3 Controlling column types\nA CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.\n\n8.3.1 Guessing types\nreadr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:\n\nDoes it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.\nDoes it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.\nDoes it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Section 18.2).\nOtherwise, it must be a string.\n\nYou can see that behavior in action in this simple example:\n\nread_csv(\"\n logical,numeric,date,string\n TRUE,1,2021-01-15,abc\n false,4.5,2021-02-15,def\n T,Inf,2021-02-16,ghi\n\")\n#> # A tibble: 3 × 4\n#> logical numeric date string\n#> <lgl> <dbl> <date> <chr> \n#> 1 TRUE 1 2021-01-15 abc \n#> 2 FALSE 4.5 2021-02-15 def \n#> 3 TRUE Inf 2021-02-16 ghi\n\nThis heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.\n\n\n8.3.2 Missing values, column types, and problems\nThe most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.\nTake this simple 1 column CSV file as an example:\n\nsimple_csv <- \"\n x\n 10\n .\n 20\n 30\"\n\nIf we read it without any additional arguments, x becomes a character column:\n\nread_csv(simple_csv)\n#> # A tibble: 4 × 1\n#> x \n#> <chr>\n#> 1 10 \n#> 2 . \n#> 3 20 \n#> 4 30\n\nIn this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:\n\ndf <- read_csv(\n simple_csv, \n col_types = list(x = col_double())\n)\n#> Warning: One or more parsing issues, call `problems()` on your data frame for\n#> details, e.g.:\n#> dat <- vroom(...)\n#> problems(dat)\n\nNow read_csv() reports that there was a problem, and tells us we can find out more with problems():\n\nproblems(df)\n#> # A tibble: 1 × 5\n#> row col expected actual file \n#> <int> <int> <chr> <chr> <chr> \n#> 1 3 1 a double . /private/var/folders/9f/nn2jnl8n1lj1sk3y391hyd…\n\nThis tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = \".\", the automatic guessing succeeds, giving us the numeric column that we want:\n\nread_csv(simple_csv, na = \".\")\n#> # A tibble: 4 × 1\n#> x\n#> <dbl>\n#> 1 10\n#> 2 NA\n#> 3 20\n#> 4 30\n\n\n\n8.3.3 Column types\nreadr provides a total of nine column types for you to use:\n\ncol_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.\ncol_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.\ncol_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.\ncol_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Chapter 17 and Chapter 18.\ncol_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Chapter 14.\ncol_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.\n\nIt’s also possible to override the default column by switching from list() to cols() and specifying .default:\n\nanother_csv <- \"\nx,y,z\n1,2,3\"\n\nread_csv(\n another_csv, \n col_types = cols(.default = col_character())\n)\n#> # A tibble: 1 × 3\n#> x y z \n#> <chr> <chr> <chr>\n#> 1 1 2 3\n\nAnother useful helper is cols_only() which will read in only the columns you specify:\n\nread_csv(\n another_csv,\n col_types = cols_only(x = col_character())\n)\n#> # A tibble: 1 × 1\n#> x \n#> <chr>\n#> 1 1",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#sec-readr-directory",
"href": "data-import.html#sec-readr-directory",
"title": "8 Data import",
"section": "8.4 Reading data from multiple files",
"text": "8.4 Reading data from multiple files\nSometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.\n\nsales_files <- c(\"data/01-sales.csv\", \"data/02-sales.csv\", \"data/03-sales.csv\")\nread_csv(sales_files, id = \"file\")\n#> # A tibble: 19 × 6\n#> file month year brand item n\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 data/01-sales.csv January 2019 1 1234 3\n#> 2 data/01-sales.csv January 2019 1 8721 9\n#> 3 data/01-sales.csv January 2019 1 1822 2\n#> 4 data/01-sales.csv January 2019 2 3333 1\n#> 5 data/01-sales.csv January 2019 2 2156 9\n#> 6 data/01-sales.csv January 2019 2 3987 6\n#> # ℹ 13 more rows\n\nOnce again, the code above will work if you have the CSV files in a data folder in your project. You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:\n\nsales_files <- c(\n \"https://pos.it/r4ds-01-sales\",\n \"https://pos.it/r4ds-02-sales\",\n \"https://pos.it/r4ds-03-sales\"\n)\nread_csv(sales_files, id = \"file\")\n\nThe id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.\nIf you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in Chapter 16.\n\nsales_files <- list.files(\"data\", pattern = \"sales\\\\.csv$\", full.names = TRUE)\nsales_files\n#> [1] \"data/01-sales.csv\" \"data/02-sales.csv\" \"data/03-sales.csv\"",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#sec-writing-to-a-file",
"href": "data-import.html#sec-writing-to-a-file",
"title": "8 Data import",
"section": "8.5 Writing to a file",
"text": "8.5 Writing to a file\nreadr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). The most important arguments to these functions are x (the data frame to save) and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.\n\nwrite_csv(students, \"students.csv\")\n\nNow let’s read that csv file back in. Note that the variable type information that you just set up is lost when you save to CSV because you’re starting over with reading from a plain text file again:\n\nstudents\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\nwrite_csv(students, \"students-2.csv\")\nread_csv(\"students-2.csv\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <chr> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nThis makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternatives:\n\nwrite_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS. This means that when you reload the object, you are loading the exact same R object that you stored.\n\nwrite_rds(students, \"students.rds\")\nread_rds(\"students.rds\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne <NA> Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\nThe arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in Chapter 23.\n\nlibrary(arrow)\nwrite_parquet(students, \"students.parquet\")\nread_parquet(\"students.parquet\")\n#> # A tibble: 6 × 5\n#> student_id full_name favourite_food meal_plan age\n#> <dbl> <chr> <chr> <fct> <dbl>\n#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4\n#> 2 2 Barclay Lynn French fries Lunch only 5\n#> 3 3 Jayendra Lyne NA Breakfast and lunch 7\n#> 4 4 Leon Rossini Anchovies Lunch only NA\n#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5\n#> 6 6 Güvenç Attila Ice cream Lunch only 6\n\n\nParquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#data-entry",
"href": "data-import.html#data-entry",
"title": "8 Data import",
"section": "8.6 Data entry",
"text": "8.6 Data entry\nSometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. tibble() works by column:\n\ntibble(\n x = c(1, 2, 5), \n y = c(\"h\", \"m\", \"g\"),\n z = c(0.08, 0.83, 0.60)\n)\n#> # A tibble: 3 × 3\n#> x y z\n#> <dbl> <chr> <dbl>\n#> 1 1 h 0.08\n#> 2 2 m 0.83\n#> 3 5 g 0.6\n\nLaying out the data by column can make it hard to see how the rows are related, so an alternative is tribble(), short for transposed tibble, which lets you lay out your data row by row. tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:\n\ntribble(\n ~x, ~y, ~z,\n 1, \"h\", 0.08,\n 2, \"m\", 0.83,\n 5, \"g\", 0.60\n)\n#> # A tibble: 3 × 3\n#> x y z\n#> <dbl> <chr> <dbl>\n#> 1 1 h 0.08\n#> 2 2 m 0.83\n#> 3 5 g 0.6",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#summary",
"href": "data-import.html#summary",
"title": "8 Data import",
"section": "8.7 Summary",
"text": "8.7 Summary\nIn this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble(). You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: Chapter 21 from Excel and Google Sheets, Chapter 22 will show you how to load data from databases, Chapter 23 from parquet files, Chapter 24 from JSON, and Chapter 25 from websites.\nWe’re just about at the end of this section of the book, but there’s one important last topic to cover: how to get help. So in the next chapter, you’ll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "data-import.html#footnotes",
"href": "data-import.html#footnotes",
"title": "8 Data import",
"section": "",
"text": "The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use |>.↩︎\nYou can override the default of 1000 with the guess_max argument.↩︎",
"crumbs": [
"Whole game",
"<span class='chapter-number'>8</span> <span class='chapter-title'>Data import</span>"
]
},
{
"objectID": "workflow-help.html",
"href": "workflow-help.html",
"title": "9 Workflow: getting help",
"section": "",
"text": "9.1 Google is your friend\nThis book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help and to help you keep learning.\nIf you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Additionally, adding package names like “tidyverse” or “ggplot2” will help narrow down the results to code that will feel more familiar to you as well, e.g., “how to make a boxplot in R” vs. “how to make a boxplot in R with ggplot2”. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = \"en\") and re-run the code; you’re more likely to find help for English error messages.)\nIf Google doesn’t help, try Stack Overflow. Start by spending a little time searching for an existing answer, including [R], to restrict your search to questions and answers that use R.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Workflow: getting help</span>"
]
},
{
"objectID": "workflow-help.html#making-a-reprex",
"href": "workflow-help.html#making-a-reprex",
"title": "9 Workflow: getting help",
"section": "9.2 Making a reprex",
"text": "9.2 Making a reprex\nIf your googling doesn’t find anything useful, it’s a really good idea to prepare a reprex, short for minimal reproducible example. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:\n\nFirst, you need to make your code reproducible. This means that you need to capture everything, i.e. include any library() calls and create all necessary objects. The easiest way to make sure you’ve done this is using the reprex package.\nSecond, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.\n\nThat sounds like a lot of work! And it can be, but it has a great payoff:\n\n80% of the time, creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.\nThe other 20% of the time, you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!\n\nWhen creating a reprex by hand, it’s easy to accidentally miss something, meaning your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package, which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):\n\ny <- 1:4\nmean(y)\n\nThen call reprex(), where the default output is formatted for GitHub:\nreprex::reprex()\nA nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The reprex is automatically copied to your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):\n``` r\ny <- 1:4\nmean(y)\n#> [1] 2.5\n```\nThis text is formatted in a special way, called Markdown, which can be pasted to sites like StackOverflow or Github and they will automatically render it to look like code. Here’s what that Markdown would look like rendered on GitHub:\n\ny <- 1:4\nmean(y)\n#> [1] 2.5\n\nAnyone else can copy, paste, and run this immediately.\nThere are three things you need to include to make your example reproducible: required packages, data, and code.\n\nPackages should be loaded at the top of the script so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; you may have discovered a bug that’s been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update().\nThe easiest way to include data is to use dput() to generate the R code needed to recreate it. For example, to recreate the mtcars dataset in R, perform the following steps:\n\nRun dput(mtcars) in R\nCopy the output\nIn reprex, type mtcars <-, then paste.\n\nTry to use the smallest subset of your data that still reveals the problem.\nSpend a little bit of time ensuring that your code is easy for others to read:\n\nMake sure you’ve used spaces and your variable names are concise yet informative.\nUse comments to indicate where your problem lies.\nDo your best to remove everything that is not related to the problem.\n\nThe shorter your code is, the easier it is to understand and the easier it is to fix.\n\nFinish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script.\nCreating reprexes is not trivial, and it will take some practice to learn to create good, truly minimal reprexes. However, learning to ask questions that include the code, and investing the time to make it reproducible will continue to pay off as you learn and master R.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Workflow: getting help</span>"
]
},
{
"objectID": "workflow-help.html#investing-in-yourself",
"href": "workflow-help.html#investing-in-yourself",
"title": "9 Workflow: getting help",
"section": "9.3 Investing in yourself",
"text": "9.3 Investing in yourself\nYou should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the tidyverse blog. To keep up with the R community more broadly, we recommend reading R Weekly: it’s a community effort to aggregate the most interesting news in the R community each week.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Workflow: getting help</span>"
]
},
{
"objectID": "workflow-help.html#summary",
"href": "workflow-help.html#summary",
"title": "9 Workflow: getting help",
"section": "9.4 Summary",
"text": "9.4 Summary\nThis chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of the whole process, and we start to get into the details of small pieces.\nThe next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you’ve learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication.",
"crumbs": [
"Whole game",
"<span class='chapter-number'>9</span> <span class='chapter-title'>Workflow: getting help</span>"
]
},
{
"objectID": "visualize.html",
"href": "visualize.html",
"title": "Visualize",
"section": "",
"text": "After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science. Now it’s time to start diving into the details. In this part of the book, you’ll learn about visualizing data in further depth.\n\n\n\n\n\n\n\n\nFigure 1: Data visualization is often the first step in data exploration.\n\n\n\n\n\nEach chapter addresses one to a few aspects of creating a data visualization.\n\nIn 10 Layers you will learn about the layered grammar of graphics.\nIn 11 Exploratory data analysis, you’ll combine visualization with your curiosity and skepticism to ask and answer interesting questions about data.\nFinally, in 12 Communication you will learn how to take your exploratory graphics, elevate them, and turn them into expository graphics, graphics that help the newcomer to your analysis understand what’s going on as quickly and easily as possible.\n\nThese three chapters get you started in the world of visualization, but there is much more to learn. The absolute best place to learn more is the ggplot2 book: ggplot2: Elegant graphics for data analysis. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems. Another great resource is the ggplot2 extensions gallery https://exts.ggplot2.tidyverse.org/gallery/. This site lists many of the packages that extend ggplot2 with new geoms and scales. It’s a great place to start if you’re trying to do something that seems hard with ggplot2.",
"crumbs": [
"Visualize"
]
},
{
"objectID": "layers.html",
"href": "layers.html",
"title": "10 Layers",
"section": "",
"text": "10.1 Introduction\nIn Chapter 2, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2.\nIn this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.\nWe will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#introduction",
"href": "layers.html#introduction",
"title": "10 Layers",
"section": "",
"text": "10.1.1 Prerequisites\nThis chapter focuses on ggplot2. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:\n\nlibrary(tidyverse)",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#aesthetic-mappings",
"href": "layers.html#aesthetic-mappings",
"title": "10 Layers",
"section": "10.2 Aesthetic mappings",
"text": "10.2 Aesthetic mappings\n\n“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey\n\nRemember that the mpg data frame bundled with the ggplot2 package contains 234 observations on 38 car models.\n\nmpg\n#> # A tibble: 234 × 11\n#> manufacturer model displ year cyl trans drv cty hwy fl \n#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>\n#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p \n#> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p \n#> 3 audi a4 2 2008 4 manual(m6) f 20 31 p \n#> 4 audi a4 2 2008 4 auto(av) f 21 30 p \n#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p \n#> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p \n#> # ℹ 228 more rows\n#> # ℹ 1 more variable: class <chr>\n\nAmong the variables in mpg are:\n\ndispl: A car’s engine size, in liters. A numerical variable.\nhwy: A car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.\nclass: Type of car. A categorical variable.\n\nLet’s start by visualizing the relationship between displ and hwy for various classes of cars. We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, color = class)) +\n geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, shape = class)) +\n geom_point()\n#> Warning: The shape palette can deal with a maximum of 6 discrete values because more\n#> than 6 becomes difficult to discriminate\n#> ℹ you have requested 7 values. Consider specifying shapes manually if you\n#> need that many have them.\n#> Warning: Removed 62 rows containing missing values or values outside the scale range\n#> (`geom_point()`).\n\n\n\n\n\n\n\n\n\n\nWhen class is mapped to shape, we get two warnings:\n\n1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.\n2: Removed 62 rows containing missing values (geom_point()).\n\nSince ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.\nSimilarly, we can map class to size or alpha aesthetics as well, which control the size and the transparency of the points, respectively.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, size = class)) +\n geom_point()\n#> Warning: Using size for a discrete variable is not advised.\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +\n geom_point()\n#> Warning: Using alpha for a discrete variable is not advised.\n\n\n\n\n\n\n\n\n\n\nBoth of these produce warnings as well:\n\nUsing alpha for a discrete variable is not advised.\n\nMapping an unordered discrete (categorical) variable (class) to an ordered aesthetic (size or alpha) is generally not a good idea because it implies a ranking that does not in fact exist.\nOnce you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line provides the same information as a legend; it explains the mapping between locations and values.\nYou can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance. For example, we can make all of the points in our plot blue:\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(color = \"blue\")\n\n\n\n\n\n\n\n\nHere, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You’ll need to pick a value that makes sense for that aesthetic:\n\nThe name of a color as a character string, e.g., color = \"blue\"\nThe size of a point in mm, e.g., size = 1\nThe shape of a point as a number, e.g, shape = 1, as shown in Figure 10.1.\n\n\n\n\n\n\n\n\n\nFigure 10.1: R has 26 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–25) have a border of color and are filled with fill. Shapes are arranged to keep similar shapes next to each other.\n\n\n\n\n\nSo far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.\nThe specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.\n\n10.2.1 Exercises\n\nCreate a scatterplot of hwy vs. displ where the points are pink filled in triangles.\nWhy did the following code not result in a plot with blue points?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy, color = \"blue\"))\n\nWhat does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)\nWhat happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#sec-geometric-objects",
"href": "layers.html#sec-geometric-objects",
"title": "10 Layers",
"section": "10.3 Geometric objects",
"text": "10.3 Geometric objects\nHow are these two plots similar?\n\n\n\n\n\n\n\n\n\n\n\n\n\nBoth plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.\nTo change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use the following code:\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_smooth()\n#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\nEvery geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + \n geom_smooth()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + \n geom_smooth()\n\n\n\n\n\n\n\n\n\n\nHere, geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value, one line describes all of the points that have an f value, and one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.\nIf this sounds strange, we can make it clearer by overlaying the lines on top of the raw data and then coloring everything according to drv.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) + \n geom_point() +\n geom_smooth(aes(linetype = drv))\n\n\n\n\n\n\n\n\nNotice that this plot contains two geoms in the same graph.\nMany geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth()\n\n# Middle\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(group = drv))\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(color = drv), show.legend = FALSE)\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(aes(color = class)) + \n geom_smooth()\n\n\n\n\n\n\n\n\nYou can use the same idea to specify different data for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in geom_point() overrides the global data argument in ggplot() for that layer only.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n geom_point(\n data = mpg |> filter(class == \"2seater\"), \n color = \"red\"\n ) +\n geom_point(\n data = mpg |> filter(class == \"2seater\"), \n shape = \"circle open\", size = 3, color = \"red\"\n )\n\n\n\n\n\n\n\n\nGeoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.\n# Left\nggplot(mpg, aes(x = hwy)) +\n geom_histogram(binwidth = 2)\n\n# Middle\nggplot(mpg, aes(x = hwy)) +\n geom_density()\n\n# Right\nggplot(mpg, aes(x = hwy)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\n\n\n\n\nggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). For example, the ggridges package (https://wilkelab.org/ggridges) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (geom_density_ridges()), but we have also mapped the same variable to multiple aesthetics (drv to y, fill, and color) as well as set an aesthetic (alpha = 0.5) to make the density curves transparent.\n\nlibrary(ggridges)\n\nggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +\n geom_density_ridges(alpha = 0.5, show.legend = FALSE)\n#> Picking joint bandwidth of 1.28\n\n\n\n\n\n\n\n\nThe best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: https://ggplot2.tidyverse.org/reference. To learn more about any single geom, use the help (e.g., ?geom_smooth).\n\n10.3.1 Exercises\n\nWhat geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\nEarlier in this chapter we used show.legend without explaining it:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_smooth(aes(color = drv), show.legend = FALSE)\n\nWhat does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?\nWhat does the se argument to geom_smooth() do?\nRecreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#facets",
"href": "layers.html#facets",
"title": "10 Layers",
"section": "10.4 Facets",
"text": "10.4 Facets\nIn Chapter 2 you learned about faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_wrap(~cyl)\n\n\n\n\n\n\n\n\nTo facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid(). The first argument of facet_grid() is also a formula, but now it’s a double sided formula: rows ~ cols.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_grid(drv ~ cyl)\n\n\n\n\n\n\n\n\nBy default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the scales argument in a faceting function to \"free_x\" will allow for different scales of x-axis across columns, \"free_y\" will allow for different scales on y-axis across rows, and \"free\" will allow both.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point() + \n facet_grid(drv ~ cyl, scales = \"free\")\n\n\n\n\n\n\n\n\n\n10.4.1 Exercises\n\nWhat happens if you facet on a continuous variable?\nWhat do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?\n\nggplot(mpg) + \n geom_point(aes(x = drv, y = cyl))\n\nWhat plots does the following code make? What does . do?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(drv ~ .)\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(. ~ cyl)\n\nTake the first faceted plot in this section:\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) + \n facet_wrap(~ cyl, nrow = 2)\n\nWhat are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?\nRead ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?\nWhich of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?\n\nggplot(mpg, aes(x = displ)) + \n geom_histogram() + \n facet_grid(drv ~ .)\n\nggplot(mpg, aes(x = displ)) + \n geom_histogram() +\n facet_grid(. ~ drv)\n\nRecreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?\n\nggplot(mpg) + \n geom_point(aes(x = displ, y = hwy)) +\n facet_grid(drv ~ .)",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#statistical-transformations",
"href": "layers.html#statistical-transformations",
"title": "10 Layers",
"section": "10.5 Statistical transformations",
"text": "10.5 Statistical transformations\nConsider a basic bar chart, drawn with geom_bar() or geom_col(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.\n\nggplot(diamonds, aes(x = cut)) + \n geom_bar()\n\n\n\n\n\n\n\n\nOn the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:\n\nBar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.\nSmoothers fit a model to your data and then plot predictions from the model.\nBoxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.\n\nThe algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. Figure 10.2 shows how this process works with geom_bar().\n\n\n\n\n\n\n\n\nFigure 10.2: When creating a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.\n\n\n\n\n\nYou can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(). If you scroll down, the section called “Computed variables” explains that it computes two new variables: count and prop.\nEvery geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:\n\nYou might want to override the default stat. In the code below, we change the stat of geom_bar() from count (the default) to identity. This lets us map the height of the bars to the raw values of a y variable.\n\ndiamonds |>\n count(cut) |>\n ggplot(aes(x = cut, y = n)) +\n geom_bar(stat = \"identity\")\n\n\n\n\n\n\n\n\nYou might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + \n geom_bar()\n\n\n\n\n\n\n\n\nTo find the possible variables that can be computed by the stat, look for the section titled “computed variables” in the help for geom_bar().\nYou might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:\n\nggplot(diamonds) + \n stat_summary(\n aes(x = cut, y = depth),\n fun.min = min,\n fun.max = max,\n fun = median\n )\n\n\n\n\n\n\n\n\n\nggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin.\n\n10.5.1 Exercises\n\nWhat is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?\nWhat does geom_col() do? How is it different from geom_bar()?\nMost geoms and stats come in pairs that are almost always used in concert. Make a list of all the pairs. What do they have in common? (Hint: Read through the documentation.)\nWhat variables does stat_smooth() compute? What arguments control its behavior?\nIn our proportion bar chart, we needed to set group = 1. Why? In other words, what is the problem with these two graphs?\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop))) + \n geom_bar()\nggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + \n geom_bar()",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#position-adjustments",
"href": "layers.html#position-adjustments",
"title": "10 Layers",
"section": "10.6 Position adjustments",
"text": "10.6 Position adjustments\nThere’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, the fill aesthetic:\n# Left\nggplot(mpg, aes(x = drv, color = drv)) + \n geom_bar()\n\n# Right\nggplot(mpg, aes(x = drv, fill = drv)) + \n geom_bar()\n\n\n\n\n\n\n\n\n\n\nNote what happens if you map the fill aesthetic to another variable, like class: the bars are automatically stacked. Each colored rectangle represents a combination of drv and class.\n\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar()\n\n\n\n\n\n\n\n\nThe stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: \"identity\", \"dodge\" or \"fill\".\n\nposition = \"identity\" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(alpha = 1/5, position = \"identity\")\n\n# Right\nggplot(mpg, aes(x = drv, color = class)) + \n geom_bar(fill = NA, position = \"identity\")\n\n\n\n\n\n\n\n\n\n\nThe identity position adjustment is more useful for 2d geoms, like points, where it is the default.\nposition = \"fill\" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.\nposition = \"dodge\" places overlapping objects directly beside one another. This makes it easier to compare individual values.\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(position = \"fill\")\n\n# Right\nggplot(mpg, aes(x = drv, fill = class)) + \n geom_bar(position = \"dodge\")\n\n\n\n\n\n\n\n\n\n\n\nThere’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?\n\n\n\n\n\n\n\n\n\nThe underlying values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?\nYou can avoid this gridding by setting the position adjustment to “jitter”. position = \"jitter\" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n geom_point(position = \"jitter\")\n\n\n\n\n\n\n\n\nAdding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = \"jitter\"): geom_jitter().\nTo learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.\n\n10.6.1 Exercises\n\nWhat is the problem with the following plot? How could you improve it?\n\nggplot(mpg, aes(x = cty, y = hwy)) + \n geom_point()\n\nWhat, if anything, is the difference between the two plots? Why?\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point()\nggplot(mpg, aes(x = displ, y = hwy)) +\n geom_point(position = \"identity\")\n\nWhat parameters to geom_jitter() control the amount of jittering?\nCompare and contrast geom_jitter() with geom_count().\nWhat’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#coordinate-systems",
"href": "layers.html#coordinate-systems",
"title": "10 Layers",
"section": "10.7 Coordinate systems",
"text": "10.7 Coordinate systems\nCoordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.\n\ncoord_quickmap() sets the aspect ratio correctly for geographic maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the Maps chapter of ggplot2: Elegant graphics for data analysis.\nnz <- map_data(\"nz\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n geom_polygon(fill = \"white\", color = \"black\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n geom_polygon(fill = \"white\", color = \"black\") +\n coord_quickmap()\n\n\n\n\n\n\n\n\n\n\ncoord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.\nbar <- ggplot(data = diamonds) + \n geom_bar(\n mapping = aes(x = clarity, fill = clarity), \n show.legend = FALSE,\n width = 1\n ) + \n theme(aspect.ratio = 1)\n\nbar + coord_flip()\nbar + coord_polar()\n\n\n\n\n\n\n\n\n\n\n\n\n10.7.1 Exercises\n\nTurn a stacked bar chart into a pie chart using coord_polar().\nWhat’s the difference between coord_quickmap() and coord_map()?\nWhat does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?\n\nggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +\n geom_point() + \n geom_abline() +\n coord_fixed()",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#the-layered-grammar-of-graphics",
"href": "layers.html#the-layered-grammar-of-graphics",
"title": "10 Layers",
"section": "10.8 The layered grammar of graphics",
"text": "10.8 The layered grammar of graphics\nWe can expand on the graphing template you learned in Section 2.3 by adding position adjustments, stats, coordinate systems, and faceting:\nggplot(data = <DATA>) + \n <GEOM_FUNCTION>(\n mapping = aes(<MAPPINGS>),\n stat = <STAT>, \n position = <POSITION>\n ) +\n <COORDINATE_FUNCTION> +\n <FACET_FUNCTION>\nOur new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.\nThe seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.\nTo see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. These steps are illustrated in Figure 10.3. You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.\n\n\n\n\n\n\n\n\nFigure 10.3: Steps for going from raw data to a table of frequencies to a bar plot where the heights of the bar represent the frequencies.\n\n\n\n\n\nAt this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.\nYou could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.\nIf you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “The Layered Grammar of Graphics”, the scientific paper that describes the theory of ggplot2 in detail.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "layers.html#summary",
"href": "layers.html#summary",
"title": "10 Layers",
"section": "10.9 Summary",
"text": "10.9 Summary\nIn this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems which allow you to fundamentally change what x and y mean. One layer we have not yet touched on is theme, which we will introduce in Section 12.5.\nTwo very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at https://posit.co/resources/cheatsheets) and the ggplot2 package website (https://ggplot2.tidyverse.org).\nAn important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it’s always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>10</span> <span class='chapter-title'>Layers</span>"
]
},
{
"objectID": "EDA.html",
"href": "EDA.html",
"title": "11 Exploratory data analysis",
"section": "",
"text": "11.1 Introduction\nThis chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:\nEDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive insights that you’ll eventually write up and communicate to others.\nEDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#introduction",
"href": "EDA.html#introduction",
"title": "11 Exploratory data analysis",
"section": "",
"text": "Generate questions about your data.\nSearch for answers by visualizing, transforming, and modelling your data.\nUse what you learn to refine your questions and/or generate new questions.\n\n\n\n\n11.1.1 Prerequisites\nIn this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.\n\nlibrary(tidyverse)",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#questions",
"href": "EDA.html#questions",
"title": "11 Exploratory data analysis",
"section": "11.2 Questions",
"text": "11.2 Questions\n\n“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox\n\n\n“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey\n\nYour goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.\nEDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.\nThere is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:\n\nWhat type of variation occurs within my variables?\nWhat type of covariation occurs between my variables?\n\nThe rest of this chapter will look at these two questions. We’ll explain what variation and covariation are, and we’ll show you several ways to answer each question.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#variation",
"href": "EDA.html#variation",
"title": "11 Exploratory data analysis",
"section": "11.3 Variation",
"text": "11.3 Variation\nVariation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values, which you’ve learned about in Chapter 2.\nWe’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram:\n\nggplot(diamonds, aes(x = carat)) +\n geom_histogram(binwidth = 0.5)\n\n\n\n\n\n\n\n\nNow that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).\n\n11.3.1 Typical values\nIn both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:\n\nWhich values are the most common? Why?\nWhich values are rare? Why? Does that match your expectations?\nCan you see any unusual patterns? What might explain them?\n\nLet’s take a look at the distribution of carat for smaller diamonds.\n\nsmaller <- diamonds |> \n filter(carat < 3)\n\nggplot(smaller, aes(x = carat)) +\n geom_histogram(binwidth = 0.01)\n\n\n\n\n\n\n\n\nThis histogram suggests several interesting questions:\n\nWhy are there more diamonds at whole carats and common fractions of carats?\nWhy are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?\n\nVisualizations can also reveal clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:\n\nHow are the observations within each subgroup similar to each other?\nHow are the observations in separate clusters different from each other?\nHow can you explain or describe the clusters?\nWhy might the appearance of clusters be misleading?\n\nSome of these questions can be answered with the data while some will require domain expertise about the data. Many of them will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.\n\n\n11.3.2 Unusual values\nOutliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.\n\nggplot(diamonds, aes(x = y)) + \n geom_histogram(binwidth = 0.5)\n\n\n\n\n\n\n\n\nThere are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():\n\nggplot(diamonds, aes(x = y)) + \n geom_histogram(binwidth = 0.5) +\n coord_cartesian(ylim = c(0, 50))\n\n\n\n\n\n\n\n\ncoord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.\nThis allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:\n\nunusual <- diamonds |> \n filter(y < 3 | y > 20) |> \n select(price, x, y, z) |>\n arrange(y)\nunusual\n#> # A tibble: 9 × 4\n#> price x y z\n#> <int> <dbl> <dbl> <dbl>\n#> 1 5139 0 0 0 \n#> 2 6381 0 0 0 \n#> 3 12800 0 0 0 \n#> 4 15686 0 0 0 \n#> 5 18034 0 0 0 \n#> 6 2130 0 0 0 \n#> 7 2130 0 0 0 \n#> 8 2075 5.15 31.8 5.12\n#> 9 12210 8.09 58.9 8.06\n\nThe y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. By doing EDA, we have discovered missing data that was coded as 0, which we never would have found by simply searching for NAs. Going forward we might choose to re-code these values as NAs in order to prevent misleading calculations. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!\nIt’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.\n\n\n11.3.3 Exercises\n\nExplore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.\nExplore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)\nHow many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?\nCompare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#sec-unusual-values-eda",
"href": "EDA.html#sec-unusual-values-eda",
"title": "11 Exploratory data analysis",
"section": "11.4 Unusual values",
"text": "11.4 Unusual values\nIf you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.\n\nDrop the entire row with the strange values:\n\ndiamonds2 <- diamonds |> \n filter(between(y, 3, 20))\n\nWe don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!\nInstead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the if_else() function to replace unusual values with NA:\n\ndiamonds2 <- diamonds |> \n mutate(y = if_else(y < 3 | y > 20, NA, y))\n\n\nIt’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n geom_point()\n#> Warning: Removed 9 rows containing missing values or values outside the scale range\n#> (`geom_point()`).\n\n\n\n\n\n\n\n\nTo suppress that warning, set na.rm = TRUE:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n geom_point(na.rm = TRUE)\n\nOther times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights1, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing.\n\nnycflights13::flights |> \n mutate(\n cancelled = is.na(dep_time),\n sched_hour = sched_dep_time %/% 100,\n sched_min = sched_dep_time %% 100,\n sched_dep_time = sched_hour + (sched_min / 60)\n ) |> \n ggplot(aes(x = sched_dep_time)) + \n geom_freqpoly(aes(color = cancelled), binwidth = 1/4)\n\n\n\n\n\n\n\n\nHowever this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.\n\n11.4.1 Exercises\n\nWhat happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?\nWhat does na.rm = TRUE do in mean() and sum()?\nRecreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#covariation",
"href": "EDA.html#covariation",
"title": "11 Exploratory data analysis",
"section": "11.5 Covariation",
"text": "11.5 Covariation\nIf variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.\n\n11.5.1 A categorical and a numerical variable\nFor example, let’s explore how the price of a diamond varies with its quality (measured by cut) using geom_freqpoly():\n\nggplot(diamonds, aes(x = price)) + \n geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\n\n\n\n\nNote that ggplot2 uses an ordered color scale for cut because it’s defined as an ordered factor variable in the data. You’ll learn more about these in Section 17.6.\nThe default appearance of geom_freqpoly() is not that useful here because the height, determined by the overall count, differs so much across cuts, making it hard to see the differences in the shapes of their distributions.\nTo make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.\n\nggplot(diamonds, aes(x = price, y = after_stat(density))) + \n geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\n\n\n\n\nNote that we’re mapping the density to y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the after_stat() function to do so.\nThere’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.\nA visually simpler plot for exploring this relationship is using side-by-side boxplots.\n\nggplot(diamonds, aes(x = cut, y = price)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\nWe see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.\ncut is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder(). You’ll learn more about that function in Section 17.4, but we want to give you a quick preview here because it’s so useful. For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:\n\nggplot(mpg, aes(x = class, y = hwy)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\nTo make the trend easier to see, we can reorder class based on the median value of hwy:\n\nggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +\n geom_boxplot()\n\n\n\n\n\n\n\n\nIf you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.\n\nggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +\n geom_boxplot()\n\n\n\n\n\n\n\n\n\n11.5.1.1 Exercises\n\nUse what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.\nBased on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?\nInstead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?\nOne problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?\nCreate a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?\nIf you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.\n\n\n\n\n11.5.2 Two categorical variables\nTo visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in geom_count():\n\nggplot(diamonds, aes(x = cut, y = color)) +\n geom_count()\n\n\n\n\n\n\n\n\nThe size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.\nAnother approach for exploring the relationship between these variables is computing the counts with dplyr:\n\ndiamonds |> \n count(color, cut)\n#> # A tibble: 35 × 3\n#> color cut n\n#> <ord> <ord> <int>\n#> 1 D Fair 163\n#> 2 D Good 662\n#> 3 D Very Good 1513\n#> 4 D Premium 1603\n#> 5 D Ideal 2834\n#> 6 E Fair 224\n#> # ℹ 29 more rows\n\nThen visualize with geom_tile() and the fill aesthetic:\n\ndiamonds |> \n count(color, cut) |> \n ggplot(aes(x = color, y = cut)) +\n geom_tile(aes(fill = n))\n\n\n\n\n\n\n\n\nIf the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.\n\n11.5.2.1 Exercises\n\nHow could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?\nWhat different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.\nUse geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?\n\n\n\n\n11.5.3 Two numerical variables\nYou’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, you can see a positive relationship between the carat size and price of a diamond: diamonds with more carats have a higher price. The relationship is exponential.\n\nggplot(smaller, aes(x = carat, y = price)) +\n geom_point()\n\n\n\n\n\n\n\n\n(In this section we’ll use the smaller dataset to stay focused on the bulk of the diamonds that are smaller than 3 carats)\nScatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend. You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_point(alpha = 1 / 100)\n\n\n\n\n\n\n\n\nBut using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.\ngeom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().\nggplot(smaller, aes(x = carat, y = price)) +\n geom_bin2d()\n\n# install.packages(\"hexbin\")\nggplot(smaller, aes(x = carat, y = price)) +\n geom_hex()\n#> Warning: Computation failed in `stat_binhex()`.\n#> Caused by error in `compute_group()`:\n#> ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\n\n\n\n\n\n\n\n\n\nAnother option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_boxplot(aes(group = cut_width(carat, 0.1)))\n\n\n\n\n\n\n\n\ncut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarizes a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.\n\n11.5.3.1 Exercises\n\nInstead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?\nVisualize the distribution of carat, partitioned by price.\nHow does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?\nCombine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.\nTwo dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. Why is a scatterplot a better display than a binned plot for this case?\n\ndiamonds |> \n filter(x >= 4) |> \n ggplot(aes(x = x, y = y)) +\n geom_point() +\n coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))\n\nInstead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number(). What are the advantages and disadvantages of this approach?\n\nggplot(smaller, aes(x = carat, y = price)) + \n geom_boxplot(aes(group = cut_number(carat, 20)))",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#patterns-and-models",
"href": "EDA.html#patterns-and-models",
"title": "11 Exploratory data analysis",
"section": "11.6 Patterns and models",
"text": "11.6 Patterns and models\nIf a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:\n\nCould this pattern be due to coincidence (i.e. random chance)?\nHow can you describe the relationship implied by the pattern?\nHow strong is the relationship implied by the pattern?\nWhat other variables might affect the relationship?\nDoes the relationship change if you look at individual subgroups of the data?\n\nPatterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.\nModels are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of price and carat, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.\n\nlibrary(tidymodels)\n\ndiamonds <- diamonds |>\n mutate(\n log_price = log(price),\n log_carat = log(carat)\n )\n\ndiamonds_fit <- linear_reg() |>\n fit(log_price ~ log_carat, data = diamonds)\n\ndiamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>\n mutate(.resid = exp(.resid))\n\nggplot(diamonds_aug, aes(x = carat, y = .resid)) + \n geom_point()\n\n\n\n\n\n\n\n\nOnce you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.\n\nggplot(diamonds_aug, aes(x = cut, y = .resid)) + \n geom_boxplot()\n\n\n\n\n\n\n\n\nWe’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#summary",
"href": "EDA.html#summary",
"title": "11 Exploratory data analysis",
"section": "11.7 Summary",
"text": "11.7 Summary\nIn this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen techniques that work with a single variable at a time and with a pair of variables. This might seem painfully restrictive if you have tens or hundreds of variables in your data, but they’re the foundation upon which all other techniques are built.\nIn the next chapter, we’ll focus on the tools we can use to communicate our results.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "EDA.html#footnotes",
"href": "EDA.html#footnotes",
"title": "11 Exploratory data analysis",
"section": "",
"text": "Remember that when we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() or package::dataset.↩︎",
"crumbs": [
"Visualize",
"<span class='chapter-number'>11</span> <span class='chapter-title'>Exploratory data analysis</span>"
]
},
{
"objectID": "communication.html",
"href": "communication.html",
"title": "12 Communication",
"section": "",
"text": "12.1 Introduction\nIn Chapter 11, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.\nNow that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.\nThis chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like The Truthful Art, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.",
"crumbs": [
"Visualize",
"<span class='chapter-number'>12</span> <span class='chapter-title'>Communication</span>"
]
},