search.json

[
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "R for Data Science (2e)",
    "section": "",
    "text": "1 тавтай морил\nЭнэ бол “R for Data Science” сэтгүүлийн 2 дахь хэвлэлд зориулагдсан вэбсайт юм. Энэхүү ном нь танд R-ээр өгөгдлийн шинжлэх ухаан хэрхэн хийхийг заах болно: Та өөрийн өгөгдлийг R-д хэрхэн оруулах, түүнийг хамгийн хэрэгцээтэй бүтцэд оруулах, хувиргах, дүрслэн харуулах аргад суралцах болно.\nЭнэ номноос та мэдээллийн шинжлэх ухааны ур чадварын практикийг олох болно. Химийн эмч туршилтын хоолойг хэрхэн цэвэрлэж, лаборатори нөөцлөх талаар сурдаг шиг та өгөгдлийг хэрхэн цэвэрлэж, график зурах талаар сурах болно. Эдгээр нь өгөгдлийн шинжлэх ухааныг бий болгох боломжийг олгодог ур чадварууд бөгөөд эндээс та R-тэй эдгээр зүйл бүрийг хийх шилдэг туршлагуудыг олох болно. Та цаг хэмнэхийн тулд график, бичиг үсэгт тайлагдсан програмчлал, хуулбарлах судалгааны дүрмийг хэрхэн ашиглах талаар суралцах болно. Мөн та мэдээлэл солилцох, дүрслэх, судлах явцад нээлтийг хөнгөвчлөх танин мэдэхүйн нөөцийг хэрхэн удирдах талаар суралцах болно.\nЭнэ вэб сайт нь CC BY-NC-ND 3.0 лицензийн дагуу лицензтэй бөгөөд үргэлж үнэ төлбөргүй байх болно. Хэрэв та номын биет хуулбарыг авахыг хүсвэл [Amazon] (https://www.amazon.com/dp/1492097403?&tag=hadlwick-20) дээр захиалж болно. Хэрэв та энэ номыг үнэ төлбөргүй уншиж байгаад талархаж байгаа бөгөөд буцааж өгөхийг хүсвэл Kākāpō Recovery: kākāpō-д хандив өргөөрэй. /www.youtube.com/watch?v =9T1vfsHYiKY) (R4DS-ийн нүүрэн дээр гардаг) нь шүүмжлэлтэй ханддаг. Шинэ Зеландаас гаралтай ховордсон тоть; ердөө 248 үлдсэн.\nХэрэв та өөр хэлээр ярьдаг бол 1-р хэвлэлд үнэгүй орчуулагдсан орчуулгыг сонирхож магадгүй юм.\nТа https://mine-cetinkaya-rundel.github.io/r4ds-solutions дээрх номноос дасгалын санал болгож буй хариултуудыг олох боломжтой.\nR4DS нь Contributor-ийн ёс зүйн дүрмийг ашигладаг болохыг анхаарна уу. Энэ номонд хувь нэмрээ оруулснаар та түүний нөхцлийг дагаж мөрдөхийг зөвшөөрч байна.",
    "crumbs": [
      "<span class='chapter-number'>1</span>  <span class='chapter-title'>тавтай морил</span>"
    ]
  },
  {
    "objectID": "index.html#талархал",
    "href": "index.html#талархал",
    "title": "R for Data Science (2e)",
    "section": "1.1 Талархал",
    "text": "1.1 Талархал\nR4DS-ийг https://www.netlify.com нээлттэй эхийн программ хангамж болон нийгэмлэгүүдийг дэмжих нэг хэсэг болгон зохион байгуулдаг. sss",
    "crumbs": [
      "<span class='chapter-number'>1</span>  <span class='chapter-title'>тавтай морил</span>"
    ]
  },
  {
    "objectID": "preface-2e.html",
    "href": "preface-2e.html",
    "title": "Preface to the second edition",
    "section": "",
    "text": "Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. We’re also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).\nA brief summary of the biggest changes follows:\n\nThe first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.\nThe second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the ggplot2 book, but now R4DS covers more of the most important techniques.\nThe third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room to cover all the details.\nThe fourth part of the book is called “Import”. It’s a new set of chapters that goes beyond reading flat text files to working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.\nThe “Program” part remains, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes details on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier and more important over the last few years. We’ve added a new chapter on important base R functions that you’re likely to see in wild-caught R code.\nThe modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the tidymodels packages and reading Tidy Modeling with R by Max Kuhn and Julia Silge.\nThe “Communicate” part remains, but has been thoroughly updated to feature Quarto instead of R Markdown. This edition of the book has been written in Quarto, and it’s clearly the tool of the future.",
    "crumbs": [
      "Preface to the second edition"
    ]
  },
  {
    "objectID": "intro.html",
    "href": "intro.html",
    "title": "Introduction",
    "section": "",
    "text": "What you will learn\nData science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The goal of “R for Data Science” is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly, and to have some fun along the way 😃. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges using the best parts of R.\nData science is a vast field, and there’s no way you can master it all by reading a single book. This book aims to give you a solid foundation in the most important tools and enough knowledge to find the resources to learn more when necessary. Our model of the steps of a typical data science project looks something like Figure 1.\nFigure 1: In our model of the data science process, you start with data import and tidying. Next, you understand your data with an iterative cycle of transforming, visualizing, and modeling. You finish the process by communicating your results to other humans.\nFirst, you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API) and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!\nOnce you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with how it is stored. In brief, when your data is tidy, each column is a variable and each row is an observation. Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.\nOnce you have tidy data, a common next step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling because getting your data in a form that’s natural to work with often feels like a fight!\nOnce you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling. These have complementary strengths and weaknesses, so any real data analysis will iterate between them many times.\nVisualization is a fundamentally human activity. A good visualization will show you things you did not expect or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question or that you need to collect different data. Visualizations can surprise you, but they don’t scale particularly well because they require a human to interpret them.\nModels are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are fundamentally mathematical or computational tools, so they generally scale well. Even when they don’t, it’s usually cheaper to buy more computers than it is to buy more brains! But every model makes assumptions, and by its very nature, a model cannot question its own assumptions. That means a model cannot fundamentally surprise you.\nThe last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualization have led you to understand the data unless you can also communicate your results to others.\nSurrounding all these tools is programming. Programming is a cross-cutting tool that you use in nearly every part of a data science project. You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off because becoming a better programmer allows you to automate common tasks and solve new problems with greater ease.\nYou’ll use these tools in every data science project, but they’re not enough for most projects. There’s a rough 80/20 rule at play: you can tackle about 80% of every project using the tools you’ll learn in this book, but you’ll need other tools to tackle the remaining 20%. Throughout this book, we’ll point you to resources where you can learn more.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#how-this-book-is-organized",
    "href": "intro.html#how-this-book-is-organized",
    "title": "Introduction",
    "section": "How this book is organized",
    "text": "How this book is organized\nThe previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you’ll iterate through them multiple times). In our experience, however, learning data importing and tidying first is suboptimal because, 80% of the time, it’s routine and boring, and the other 20% of the time, it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualization and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.\nWithin each chapter, we try to adhere to a consistent pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. Although it can be tempting to skip the exercises, there’s no better way to learn than by practicing on real problems.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#what-you-wont-learn",
    "href": "intro.html#what-you-wont-learn",
    "title": "Introduction",
    "section": "What you won’t learn",
    "text": "What you won’t learn\nThere are several important topics that this book doesn’t cover. We believe it’s important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic.\n\nModeling\nModeling is super important for data science, but it’s a big topic, and unfortunately, we just don’t have the space to give it the coverage it deserves here. To learn more about modeling, we highly recommend Tidy Modeling with R by our colleagues Max Kuhn and Julia Silge. This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.\n\n\nBig data\nThis book proudly and primarily focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data. The tools you’ll learn throughout the majority of this book will easily handle hundreds of megabytes of data, and with a bit of care, you can typically use them to work with a few gigabytes of data. We’ll also show you how to get data out of databases and parquet files, both of which are often used to store big data. You won’t necessarily be able to work with the entire dataset, but that’s not a problem because you only need a subset or subsample to answer the question that you’re interested in.\nIf you’re routinely working with larger data (10–100 GB, say), we recommend learning more about data.table. We don’t teach it here because it uses a different interface than the tidyverse and requires you to learn some different conventions. However, it is incredibly faster, and the performance payoff is worth investing some time in learning it if you’re working with large data.\n\n\nPython, Julia, and friends\nIn this book, you won’t learn anything about Python, Julia, or any other programming language useful for data science. This isn’t because we think these tools are bad. They’re not! And in practice, most data science teams use a mix of languages, often at least R and Python. But we strongly believe that it’s best to master one tool at a time, and R is a great place to start.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#prerequisites",
    "href": "intro.html#prerequisites",
    "title": "Introduction",
    "section": "Prerequisites",
    "text": "Prerequisites\nWe’ve made a few assumptions about what you already know to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some basic programming experience already. If you’ve never programmed before, you might find Hands on Programming with R by Garrett to be a valuable adjunct to this book.\nYou need four things to run the code in this book: R, RStudio, a collection of R packages called the tidyverse, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, documentation that describes how to use them, and sample data.\n\nR\nTo download R, go to CRAN, the comprehensive R archive network, https://cloud.r-project.org. A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions that require you to re-install all your packages, but putting it off only makes it worse. We recommend R 4.2.0 or later for this book.\n\n\nRStudio\nRStudio is an integrated development environment, or IDE, for R programming, which you can download from https://posit.co/download/rstudio-desktop/. RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out, so there’s no need to check back. It’s a good idea to upgrade regularly to take advantage of the latest and greatest features. For this book, make sure you have at least RStudio 2022.02.0.\nWhen you start RStudio, Figure 2, you’ll see two key regions in the interface: the console pane and the output pane. For now, all you need to know is that you type the R code in the console pane and press enter to run it. You’ll learn more as we go along!1\n\n\n\n\n\n\n\n\nFigure 2: The RStudio IDE has two key regions: type R code in the console pane on the left, and look for plots in the output pane on the right.\n\n\n\n\n\n\n\nThe tidyverse\nYou’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together.\nYou can install the complete tidyverse with a single line of code:\n\ninstall.packages(\"tidyverse\")\n\nOn your computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on your computer.\nYou will not be able to use the functions, objects, or help files in a package until you load it with library(). Once you have installed a package, you can load it using the library() function:\n\nlibrary(tidyverse)\n#&gt; ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#&gt; ✔ dplyr     1.1.4     ✔ readr     2.1.5\n#&gt; ✔ forcats   1.0.0     ✔ stringr   1.5.1\n#&gt; ✔ ggplot2   3.5.1     ✔ tibble    3.2.1\n#&gt; ✔ lubridate 1.9.3     ✔ tidyr     1.3.1\n#&gt; ✔ purrr     1.0.2     \n#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#&gt; ✖ dplyr::filter() masks stats::filter()\n#&gt; ✖ dplyr::lag()    masks stats::lag()\n#&gt; ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors\n\nThis tells you that tidyverse loads nine packages: dplyr, forcats, ggplot2, lubridate, purrr, readr, stringr, tibble, tidyr. These are considered the core of the tidyverse because you’ll use them in almost every analysis.\nPackages in the tidyverse change fairly frequently. You can see if updates are available by running tidyverse_update().\n\n\nOther packages\nThere are many other excellent packages that are not part of the tidyverse because they solve problems in a different domain or are designed with a different set of underlying principles. This doesn’t make them better or worse; it just makes them different. In other words, the complement to the tidyverse is not the messyverse but many other universes of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data.\nWe’ll use many packages from outside the tidyverse in this book. For example, we’ll use the following packages because they provide interesting datasets for us to work with in the process of learning R:\n\ninstall.packages(\n  c(\"arrow\", \"babynames\", \"curl\", \"duckdb\", \"gapminder\", \n    \"ggrepel\", \"ggridges\", \"ggthemes\", \"hexbin\", \"janitor\", \"Lahman\", \n    \"leaflet\", \"maps\", \"nycflights13\", \"openxlsx\", \"palmerpenguins\", \n    \"repurrrsive\", \"tidymodels\", \"writexl\")\n  )\n\nWe’ll also use a selection of other packages for one off examples. You don’t need to install them now, just remember that whenever you see an error like this:\n\nlibrary(ggrepel)\n#&gt; Error in library(ggrepel) : there is no package called ‘ggrepel’\n\nYou need to run install.packages(\"ggrepel\") to install the package.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#running-r-code",
    "href": "intro.html#running-r-code",
    "title": "Introduction",
    "section": "Running R code",
    "text": "Running R code\nThe previous section showed you several examples of running R code. The code in the book looks like this:\n\n1 + 2\n#&gt; [1] 3\n\nIf you run the same code in your local console, it will look like this:\n&gt; 1 + 2\n[1] 3\nThere are two main differences. In your console, you type after the &gt;, called the prompt; we don’t show the prompt in the book. In the book, the output is commented out with #&gt;; in your console, it appears directly after your code. These two differences mean that if you’re working with an electronic version of the book, you can easily copy code out of the book and paste it into the console.\nThroughout the book, we use a consistent set of conventions to refer to code:\n\nFunctions are displayed in a code font and followed by parentheses, like sum() or mean().\nOther R objects (such as data or function arguments) are in a code font, without parentheses, like flights or x.\nSometimes, to make it clear which package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate() or nycflights13::flights. This is also valid R code.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#acknowledgments",
    "href": "intro.html#acknowledgments",
    "title": "Introduction",
    "section": "Acknowledgments",
    "text": "Acknowledgments\nThis book isn’t just the product of Hadley, Mine, and Garrett but is the result of many conversations (in person and online) that we’ve had with many people in the R community. We’re incredibly grateful for all the conversations we’ve had with y’all; thank you so much!\nThis book was written in the open, and many people contributed via pull requests. A special thanks to all 259 of you who contributed improvements via GitHub pull requests (in alphabetical order by username): @a-rosenberg, Tim Becker (@a2800276), Abinash Satapathy (@Abinashbunty), Adam Gruer (@adam-gruer), adi pradhan (@adidoit), A. s. (@Adrianzo), Aep Hidyatuloh (@aephidayatuloh), Andrea Gilardi (@agila5), Ajay Deonarine (@ajay-d), @AlanFeder, Daihe Sui (@alansuidaihe), @alberto-agudo, @AlbertRapp, @aleloi, pete (@alonzi), Alex (@ALShum), Andrew M. (@amacfarland), Andrew Landgraf (@andland), @andyhuynh92, Angela Li (@angela-li), Antti Rask (@AnttiRask), LOU Xun (@aquarhead), @ariespirgel, @august-18, Michael Henry (@aviast), Azza Ahmed (@azzaea), Steven Moran (@bambooforest), Brian G. Barkley (@BarkleyBG), Mara Averick (@batpigandme), Oluwafemi OYEDELE (@BB1464), Brent Brewington (@bbrewington), Bill Behrman (@behrman), Ben Herbertson (@benherbertson), Ben Marwick (@benmarwick), Ben Steinberg (@bensteinberg), Benjamin Yeh (@bentyeh), Betul Turkoglu (@betulturkoglu), Brandon Greenwell (@bgreenwell), Bianca Peterson (@BinxiePeterson), Birger Niklas (@BirgerNi), Brett Klamer (@bklamer), @boardtc, Christian (@c-hoh), Caddy (@caddycarine), Camille V Leonard (@camillevleonard), @canovasjm, Cedric Batailler (@cedricbatailler), Christina Wei (@christina-wei), Christian Mongeau (@chrMongeau), Cooper Morris (@coopermor), Colin Gillespie (@csgillespie), Rademeyer Vermaak (@csrvermaak), Chloe Thierstein (@cthierst), Chris Saunders (@ctsa), Abhinav Singh (@curious-abhinav), Curtis Alexander (@curtisalexander), Christian G. Warden (@cwarden), Charlotte Wickham (@cwickham), Kenny Darrell (@darrkj), David Kane (@davidkane9), David (@davidrsch), David Rubinger (@davidrubinger), David Clark (@DDClark), Derwin McGeary (@derwinmcgeary), Daniel Gromer (@dgromer), @Divider85, @djbirke, Danielle Navarro (@djnavarro), Russell Shean (@DOH-RPS1303), Zhuoer Dong (@dongzhuoer), Devin Pastoor (@dpastoor), @DSGeoff, Devarshi Thakkar (@dthakkar09), Julian During (@duju211), Dylan Cashman (@dylancashman), Dirk Eddelbuettel (@eddelbuettel), Edwin Thoen (@EdwinTh), Ahmed El-Gabbas (@elgabbas), Henry Webel (@enryH), Ercan Karadas (@ercan7), Eric Kitaif (@EricKit), Eric Watt (@ericwatt), Erik Erhardt (@erikerhardt), Etienne B. Racine (@etiennebr), Everett Robinson (@evjrob), @fellennert, Flemming Miguel (@flemmingmiguel), Floris Vanderhaeghe (@florisvdh), @funkybluehen, @gabrivera, Garrick Aden-Buie (@gadenbuie), Peter Ganong (@ganong123), Gerome Meyer (@GeroVanMi), Gleb Ebert (@gl-eb), Josh Goldberg (@GoldbergData), bahadir cankardes (@gridgrad), Gustav W Delius (@gustavdelius), Hao Chen (@hao-trivago), Harris McGehee (@harrismcgehee), @hendrikweisser, Hengni Cai (@hengnicai), Iain (@Iain-S), Ian Sealy (@iansealy), Ian Lyttle (@ijlyttle), Ivan Krukov (@ivan-krukov), Jacob Kaplan (@jacobkap), Jazz Weisman (@jazzlw), John Blischak (@jdblischak), John D. Storey (@jdstorey), Gregory Jefferis (@jefferis), Jeffrey Stevens (@JeffreyRStevens), 蒋雨蒙 (@JeldorPKU), Jennifer (Jenny) Bryan (@jennybc), Jen Ren (@jenren), Jeroen Janssens (@jeroenjanssens), @jeromecholewa, Janet Wesner (@jilmun), Jim Hester (@jimhester), JJ Chen (@jjchern), Jacek Kolacz (@jkolacz), Joanne Jang (@joannejang), @johannes4998, John Sears (@johnsears), @jonathanflint, Jon Calder (@jonmcalder), Jonathan Page (@jonpage), Jon Harmon (@jonthegeek), JooYoung Seo (@jooyoungseo), Justinas Petuchovas (@jpetuchovas), Jordan (@jrdnbradford), Jeffrey Arnold (@jrnold), Jose Roberto Ayala Solares (@jroberayalas), Joyce Robbins (@jtr13), @juandering, Julia Stewart Lowndes (@jules32), Sonja (@kaetschap), Kara Woo (@karawoo), Katrin Leinweber (@katrinleinweber), Karandeep Singh (@kdpsingh), Kevin Perese (@kevinxperese), Kevin Ferris (@kferris10), Kirill Sevastyanenko (@kirillseva), Jonathan Kitt (@KittJonathan), @koalabearski, Kirill Müller (@krlmlr), Rafał Kucharski (@kucharsky), Kevin Wright (@kwstat), Noah Landesberg (@landesbergn), Lawrence Wu (@lawwu), @lindbrook, Luke W Johnston (@lwjohnst86), Kara de la Marck (@MarckK), Kunal Marwaha (@marwahaha), Matan Hakim (@matanhakim), Matthias Liew (@MatthiasLiew), Matt Wittbrodt (@MattWittbrodt), Mauro Lepore (@maurolepore), Mark Beveridge (@mbeveridge), @mcewenkhundi, mcsnowface, PhD (@mcsnowface), Matt Herman (@mfherman), Michael Boerman (@michaelboerman), Mitsuo Shiota (@mitsuoxv), Matthew Hendrickson (@mjhendrickson), @MJMarshall, Misty Knight-Finley (@mkfin7), Mohammed Hamdy (@mmhamdy), Maxim Nazarov (@mnazarov), Maria Paula Caldas (@mpaulacaldas), Mustafa Ascha (@mustafaascha), Nelson Areal (@nareal), Nate Olson (@nate-d-olson), Nathanael (@nateaff), @nattalides, Ned Western (@NedJWestern), Nick Clark (@nickclark1000), @nickelas, Nirmal Patel (@nirmalpatel), Nischal Shrestha (@nischalshrestha), Nicholas Tierney (@njtierney), Jakub Nowosad (@Nowosad), Nick Pullen (@nstjhp), @olivier6088, Olivier Cailloux (@oliviercailloux), Robin Penfold (@p0bs), Pablo E. Garcia (@pabloedug), Paul Adamson (@padamson), Penelope Y (@penelopeysm), Peter Hurford (@peterhurford), Peter Baumgartner (@petzi53), Patrick Kennedy (@pkq), Pooya Taherkhani (@pooyataher), Y. Yu (@PursuitOfDataScience), Radu Grosu (@radugrosu), Ranae Dietzel (@Ranae), Ralph Straumann (@rastrau), Rayna M Harris (@raynamharris), @ReeceGoding, Robin Gertenbach (@rgertenbach), Jajo (@RIngyao), Riva Quiroga (@rivaquiroga), Richard Knight (@RJHKnight), Richard Zijdeman (@rlzijdeman), @robertchu03, Robin Kohrs (@RobinKohrs), Robin (@Robinlovelace), Emily Robinson (@robinsones), Rob Tenorio (@robtenorio), Rod Mazloomi (@RodAli), Rohan Alexander (@RohanAlexander), Romero Morais (@RomeroBarata), Albert Y. Kim (@rudeboybert), Saghir (@saghirb), Hojjat Salmasian (@salmasian), Jonas (@sauercrowd), Vebash Naidoo (@sciencificity), Seamus McKinsey (@seamus-mckinsey), @seanpwilliams, Luke Smith (@seasmith), Matthew Sedaghatfar (@sedaghatfar), Sebastian Kraus (@sekR4), Sam Firke (@sfirke), Shannon Ellis (@ShanEllis), @shoili, Christian Heinrich (@Shurakai), S’busiso Mkhondwane (@sibusiso16), SM Raiyyan (@sm-raiyyan), Jakob Krigovsky (@sonicdoe), Stephan Koenig (@stephan-koenig), Stephen Balogun (@stephenbalogun), Steven M. Mortimer (@StevenMMortimer), Stéphane Guillou (@stragu), Sulgi Kim (@sulgik), Sergiusz Bleja (@svenski), Tal Galili (@talgalili), Alec Fisher (@Taurenamo), Todd Gerarden (@tgerarden), Tom Godfrey (@thomasggodfrey), Tim Broderick (@timbroderick), Tim Waterhouse (@timwaterhouse), TJ Mahr (@tjmahr), Thomas Klebel (@tklebel), Tom Prior (@tomjamesprior), Terence Teo (@tteo), @twgardner2, Ulrik Lyngs (@ulyngs), Shinya Uryu (@uribo), Martin Van der Linden (@vanderlindenma), Walter Somerville (@waltersom), @werkstattcodes, Will Beasley (@wibeasley), Yihui Xie (@yihui), Yiming (Paul) Li (@yimingli), @yingxingwu, Hiroaki Yutani (@yutannihilation), Yu Yu Aung (@yuyu-aung), Zach Bogart (@zachbogart), @zeal626, Zeki Akyol (@zekiakyol).",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#colophon",
    "href": "intro.html#colophon",
    "title": "Introduction",
    "section": "Colophon",
    "text": "Colophon\nAn online version of this book is available at https://r4ds.hadley.nz. It will continue to evolve in between reprints of the physical book. The source of the book is available at https://github.com/hadley/r4ds. The book is powered by Quarto, which makes it easy to write books that combine text and executable code.",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "intro.html#footnotes",
    "href": "intro.html#footnotes",
    "title": "Introduction",
    "section": "",
    "text": "If you’d like a comprehensive overview of all of RStudio’s features, see the RStudio User Guide at https://docs.posit.co/ide/user.↩︎",
    "crumbs": [
      "Introduction"
    ]
  },
  {
    "objectID": "whole-game.html",
    "href": "whole-game.html",
    "title": "Whole game",
    "section": "",
    "text": "Our goal in this part of the book is to give you a rapid overview of the main tools of data science: importing, tidying, transforming, and visualizing data, as shown in Figure 1. We want to show you the “whole game” of data science giving you just enough of all the major pieces so that you can tackle real, if simple, datasets. The later parts of the book will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.\n\n\n\n\n\n\n\n\nFigure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.\n\n\n\n\n\nFour chapters focus on the tools of data science:\n\nVisualization is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. In 2  Data visualization you’ll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.\nVisualization alone is typically not enough, so in 4  Data transformation, you’ll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.\nIn 6  Data tidying, you’ll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier. You’ll learn the underlying principles, and how to get your data into a tidy form.\nBefore you can transform and visualize your data, you need to first get your data into R. In 8  Data import you’ll learn the basics of getting .csv files into R.\n\nNestled among these chapters are four other chapters that focus on your R workflow. In 3  Workflow: basics, 5  Workflow: code style, and 7  Workflow: scripts and projects you’ll learn good workflow practices for writing and organizing your R code. These will set you up for success in the long run, as they’ll give you the tools to stay organized when you tackle real projects. Finally, 9  Workflow: getting help will teach you how to get help and keep learning.",
    "crumbs": [
      "Whole game"
    ]
  },
  {
    "objectID": "data-visualize.html",
    "href": "data-visualize.html",
    "title": "2  Data visualization",
    "section": "",
    "text": "2.1 Introduction\nR has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.\nThis chapter will teach you how to visualize your data using ggplot2. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects – the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. We’ll finish off with saving your plots and troubleshooting tips.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#introduction",
    "href": "data-visualize.html#introduction",
    "title": "2  Data visualization",
    "section": "",
    "text": "“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey\n\n\n\n\n2.1.1 Prerequisites\nThis chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:\n\nlibrary(tidyverse)\n#&gt; ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#&gt; ✔ dplyr     1.1.4     ✔ readr     2.1.5\n#&gt; ✔ forcats   1.0.0     ✔ stringr   1.5.1\n#&gt; ✔ ggplot2   3.5.1     ✔ tibble    3.2.1\n#&gt; ✔ lubridate 1.9.3     ✔ tidyr     1.3.1\n#&gt; ✔ purrr     1.0.2     \n#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#&gt; ✖ dplyr::filter() masks stats::filter()\n#&gt; ✖ dplyr::lag()    masks stats::lag()\n#&gt; ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors\n\nThat one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)1.\nIf you run this code and get the error message there is no package called 'tidyverse', you’ll need to first install it, then run library() once again.\n\ninstall.packages(\"tidyverse\")\nlibrary(tidyverse)\n\nYou only need to install a package once, but you need to load it every time you start a new session.\nIn addition to tidyverse, we will also use the palmerpenguins package, which includes the penguins dataset containing body measurements for penguins on three islands in the Palmer Archipelago, and the ggthemes package, which offers a colorblind safe color palette.\n\nlibrary(palmerpenguins)\nlibrary(ggthemes)",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#first-steps",
    "href": "data-visualize.html#first-steps",
    "title": "2  Data visualization",
    "section": "2.2 First steps",
    "text": "2.2 First steps\nDo penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? How about by the island where the penguin lives? Let’s create visualizations that we can use to answer these questions.\n\n2.2.1 The penguins data frame\nYou can test your answers to those questions with the penguins data frame found in palmerpenguins (a.k.a. palmerpenguins::penguins). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). penguins contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER2.\nTo make the discussion easier, let’s define some terms:\n\nA variable is a quantity, quality, or property that you can measure.\nA value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.\nAn observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.\nTabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.\n\nIn this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.\nType the name of the data frame in the console and R will print a preview of its contents. Note that it says tibble on top of this preview. In the tidyverse, we use special data frames called tibbles that you will learn more about soon.\n\npenguins\n#&gt; # A tibble: 344 × 8\n#&gt;   species island    bill_length_mm bill_depth_mm flipper_length_mm\n#&gt;   &lt;fct&gt;   &lt;fct&gt;              &lt;dbl&gt;         &lt;dbl&gt;             &lt;int&gt;\n#&gt; 1 Adelie  Torgersen           39.1          18.7               181\n#&gt; 2 Adelie  Torgersen           39.5          17.4               186\n#&gt; 3 Adelie  Torgersen           40.3          18                 195\n#&gt; 4 Adelie  Torgersen           NA            NA                  NA\n#&gt; 5 Adelie  Torgersen           36.7          19.3               193\n#&gt; 6 Adelie  Torgersen           39.3          20.6               190\n#&gt; # ℹ 338 more rows\n#&gt; # ℹ 3 more variables: body_mass_g &lt;int&gt;, sex &lt;fct&gt;, year &lt;int&gt;\n\nThis data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use glimpse(). Or, if you’re in RStudio, run View(penguins) to open an interactive data viewer.\n\nglimpse(penguins)\n#&gt; Rows: 344\n#&gt; Columns: 8\n#&gt; $ species           &lt;fct&gt; Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A…\n#&gt; $ island            &lt;fct&gt; Torgersen, Torgersen, Torgersen, Torgersen, Torge…\n#&gt; $ bill_length_mm    &lt;dbl&gt; 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…\n#&gt; $ bill_depth_mm     &lt;dbl&gt; 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…\n#&gt; $ flipper_length_mm &lt;int&gt; 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, …\n#&gt; $ body_mass_g       &lt;int&gt; 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347…\n#&gt; $ sex               &lt;fct&gt; male, female, female, NA, female, male, female, m…\n#&gt; $ year              &lt;int&gt; 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…\n\nAmong the variables in penguins are:\n\nspecies: a penguin’s species (Adelie, Chinstrap, or Gentoo).\nflipper_length_mm: length of a penguin’s flipper, in millimeters.\nbody_mass_g: body mass of a penguin, in grams.\n\nTo learn more about penguins, open its help page by running ?penguins.\n\n\n2.2.2 Ultimate goal\nOur ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.\n\n\n\n\n\n\n\n\n\n\n\n2.2.3 Creating a ggplot\nLet’s recreate this plot step-by-step.\nWith ggplot2, you begin a plot with the function ggplot(), defining a plot object that you then add layers to. The first argument of ggplot() is the dataset to use in the graph and so ggplot(data = penguins) creates an empty graph that is primed to display the penguins data, but since we haven’t told it how to visualize it yet, for now it’s empty. This is not a very exciting plot, but you can think of it like an empty canvas you’ll paint the remaining layers of your plot onto.\n\nggplot(data = penguins)\n\n\n\n\n\n\n\n\nNext, we need to tell ggplot() how the information from our data will be visually represented. The mapping argument of the ggplot() function defines how variables in your dataset are mapped to visual properties (aesthetics) of your plot. The mapping argument is always defined in the aes() function, and the x and y arguments of aes() specify which variables to map to the x and y axes. For now, we will only map flipper length to the x aesthetic and body mass to the y aesthetic. ggplot2 looks for the mapped variables in the data argument, in this case, penguins.\nThe following plot shows the result of adding these mappings.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n)\n\n\n\n\n\n\n\n\nOur empty canvas now has more structure – it’s clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis). But the penguins themselves are not yet on the plot. This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot.\nTo do so, we need to define a geom: the geometrical object that a plot uses to represent data. These geometric objects are made available in ggplot2 with functions that start with geom_. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms (geom_bar()), line charts use line geoms (geom_line()), boxplots use boxplot geoms (geom_boxplot()), scatterplots use point geoms (geom_point()), and so on.\nThe function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. You’ll learn a whole bunch of geoms throughout the book, particularly in Chapter 10.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point()\n#&gt; Warning: Removed 2 rows containing missing values or values outside the scale range\n#&gt; (`geom_point()`).\n\n\n\n\n\n\n\n\nNow we have something that looks like what we might think of as a “scatterplot”. It doesn’t yet match our “ultimate goal” plot, but using this plot we can start answering the question that motivated our exploration: “What does the relationship between flipper length and body mass look like?” The relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn’t too much scatter around such a line). Penguins with longer flippers are generally larger in terms of their body mass.\nBefore we add more layers to this plot, let’s pause for a moment and review the warning message we got:\n\nRemoved 2 rows containing missing values (geom_point()).\n\nWe’re seeing this message because there are two penguins in our dataset with missing body mass and/or flipper length values and ggplot2 has no way of representing them on the plot without both of these values. Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. This type of warning is probably one of the most common types of warnings you will see when working with real data – missing values are a very common issue and you’ll learn more about them throughout the book, particularly in Chapter 19. For the remaining plots in this chapter we will suppress this warning so it’s not printed alongside every single plot we make.\n\n\n2.2.4 Adding aesthetics and layers\nScatterplots are useful for displaying the relationship between two numerical variables, but it’s always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. For example, does the relationship between flipper length and body mass differ by species? Let’s incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between these variables. We will do this by representing species with different colored points.\nTo achieve this, will we need to modify the aesthetic or the geom? If you guessed “in the aesthetic mapping, inside of aes()”, you’re already getting the hang of creating data visualizations with ggplot2! And if not, don’t worry. Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)\n) +\n  geom_point()\n\n\n\n\n\n\n\n\nWhen a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as scaling. ggplot2 will also add a legend that explains which values correspond to which levels.\nNow let’s add one more layer: a smooth curve displaying the relationship between body mass and flipper length. Before you proceed, refer back to the code above, and think about how we can add this to our existing plot.\nSince this is a new geometric object representing our data, we will add a new geom as a layer on top of our point geom: geom_smooth(). And we will specify that we want to draw the line of best fit based on a linear model with method = \"lm\".\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)\n) +\n  geom_point() +\n  geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nWe have successfully added lines, but this plot doesn’t look like the plot from Section 2.2.2, which only has one line for the entire dataset as opposed to separate lines for each of the penguin species.\nWhen aesthetic mappings are defined in ggplot(), at the global level, they’re passed down to each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a mapping argument, which allows for aesthetic mappings at the local level that are added to those inherited from the global level. Since we want points to be colored based on species but don’t want the lines to be separated out for them, we should specify color = species for geom_point() only.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point(mapping = aes(color = species)) +\n  geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nVoila! We have something that looks very much like our ultimate goal, though it’s not yet perfect. We still need to use different shapes for each species of penguins and improve labels.\nIt’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map species to the shape aesthetic.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point(mapping = aes(color = species, shape = species)) +\n  geom_smooth(method = \"lm\")\n\n\n\n\n\n\n\n\nNote that the legend is automatically updated to reflect the different shapes of the points as well.\nAnd finally, we can improve the labels of our plot using the labs() function in a new layer. Some of the arguments to labs() might be self explanatory: title adds a title and subtitle adds a subtitle to the plot. Other arguments match the aesthetic mappings, x is the x-axis label, y is the y-axis label, and color and shape define the label for the legend. In addition, we can improve the color palette to be colorblind safe with the scale_color_colorblind() function from the ggthemes package.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point(aes(color = species, shape = species)) +\n  geom_smooth(method = \"lm\") +\n  labs(\n    title = \"Body mass and flipper length\",\n    subtitle = \"Dimensions for Adelie, Chinstrap, and Gentoo Penguins\",\n    x = \"Flipper length (mm)\", y = \"Body mass (g)\",\n    color = \"Species\", shape = \"Species\"\n  ) +\n  scale_color_colorblind()\n\n\n\n\n\n\n\n\nWe finally have a plot that perfectly matches our “ultimate goal”!\n\n\n2.2.5 Exercises\n\nHow many rows are in penguins? How many columns?\nWhat does the bill_depth_mm variable in the penguins data frame describe? Read the help for ?penguins to find out.\nMake a scatterplot of bill_depth_mm vs. bill_length_mm. That is, make a scatterplot with bill_depth_mm on the y-axis and bill_length_mm on the x-axis. Describe the relationship between these two variables.\nWhat happens if you make a scatterplot of species vs. bill_depth_mm? What might be a better choice of geom?\nWhy does the following give an error and how would you fix it?\n\nggplot(data = penguins) + \n  geom_point()\n\nWhat does the na.rm argument do in geom_point()? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to TRUE.\nAdd the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for labs().\nRecreate the following visualization. What aesthetic should bill_depth_mm be mapped to? And should it be mapped at the global level or at the geom level?\n\n\n\n\n\n\n\n\n\nRun this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)\n) +\n  geom_point() +\n  geom_smooth(se = FALSE)\n\nWill these two graphs look different? Why/why not?\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point() +\n  geom_smooth()\n\nggplot() +\n  geom_point(\n    data = penguins,\n    mapping = aes(x = flipper_length_mm, y = body_mass_g)\n  ) +\n  geom_smooth(\n    data = penguins,\n    mapping = aes(x = flipper_length_mm, y = body_mass_g)\n  )",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#sec-ggplot2-calls",
    "href": "data-visualize.html#sec-ggplot2-calls",
    "title": "2  Data visualization",
    "section": "2.3 ggplot2 calls",
    "text": "2.3 ggplot2 calls\nAs we move on from these introductory sections, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:\n\nggplot(\n  data = penguins,\n  mapping = aes(x = flipper_length_mm, y = body_mass_g)\n) +\n  geom_point()\n\nTypically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to ggplot() are data and mapping, in the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of extra text, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in Chapter 26.\nRewriting the previous plot more concisely yields:\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + \n  geom_point()\n\nIn the future, you’ll also learn about the pipe, |&gt;, which will allow you to create that plot with:\n\npenguins |&gt; \n  ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + \n  geom_point()",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#visualizing-distributions",
    "href": "data-visualize.html#visualizing-distributions",
    "title": "2  Data visualization",
    "section": "2.4 Visualizing distributions",
    "text": "2.4 Visualizing distributions\nHow you visualize the distribution of a variable depends on the type of variable: categorical or numerical.\n\n2.4.1 A categorical variable\nA variable is categorical if it can only take one of a small set of values. To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value.\n\nggplot(penguins, aes(x = species)) +\n  geom_bar()\n\n\n\n\n\n\n\n\nIn bar plots of categorical variables with non-ordered levels, like the penguin species above, it’s often preferable to reorder the bars based on their frequencies. Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor.\n\nggplot(penguins, aes(x = fct_infreq(species))) +\n  geom_bar()\n\n\n\n\n\n\n\n\nYou will learn more about factors and functions for dealing with factors (like fct_infreq() shown above) in Chapter 17.\n\n\n2.4.2 A numerical variable\nA variable is numerical (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values. Numerical variables can be continuous or discrete.\nOne commonly used visualization for distributions of continuous variables is a histogram.\n\nggplot(penguins, aes(x = body_mass_g)) +\n  geom_histogram(binwidth = 200)\n\n\n\n\n\n\n\n\nA histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a body_mass_g value between 3,500 and 3,700 grams, which are the left and right edges of the bar.\nYou can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. In the plots below a binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution. Similarly, a binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution. A binwidth of 200 provides a sensible balance.\nggplot(penguins, aes(x = body_mass_g)) +\n  geom_histogram(binwidth = 20)\nggplot(penguins, aes(x = body_mass_g)) +\n  geom_histogram(binwidth = 2000)\n\n\n\n\n\n\n\n\n\n\nAn alternative visualization for distributions of numerical variables is a density plot. A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution. We won’t go into how geom_density() estimates the density (you can read more about that in the function documentation), but let’s explain how the density curve is drawn with an analogy. Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.\n\nggplot(penguins, aes(x = body_mass_g)) +\n  geom_density()\n#&gt; Warning: Removed 2 rows containing non-finite outside the scale range\n#&gt; (`stat_density()`).\n\n\n\n\n\n\n\n\n\n\n2.4.3 Exercises\n\nMake a bar plot of species of penguins, where you assign species to the y aesthetic. How is this plot different?\nHow are the following two plots different? Which aesthetic, color or fill, is more useful for changing the color of bars?\n\nggplot(penguins, aes(x = species)) +\n  geom_bar(color = \"red\")\n\nggplot(penguins, aes(x = species)) +\n  geom_bar(fill = \"red\")\n\nWhat does the bins argument in geom_histogram() do?\nMake a histogram of the carat variable in the diamonds dataset that is available when you load the tidyverse package. Experiment with different binwidths. What binwidth reveals the most interesting patterns?",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#visualizing-relationships",
    "href": "data-visualize.html#visualizing-relationships",
    "title": "2  Data visualization",
    "section": "2.5 Visualizing relationships",
    "text": "2.5 Visualizing relationships\nTo visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.\n\n2.5.1 A numerical and a categorical variable\nTo visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A boxplot is a type of visual shorthand for measures of position (percentiles) that describe a distribution. It is also useful for identifying potential outliers. As shown in Figure 2.1, each boxplot consists of:\n\nA box that indicates the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile. In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.\nVisual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.\nA line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.\n\n\n\n\n\n\n\n\n\nFigure 2.1: Diagram depicting how a boxplot is created.\n\n\n\n\n\nLet’s take a look at the distribution of body mass by species using geom_boxplot():\n\nggplot(penguins, aes(x = species, y = body_mass_g)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\nAlternatively, we can make density plots with geom_density().\n\nggplot(penguins, aes(x = body_mass_g, color = species)) +\n  geom_density(linewidth = 0.75)\n\n\n\n\n\n\n\n\nWe’ve also customized the thickness of the lines using the linewidth argument in order to make them stand out a bit more against the background.\nAdditionally, we can map species to both color and fill aesthetics and use the alpha aesthetic to add transparency to the filled density curves. This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque). In the following plot it’s set to 0.5.\n\nggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +\n  geom_density(alpha = 0.5)\n\n\n\n\n\n\n\n\nNote the terminology we have used here:\n\nWe map variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable.\nOtherwise, we set the value of an aesthetic.\n\n\n\n2.5.2 Two categorical variables\nWe can use stacked bar plots to visualize the relationship between two categorical variables. For example, the following two stacked bar plots both display the relationship between island and species, or specifically, visualizing the distribution of species within each island.\nThe first plot shows the frequencies of each species of penguins on each island. The plot of frequencies shows that there are equal numbers of Adelies on each island. But we don’t have a good sense of the percentage balance within each island.\n\nggplot(penguins, aes(x = island, fill = species)) +\n  geom_bar()\n\n\n\n\n\n\n\n\nThe second plot, a relative frequency plot created by setting position = \"fill\" in the geom, is more useful for comparing species distributions across islands since it’s not affected by the unequal numbers of penguins across the islands. Using this plot we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.\n\nggplot(penguins, aes(x = island, fill = species)) +\n  geom_bar(position = \"fill\")\n\n\n\n\n\n\n\n\nIn creating these bar charts, we map the variable that will be separated into bars to the x aesthetic, and the variable that will change the colors inside the bars to the fill aesthetic.\n\n\n2.5.3 Two numerical variables\nSo far you’ve learned about scatterplots (created with geom_point()) and smooth curves (created with geom_smooth()) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\n2.5.4 Three or more variables\nAs we saw in Section 2.2.4, we can incorporate more variables into a plot by mapping them to additional aesthetics. For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n  geom_point(aes(color = species, shape = island))\n\n\n\n\n\n\n\n\nHowever adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.\nTo facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() is a formula3, which you create with ~ followed by a variable name. The variable that you pass to facet_wrap() should be categorical.\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n  geom_point(aes(color = species, shape = species)) +\n  facet_wrap(~island)\n\n\n\n\n\n\n\n\nYou will learn about many other geoms for visualizing distributions of variables and relationships between them in Chapter 10.\n\n\n2.5.5 Exercises\n\nThe mpg data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models. Which variables in mpg are categorical? Which variables are numerical? (Hint: Type ?mpg to read the documentation for the dataset.) How can you see this information when you run mpg?\nMake a scatterplot of hwy vs. displ using the mpg data frame. Next, map a third, numerical variable to color, then size, then both color and size, then shape. How do these aesthetics behave differently for categorical vs. numerical variables?\nIn the scatterplot of hwy vs. displ, what happens if you map a third variable to linewidth?\nWhat happens if you map the same variable to multiple aesthetics?\nMake a scatterplot of bill_depth_mm vs. bill_length_mm and color the points by species. What does adding coloring by species reveal about the relationship between these two variables? What about faceting by species?\nWhy does the following yield two separate legends? How would you fix it to combine the two legends?\n\nggplot(\n  data = penguins,\n  mapping = aes(\n    x = bill_length_mm, y = bill_depth_mm, \n    color = species, shape = species\n  )\n) +\n  geom_point() +\n  labs(color = \"Species\")\n\nCreate the two following stacked bar plots. Which question can you answer with the first one? Which question can you answer with the second one?\n\nggplot(penguins, aes(x = island, fill = species)) +\n  geom_bar(position = \"fill\")\nggplot(penguins, aes(x = species, fill = island)) +\n  geom_bar(position = \"fill\")",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#sec-ggsave",
    "href": "data-visualize.html#sec-ggsave",
    "title": "2  Data visualization",
    "section": "2.6 Saving your plots",
    "text": "2.6 Saving your plots\nOnce you’ve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk:\n\nggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +\n  geom_point()\nggsave(filename = \"penguin-plot.png\")\n\nThis will save your plot to your working directory, a concept you’ll learn more about in Chapter 7.\nIf you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them. You can learn more about ggsave() in the documentation.\nGenerally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in Chapter 29.\n\n2.6.1 Exercises\n\nRun the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?\n\nggplot(mpg, aes(x = class)) +\n  geom_bar()\nggplot(mpg, aes(x = cty, y = hwy)) +\n  geom_point()\nggsave(\"mpg-plot.png\")\n\nWhat do you need to change in the code above to save the plot as a PDF instead of a PNG? How could you find out what types of image files would work in ggsave()?",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#common-problems",
    "href": "data-visualize.html#common-problems",
    "title": "2  Data visualization",
    "section": "2.7 Common problems",
    "text": "2.7 Common problems\nAs you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesn’t work on the first try!\nStart by carefully comparing the code that you’re running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every \" is paired with another \". Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.\nOne common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:\n\nggplot(data = mpg) \n+ geom_point(mapping = aes(x = displ, y = hwy))\n\nIf you’re still stuck, try the help. You can get help about any R function by running ?function_name in the console, or highlighting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.\nIf that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#summary",
    "href": "data-visualize.html#summary",
    "title": "2  Data visualization",
    "section": "2.8 Summary",
    "text": "2.8 Summary\nIn this chapter, you’ve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by leveraging additional aesthetic mappings and/or splitting your plot into small multiples using faceting.\nWe’ll use visualizations again and again throughout this book, introducing new techniques as we need them as well as do a deeper dive into creating visualizations with ggplot2 in Chapter 10 through Chapter 12.\nWith the basics of visualization under your belt, in the next chapter we’re going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because it’ll help you stay organized as you write increasing amounts of R code.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "data-visualize.html#footnotes",
    "href": "data-visualize.html#footnotes",
    "title": "2  Data visualization",
    "section": "",
    "text": "You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at https://conflicted.r-lib.org.↩︎\nHorst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.↩︎\nHere “formula” is the name of the thing created by ~, not a synonym for “equation”.↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>2</span>  <span class='chapter-title'>Data visualization</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html",
    "href": "workflow-basics.html",
    "title": "3  Workflow: basics",
    "section": "",
    "text": "3.1 Coding basics\nYou now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place can cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.\nBefore we go any further, let’s ensure you’ve got a solid foundation in running R code and that you know some of the most helpful RStudio features.\nLet’s review some basics we’ve omitted so far in the interest of getting you plotting as quickly as possible. You can use R to do basic math calculations:\n1 / 200 * 30\n#&gt; [1] 0.15\n(59 + 73 + 2) / 3\n#&gt; [1] 44.66667\nsin(pi / 2)\n#&gt; [1] 1\nYou can create new objects with the assignment operator &lt;-:\nx &lt;- 3 * 4\nNote that the value of x is not printed, it’s just stored. If you want to view the value, type x in the console.\nYou can combine multiple elements into a vector with c():\nprimes &lt;- c(2, 3, 5, 7, 11, 13)\nAnd basic arithmetic on vectors is applied to every element of of the vector:\nprimes * 2\n#&gt; [1]  4  6 10 14 22 26\nprimes - 1\n#&gt; [1]  1  2  4  6 10 12\nAll R statements where you create objects, assignment statements, have the same form:\nobject_name &lt;- value\nWhen reading that code, say “object name gets value” in your head.\nYou will make lots of assignments, and &lt;- is a pain to type. You can save time with RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds &lt;- with spaces, which is a good code formatting practice. Code can be miserable to read on a good day, so giveyoureyesabreak and use spaces.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html#comments",
    "href": "workflow-basics.html#comments",
    "title": "3  Workflow: basics",
    "section": "3.2 Comments",
    "text": "3.2 Comments\nR will ignore any text after # for that line. This allows you to write comments, text that is ignored by R but read by other humans. We’ll sometimes include comments in examples explaining what’s happening with the code.\nComments can be helpful for briefly describing what the following code does.\n\n# create vector of primes\nprimes &lt;- c(2, 3, 5, 7, 11, 13)\n\n# multiply primes by 2\nprimes * 2\n#&gt; [1]  4  6 10 14 22 26\n\nWith short pieces of code like this, leaving a comment for every single line of code might not be necessary. But as the code you’re writing gets more complex, comments can save you (and your collaborators) a lot of time figuring out what was done in the code.\nUse comments to explain the why of your code, not the how or the what. The what and how of your code are always possible to figure out, even if it might be tedious, by carefully reading it. If you describe every step in the comments, and then change the code, you will have to remember to update the comments as well or it will be confusing when you return to your code in the future.\nFiguring out why something was done is much more difficult, if not impossible. For example, geom_smooth() has an argument called span, which controls the smoothness of the curve, with larger values yielding a smoother curve. Suppose you decide to change the value of span from its default of 0.75 to 0.9: it’s easy for a future reader to understand what is happening, but unless you note your thinking in a comment, no one will understand why you changed the default.\nFor data analysis code, use comments to explain your overall plan of attack and record important insights as you encounter them. There’s no way to re-capture this knowledge from the code itself.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html#sec-whats-in-a-name",
    "href": "workflow-basics.html#sec-whats-in-a-name",
    "title": "3  Workflow: basics",
    "section": "3.3 What’s in a name?",
    "text": "3.3 What’s in a name?\nObject names must start with a letter and can only contain letters, numbers, _, and .. You want your object names to be descriptive, so you’ll need to adopt a convention for multiple words. We recommend snake_case, where you separate lowercase words with _.\n\ni_use_snake_case\notherPeopleUseCamelCase\nsome.people.use.periods\nAnd_aFew.People_RENOUNCEconvention\n\nWe’ll return to names again when we discuss code style in Chapter 5.\nYou can inspect an object by typing its name:\n\nx\n#&gt; [1] 12\n\nMake another assignment:\n\nthis_is_a_really_long_name &lt;- 2.5\n\nTo inspect this object, try out RStudio’s completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.\nLet’s assume you made a mistake, and that the value of this_is_a_really_long_name should be 3.5, not 2.5. You can use another keyboard shortcut to help you fix it. For example, you can press ↑ to bring the last command you typed and edit it. Or, type “this” then press Cmd/Ctrl + ↑ to list all the commands you’ve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.\nMake yet another assignment:\n\nr_rocks &lt;- 2^3\n\nLet’s try to inspect it:\n\nr_rock\n#&gt; Error: object 'r_rock' not found\nR_rocks\n#&gt; Error: object 'R_rocks' not found\n\nThis illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. If not, you’re likely to get an error that says the object you’re looking for was not found. Typos matter; R can’t read your mind and say, “oh, they probably meant r_rocks when they typed r_rock”. Case matters; similarly, R can’t read your mind and say, “oh, they probably meant r_rocks when they typed R_rocks”.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html#calling-functions",
    "href": "workflow-basics.html#calling-functions",
    "title": "3  Workflow: basics",
    "section": "3.4 Calling functions",
    "text": "3.4 Calling functions\nR has a large collection of built-in functions that are called like this:\n\nfunction_name(argument1 = value1, argument2 = value2, ...)\n\nLet’s try using seq(), which makes regular sequences of numbers, and while we’re at it, learn more helpful features of RStudio. Type se and hit TAB. A popup shows you possible completions. Specify seq() by typing more (a q) to disambiguate or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.\nWhen you’ve selected the function you want, press TAB again. RStudio will add matching opening (() and closing ()) parentheses for you. Type the name of the first argument, from, and set it equal to 1. Then, type the name of the second argument, to, and set it equal to 10. Finally, hit return.\n\nseq(from = 1, to = 10)\n#&gt;  [1]  1  2  3  4  5  6  7  8  9 10\n\nWe often omit the names of the first several arguments in function calls, so we can rewrite this as follows:\n\nseq(1, 10)\n#&gt;  [1]  1  2  3  4  5  6  7  8  9 10\n\nType the following code and notice that RStudio provides similar assistance with the paired quotation marks:\n\nx &lt;- \"hello world\"\n\nQuotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:\n&gt; x &lt;- \"hello\n+\nThe + tells you that R is waiting for more input; it doesn’t think you’re done yet. Usually, this means you’ve forgotten either a \" or a ). Either add the missing pair, or press ESCAPE to abort the expression and try again.\nNote that the environment tab in the upper right pane displays all of the objects that you’ve created:",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html#exercises",
    "href": "workflow-basics.html#exercises",
    "title": "3  Workflow: basics",
    "section": "3.5 Exercises",
    "text": "3.5 Exercises\n\nWhy does this code not work?\n\nmy_variable &lt;- 10\nmy_varıable\n#&gt; Error: object 'my_varıable' not found\n\nLook carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)\nTweak each of the following R commands so that they run correctly:\n\nlibary(todyverse)\n\nggplot(dTA = mpg) + \n  geom_point(maping = aes(x = displ y = hwy)) +\n  geom_smooth(method = \"lm)\n\nPress Option + Shift + K / Alt + Shift + K. What happens? How can you get to the same place using the menus?\nLet’s revisit an exercise from the Section 2.6. Run the following lines of code. Which of the two plots is saved as mpg-plot.png? Why?\n\nmy_bar_plot &lt;- ggplot(mpg, aes(x = class)) +\n  geom_bar()\nmy_scatter_plot &lt;- ggplot(mpg, aes(x = cty, y = hwy)) +\n  geom_point()\nggsave(filename = \"mpg-plot.png\", plot = my_bar_plot)",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "workflow-basics.html#summary",
    "href": "workflow-basics.html#summary",
    "title": "3  Workflow: basics",
    "section": "3.6 Summary",
    "text": "3.6 Summary\nNow that you’ve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, we’ll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it’s selecting important variables, filtering down to rows of interest, or computing summary statistics.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>3</span>  <span class='chapter-title'>Workflow: basics</span>"
    ]
  },
  {
    "objectID": "data-transform.html",
    "href": "data-transform.html",
    "title": "4  Data transformation",
    "section": "",
    "text": "4.1 Introduction\nVisualization is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need to make the graph you want. Often you’ll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed from New York City in 2013.\nThe goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame, then circle back to talk more about the pipe, an important tool that you use to combine verbs. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action. In later chapters, we’ll return to the functions in more detail as we start to dig into specific types of data (e.g., numbers, strings, dates).",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#introduction",
    "href": "data-transform.html#introduction",
    "title": "4  Data transformation",
    "section": "",
    "text": "4.1.1 Prerequisites\nIn this chapter, we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package and use ggplot2 to help us understand the data.\n\nlibrary(nycflights13)\nlibrary(tidyverse)\n#&gt; ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──\n#&gt; ✔ dplyr     1.1.4     ✔ readr     2.1.5\n#&gt; ✔ forcats   1.0.0     ✔ stringr   1.5.1\n#&gt; ✔ ggplot2   3.5.1     ✔ tibble    3.2.1\n#&gt; ✔ lubridate 1.9.3     ✔ tidyr     1.3.1\n#&gt; ✔ purrr     1.0.2     \n#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──\n#&gt; ✖ dplyr::filter() masks stats::filter()\n#&gt; ✖ dplyr::lag()    masks stats::lag()\n#&gt; ℹ Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors\n\nTake careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: stats::filter() and stats::lag(). So far, we’ve mostly ignored which package a function comes from because it doesn’t usually matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which package a function comes from, we’ll use the same syntax as R: packagename::functionname().\n\n\n4.1.2 nycflights13\nTo explore the basic dplyr verbs, we will use nycflights13::flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics and is documented in ?flights.\n\nflights\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nflights is a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference between tibbles and data frames is the way tibbles print; they are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably View(flights), which opens an interactive, scrollable, and filterable view. Otherwise you can use print(flights, width = Inf) to show all columns, or use glimpse():\n\nglimpse(flights)\n#&gt; Rows: 336,776\n#&gt; Columns: 19\n#&gt; $ year           &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…\n#&gt; $ month          &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#&gt; $ day            &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…\n#&gt; $ dep_time       &lt;int&gt; 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…\n#&gt; $ sched_dep_time &lt;int&gt; 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…\n#&gt; $ dep_delay      &lt;dbl&gt; 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…\n#&gt; $ arr_time       &lt;int&gt; 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…\n#&gt; $ sched_arr_time &lt;int&gt; 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…\n#&gt; $ arr_delay      &lt;dbl&gt; 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…\n#&gt; $ carrier        &lt;chr&gt; \"UA\", \"UA\", \"AA\", \"B6\", \"DL\", \"UA\", \"B6\", \"EV\", \"B6\"…\n#&gt; $ flight         &lt;int&gt; 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…\n#&gt; $ tailnum        &lt;chr&gt; \"N14228\", \"N24211\", \"N619AA\", \"N804JB\", \"N668DN\", \"N…\n#&gt; $ origin         &lt;chr&gt; \"EWR\", \"LGA\", \"JFK\", \"JFK\", \"LGA\", \"EWR\", \"EWR\", \"LG…\n#&gt; $ dest           &lt;chr&gt; \"IAH\", \"IAH\", \"MIA\", \"BQN\", \"ATL\", \"ORD\", \"FLL\", \"IA…\n#&gt; $ air_time       &lt;dbl&gt; 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…\n#&gt; $ distance       &lt;dbl&gt; 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…\n#&gt; $ hour           &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…\n#&gt; $ minute         &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…\n#&gt; $ time_hour      &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…\n\nIn both views, the variable names are followed by abbreviations that tell you the type of each variable: &lt;int&gt; is short for integer, &lt;dbl&gt; is short for double (aka real numbers), &lt;chr&gt; for character (aka strings), and &lt;dttm&gt; for date-time. These are important because the operations you can perform on a column depend heavily on its “type.”\n\n\n4.1.3 dplyr basics\nYou’re about to learn the primary dplyr verbs (functions), which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:\n\nThe first argument is always a data frame.\nThe subsequent arguments typically describe which columns to operate on using the variable names (without quotes).\nThe output is always a new data frame.\n\nBecause each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we’ll do so with the pipe, |&gt;. We’ll discuss the pipe more in Section 4.4, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that x |&gt; f(y) is equivalent to f(x, y), and x |&gt; f(y) |&gt; g(z) is equivalent to g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:\n\nflights |&gt;\n  filter(dest == \"IAH\") |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    arr_delay = mean(arr_delay, na.rm = TRUE)\n  )\n\ndplyr’s verbs are organized into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections, you’ll learn the most important verbs for rows, columns, and groups. Then, we’ll return to the join verbs that work on tables in Chapter 20. Let’s dive in!",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#rows",
    "href": "data-transform.html#rows",
    "title": "4  Data transformation",
    "section": "4.2 Rows",
    "text": "4.2 Rows\nThe most important verbs that operate on rows of a dataset are filter(), which changes which rows are present without changing their order, and arrange(), which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss distinct() which finds rows with unique values. Unlike arrange() and filter() it can also optionally modify the columns.\n\n4.2.1 filter()\nfilter() allows you to keep rows based on the values of the columns1. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that departed more than 120 minutes (two hours) late:\n\nflights |&gt; \n  filter(dep_delay &gt; 120)\n#&gt; # A tibble: 9,723 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      848           1835       853     1001           1950\n#&gt; 2  2013     1     1      957            733       144     1056            853\n#&gt; 3  2013     1     1     1114            900       134     1447           1222\n#&gt; 4  2013     1     1     1540           1338       122     2020           1825\n#&gt; 5  2013     1     1     1815           1325       290     2120           1542\n#&gt; 6  2013     1     1     1842           1422       260     1958           1535\n#&gt; # ℹ 9,717 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nAs well as &gt; (greater than), you can use &gt;= (greater than or equal to), &lt; (less than), &lt;= (less than or equal to), == (equal to), and != (not equal to). You can also combine conditions with & or , to indicate “and” (check for both conditions) or with | to indicate “or” (check for either condition):\n\n# Flights that departed on January 1\nflights |&gt; \n  filter(month == 1 & day == 1)\n#&gt; # A tibble: 842 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 836 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\n# Flights that departed in January or February\nflights |&gt; \n  filter(month == 1 | month == 2)\n#&gt; # A tibble: 51,955 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 51,949 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nThere’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:\n\n# A shorter way to select flights that departed in January or February\nflights |&gt; \n  filter(month %in% c(1, 2))\n#&gt; # A tibble: 51,955 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 51,949 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nWe’ll come back to these comparisons and logical operators in more detail in Chapter 13.\nWhen you run filter() dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, &lt;-:\n\njan1 &lt;- flights |&gt; \n  filter(month == 1 & day == 1)\n\n\n\n4.2.2 Common mistakes\nWhen you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. filter() will let you know when this happens:\n\nflights |&gt; \n  filter(month = 1)\n#&gt; Error in `filter()`:\n#&gt; ! We detected a named input.\n#&gt; ℹ This usually means that you've used `=` instead of `==`.\n#&gt; ℹ Did you mean `month == 1`?\n\nAnother mistakes is you write “or” statements like you would in English:\n\nflights |&gt; \n  filter(month == 1 | 2)\n\nThis “works”, in the sense that it doesn’t throw an error, but it doesn’t do what you want because | first checks the condition month == 1 and then checks the condition 2, which is not a sensible condition to check. We’ll learn more about what’s happening here and why in Section 13.3.2.\n\n\n4.2.3 arrange()\narrange() changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of the preceding columns. For example, the following code sorts by the departure time, which is spread over four columns. We get the earliest years first, then within a year, the earliest months, etc.\n\nflights |&gt; \n  arrange(year, month, day, dep_time)\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nYou can use desc() on a column inside of arrange() to re-order the data frame based on that column in descending (big-to-small) order. For example, this code orders flights from most to least delayed:\n\nflights |&gt; \n  arrange(desc(dep_delay))\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     9      641            900      1301     1242           1530\n#&gt; 2  2013     6    15     1432           1935      1137     1607           2120\n#&gt; 3  2013     1    10     1121           1635      1126     1239           1810\n#&gt; 4  2013     9    20     1139           1845      1014     1457           2210\n#&gt; 5  2013     7    22      845           1600      1005     1044           1815\n#&gt; 6  2013     4    10     1100           1900       960     1342           2211\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nNote that the number of rows has not changed – we’re only arranging the data, we’re not filtering it.\n\n\n4.2.4 distinct()\ndistinct() finds all the unique rows in a dataset, so technically, it primarily operates on the rows. Most of the time, however, you’ll want the distinct combination of some variables, so you can also optionally supply column names:\n\n# Remove duplicate rows, if any\nflights |&gt; \n  distinct()\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\n# Find all unique origin and destination pairs\nflights |&gt; \n  distinct(origin, dest)\n#&gt; # A tibble: 224 × 2\n#&gt;   origin dest \n#&gt;   &lt;chr&gt;  &lt;chr&gt;\n#&gt; 1 EWR    IAH  \n#&gt; 2 LGA    IAH  \n#&gt; 3 JFK    MIA  \n#&gt; 4 JFK    BQN  \n#&gt; 5 LGA    ATL  \n#&gt; 6 EWR    ORD  \n#&gt; # ℹ 218 more rows\n\nAlternatively, if you want to keep other columns when filtering for unique rows, you can use the .keep_all = TRUE option.\n\nflights |&gt; \n  distinct(origin, dest, .keep_all = TRUE)\n#&gt; # A tibble: 224 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 218 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nIt’s not a coincidence that all of these distinct flights are on January 1: distinct() will find the first occurrence of a unique row in the dataset and discard the rest.\nIf you want to find the number of occurrences instead, you’re better off swapping distinct() for count(). With the sort = TRUE argument, you can arrange them in descending order of the number of occurrences. You’ll learn more about count in Section 14.3.\n\nflights |&gt;\n  count(origin, dest, sort = TRUE)\n#&gt; # A tibble: 224 × 3\n#&gt;   origin dest      n\n#&gt;   &lt;chr&gt;  &lt;chr&gt; &lt;int&gt;\n#&gt; 1 JFK    LAX   11262\n#&gt; 2 LGA    ATL   10263\n#&gt; 3 LGA    ORD    8857\n#&gt; 4 JFK    SFO    8204\n#&gt; 5 LGA    CLT    6168\n#&gt; 6 EWR    ORD    6100\n#&gt; # ℹ 218 more rows\n\n\n\n4.2.5 Exercises\n\nIn a single pipeline for each condition, find all flights that meet the condition:\n\nHad an arrival delay of two or more hours\nFlew to Houston (IAH or HOU)\nWere operated by United, American, or Delta\nDeparted in summer (July, August, and September)\nArrived more than two hours late but didn’t leave late\nWere delayed by at least an hour, but made up over 30 minutes in flight\n\nSort flights to find the flights with the longest departure delays. Find the flights that left earliest in the morning.\nSort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)\nWas there a flight on every day of 2013?\nWhich flights traveled the farthest distance? Which traveled the least distance?\nDoes it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#columns",
    "href": "data-transform.html#columns",
    "title": "4  Data transformation",
    "section": "4.3 Columns",
    "text": "4.3 Columns\nThere are four important verbs that affect the columns without changing the rows: mutate() creates new columns that are derived from the existing columns, select() changes which columns are present, rename() changes the names of the columns, and relocate() changes the positions of the columns.\n\n4.3.1 mutate()\nThe job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:\n\nflights |&gt; \n  mutate(\n    gain = dep_delay - arr_delay,\n    speed = distance / air_time * 60\n  )\n#&gt; # A tibble: 336,776 × 21\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 13 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nBy default, mutate() adds new columns on the right-hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left-hand side2:\n\nflights |&gt; \n  mutate(\n    gain = dep_delay - arr_delay,\n    speed = distance / air_time * 60,\n    .before = 1\n  )\n#&gt; # A tibble: 336,776 × 21\n#&gt;    gain speed  year month   day dep_time sched_dep_time dep_delay arr_time\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1    -9  370.  2013     1     1      517            515         2      830\n#&gt; 2   -16  374.  2013     1     1      533            529         4      850\n#&gt; 3   -31  408.  2013     1     1      542            540         2      923\n#&gt; 4    17  517.  2013     1     1      544            545        -1     1004\n#&gt; 5    19  394.  2013     1     1      554            600        -6      812\n#&gt; 6   -16  288.  2013     1     1      554            558        -4      740\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 12 more variables: sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, …\n\nThe . indicates that .before is an argument to the function, not the name of a third new variable we are creating. You can also use .after to add after a variable, and in both .before and .after you can use the variable name instead of a position. For example, we could add the new variables after day:\n\nflights |&gt; \n  mutate(\n    gain = dep_delay - arr_delay,\n    speed = distance / air_time * 60,\n    .after = day\n  )\n\nAlternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is \"used\" which specifies that we only keep the columns that were involved or created in the mutate() step. For example, the following output will contain only the variables dep_delay, arr_delay, air_time, gain, hours, and gain_per_hour.\n\nflights |&gt; \n  mutate(\n    gain = dep_delay - arr_delay,\n    hours = air_time / 60,\n    gain_per_hour = gain / hours,\n    .keep = \"used\"\n  )\n\nNote that since we haven’t assigned the result of the above computation back to flights, the new variables gain, hours, and gain_per_hour will only be printed but will not be stored in a data frame. And if we want them to be available in a data frame for future use, we should think carefully about whether we want the result to be assigned back to flights, overwriting the original data frame with many more variables, or to a new object. Often, the right answer is a new object that is named informatively to indicate its contents, e.g., delay_gain, but you might also have good reasons for overwriting flights.\n\n\n4.3.2 select()\nIt’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables:\n\nSelect columns by name:\n\nflights |&gt; \n  select(year, month, day)\n\nSelect all columns between year and day (inclusive):\n\nflights |&gt; \n  select(year:day)\n\nSelect all columns except those from year to day (inclusive):\n\nflights |&gt; \n  select(!year:day)\n\nHistorically this operation was done with - instead of !, so you’re likely to see that in the wild. These two operators serve the same purpose but with subtle differences in behavior. We recommend using ! because it reads as “not” and combines well with & and |.\nSelect all columns that are characters:\n\nflights |&gt; \n  select(where(is.character))\n\n\nThere are a number of helper functions you can use within select():\n\nstarts_with(\"abc\"): matches names that begin with “abc”.\nends_with(\"xyz\"): matches names that end with “xyz”.\ncontains(\"ijk\"): matches names that contain “ijk”.\nnum_range(\"x\", 1:3): matches x1, x2 and x3.\n\nSee ?select for more details. Once you know regular expressions (the topic of Chapter 16) you’ll also be able to use matches() to select variables that match a pattern.\nYou can rename variables as you select() them by using =. The new name appears on the left-hand side of the =, and the old variable appears on the right-hand side:\n\nflights |&gt; \n  select(tail_num = tailnum)\n#&gt; # A tibble: 336,776 × 1\n#&gt;   tail_num\n#&gt;   &lt;chr&gt;   \n#&gt; 1 N14228  \n#&gt; 2 N24211  \n#&gt; 3 N619AA  \n#&gt; 4 N804JB  \n#&gt; 5 N668DN  \n#&gt; 6 N39463  \n#&gt; # ℹ 336,770 more rows\n\n\n\n4.3.3 rename()\nIf you want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():\n\nflights |&gt; \n  rename(tail_num = tailnum)\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nIf you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.\n\n\n4.3.4 relocate()\nUse relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:\n\nflights |&gt; \n  relocate(time_hour, air_time)\n#&gt; # A tibble: 336,776 × 19\n#&gt;   time_hour           air_time  year month   day dep_time sched_dep_time\n#&gt;   &lt;dttm&gt;                 &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1 2013-01-01 05:00:00      227  2013     1     1      517            515\n#&gt; 2 2013-01-01 05:00:00      227  2013     1     1      533            529\n#&gt; 3 2013-01-01 05:00:00      160  2013     1     1      542            540\n#&gt; 4 2013-01-01 05:00:00      183  2013     1     1      544            545\n#&gt; 5 2013-01-01 06:00:00      116  2013     1     1      554            600\n#&gt; 6 2013-01-01 05:00:00      150  2013     1     1      554            558\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 12 more variables: dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;, …\n\nYou can also specify where to put them using the .before and .after arguments, just like in mutate():\n\nflights |&gt; \n  relocate(year:dep_time, .after = time_hour)\nflights |&gt; \n  relocate(starts_with(\"arr\"), .before = dep_time)\n\n\n\n4.3.5 Exercises\n\nCompare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?\nBrainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.\nWhat happens if you specify the name of the same variable multiple times in a select() call?\nWhat does the any_of() function do? Why might it be helpful in conjunction with this vector?\n\nvariables &lt;- c(\"year\", \"month\", \"day\", \"dep_delay\", \"arr_delay\")\n\nDoes the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?\n\nflights |&gt; select(contains(\"TIME\"))\n\nRename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.\nWhy doesn’t the following work, and what does the error mean?\n\nflights |&gt; \n  select(tailnum) |&gt; \n  arrange(arr_delay)\n#&gt; Error in `arrange()`:\n#&gt; ℹ In argument: `..1 = arr_delay`.\n#&gt; Caused by error:\n#&gt; ! object 'arr_delay' not found",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#sec-the-pipe",
    "href": "data-transform.html#sec-the-pipe",
    "title": "4  Data transformation",
    "section": "4.4 The pipe",
    "text": "4.4 The pipe\nWe’ve shown you simple examples of the pipe above, but its real power arises when you start to combine multiple verbs. For example, imagine that you wanted to find the fastest flights to Houston’s IAH airport: you need to combine filter(), mutate(), select(), and arrange():\n\nflights |&gt; \n  filter(dest == \"IAH\") |&gt; \n  mutate(speed = distance / air_time * 60) |&gt; \n  select(year:day, dep_time, carrier, flight, speed) |&gt; \n  arrange(desc(speed))\n#&gt; # A tibble: 7,198 × 7\n#&gt;    year month   day dep_time carrier flight speed\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt; &lt;chr&gt;    &lt;int&gt; &lt;dbl&gt;\n#&gt; 1  2013     7     9      707 UA         226  522.\n#&gt; 2  2013     8    27     1850 UA        1128  521.\n#&gt; 3  2013     8    28      902 UA        1711  519.\n#&gt; 4  2013     8    28     2122 UA        1022  519.\n#&gt; 5  2013     6    11     1628 UA        1178  515.\n#&gt; 6  2013     8    27     1017 UA         333  515.\n#&gt; # ℹ 7,192 more rows\n\nEven though this pipeline has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights data, then filter, then mutate, then select, then arrange.\nWhat would happen if we didn’t have the pipe? We could nest each function call inside the previous call:\n\narrange(\n  select(\n    mutate(\n      filter(\n        flights, \n        dest == \"IAH\"\n      ),\n      speed = distance / air_time * 60\n    ),\n    year:day, dep_time, carrier, flight, speed\n  ),\n  desc(speed)\n)\n\nOr we could use a bunch of intermediate objects:\n\nflights1 &lt;- filter(flights, dest == \"IAH\")\nflights2 &lt;- mutate(flights1, speed = distance / air_time * 60)\nflights3 &lt;- select(flights2, year:day, dep_time, carrier, flight, speed)\narrange(flights3, desc(speed))\n\nWhile both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.\nTo add the pipe to your code, we recommend using the built-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |&gt; instead of %&gt;% as shown in Figure 4.1; more on %&gt;% shortly.\n\n\n\n\n\n\n\n\nFigure 4.1: To insert |&gt;, make sure the “Use native pipe operator” option is checked.\n\n\n\n\n\n\n\n\n\n\n\nmagrittr\n\n\n\nIf you’ve been using the tidyverse for a while, you might be familiar with the %&gt;% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %&gt;% whenever you load the tidyverse:\n\nlibrary(tidyverse)\n\nmtcars %&gt;% \n  group_by(cyl) %&gt;%\n  summarize(n = n())\n\nFor simple cases, |&gt; and %&gt;% behave identically. So why do we recommend the base pipe? Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse. Secondly, |&gt; is quite a bit simpler than %&gt;%: in the time between the invention of %&gt;% in 2014 and the inclusion of |&gt; in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#groups",
    "href": "data-transform.html#groups",
    "title": "4  Data transformation",
    "section": "4.5 Groups",
    "text": "4.5 Groups\nSo far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: group_by(), summarize(), and the slice family of functions.\n\n4.5.1 group_by()\nUse group_by() to divide your dataset into groups meaningful for your analysis:\n\nflights |&gt; \n  group_by(month)\n#&gt; # A tibble: 336,776 × 19\n#&gt; # Groups:   month [12]\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\ngroup_by() doesn’t change the data but, if you look closely at the output, you’ll notice that the output indicates that it is “grouped by” month (Groups: month [12]). This means subsequent operations will now work “by month”. group_by() adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.\n\n\n4.5.2 summarize()\nThe most important grouped operation is a summary, which, if being used to calculate a single summary statistic, reduces the data frame to have a single row for each group. In dplyr, this operation is performed by summarize()3, as shown by the following example, which computes the average departure delay by month:\n\nflights |&gt; \n  group_by(month) |&gt; \n  summarize(\n    avg_delay = mean(dep_delay)\n  )\n#&gt; # A tibble: 12 × 2\n#&gt;   month avg_delay\n#&gt;   &lt;int&gt;     &lt;dbl&gt;\n#&gt; 1     1        NA\n#&gt; 2     2        NA\n#&gt; 3     3        NA\n#&gt; 4     4        NA\n#&gt; 5     5        NA\n#&gt; 6     6        NA\n#&gt; # ℹ 6 more rows\n\nUh-oh! Something has gone wrong, and all of our results are NAs (pronounced “N-A”), R’s symbol for missing value. This happened because some of the observed flights had missing data in the delay column, and so when we calculated the mean including those values, we got an NA result. We’ll come back to discuss missing values in detail in Chapter 19, but for now, we’ll tell the mean() function to ignore all missing values by setting the argument na.rm to TRUE:\n\nflights |&gt; \n  group_by(month) |&gt; \n  summarize(\n    avg_delay = mean(dep_delay, na.rm = TRUE)\n  )\n#&gt; # A tibble: 12 × 2\n#&gt;   month avg_delay\n#&gt;   &lt;int&gt;     &lt;dbl&gt;\n#&gt; 1     1      10.0\n#&gt; 2     2      10.8\n#&gt; 3     3      13.2\n#&gt; 4     4      13.9\n#&gt; 5     5      13.0\n#&gt; 6     6      20.8\n#&gt; # ℹ 6 more rows\n\nYou can create any number of summaries in a single call to summarize(). You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is n(), which returns the number of rows in each group:\n\nflights |&gt; \n  group_by(month) |&gt; \n  summarize(\n    avg_delay = mean(dep_delay, na.rm = TRUE), \n    n = n()\n  )\n#&gt; # A tibble: 12 × 3\n#&gt;   month avg_delay     n\n#&gt;   &lt;int&gt;     &lt;dbl&gt; &lt;int&gt;\n#&gt; 1     1      10.0 27004\n#&gt; 2     2      10.8 24951\n#&gt; 3     3      13.2 28834\n#&gt; 4     4      13.9 28330\n#&gt; 5     5      13.0 28796\n#&gt; 6     6      20.8 28243\n#&gt; # ℹ 6 more rows\n\nMeans and counts can get you a surprisingly long way in data science!\n\n\n4.5.3 The slice_ functions\nThere are five handy functions that allow you to extract specific rows within each group:\n\ndf |&gt; slice_head(n = 1) takes the first row from each group.\ndf |&gt; slice_tail(n = 1) takes the last row in each group.\ndf |&gt; slice_min(x, n = 1) takes the row with the smallest value of column x.\ndf |&gt; slice_max(x, n = 1) takes the row with the largest value of column x.\ndf |&gt; slice_sample(n = 1) takes one random row.\n\nYou can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the flights that are most delayed upon arrival at each destination:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  slice_max(arr_delay, n = 1) |&gt;\n  relocate(dest)\n#&gt; # A tibble: 108 × 19\n#&gt; # Groups:   dest [105]\n#&gt;   dest   year month   day dep_time sched_dep_time dep_delay arr_time\n#&gt;   &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1 ABQ    2013     7    22     2145           2007        98      132\n#&gt; 2 ACK    2013     7    23     1139            800       219     1250\n#&gt; 3 ALB    2013     1    25      123           2000       323      229\n#&gt; 4 ANC    2013     8    17     1740           1625        75     2042\n#&gt; 5 ATL    2013     7    22     2257            759       898      121\n#&gt; 6 AUS    2013     7    10     2056           1505       351     2347\n#&gt; # ℹ 102 more rows\n#&gt; # ℹ 11 more variables: sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, …\n\nNote that there are 105 destinations but we get 108 rows here. What’s up? slice_min() and slice_max() keep tied values so n = 1 means give us all rows with the highest value. If you want exactly one row per group you can set with_ties = FALSE.\nThis is similar to computing the max delay with summarize(), but you get the whole corresponding row (or rows if there’s a tie) instead of the single summary statistic.\n\n\n4.5.4 Grouping by multiple variables\nYou can create groups using more than one variable. For example, we could make a group for each date.\n\ndaily &lt;- flights |&gt;  \n  group_by(year, month, day)\ndaily\n#&gt; # A tibble: 336,776 × 19\n#&gt; # Groups:   year, month, day [365]\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nWhen you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:\n\ndaily_flights &lt;- daily |&gt; \n  summarize(n = n())\n#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using\n#&gt; the `.groups` argument.\n\nIf you’re happy with this behavior, you can explicitly request it in order to suppress the message:\n\ndaily_flights &lt;- daily |&gt; \n  summarize(\n    n = n(), \n    .groups = \"drop_last\"\n  )\n\nAlternatively, change the default behavior by setting a different value, e.g., \"drop\" to drop all grouping or \"keep\" to preserve the same groups.\n\n\n4.5.5 Ungrouping\nYou might also want to remove grouping from a data frame without using summarize(). You can do this with ungroup().\n\ndaily |&gt; \n  ungroup()\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nNow let’s see what happens when you summarize an ungrouped data frame.\n\ndaily |&gt; \n  ungroup() |&gt;\n  summarize(\n    avg_delay = mean(dep_delay, na.rm = TRUE), \n    flights = n()\n  )\n#&gt; # A tibble: 1 × 2\n#&gt;   avg_delay flights\n#&gt;       &lt;dbl&gt;   &lt;int&gt;\n#&gt; 1      12.6  336776\n\nYou get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.\n\n\n4.5.6 .by\ndplyr 1.1.0 includes a new, experimental, syntax for per-operation grouping, the .by argument. group_by() and ungroup() aren’t going away, but you can now also use the .by argument to group within a single operation:\n\nflights |&gt; \n  summarize(\n    delay = mean(dep_delay, na.rm = TRUE), \n    n = n(),\n    .by = month\n  )\n\nOr if you want to group by multiple variables:\n\nflights |&gt; \n  summarize(\n    delay = mean(dep_delay, na.rm = TRUE), \n    n = n(),\n    .by = c(origin, dest)\n  )\n\n.by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.\nWe didn’t focus on this syntax in this chapter because it was very new when we wrote the book. We did want to mention it because we think it has a lot of promise and it’s likely to be quite popular. You can learn more about it in the dplyr 1.1.0 blog post.\n\n\n4.5.7 Exercises\n\nWhich carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |&gt; group_by(carrier, dest) |&gt; summarize(n()))\nFind the flights that are most delayed upon departure from each destination.\nHow do delays vary over the course of the day? Illustrate your answer with a plot.\nWhat happens if you supply a negative n to slice_min() and friends?\nExplain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?\nSuppose we have the following tiny data frame:\n\ndf &lt;- tibble(\n  x = 1:5,\n  y = c(\"a\", \"b\", \"a\", \"a\", \"b\"),\n  z = c(\"K\", \"K\", \"L\", \"L\", \"K\")\n)\n\n\nWrite down what you think the output will look like, then check if you were correct, and describe what group_by() does.\n\ndf |&gt;\n  group_by(y)\n\nWrite down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also, comment on how it’s different from the group_by() in part (a).\n\ndf |&gt;\n  arrange(y)\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does.\n\ndf |&gt;\n  group_by(y) |&gt;\n  summarize(mean_x = mean(x))\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.\n\ndf |&gt;\n  group_by(y, z) |&gt;\n  summarize(mean_x = mean(x))\n\nWrite down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?\n\ndf |&gt;\n  group_by(y, z) |&gt;\n  summarize(mean_x = mean(x), .groups = \"drop\")\n\nWrite down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?\n\ndf |&gt;\n  group_by(y, z) |&gt;\n  summarize(mean_x = mean(x))\n\ndf |&gt;\n  group_by(y, z) |&gt;\n  mutate(mean_x = mean(x))",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#sec-sample-size",
    "href": "data-transform.html#sec-sample-size",
    "title": "4  Data transformation",
    "section": "4.6 Case study: aggregates and sample size",
    "text": "4.6 Case study: aggregates and sample size\nWhenever you do any aggregation, it’s always a good idea to include a count (n()). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. We’ll demonstrate this with some baseball data from the Lahman package. Specifically, we will compare what proportion of times a player gets a hit (H) vs. the number of times they try to put the ball in play (AB):\n\nbatters &lt;- Lahman::Batting |&gt; \n  group_by(playerID) |&gt; \n  summarize(\n    performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),\n    n = sum(AB, na.rm = TRUE)\n  )\nbatters\n#&gt; # A tibble: 20,730 × 3\n#&gt;   playerID  performance     n\n#&gt;   &lt;chr&gt;           &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 aardsda01      0          4\n#&gt; 2 aaronha01      0.305  12364\n#&gt; 3 aaronto01      0.229    944\n#&gt; 4 aasedo01       0          5\n#&gt; 5 abadan01       0.0952    21\n#&gt; 6 abadfe01       0.111      9\n#&gt; # ℹ 20,724 more rows\n\nWhen we plot the skill of the batter (measured by the batting average, performance) against the number of opportunities to hit the ball (measured by times at bat, n), you see two patterns:\n\nThe variation in performance is larger among players with fewer at-bats. The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you’ll see that the variation decreases as the sample size increases4.\nThere’s a positive correlation between skill (performance) and opportunities to hit the ball (n) because teams want to give their best batters the most opportunities to hit the ball.\n\n\nbatters |&gt; \n  filter(n &gt; 100) |&gt; \n  ggplot(aes(x = n, y = performance)) +\n  geom_point(alpha = 1 / 10) + \n  geom_smooth(se = FALSE)\n\n\n\n\n\n\n\n\nNote the handy pattern for combining ggplot2 and dplyr. You just have to remember to switch from |&gt;, for dataset processing, to + for adding layers to your plot.\nThis also has important implications for ranking. If you naively sort on desc(performance), the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they’re not necessarily the most skilled players:\n\nbatters |&gt; \n  arrange(desc(performance))\n#&gt; # A tibble: 20,730 × 3\n#&gt;   playerID  performance     n\n#&gt;   &lt;chr&gt;           &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 abramge01           1     1\n#&gt; 2 alberan01           1     1\n#&gt; 3 banisje01           1     1\n#&gt; 4 bartocl01           1     1\n#&gt; 5 bassdo01            1     1\n#&gt; 6 birasst01           1     2\n#&gt; # ℹ 20,724 more rows\n\nYou can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#summary",
    "href": "data-transform.html#summary",
    "title": "4  Data transformation",
    "section": "4.7 Summary",
    "text": "4.7 Summary\nIn this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like filter() and arrange()), those that manipulate the columns (like select() and mutate()) and those that manipulate groups (like group_by() and summarize()). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll return to that in the Transform part of the book, where each chapter provides tools for a specific type of variable.\nIn the next chapter, we’ll pivot back to workflow to discuss the importance of code style and keeping your code well organized to make it easy for you and others to read and understand.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "data-transform.html#footnotes",
    "href": "data-transform.html#footnotes",
    "title": "4  Data transformation",
    "section": "",
    "text": "Later, you’ll learn about the slice_*() family, which allows you to choose rows based on their positions.↩︎\nRemember that in RStudio, the easiest way to see a dataset with many columns is View().↩︎\nOr summarise(), if you prefer British English.↩︎\n*cough* the law of large numbers *cough*.↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>4</span>  <span class='chapter-title'>Data transformation</span>"
    ]
  },
  {
    "objectID": "workflow-style.html",
    "href": "workflow-style.html",
    "title": "5  Workflow: code style",
    "section": "",
    "text": "5.1 Names\nGood coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer, it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else. This chapter will introduce the most important points of the tidyverse style guide, which is used throughout this book.\nStyling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the styler package by Lorenz Walthert. Once you’ve installed it with install.packages(\"styler\"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any built-in RStudio command and many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts offered by styler. Figure 5.1 shows the results.\nWe’ll use the tidyverse and nycflights13 packages for code examples in this chapter.\nWe talked briefly about names in Section 3.3. Remember that variable names (those created by &lt;- and those created by mutate()) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.\n# Strive for:\nshort_flights &lt;- flights |&gt; filter(air_time &lt; 60)\n\n# Avoid:\nSHORTFLIGHTS &lt;- flights |&gt; filter(air_time &lt; 60)\nAs a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.\nIf you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme, you’re better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#spaces",
    "href": "workflow-style.html#spaces",
    "title": "5  Workflow: code style",
    "section": "5.2 Spaces",
    "text": "5.2 Spaces\nPut spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, &lt;, …), and around the assignment operator (&lt;-).\n\n# Strive for\nz &lt;- (a + b)^2 / d\n\n# Avoid\nz&lt;-( a + b ) ^ 2/d\n\nDon’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in standard English.\n\n# Strive for\nmean(x, na.rm = TRUE)\n\n# Avoid\nmean (x ,na.rm=TRUE)\n\nIt’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in mutate(), you might want to add spaces so that all the = line up.1 This makes it easier to skim the code.\n\nflights |&gt; \n  mutate(\n    speed      = distance / air_time,\n    dep_hour   = dep_time %/% 100,\n    dep_minute = dep_time %%  100\n  )",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#sec-pipes",
    "href": "workflow-style.html#sec-pipes",
    "title": "5  Workflow: code style",
    "section": "5.3 Pipes",
    "text": "5.3 Pipes\n|&gt; should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.\n\n# Strive for \nflights |&gt;  \n  filter(!is.na(arr_delay), !is.na(tailnum)) |&gt; \n  count(dest)\n\n# Avoid\nflights|&gt;filter(!is.na(arr_delay), !is.na(tailnum))|&gt;count(dest)\n\nIf the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.\n\n# Strive for\nflights |&gt;  \n  group_by(tailnum) |&gt; \n  summarize(\n    delay = mean(arr_delay, na.rm = TRUE),\n    n = n()\n  )\n\n# Avoid\nflights |&gt;\n  group_by(\n    tailnum\n  ) |&gt; \n  summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())\n\nAfter the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |&gt; . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.\n\n# Strive for \nflights |&gt;  \n  group_by(tailnum) |&gt; \n  summarize(\n    delay = mean(arr_delay, na.rm = TRUE),\n    n = n()\n  )\n\n# Avoid\nflights|&gt;\n  group_by(tailnum) |&gt; \n  summarize(\n             delay = mean(arr_delay, na.rm = TRUE), \n             n = n()\n           )\n\n# Avoid\nflights|&gt;\n  group_by(tailnum) |&gt; \n  summarize(\n  delay = mean(arr_delay, na.rm = TRUE), \n  n = n()\n  )\n\nIt’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.\n\n# This fits compactly on one line\ndf |&gt; mutate(y = x + 1)\n\n# While this takes up 4x as many lines, it's easily extended to \n# more variables and more steps in the future\ndf |&gt; \n  mutate(\n    y = x + 1\n  )\n\nFinally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#ggplot2",
    "href": "workflow-style.html#ggplot2",
    "title": "5  Workflow: code style",
    "section": "5.4 ggplot2",
    "text": "5.4 ggplot2\nThe same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |&gt;.\n\nflights |&gt; \n  group_by(month) |&gt; \n  summarize(\n    delay = mean(arr_delay, na.rm = TRUE)\n  ) |&gt; \n  ggplot(aes(x = month, y = delay)) +\n  geom_point() + \n  geom_line()\n\nAgain, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(\n    distance = mean(distance),\n    speed = mean(distance / air_time, na.rm = TRUE)\n  ) |&gt; \n  ggplot(aes(x = distance, y = speed)) +\n  geom_smooth(\n    method = \"loess\",\n    span = 0.5,\n    se = FALSE, \n    color = \"white\", \n    linewidth = 4\n  ) +\n  geom_point()\n\nWatch for the transition from |&gt; to +. We wish this transition wasn’t necessary, but unfortunately, ggplot2 was written before the pipe was discovered.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#sectioning-comments",
    "href": "workflow-style.html#sectioning-comments",
    "title": "5  Workflow: code style",
    "section": "5.5 Sectioning comments",
    "text": "5.5 Sectioning comments\nAs your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:\n\n# Load data --------------------------------------\n\n# Plot data --------------------------------------\n\nRStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figure 5.2.\n\n\n\n\n\n\n\n\nFigure 5.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#exercises",
    "href": "workflow-style.html#exercises",
    "title": "5  Workflow: code style",
    "section": "5.6 Exercises",
    "text": "5.6 Exercises\n\nRestyle the following pipelines following the guidelines above.\n\nflights|&gt;filter(dest==\"IAH\")|&gt;group_by(year,month,day)|&gt;summarize(n=n(),\ndelay=mean(arr_delay,na.rm=TRUE))|&gt;filter(n&gt;10)\n\nflights|&gt;filter(carrier==\"UA\",dest%in%c(\"IAH\",\"HOU\"),sched_dep_time&gt;\n0900,sched_arr_time&lt;2000)|&gt;group_by(flight)|&gt;summarize(delay=mean(\narr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|&gt;filter(n&gt;10)",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#summary",
    "href": "workflow-style.html#summary",
    "title": "5  Workflow: code style",
    "section": "5.7 Summary",
    "text": "5.7 Summary\nIn this chapter, you’ve learned the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.\nIn the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "workflow-style.html#footnotes",
    "href": "workflow-style.html#footnotes",
    "title": "5  Workflow: code style",
    "section": "",
    "text": "Since dep_time is in HMM or HHMM format, we use integer division (%/%) to get hour and remainder (also known as modulo, %%) to get minute.↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>5</span>  <span class='chapter-title'>Workflow: code style</span>"
    ]
  },
  {
    "objectID": "data-tidy.html",
    "href": "data-tidy.html",
    "title": "6  Data tidying",
    "section": "",
    "text": "6.1 Introduction\nIn this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.\nIn this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#introduction",
    "href": "data-tidy.html#introduction",
    "title": "6  Data tidying",
    "section": "",
    "text": "“Happy families are all alike; every unhappy family is unhappy in its own way.”\n— Leo Tolstoy\n\n\n“Tidy datasets are all alike, but every messy dataset is messy in its own way.”\n— Hadley Wickham\n\n\n\n\n6.1.1 Prerequisites\nIn this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.\n\nlibrary(tidyverse)\n\nFrom this chapter on, we’ll suppress the loading message from library(tidyverse).",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#sec-tidy-data",
    "href": "data-tidy.html#sec-tidy-data",
    "title": "6  Data tidying",
    "section": "6.2 Tidy data",
    "text": "6.2 Tidy data\nYou can represent the same underlying data in multiple ways. The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: country, year, population, and number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.\n\ntable1\n#&gt; # A tibble: 6 × 4\n#&gt;   country      year  cases population\n#&gt;   &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;      &lt;dbl&gt;\n#&gt; 1 Afghanistan  1999    745   19987071\n#&gt; 2 Afghanistan  2000   2666   20595360\n#&gt; 3 Brazil       1999  37737  172006362\n#&gt; 4 Brazil       2000  80488  174504898\n#&gt; 5 China        1999 212258 1272915272\n#&gt; 6 China        2000 213766 1280428583\n\ntable2\n#&gt; # A tibble: 12 × 4\n#&gt;   country      year type           count\n#&gt;   &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;          &lt;dbl&gt;\n#&gt; 1 Afghanistan  1999 cases            745\n#&gt; 2 Afghanistan  1999 population  19987071\n#&gt; 3 Afghanistan  2000 cases           2666\n#&gt; 4 Afghanistan  2000 population  20595360\n#&gt; 5 Brazil       1999 cases          37737\n#&gt; 6 Brazil       1999 population 172006362\n#&gt; # ℹ 6 more rows\n\ntable3\n#&gt; # A tibble: 6 × 3\n#&gt;   country      year rate             \n#&gt;   &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;            \n#&gt; 1 Afghanistan  1999 745/19987071     \n#&gt; 2 Afghanistan  2000 2666/20595360    \n#&gt; 3 Brazil       1999 37737/172006362  \n#&gt; 4 Brazil       2000 80488/174504898  \n#&gt; 5 China        1999 212258/1272915272\n#&gt; 6 China        2000 213766/1280428583\n\nThese are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.\nThere are three interrelated rules that make a dataset tidy:\n\nEach variable is a column; each column is a variable.\nEach observation is a row; each row is an observation.\nEach value is a cell; each cell is a single value.\n\nFigure 6.1 shows the rules visually.\n\n\n\n\n\n\n\n\nFigure 6.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.\n\n\n\n\n\nWhy ensure that your data is tidy? There are two main advantages:\n\nThere’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.\nThere’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in Section 4.3.1 and Section 4.5.2, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.\n\ndplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with table1.\n\n# Compute rate per 10,000\ntable1 |&gt;\n  mutate(rate = cases / population * 10000)\n#&gt; # A tibble: 6 × 5\n#&gt;   country      year  cases population  rate\n#&gt;   &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 Afghanistan  1999    745   19987071 0.373\n#&gt; 2 Afghanistan  2000   2666   20595360 1.29 \n#&gt; 3 Brazil       1999  37737  172006362 2.19 \n#&gt; 4 Brazil       2000  80488  174504898 4.61 \n#&gt; 5 China        1999 212258 1272915272 1.67 \n#&gt; 6 China        2000 213766 1280428583 1.67\n\n# Compute total cases per year\ntable1 |&gt; \n  group_by(year) |&gt; \n  summarize(total_cases = sum(cases))\n#&gt; # A tibble: 2 × 2\n#&gt;    year total_cases\n#&gt;   &lt;dbl&gt;       &lt;dbl&gt;\n#&gt; 1  1999      250740\n#&gt; 2  2000      296920\n\n# Visualize changes over time\nggplot(table1, aes(x = year, y = cases)) +\n  geom_line(aes(group = country), color = \"grey50\") +\n  geom_point(aes(color = country, shape = country)) +\n  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000\n\n\n\n\n\n\n\n\n\n6.2.1 Exercises\n\nFor each of the sample tables, describe what each observation and each column represents.\nSketch out the process you’d use to calculate the rate for table2 and table3. You will need to perform four operations:\n\nExtract the number of TB cases per country per year.\nExtract the matching population per country per year.\nDivide cases by population, and multiply by 10000.\nStore back in the appropriate place.\n\nYou haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#sec-pivoting",
    "href": "data-tidy.html#sec-pivoting",
    "title": "6  Data tidying",
    "section": "6.3 Lengthening data",
    "text": "6.3 Lengthening data\nThe principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:\n\nData is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.\nMost people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.\n\nThis means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.\ntidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We’ll first start with pivot_longer() because it’s the most common case. Let’s dive into some examples.\n\n6.3.1 Data in column names\nThe billboard dataset records the billboard rank of songs in the year 2000:\n\nbillboard\n#&gt; # A tibble: 317 × 79\n#&gt;   artist       track               date.entered   wk1   wk2   wk3   wk4   wk5\n#&gt;   &lt;chr&gt;        &lt;chr&gt;               &lt;date&gt;       &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 2 Pac        Baby Don't Cry (Ke… 2000-02-26      87    82    72    77    87\n#&gt; 2 2Ge+her      The Hardest Part O… 2000-09-02      91    87    92    NA    NA\n#&gt; 3 3 Doors Down Kryptonite          2000-04-08      81    70    68    67    66\n#&gt; 4 3 Doors Down Loser               2000-10-21      76    76    72    69    67\n#&gt; 5 504 Boyz     Wobble Wobble       2000-04-15      57    34    25    17    17\n#&gt; 6 98^0         Give Me Just One N… 2000-08-19      51    39    34    26    26\n#&gt; # ℹ 311 more rows\n#&gt; # ℹ 71 more variables: wk6 &lt;dbl&gt;, wk7 &lt;dbl&gt;, wk8 &lt;dbl&gt;, wk9 &lt;dbl&gt;, …\n\nIn this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1. Here, the column names are one variable (the week) and the cell values are another (the rank).\nTo tidy this data, we’ll use pivot_longer():\n\nbillboard |&gt; \n  pivot_longer(\n    cols = starts_with(\"wk\"), \n    names_to = \"week\", \n    values_to = \"rank\"\n  )\n#&gt; # A tibble: 24,092 × 5\n#&gt;    artist track                   date.entered week   rank\n#&gt;    &lt;chr&gt;  &lt;chr&gt;                   &lt;date&gt;       &lt;chr&gt; &lt;dbl&gt;\n#&gt;  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87\n#&gt;  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82\n#&gt;  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72\n#&gt;  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77\n#&gt;  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87\n#&gt;  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94\n#&gt;  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99\n#&gt;  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA\n#&gt;  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA\n#&gt; 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA\n#&gt; # ℹ 24,082 more rows\n\nAfter the data, there are three key arguments:\n\ncols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with(\"wk\").\nnames_to names the variable stored in the column names, we named that variable week.\nvalues_to names the variable stored in the cell values, we named that variable rank.\n\nNote that in the code \"week\" and \"rank\" are quoted because those are new variables we’re creating, they don’t yet exist in the data when we run the pivot_longer() call.\nNow let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they were forced to exist by the structure of the dataset2, so we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE:\n\nbillboard |&gt; \n  pivot_longer(\n    cols = starts_with(\"wk\"), \n    names_to = \"week\", \n    values_to = \"rank\",\n    values_drop_na = TRUE\n  )\n#&gt; # A tibble: 5,307 × 5\n#&gt;   artist track                   date.entered week   rank\n#&gt;   &lt;chr&gt;  &lt;chr&gt;                   &lt;date&gt;       &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87\n#&gt; 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82\n#&gt; 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72\n#&gt; 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77\n#&gt; 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87\n#&gt; 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94\n#&gt; # ℹ 5,301 more rows\n\nThe number of rows is now much lower, indicating that many rows with NAs were dropped.\nYou might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.\nThis data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.\n\nbillboard_longer &lt;- billboard |&gt; \n  pivot_longer(\n    cols = starts_with(\"wk\"), \n    names_to = \"week\", \n    values_to = \"rank\",\n    values_drop_na = TRUE\n  ) |&gt; \n  mutate(\n    week = parse_number(week)\n  )\nbillboard_longer\n#&gt; # A tibble: 5,307 × 5\n#&gt;   artist track                   date.entered  week  rank\n#&gt;   &lt;chr&gt;  &lt;chr&gt;                   &lt;date&gt;       &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1    87\n#&gt; 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2    82\n#&gt; 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3    72\n#&gt; 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4    77\n#&gt; 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5    87\n#&gt; 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6    94\n#&gt; # ℹ 5,301 more rows\n\nNow that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. The code is shown below and the result is in Figure 6.2. We can see that very few songs stay in the top 100 for more than 20 weeks.\n\nbillboard_longer |&gt; \n  ggplot(aes(x = week, y = rank, group = track)) + \n  geom_line(alpha = 0.25) + \n  scale_y_reverse()\n\n\n\n\n\n\n\nFigure 6.2: A line plot showing how the rank of a song changes over time.\n\n\n\n\n\n\n\n6.3.2 How does pivoting work?\nNow that you’ve seen how we can use pivoting to reshape our data, let’s take a little time to gain some intuition about what pivoting does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient. We’ll create the data with tribble(), a handy function for constructing small tibbles by hand:\n\ndf &lt;- tribble(\n  ~id,  ~bp1, ~bp2,\n   \"A\",  100,  120,\n   \"B\",  140,  115,\n   \"C\",  120,  125\n)\n\nWe want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot df longer:\n\ndf |&gt; \n  pivot_longer(\n    cols = bp1:bp2,\n    names_to = \"measurement\",\n    values_to = \"value\"\n  )\n#&gt; # A tibble: 6 × 3\n#&gt;   id    measurement value\n#&gt;   &lt;chr&gt; &lt;chr&gt;       &lt;dbl&gt;\n#&gt; 1 A     bp1           100\n#&gt; 2 A     bp2           120\n#&gt; 3 B     bp1           140\n#&gt; 4 B     bp2           115\n#&gt; 5 C     bp1           120\n#&gt; 6 C     bp2           125\n\nHow does the reshaping work? It’s easier to see if we think about it column by column. As shown in Figure 6.3, the values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted.\n\n\n\n\n\n\n\n\nFigure 6.3: Columns that are already variables need to be repeated, once for each column that is pivoted.\n\n\n\n\n\nThe column names become values in a new variable, whose name is defined by names_to, as shown in Figure 6.4. They need to be repeated once for each row in the original dataset.\n\n\n\n\n\n\n\n\nFigure 6.4: The column names of pivoted columns become values in a new column. The values need to be repeated once for each row of the original dataset.\n\n\n\n\n\nThe cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row. Figure 6.5 illustrates the process.\n\n\n\n\n\n\n\n\nFigure 6.5: The number of values is preserved (not repeated), but unwound row-by-row.\n\n\n\n\n\n\n\n6.3.3 Many variables in column names\nA more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables. For example, take the who2 dataset, the source of table1 and friends that you saw above:\n\nwho2\n#&gt; # A tibble: 7,240 × 58\n#&gt;   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554\n#&gt;   &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan  1980       NA        NA        NA        NA        NA\n#&gt; 2 Afghanistan  1981       NA        NA        NA        NA        NA\n#&gt; 3 Afghanistan  1982       NA        NA        NA        NA        NA\n#&gt; 4 Afghanistan  1983       NA        NA        NA        NA        NA\n#&gt; 5 Afghanistan  1984       NA        NA        NA        NA        NA\n#&gt; 6 Afghanistan  1985       NA        NA        NA        NA        NA\n#&gt; # ℹ 7,234 more rows\n#&gt; # ℹ 51 more variables: sp_m_5564 &lt;dbl&gt;, sp_m_65 &lt;dbl&gt;, sp_f_014 &lt;dbl&gt;, …\n\nThis dataset, collected by the World Health Organisation, records information about tuberculosis diagnoses. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender (coded as a binary variable in this dataset), and the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example).\nSo in this case we have six pieces of information recorded in who2: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use pivot_longer() with a vector of column names for names_to and instructors for splitting the original variable names into pieces for names_sep as well as a column name for values_to:\n\nwho2 |&gt; \n  pivot_longer(\n    cols = !(country:year),\n    names_to = c(\"diagnosis\", \"gender\", \"age\"), \n    names_sep = \"_\",\n    values_to = \"count\"\n  )\n#&gt; # A tibble: 405,440 × 6\n#&gt;   country      year diagnosis gender age   count\n#&gt;   &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;     &lt;chr&gt;  &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1 Afghanistan  1980 sp        m      014      NA\n#&gt; 2 Afghanistan  1980 sp        m      1524     NA\n#&gt; 3 Afghanistan  1980 sp        m      2534     NA\n#&gt; 4 Afghanistan  1980 sp        m      3544     NA\n#&gt; 5 Afghanistan  1980 sp        m      4554     NA\n#&gt; 6 Afghanistan  1980 sp        m      5564     NA\n#&gt; # ℹ 405,434 more rows\n\nAn alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Chapter 16.\nConceptually, this is only a minor variation on the simpler case you’ve already seen. Figure 6.6 shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that’s faster.\n\n\n\n\n\n\n\n\nFigure 6.6: Pivoting columns with multiple pieces of information in the names means that each column name now fills in values in multiple output columns.\n\n\n\n\n\n\n\n6.3.4 Data and variable names in the column headers\nThe next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the household dataset:\n\nhousehold\n#&gt; # A tibble: 5 × 5\n#&gt;   family dob_child1 dob_child2 name_child1 name_child2\n#&gt;    &lt;int&gt; &lt;date&gt;     &lt;date&gt;     &lt;chr&gt;       &lt;chr&gt;      \n#&gt; 1      1 1998-11-26 2000-01-29 Susan       Jose       \n#&gt; 2      2 1996-06-22 NA         Mark        &lt;NA&gt;       \n#&gt; 3      3 2002-07-11 2004-04-05 Sam         Seth       \n#&gt; 4      4 2004-10-10 2009-08-27 Craig       Khai       \n#&gt; 5      5 2000-12-05 2005-02-28 Parker      Gracie\n\nThis dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (dob, name) and the values of another (child, with values 1 or 2). To solve this problem we again need to supply a vector to names_to but this time we use the special \".value\" sentinel; this isn’t the name of a variable but a unique value that tells pivot_longer() to do something different. This overrides the usual values_to argument to use the first component of the pivoted column name as a variable name in the output.\n\nhousehold |&gt; \n  pivot_longer(\n    cols = !family, \n    names_to = c(\".value\", \"child\"), \n    names_sep = \"_\", \n    values_drop_na = TRUE\n  )\n#&gt; # A tibble: 9 × 4\n#&gt;   family child  dob        name \n#&gt;    &lt;int&gt; &lt;chr&gt;  &lt;date&gt;     &lt;chr&gt;\n#&gt; 1      1 child1 1998-11-26 Susan\n#&gt; 2      1 child2 2000-01-29 Jose \n#&gt; 3      2 child1 1996-06-22 Mark \n#&gt; 4      3 child1 2002-07-11 Sam  \n#&gt; 5      3 child2 2004-04-05 Seth \n#&gt; 6      4 child1 2004-10-10 Craig\n#&gt; # ℹ 3 more rows\n\nWe again use values_drop_na = TRUE, since the shape of the input forces the creation of explicit missing variables (e.g., for families with only one child).\nFigure 6.7 illustrates the basic idea with a simpler example. When you use \".value\" in names_to, the column names in the input contribute to both values and variable names in the output.\n\n\n\n\n\n\n\n\nFigure 6.7: Pivoting with names_to = c(\".value\", \"num\") splits the column names into two components: the first part determines the output column name (x or y), and the second part determines the value of the num column.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#widening-data",
    "href": "data-tidy.html#widening-data",
    "title": "6  Data tidying",
    "section": "6.4 Widening data",
    "text": "6.4 Widening data\nSo far we’ve used pivot_longer() to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to pivot_wider(), which makes datasets wider by increasing columns and reducing rows and helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.\nWe’ll start by looking at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:\n\ncms_patient_experience\n#&gt; # A tibble: 500 × 5\n#&gt;   org_pac_id org_nm                     measure_cd   measure_title   prf_rate\n#&gt;   &lt;chr&gt;      &lt;chr&gt;                      &lt;chr&gt;        &lt;chr&gt;              &lt;dbl&gt;\n#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS…       63\n#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS…       87\n#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS…       86\n#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS…       57\n#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS…       85\n#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS…       24\n#&gt; # ℹ 494 more rows\n\nThe core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization. We can see the complete set of values for measure_cd and measure_title by using distinct():\n\ncms_patient_experience |&gt; \n  distinct(measure_cd, measure_title)\n#&gt; # A tibble: 6 × 2\n#&gt;   measure_cd   measure_title                                                 \n#&gt;   &lt;chr&gt;        &lt;chr&gt;                                                         \n#&gt; 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…\n#&gt; 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate            \n#&gt; 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider              \n#&gt; 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education            \n#&gt; 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff        \n#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources\n\nNeither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.\npivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from) and the column name (names_from):\n\ncms_patient_experience |&gt; \n  pivot_wider(\n    names_from = measure_cd,\n    values_from = prf_rate\n  )\n#&gt; # A tibble: 500 × 9\n#&gt;   org_pac_id org_nm                   measure_title   CAHPS_GRP_1 CAHPS_GRP_2\n#&gt;   &lt;chr&gt;      &lt;chr&gt;                    &lt;chr&gt;                 &lt;dbl&gt;       &lt;dbl&gt;\n#&gt; 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          63          NA\n#&gt; 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          87\n#&gt; 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA\n#&gt; 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA\n#&gt; 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA\n#&gt; 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA\n#&gt; # ℹ 494 more rows\n#&gt; # ℹ 4 more variables: CAHPS_GRP_3 &lt;dbl&gt;, CAHPS_GRP_5 &lt;dbl&gt;, …\n\nThe output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, we also need to tell pivot_wider() which column or columns have values that uniquely identify each row; in this case those are the variables starting with \"org\":\n\ncms_patient_experience |&gt; \n  pivot_wider(\n    id_cols = starts_with(\"org\"),\n    names_from = measure_cd,\n    values_from = prf_rate\n  )\n#&gt; # A tibble: 95 × 8\n#&gt;   org_pac_id org_nm           CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5\n#&gt;   &lt;chr&gt;      &lt;chr&gt;                  &lt;dbl&gt;       &lt;dbl&gt;       &lt;dbl&gt;       &lt;dbl&gt;\n#&gt; 1 0446157747 USC CARE MEDICA…          63          87          86          57\n#&gt; 2 0446162697 ASSOCIATION OF …          59          85          83          63\n#&gt; 3 0547164295 BEAVER MEDICAL …          49          NA          75          44\n#&gt; 4 0749333730 CAPE PHYSICIANS…          67          84          85          65\n#&gt; 5 0840104360 ALLIANCE PHYSIC…          66          87          87          64\n#&gt; 6 0840109864 REX HOSPITAL INC          73          87          84          67\n#&gt; # ℹ 89 more rows\n#&gt; # ℹ 2 more variables: CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;\n\nThis gives us the output that we’re looking for.\n\n6.4.1 How does pivot_wider() work?\nTo understand how pivot_wider() works, let’s again start with a very simple dataset. This time we have two patients with ids A and B, we have three blood pressure measurements on patient A and two on patient B:\n\ndf &lt;- tribble(\n  ~id, ~measurement, ~value,\n  \"A\",        \"bp1\",    100,\n  \"B\",        \"bp1\",    140,\n  \"B\",        \"bp2\",    115, \n  \"A\",        \"bp2\",    120,\n  \"A\",        \"bp3\",    105\n)\n\nWe’ll take the values from the value column and the names from the measurement column:\n\ndf |&gt; \n  pivot_wider(\n    names_from = measurement,\n    values_from = value\n  )\n#&gt; # A tibble: 2 × 4\n#&gt;   id      bp1   bp2   bp3\n#&gt;   &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 A       100   120   105\n#&gt; 2 B       140   115    NA\n\nTo begin the process pivot_wider() needs to first figure out what will go in the rows and columns. The new column names will be the unique values of measurement.\n\ndf |&gt; \n  distinct(measurement) |&gt; \n  pull()\n#&gt; [1] \"bp1\" \"bp2\" \"bp3\"\n\nBy default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here there is only one column, but in general there can be any number.\n\ndf |&gt; \n  select(-measurement, -value) |&gt; \n  distinct()\n#&gt; # A tibble: 2 × 1\n#&gt;   id   \n#&gt;   &lt;chr&gt;\n#&gt; 1 A    \n#&gt; 2 B\n\npivot_wider() then combines these results to generate an empty data frame:\n\ndf |&gt; \n  select(-measurement, -value) |&gt; \n  distinct() |&gt; \n  mutate(x = NA, y = NA, z = NA)\n#&gt; # A tibble: 2 × 4\n#&gt;   id    x     y     z    \n#&gt;   &lt;chr&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;\n#&gt; 1 A     NA    NA    NA   \n#&gt; 2 B     NA    NA    NA\n\nIt then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing. We’ll come back to this idea that pivot_wider() can “make” missing values in Chapter 19.\nYou might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and measurement “bp1”:\n\ndf &lt;- tribble(\n  ~id, ~measurement, ~value,\n  \"A\",        \"bp1\",    100,\n  \"A\",        \"bp1\",    102,\n  \"A\",        \"bp2\",    120,\n  \"B\",        \"bp1\",    140, \n  \"B\",        \"bp2\",    115\n)\n\nIf we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in Chapter 24:\n\ndf |&gt;\n  pivot_wider(\n    names_from = measurement,\n    values_from = value\n  )\n#&gt; Warning: Values from `value` are not uniquely identified; output will contain\n#&gt; list-cols.\n#&gt; • Use `values_fn = list` to suppress this warning.\n#&gt; • Use `values_fn = {summary_fun}` to summarise duplicates.\n#&gt; • Use the following dplyr code to identify duplicates.\n#&gt;   {data} |&gt;\n#&gt;   dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |&gt;\n#&gt;   dplyr::filter(n &gt; 1L)\n#&gt; # A tibble: 2 × 3\n#&gt;   id    bp1       bp2      \n#&gt;   &lt;chr&gt; &lt;list&gt;    &lt;list&gt;   \n#&gt; 1 A     &lt;dbl [2]&gt; &lt;dbl [1]&gt;\n#&gt; 2 B     &lt;dbl [1]&gt; &lt;dbl [1]&gt;\n\nSince you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:\n\ndf |&gt; \n  group_by(id, measurement) |&gt; \n  summarize(n = n(), .groups = \"drop\") |&gt; \n  filter(n &gt; 1)\n#&gt; # A tibble: 1 × 3\n#&gt;   id    measurement     n\n#&gt;   &lt;chr&gt; &lt;chr&gt;       &lt;int&gt;\n#&gt; 1 A     bp1             2\n\nIt’s then up to you to figure out what’s gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#summary",
    "href": "data-tidy.html#summary",
    "title": "6  Data tidying",
    "section": "6.5 Summary",
    "text": "6.5 Summary\nIn this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions, the main challenge is transforming the data from whatever structure you receive it in to a tidy format. To that end, you learned about pivot_longer() and pivot_wider() which allow you to tidy up many untidy datasets. The examples we presented here are a selection of those from vignette(\"pivot\", package = \"tidyr\"), so if you encounter a problem that this chapter doesn’t help you with, that vignette is a good place to try next.\nAnother challenge is that, for a given dataset, it can be impossible to label the longer or the wider version as the “tidy” one. This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn’t actually define what a variable is (and it’s surprisingly hard to do so). It’s totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest. So if you’re stuck figuring out how to do some computation, consider switching up the organisation of your data; don’t be afraid to untidy, transform, and re-tidy as needed!\nIf you enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the Tidy Data paper published in the Journal of Statistical Software.\nNow that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "data-tidy.html#footnotes",
    "href": "data-tidy.html#footnotes",
    "title": "6  Data tidying",
    "section": "",
    "text": "The song will be included as long as it was in the top 100 at some point in 2000, and is tracked for up to 72 weeks after it appears.↩︎\nWe’ll come back to this idea in Chapter 19.↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>6</span>  <span class='chapter-title'>Data tidying</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html",
    "href": "workflow-scripts.html",
    "title": "7  Workflow: scripts and projects",
    "section": "",
    "text": "7.1 Scripts\nThis chapter will introduce you to two essential tools for organizing your code: scripts and projects.\nSo far, you have used the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, as in Figure 7.1. The script editor is a great place to experiment with your code. When you want to change something, you don’t have to re-type the whole thing, you can just edit the script and re-run it. And once you have written code that works and does what you want, you can save it as a script file to easily return to later.\nFigure 7.1: Opening the script editor adds a new pane at the top-left of the IDE.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html#scripts",
    "href": "workflow-scripts.html#scripts",
    "title": "7  Workflow: scripts and projects",
    "section": "",
    "text": "7.1.1 Running code\nThe script editor is an excellent place for building complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below.\n\nlibrary(dplyr)\nlibrary(nycflights13)\n\nnot_cancelled &lt;- flights |&gt; \n  filter(!is.na(dep_delay)█, !is.na(arr_delay))\n\nnot_cancelled |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(mean = mean(dep_delay))\n\nIf your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the following statement (beginning with not_cancelled |&gt;). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.\nInstead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that you’ve captured all the important parts of your code in the script.\nWe recommend you always start your script with the packages you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include install.packages() in a script you share. It’s inconsiderate to hand off a script that will change something on their computer if they’re not being careful!\nWhen working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.\n\n\n7.1.2 RStudio diagnostics\nIn the script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:\n\n\n\n\n\n\n\n\n\nHover over the cross to see what the problem is:\n\n\n\n\n\n\n\n\n\nRStudio will also let you know about potential problems:\n\n\n\n\n\n\n\n\n\n\n\n7.1.3 Saving and naming\nRStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, it’s a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.\nIt might be tempting to name your files code.R or myscript.R, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:\n\nFile names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.\nFile names should be human readable: use file names to describe what’s in the file.\nFile names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.\n\nFor example, suppose you have the following files in a project folder.\nalternative model.R\ncode for exploratory analysis.r\nfinalreport.qmd\nFinalReport.qmd\nfig 1.png\nFigure_02.png\nmodel_first_try.R\nrun-first.r\ntemp.txt\nThere are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReport1), and some names don’t describe their contents (run-first and temp).\nHere’s a better way of naming and organizing the same set of files:\n01-load-data.R\n02-exploratory-analysis.R\n03-model-approach-1.R\n04-model-approach-2.R\nfig-01.png\nfig-02.png\nreport-2022-03-20.qmd\nreport-2022-04-02.qmd\nreport-draft-notes.txt\nNumbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and temp is renamed to report-draft-notes to better describe its contents. If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html#projects",
    "href": "workflow-scripts.html#projects",
    "title": "7  Workflow: scripts and projects",
    "section": "7.2 Projects",
    "text": "7.2 Projects\nOne day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.\nTo handle these real life situations, you need to make two decisions:\n\nWhat is the source of truth? What will you save as your lasting record of what happened?\nWhere does your analysis live?\n\n\n7.2.1 What is the source of truth?\nAs a beginner, it’s okay to rely on your current Environment to contain all the objects you have created throughout your analysis. However, to make it easier to work on larger projects or collaborate with others, your source of truth should be the R scripts. With your R scripts (and your data files), you can recreate the environment. With only your environment, it’s much harder to recreate your R scripts: you’ll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you’ll have to carefully mine your R history.\nTo help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running usethis::use_blank_slate()2 or by mimicking the options shown in Figure 7.2. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time nor will the objects you created or the datasets you read be available to use. But this short-term pain saves you long-term agony because it forces you to capture all important procedures in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculation in your environment, not the calculation itself in your code.\n\n\n\n\n\n\n\n\nFigure 7.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.\n\n\n\n\n\nThere is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:\n\nPress Cmd/Ctrl + Shift + 0/F10 to restart R.\nPress Cmd/Ctrl + Shift + S to re-run the current script.\n\nWe collectively use this pattern hundreds of times a week.\nAlternatively, if you don’t use keyboard shortcuts, you can go to Session &gt; Restart R and then highlight and re-run your current script.\n\n\n\n\n\n\nRStudio server\n\n\n\nIf you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a clean slate.\n\n\n\n\n7.2.2 Where does your analysis live?\nR has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:\n\n\n\n\n\n\n\n\n\nAnd you can print this out in R code by running getwd():\n\ngetwd()\n#&gt; [1] \"/Users/hadley/Documents/r4ds\"\n\nIn this R session, the current working directory (think of it as “home”) is in hadley’s Documents folder, in a subfolder called r4ds. This code will return a different result when you run it, because your computer has a different directory structure than Hadley’s!\nAs a beginning R user, it’s OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer. But you’re more than a handful of chapters into this book, and you’re no longer a beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.\nYou can set the working directory from within R but we do not recommend it:\n\nsetwd(\"/path/to/my/CoolProject\")\n\nThere’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the RStudio project.\n\n\n7.2.3 RStudio projects\nKeeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via projects. Let’s make a project for you to use while you’re working through the rest of this book. Click File &gt; New Project, then follow the steps shown in Figure 7.3.\n\n\n\n\n\n\n\n\nFigure 7.3: To create new project: (top) first click New Directory, then (middle) click New Project, then (bottom) fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.\n\n\n\n\n\nCall your project r4ds and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!\nOnce this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:\n\ngetwd()\n#&gt; [1] /Users/hadley/Documents/r4ds\n\nNow enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Then, create a new folder called “data”. You can do this by clicking on the “New Folder” button in the Files pane in RStudio. Finally, run the complete script which will save a PNG and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.\n\nlibrary(tidyverse)\n\nggplot(diamonds, aes(x = carat, y = price)) + \n  geom_hex()\nggsave(\"diamonds.png\")\n\nwrite_csv(diamonds, \"data/diamonds.csv\")\n\nQuit RStudio. Inspect the folder associated with your project — notice the .Rproj file. Double-click that file to re-open the project. Notice you get back to where you left off: it’s the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you’re starting with a clean slate.\nIn your favorite OS-specific way, search your computer for diamonds.png and you will find the PNG (no surprise) but also the script that created it (diamonds.R). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files with R code and never with the mouse or the clipboard, you will be able to reproduce old work with ease!\n\n\n7.2.4 Relative and absolute paths\nOnce you’re inside a project, you should only ever use relative paths not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. When Hadley wrote data/diamonds.csv above it was a shortcut for /Users/hadley/Documents/r4ds/data/diamonds.csv. But importantly, if Mine ran this code on her computer, it would point to /Users/Mine/Documents/r4ds/data/diamonds.csv. This is why relative paths are important: they’ll work regardless of where the R project folder ends up.\nAbsolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g., C:) or two backslashes (e.g., \\\\servername) and on Mac/Linux they start with a slash “/” (e.g., /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.\nThere’s another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g., data/diamonds.csv) and Windows uses backslashes (e.g., data\\diamonds.csv). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html#exercises",
    "href": "workflow-scripts.html#exercises",
    "title": "7  Workflow: scripts and projects",
    "section": "7.3 Exercises",
    "text": "7.3 Exercises\n\nGo to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!\nWhat other common mistakes will RStudio diagnostics report? Read https://support.posit.co/hc/en-us/articles/205753617-Code-Diagnostics to find out.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html#summary",
    "href": "workflow-scripts.html#summary",
    "title": "7  Workflow: scripts and projects",
    "section": "7.4 Summary",
    "text": "7.4 Summary\nIn this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.\nIn summary, scripts and projects give you a solid workflow that will serve you well in the future:\n\nCreate one RStudio project for each data analysis project.\nSave your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.\nOnly ever use relative paths, not absolute paths.\n\nThen everything you need is in one place and cleanly separated from all the other projects that you are working on.\nSo far, we’ve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data won’t be available in this way. So in the next chapter, you’re going to learn how load data from disk into your R session using the readr package.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "workflow-scripts.html#footnotes",
    "href": "workflow-scripts.html#footnotes",
    "title": "7  Workflow: scripts and projects",
    "section": "",
    "text": "Not to mention that you’re tempting fate by using “final” in the name 😆 The comic Piled Higher and Deeper has a fun strip on this.↩︎\nIf you don’t have usethis installed, you can install it with install.packages(\"usethis\").↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>7</span>  <span class='chapter-title'>Workflow: scripts and projects</span>"
    ]
  },
  {
    "objectID": "data-import.html",
    "href": "data-import.html",
    "title": "8  Data import",
    "section": "",
    "text": "8.1 Introduction\nWorking with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.\nSpecifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#introduction",
    "href": "data-import.html#introduction",
    "title": "8  Data import",
    "section": "",
    "text": "8.1.1 Prerequisites\nIn this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#reading-data-from-a-file",
    "href": "data-import.html#reading-data-from-a-file",
    "title": "8  Data import",
    "section": "8.2 Reading data from a file",
    "text": "8.2 Reading data from a file\nTo begin, we’ll focus on the most common rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas.\n\nStudent ID,Full Name,favourite.food,mealPlan,AGE\n1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4\n2,Barclay Lynn,French fries,Lunch only,5\n3,Jayendra Lyne,N/A,Breakfast and lunch,7\n4,Leon Rossini,Anchovies,Lunch only,\n5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five\n6,Güvenç Attila,Ice cream,Lunch only,6\n\nTable 8.1 shows a representation of the same data as a table.\n\n\n\n\nTable 8.1: Data from the students.csv file as a table.\n\n\n\n\n\n\n\n\n\n\n\n\n\nStudent ID\nFull Name\nfavourite.food\nmealPlan\nAGE\n\n\n\n\n1\nSunil Huffmann\nStrawberry yoghurt\nLunch only\n4\n\n\n2\nBarclay Lynn\nFrench fries\nLunch only\n5\n\n\n3\nJayendra Lyne\nN/A\nBreakfast and lunch\n7\n\n\n4\nLeon Rossini\nAnchovies\nLunch only\nNA\n\n\n5\nChidiegwu Dunkel\nPizza\nBreakfast and lunch\nfive\n\n\n6\nGüvenç Attila\nIce cream\nLunch only\n6\n\n\n\n\n\n\n\n\nWe can read this file into R using read_csv(). The first argument is the most important: the path to the file. You can think about the path as the address of the file: the file is called students.csv and that it lives in the data folder.\n\nstudents &lt;- read_csv(\"data/students.csv\")\n#&gt; Rows: 6 Columns: 5\n#&gt; ── Column specification ─────────────────────────────────────────────────────\n#&gt; Delimiter: \",\"\n#&gt; chr (4): Full Name, favourite.food, mealPlan, AGE\n#&gt; dbl (1): Student ID\n#&gt; \n#&gt; ℹ Use `spec()` to retrieve the full column specification for this data.\n#&gt; ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.\n\nThe code above will work if you have the students.csv file in a data folder in your project. You can download the students.csv file from https://pos.it/r4ds-students-csv or you can read it directly from that URL with:\n\nstudents &lt;- read_csv(\"https://pos.it/r4ds-students-csv\")\n\nWhen you run read_csv(), it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in Section 8.3.\n\n8.2.1 Practical advice\nOnce you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students data with that in mind.\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  \n#&gt;          &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2            2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    \n#&gt; 4            4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6            6 Güvenç Attila    Ice cream          Lunch only          6\n\nIn the favourite.food column, there are a bunch of food items, and then the character string N/A, which should have been a real NA that R will recognize as “not available”. This is something we can address using the na argument. By default, read_csv() only recognizes empty strings (\"\") in this dataset as NAs, and we want it to also recognize the character string \"N/A\".\n\nstudents &lt;- read_csv(\"data/students.csv\", na = c(\"N/A\", \"\"))\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  \n#&gt;          &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2            2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3            3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4            4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6            6 Güvenç Attila    Ice cream          Lunch only          6\n\nYou might also notice that the Student ID and Full Name columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names; they’re non-syntactic names. To refer to these variables, you need to surround them with backticks, `:\n\nstudents |&gt; \n  rename(\n    student_id = `Student ID`,\n    full_name = `Full Name`\n  )\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite.food     mealPlan            AGE  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nAn alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once1.\n\nstudents |&gt; janitor::clean_names()\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nAnother common task after reading in data is to consider variable types. For example, meal_plan is a categorical variable with a known set of possible values, which in R should be represented as a factor:\n\nstudents |&gt;\n  janitor::clean_names() |&gt;\n  mutate(meal_plan = factor(meal_plan))\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;fct&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nNote that the values in the meal_plan variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (&lt;chr&gt;) to factor (&lt;fct&gt;). You’ll learn more about factors in Chapter 17.\nBefore you analyze these data, you’ll probably want to fix the age column. Currently, age is a character variable because one of the observations is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in Chapter 21.\n\nstudents &lt;- students |&gt;\n  janitor::clean_names() |&gt;\n  mutate(\n    meal_plan = factor(meal_plan),\n    age = parse_number(if_else(age == \"five\", \"5\", age))\n  )\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;fct&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\nA new function here is if_else(), which has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is FALSE. Here we’re saying if age is the character string \"five\", make it \"5\", and if not leave it as age. You will learn more about if_else() and logical vectors in Chapter 13.\n\n\n8.2.2 Other arguments\nThere are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: read_csv() can read text strings that you’ve created and formatted like a CSV file:\n\nread_csv(\n  \"a,b,c\n  1,2,3\n  4,5,6\"\n)\n#&gt; # A tibble: 2 × 3\n#&gt;       a     b     c\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2     3\n#&gt; 2     4     5     6\n\nUsually, read_csv() uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use skip = n to skip the first n lines or use comment = \"#\" to drop all lines that start with (e.g.) #:\n\nread_csv(\n  \"The first line of metadata\n  The second line of metadata\n  x,y,z\n  1,2,3\",\n  skip = 2\n)\n#&gt; # A tibble: 1 × 3\n#&gt;       x     y     z\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2     3\n\nread_csv(\n  \"# A comment I want to skip\n  x,y,z\n  1,2,3\",\n  comment = \"#\"\n)\n#&gt; # A tibble: 1 × 3\n#&gt;       x     y     z\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2     3\n\nIn other cases, the data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings and instead label them sequentially from X1 to Xn:\n\nread_csv(\n  \"1,2,3\n  4,5,6\",\n  col_names = FALSE\n)\n#&gt; # A tibble: 2 × 3\n#&gt;      X1    X2    X3\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2     3\n#&gt; 2     4     5     6\n\nAlternatively, you can pass col_names a character vector which will be used as the column names:\n\nread_csv(\n  \"1,2,3\n  4,5,6\",\n  col_names = c(\"x\", \"y\", \"z\")\n)\n#&gt; # A tibble: 2 × 3\n#&gt;       x     y     z\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2     3\n#&gt; 2     4     5     6\n\nThese arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv file and read the documentation for read_csv()’s many other arguments.)\n\n\n8.2.3 Other file types\nOnce you’ve mastered read_csv(), using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:\n\nread_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.\nread_tsv() reads tab-delimited files.\nread_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.\nread_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().\nread_table() reads a common variation of fixed-width files where columns are separated by white space.\nread_log() reads Apache-style log files.\n\n\n\n8.2.4 Exercises\n\nWhat function would you use to read a file where fields were separated with “|”?\nApart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?\nWhat are the most important arguments to read_fwf()?\nSometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like \" or '. By default, read_csv() assumes that the quoting character will be \". To read the following text into a data frame, what argument to read_csv() do you need to specify?\n\n\"x,y\\n1,'a,b'\"\n\nIdentify what is wrong with each of the following inline CSV files. What happens when you run the code?\n\nread_csv(\"a,b\\n1,2,3\\n4,5,6\")\nread_csv(\"a,b,c\\n1,2\\n1,2,3,4\")\nread_csv(\"a,b\\n\\\"1\")\nread_csv(\"a,b\\n1,2\\na,b\")\nread_csv(\"a;b\\n1;3\")\n\nPractice referring to non-syntactic names in the following data frame by:\n\nExtracting the variable called 1.\nPlotting a scatterplot of 1 vs. 2.\nCreating a new column called 3, which is 2 divided by 1.\nRenaming the columns to one, two, and three.\n\n\nannoying &lt;- tibble(\n  `1` = 1:10,\n  `2` = `1` * 2 + rnorm(length(`1`))\n)",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#sec-col-types",
    "href": "data-import.html#sec-col-types",
    "title": "8  Data import",
    "section": "8.3 Controlling column types",
    "text": "8.3 Controlling column types\nA CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.\n\n8.3.1 Guessing types\nreadr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,0002 rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:\n\nDoes it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.\nDoes it contain only numbers (e.g., 1, -4.5, 5e6, Inf)? If so, it’s a number.\nDoes it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in Section 18.2).\nOtherwise, it must be a string.\n\nYou can see that behavior in action in this simple example:\n\nread_csv(\"\n  logical,numeric,date,string\n  TRUE,1,2021-01-15,abc\n  false,4.5,2021-02-15,def\n  T,Inf,2021-02-16,ghi\n\")\n#&gt; # A tibble: 3 × 4\n#&gt;   logical numeric date       string\n#&gt;   &lt;lgl&gt;     &lt;dbl&gt; &lt;date&gt;     &lt;chr&gt; \n#&gt; 1 TRUE        1   2021-01-15 abc   \n#&gt; 2 FALSE       4.5 2021-02-15 def   \n#&gt; 3 TRUE      Inf   2021-02-16 ghi\n\nThis heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.\n\n\n8.3.2 Missing values, column types, and problems\nThe most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the NA that readr expects.\nTake this simple 1 column CSV file as an example:\n\nsimple_csv &lt;- \"\n  x\n  10\n  .\n  20\n  30\"\n\nIf we read it without any additional arguments, x becomes a character column:\n\nread_csv(simple_csv)\n#&gt; # A tibble: 4 × 1\n#&gt;   x    \n#&gt;   &lt;chr&gt;\n#&gt; 1 10   \n#&gt; 2 .    \n#&gt; 3 20   \n#&gt; 4 30\n\nIn this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s sprinkled among them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list where the names match the column names in the CSV file:\n\ndf &lt;- read_csv(\n  simple_csv, \n  col_types = list(x = col_double())\n)\n#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for\n#&gt; details, e.g.:\n#&gt;   dat &lt;- vroom(...)\n#&gt;   problems(dat)\n\nNow read_csv() reports that there was a problem, and tells us we can find out more with problems():\n\nproblems(df)\n#&gt; # A tibble: 1 × 5\n#&gt;     row   col expected actual file                                           \n#&gt;   &lt;int&gt; &lt;int&gt; &lt;chr&gt;    &lt;chr&gt;  &lt;chr&gt;                                          \n#&gt; 1     3     1 a double .      /private/var/folders/9f/nn2jnl8n1lj1sk3y391hyd…\n\nThis tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = \".\", the automatic guessing succeeds, giving us the numeric column that we want:\n\nread_csv(simple_csv, na = \".\")\n#&gt; # A tibble: 4 × 1\n#&gt;       x\n#&gt;   &lt;dbl&gt;\n#&gt; 1    10\n#&gt; 2    NA\n#&gt; 3    20\n#&gt; 4    30\n\n\n\n8.3.3 Column types\nreadr provides a total of nine column types for you to use:\n\ncol_logical() and col_double() read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.\ncol_integer() reads integers. We seldom distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.\ncol_character() reads strings. This can be useful to specify explicitly when you have a column that is a numeric identifier, i.e., long series of digits that identifies an object but doesn’t make sense to apply mathematical operations to. Examples include phone numbers, social security numbers, credit card numbers, etc.\ncol_factor(), col_date(), and col_datetime() create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in Chapter 17 and Chapter 18.\ncol_number() is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in Chapter 14.\ncol_skip() skips a column so it’s not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.\n\nIt’s also possible to override the default column by switching from list() to cols() and specifying .default:\n\nanother_csv &lt;- \"\nx,y,z\n1,2,3\"\n\nread_csv(\n  another_csv, \n  col_types = cols(.default = col_character())\n)\n#&gt; # A tibble: 1 × 3\n#&gt;   x     y     z    \n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 1     2     3\n\nAnother useful helper is cols_only() which will read in only the columns you specify:\n\nread_csv(\n  another_csv,\n  col_types = cols_only(x = col_character())\n)\n#&gt; # A tibble: 1 × 1\n#&gt;   x    \n#&gt;   &lt;chr&gt;\n#&gt; 1 1",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#sec-readr-directory",
    "href": "data-import.html#sec-readr-directory",
    "title": "8  Data import",
    "section": "8.4 Reading data from multiple files",
    "text": "8.4 Reading data from multiple files\nSometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With read_csv() you can read these data in at once and stack them on top of each other in a single data frame.\n\nsales_files &lt;- c(\"data/01-sales.csv\", \"data/02-sales.csv\", \"data/03-sales.csv\")\nread_csv(sales_files, id = \"file\")\n#&gt; # A tibble: 19 × 6\n#&gt;   file              month    year brand  item     n\n#&gt;   &lt;chr&gt;             &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 data/01-sales.csv January  2019     1  1234     3\n#&gt; 2 data/01-sales.csv January  2019     1  8721     9\n#&gt; 3 data/01-sales.csv January  2019     1  1822     2\n#&gt; 4 data/01-sales.csv January  2019     2  3333     1\n#&gt; 5 data/01-sales.csv January  2019     2  2156     9\n#&gt; 6 data/01-sales.csv January  2019     2  3987     6\n#&gt; # ℹ 13 more rows\n\nOnce again, the code above will work if you have the CSV files in a data folder in your project. You can download these files from https://pos.it/r4ds-01-sales, https://pos.it/r4ds-02-sales, and https://pos.it/r4ds-03-sales or you can read them directly with:\n\nsales_files &lt;- c(\n  \"https://pos.it/r4ds-01-sales\",\n  \"https://pos.it/r4ds-02-sales\",\n  \"https://pos.it/r4ds-03-sales\"\n)\nread_csv(sales_files, id = \"file\")\n\nThe id argument adds a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.\nIf you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base list.files() function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in Chapter 16.\n\nsales_files &lt;- list.files(\"data\", pattern = \"sales\\\\.csv$\", full.names = TRUE)\nsales_files\n#&gt; [1] \"data/01-sales.csv\" \"data/02-sales.csv\" \"data/03-sales.csv\"",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#sec-writing-to-a-file",
    "href": "data-import.html#sec-writing-to-a-file",
    "title": "8  Data import",
    "section": "8.5 Writing to a file",
    "text": "8.5 Writing to a file\nreadr also comes with two useful functions for writing data back to disk: write_csv() and write_tsv(). The most important arguments to these functions are x (the data frame to save) and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.\n\nwrite_csv(students, \"students.csv\")\n\nNow let’s read that csv file back in. Note that the variable type information that you just set up is lost when you save to CSV because you’re starting over with reading from a plain text file again:\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;fct&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\nwrite_csv(students, \"students-2.csv\")\nread_csv(\"students-2.csv\")\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\nThis makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternatives:\n\nwrite_rds() and read_rds() are uniform wrappers around the base functions readRDS() and saveRDS(). These store data in R’s custom binary format called RDS. This means that when you reload the object, you are loading the exact same R object that you stored.\n\nwrite_rds(students, \"students.rds\")\nread_rds(\"students.rds\")\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;fct&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\nThe arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in Chapter 23.\n\nlibrary(arrow)\nwrite_parquet(students, \"students.parquet\")\nread_parquet(\"students.parquet\")\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;fct&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\n\nParquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#data-entry",
    "href": "data-import.html#data-entry",
    "title": "8  Data import",
    "section": "8.6 Data entry",
    "text": "8.6 Data entry\nSometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. tibble() works by column:\n\ntibble(\n  x = c(1, 2, 5), \n  y = c(\"h\", \"m\", \"g\"),\n  z = c(0.08, 0.83, 0.60)\n)\n#&gt; # A tibble: 3 × 3\n#&gt;       x y         z\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1     1 h      0.08\n#&gt; 2     2 m      0.83\n#&gt; 3     5 g      0.6\n\nLaying out the data by column can make it hard to see how the rows are related, so an alternative is tribble(), short for transposed tibble, which lets you lay out your data row by row. tribble() is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:\n\ntribble(\n  ~x, ~y, ~z,\n  1, \"h\", 0.08,\n  2, \"m\", 0.83,\n  5, \"g\", 0.60\n)\n#&gt; # A tibble: 3 × 3\n#&gt;       x y         z\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1     1 h      0.08\n#&gt; 2     2 m      0.83\n#&gt; 3     5 g      0.6",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#summary",
    "href": "data-import.html#summary",
    "title": "8  Data import",
    "section": "8.7 Summary",
    "text": "8.7 Summary\nIn this chapter, you’ve learned how to load CSV files with read_csv() and to do your own data entry with tibble() and tribble(). You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: Chapter 21 from Excel and Google Sheets, Chapter 22 will show you how to load data from databases, Chapter 23 from parquet files, Chapter 24 from JSON, and Chapter 25 from websites.\nWe’re just about at the end of this section of the book, but there’s one important last topic to cover: how to get help. So in the next chapter, you’ll learn some good places to look for help, how to create a reprex to maximize your chances of getting good help, and some general advice on keeping up with the world of R.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "data-import.html#footnotes",
    "href": "data-import.html#footnotes",
    "title": "8  Data import",
    "section": "",
    "text": "The janitor package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that use |&gt;.↩︎\nYou can override the default of 1000 with the guess_max argument.↩︎",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>8</span>  <span class='chapter-title'>Data import</span>"
    ]
  },
  {
    "objectID": "workflow-help.html",
    "href": "workflow-help.html",
    "title": "9  Workflow: getting help",
    "section": "",
    "text": "9.1 Google is your friend\nThis book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help and to help you keep learning.\nIf you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Additionally, adding package names like “tidyverse” or “ggplot2” will help narrow down the results to code that will feel more familiar to you as well, e.g., “how to make a boxplot in R” vs. “how to make a boxplot in R with ggplot2”. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = \"en\") and re-run the code; you’re more likely to find help for English error messages.)\nIf Google doesn’t help, try Stack Overflow. Start by spending a little time searching for an existing answer, including [R], to restrict your search to questions and answers that use R.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Workflow: getting help</span>"
    ]
  },
  {
    "objectID": "workflow-help.html#making-a-reprex",
    "href": "workflow-help.html#making-a-reprex",
    "title": "9  Workflow: getting help",
    "section": "9.2 Making a reprex",
    "text": "9.2 Making a reprex\nIf your googling doesn’t find anything useful, it’s a really good idea to prepare a reprex, short for minimal reproducible example. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:\n\nFirst, you need to make your code reproducible. This means that you need to capture everything, i.e. include any library() calls and create all necessary objects. The easiest way to make sure you’ve done this is using the reprex package.\nSecond, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.\n\nThat sounds like a lot of work! And it can be, but it has a great payoff:\n\n80% of the time, creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.\nThe other 20% of the time, you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!\n\nWhen creating a reprex by hand, it’s easy to accidentally miss something, meaning your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package, which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):\n\ny &lt;- 1:4\nmean(y)\n\nThen call reprex(), where the default output is formatted for GitHub:\nreprex::reprex()\nA nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The reprex is automatically copied to your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):\n``` r\ny &lt;- 1:4\nmean(y)\n#&gt; [1] 2.5\n```\nThis text is formatted in a special way, called Markdown, which can be pasted to sites like StackOverflow or Github and they will automatically render it to look like code. Here’s what that Markdown would look like rendered on GitHub:\n\ny &lt;- 1:4\nmean(y)\n#&gt; [1] 2.5\n\nAnyone else can copy, paste, and run this immediately.\nThere are three things you need to include to make your example reproducible: required packages, data, and code.\n\nPackages should be loaded at the top of the script so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; you may have discovered a bug that’s been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update().\nThe easiest way to include data is to use dput() to generate the R code needed to recreate it. For example, to recreate the mtcars dataset in R, perform the following steps:\n\nRun dput(mtcars) in R\nCopy the output\nIn reprex, type mtcars &lt;-, then paste.\n\nTry to use the smallest subset of your data that still reveals the problem.\nSpend a little bit of time ensuring that your code is easy for others to read:\n\nMake sure you’ve used spaces and your variable names are concise yet informative.\nUse comments to indicate where your problem lies.\nDo your best to remove everything that is not related to the problem.\n\nThe shorter your code is, the easier it is to understand and the easier it is to fix.\n\nFinish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script.\nCreating reprexes is not trivial, and it will take some practice to learn to create good, truly minimal reprexes. However, learning to ask questions that include the code, and investing the time to make it reproducible will continue to pay off as you learn and master R.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Workflow: getting help</span>"
    ]
  },
  {
    "objectID": "workflow-help.html#investing-in-yourself",
    "href": "workflow-help.html#investing-in-yourself",
    "title": "9  Workflow: getting help",
    "section": "9.3 Investing in yourself",
    "text": "9.3 Investing in yourself\nYou should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the tidyverse blog. To keep up with the R community more broadly, we recommend reading R Weekly: it’s a community effort to aggregate the most interesting news in the R community each week.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Workflow: getting help</span>"
    ]
  },
  {
    "objectID": "workflow-help.html#summary",
    "href": "workflow-help.html#summary",
    "title": "9  Workflow: getting help",
    "section": "9.4 Summary",
    "text": "9.4 Summary\nThis chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of the whole process, and we start to get into the details of small pieces.\nThe next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you’ve learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication.",
    "crumbs": [
      "Whole game",
      "<span class='chapter-number'>9</span>  <span class='chapter-title'>Workflow: getting help</span>"
    ]
  },
  {
    "objectID": "visualize.html",
    "href": "visualize.html",
    "title": "Visualize",
    "section": "",
    "text": "After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science. Now it’s time to start diving into the details. In this part of the book, you’ll learn about visualizing data in further depth.\n\n\n\n\n\n\n\n\nFigure 1: Data visualization is often the first step in data exploration.\n\n\n\n\n\nEach chapter addresses one to a few aspects of creating a data visualization.\n\nIn 10  Layers you will learn about the layered grammar of graphics.\nIn 11  Exploratory data analysis, you’ll combine visualization with your curiosity and skepticism to ask and answer interesting questions about data.\nFinally, in 12  Communication you will learn how to take your exploratory graphics, elevate them, and turn them into expository graphics, graphics that help the newcomer to your analysis understand what’s going on as quickly and easily as possible.\n\nThese three chapters get you started in the world of visualization, but there is much more to learn. The absolute best place to learn more is the ggplot2 book: ggplot2: Elegant graphics for data analysis. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems. Another great resource is the ggplot2 extensions gallery https://exts.ggplot2.tidyverse.org/gallery/. This site lists many of the packages that extend ggplot2 with new geoms and scales. It’s a great place to start if you’re trying to do something that seems hard with ggplot2.",
    "crumbs": [
      "Visualize"
    ]
  },
  {
    "objectID": "layers.html",
    "href": "layers.html",
    "title": "10  Layers",
    "section": "",
    "text": "10.1 Introduction\nIn Chapter 2, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2.\nIn this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.\nWe will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#introduction",
    "href": "layers.html#introduction",
    "title": "10  Layers",
    "section": "",
    "text": "10.1.1 Prerequisites\nThis chapter focuses on ggplot2. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:\n\nlibrary(tidyverse)",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#aesthetic-mappings",
    "href": "layers.html#aesthetic-mappings",
    "title": "10  Layers",
    "section": "10.2 Aesthetic mappings",
    "text": "10.2 Aesthetic mappings\n\n“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey\n\nRemember that the mpg data frame bundled with the ggplot2 package contains 234 observations on 38 car models.\n\nmpg\n#&gt; # A tibble: 234 × 11\n#&gt;   manufacturer model displ  year   cyl trans      drv     cty   hwy fl   \n#&gt;   &lt;chr&gt;        &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;      &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;\n#&gt; 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p    \n#&gt; 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p    \n#&gt; 3 audi         a4      2    2008     4 manual(m6) f        20    31 p    \n#&gt; 4 audi         a4      2    2008     4 auto(av)   f        21    30 p    \n#&gt; 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p    \n#&gt; 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p    \n#&gt; # ℹ 228 more rows\n#&gt; # ℹ 1 more variable: class &lt;chr&gt;\n\nAmong the variables in mpg are:\n\ndispl: A car’s engine size, in liters. A numerical variable.\nhwy: A car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.\nclass: Type of car. A categorical variable.\n\nLet’s start by visualizing the relationship between displ and hwy for various classes of cars. We can do this with a scatterplot where the numerical variables are mapped to the x and y aesthetics and the categorical variable is mapped to an aesthetic like color or shape.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, color = class)) +\n  geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, shape = class)) +\n  geom_point()\n#&gt; Warning: The shape palette can deal with a maximum of 6 discrete values because more\n#&gt; than 6 becomes difficult to discriminate\n#&gt; ℹ you have requested 7 values. Consider specifying shapes manually if you\n#&gt;   need that many have them.\n#&gt; Warning: Removed 62 rows containing missing values or values outside the scale range\n#&gt; (`geom_point()`).\n\n\n\n\n\n\n\n\n\n\nWhen class is mapped to shape, we get two warnings:\n\n1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.\n2: Removed 62 rows containing missing values (geom_point()).\n\nSince ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.\nSimilarly, we can map class to size or alpha aesthetics as well, which control the size and the transparency of the points, respectively.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, size = class)) +\n  geom_point()\n#&gt; Warning: Using size for a discrete variable is not advised.\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +\n  geom_point()\n#&gt; Warning: Using alpha for a discrete variable is not advised.\n\n\n\n\n\n\n\n\n\n\nBoth of these produce warnings as well:\n\nUsing alpha for a discrete variable is not advised.\n\nMapping an unordered discrete (categorical) variable (class) to an ordered aesthetic (size or alpha) is generally not a good idea because it implies a ranking that does not in fact exist.\nOnce you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line provides the same information as a legend; it explains the mapping between locations and values.\nYou can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance. For example, we can make all of the points in our plot blue:\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point(color = \"blue\")\n\n\n\n\n\n\n\n\nHere, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You’ll need to pick a value that makes sense for that aesthetic:\n\nThe name of a color as a character string, e.g., color = \"blue\"\nThe size of a point in mm, e.g., size = 1\nThe shape of a point as a number, e.g, shape = 1, as shown in Figure 10.1.\n\n\n\n\n\n\n\n\n\nFigure 10.1: R has 26 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–25) have a border of color and are filled with fill. Shapes are arranged to keep similar shapes next to each other.\n\n\n\n\n\nSo far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.\nThe specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.\n\n10.2.1 Exercises\n\nCreate a scatterplot of hwy vs. displ where the points are pink filled in triangles.\nWhy did the following code not result in a plot with blue points?\n\nggplot(mpg) + \n  geom_point(aes(x = displ, y = hwy, color = \"blue\"))\n\nWhat does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)\nWhat happens if you map an aesthetic to something other than a variable name, like aes(color = displ &lt; 5)? Note, you’ll also need to specify x and y.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#sec-geometric-objects",
    "href": "layers.html#sec-geometric-objects",
    "title": "10  Layers",
    "section": "10.3 Geometric objects",
    "text": "10.3 Geometric objects\nHow are these two plots similar?\n\n\n\n\n\n\n\n\n\n\n\n\n\nBoth plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.\nTo change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use the following code:\n\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_smooth()\n#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\nEvery geom function in ggplot2 takes a mapping argument, either defined locally in the geom layer or globally in the ggplot() layer. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.\n# Left\nggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + \n  geom_smooth()\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + \n  geom_smooth()\n\n\n\n\n\n\n\n\n\n\nHere, geom_smooth() separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value, one line describes all of the points that have an f value, and one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.\nIf this sounds strange, we can make it clearer by overlaying the lines on top of the raw data and then coloring everything according to drv.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) + \n  geom_point() +\n  geom_smooth(aes(linetype = drv))\n\n\n\n\n\n\n\n\nNotice that this plot contains two geoms in the same graph.\nMany geoms, like geom_smooth(), use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_smooth()\n\n# Middle\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_smooth(aes(group = drv))\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_smooth(aes(color = drv), show.legend = FALSE)\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point(aes(color = class)) + \n  geom_smooth()\n\n\n\n\n\n\n\n\nYou can use the same idea to specify different data for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in geom_point() overrides the global data argument in ggplot() for that layer only.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  geom_point(\n    data = mpg |&gt; filter(class == \"2seater\"), \n    color = \"red\"\n  ) +\n  geom_point(\n    data = mpg |&gt; filter(class == \"2seater\"), \n    shape = \"circle open\", size = 3, color = \"red\"\n  )\n\n\n\n\n\n\n\n\nGeoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.\n# Left\nggplot(mpg, aes(x = hwy)) +\n  geom_histogram(binwidth = 2)\n\n# Middle\nggplot(mpg, aes(x = hwy)) +\n  geom_density()\n\n# Right\nggplot(mpg, aes(x = hwy)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\n\n\n\n\nggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). For example, the ggridges package (https://wilkelab.org/ggridges) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (geom_density_ridges()), but we have also mapped the same variable to multiple aesthetics (drv to y, fill, and color) as well as set an aesthetic (alpha = 0.5) to make the density curves transparent.\n\nlibrary(ggridges)\n\nggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +\n  geom_density_ridges(alpha = 0.5, show.legend = FALSE)\n#&gt; Picking joint bandwidth of 1.28\n\n\n\n\n\n\n\n\nThe best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: https://ggplot2.tidyverse.org/reference. To learn more about any single geom, use the help (e.g., ?geom_smooth).\n\n10.3.1 Exercises\n\nWhat geom would you use to draw a line chart? A boxplot? A histogram? An area chart?\nEarlier in this chapter we used show.legend without explaining it:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_smooth(aes(color = drv), show.legend = FALSE)\n\nWhat does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?\nWhat does the se argument to geom_smooth() do?\nRecreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#facets",
    "href": "layers.html#facets",
    "title": "10  Layers",
    "section": "10.4 Facets",
    "text": "10.4 Facets\nIn Chapter 2 you learned about faceting with facet_wrap(), which splits a plot into subplots that each display one subset of the data based on a categorical variable.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  facet_wrap(~cyl)\n\n\n\n\n\n\n\n\nTo facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid(). The first argument of facet_grid() is also a formula, but now it’s a double sided formula: rows ~ cols.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  facet_grid(drv ~ cyl)\n\n\n\n\n\n\n\n\nBy default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the scales argument in a faceting function to \"free_x\" will allow for different scales of x-axis across columns, \"free_y\" will allow for different scales on y-axis across rows, and \"free\" will allow both.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  facet_grid(drv ~ cyl, scales = \"free\")\n\n\n\n\n\n\n\n\n\n10.4.1 Exercises\n\nWhat happens if you facet on a continuous variable?\nWhat do the empty cells in the plot above with facet_grid(drv ~ cyl) mean? Run the following code. How do they relate to the resulting plot?\n\nggplot(mpg) + \n  geom_point(aes(x = drv, y = cyl))\n\nWhat plots does the following code make? What does . do?\n\nggplot(mpg) + \n  geom_point(aes(x = displ, y = hwy)) +\n  facet_grid(drv ~ .)\n\nggplot(mpg) + \n  geom_point(aes(x = displ, y = hwy)) +\n  facet_grid(. ~ cyl)\n\nTake the first faceted plot in this section:\n\nggplot(mpg) + \n  geom_point(aes(x = displ, y = hwy)) + \n  facet_wrap(~ cyl, nrow = 2)\n\nWhat are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?\nRead ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?\nWhich of the following plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?\n\nggplot(mpg, aes(x = displ)) + \n  geom_histogram() + \n  facet_grid(drv ~ .)\n\nggplot(mpg, aes(x = displ)) + \n  geom_histogram() +\n  facet_grid(. ~ drv)\n\nRecreate the following plot using facet_wrap() instead of facet_grid(). How do the positions of the facet labels change?\n\nggplot(mpg) + \n  geom_point(aes(x = displ, y = hwy)) +\n  facet_grid(drv ~ .)",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#statistical-transformations",
    "href": "layers.html#statistical-transformations",
    "title": "10  Layers",
    "section": "10.5 Statistical transformations",
    "text": "10.5 Statistical transformations\nConsider a basic bar chart, drawn with geom_bar() or geom_col(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.\n\nggplot(diamonds, aes(x = cut)) + \n  geom_bar()\n\n\n\n\n\n\n\n\nOn the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:\n\nBar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.\nSmoothers fit a model to your data and then plot predictions from the model.\nBoxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box.\n\nThe algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. Figure 10.2 shows how this process works with geom_bar().\n\n\n\n\n\n\n\n\nFigure 10.2: When creating a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.\n\n\n\n\n\nYou can learn which stat a geom uses by inspecting the default value for the stat argument. For example, ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count(). stat_count() is documented on the same page as geom_bar(). If you scroll down, the section called “Computed variables” explains that it computes two new variables: count and prop.\nEvery geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:\n\nYou might want to override the default stat. In the code below, we change the stat of geom_bar() from count (the default) to identity. This lets us map the height of the bars to the raw values of a y variable.\n\ndiamonds |&gt;\n  count(cut) |&gt;\n  ggplot(aes(x = cut, y = n)) +\n  geom_bar(stat = \"identity\")\n\n\n\n\n\n\n\n\nYou might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) + \n  geom_bar()\n\n\n\n\n\n\n\n\nTo find the possible variables that can be computed by the stat, look for the section titled “computed variables” in the help for geom_bar().\nYou might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:\n\nggplot(diamonds) + \n  stat_summary(\n    aes(x = cut, y = depth),\n    fun.min = min,\n    fun.max = max,\n    fun = median\n  )\n\n\n\n\n\n\n\n\n\nggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g., ?stat_bin.\n\n10.5.1 Exercises\n\nWhat is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?\nWhat does geom_col() do? How is it different from geom_bar()?\nMost geoms and stats come in pairs that are almost always used in concert. Make a list of all the pairs. What do they have in common? (Hint: Read through the documentation.)\nWhat variables does stat_smooth() compute? What arguments control its behavior?\nIn our proportion bar chart, we needed to set group = 1. Why? In other words, what is the problem with these two graphs?\n\nggplot(diamonds, aes(x = cut, y = after_stat(prop))) + \n  geom_bar()\nggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) + \n  geom_bar()",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#position-adjustments",
    "href": "layers.html#position-adjustments",
    "title": "10  Layers",
    "section": "10.6 Position adjustments",
    "text": "10.6 Position adjustments\nThere’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, the fill aesthetic:\n# Left\nggplot(mpg, aes(x = drv, color = drv)) + \n  geom_bar()\n\n# Right\nggplot(mpg, aes(x = drv, fill = drv)) + \n  geom_bar()\n\n\n\n\n\n\n\n\n\n\nNote what happens if you map the fill aesthetic to another variable, like class: the bars are automatically stacked. Each colored rectangle represents a combination of drv and class.\n\nggplot(mpg, aes(x = drv, fill = class)) + \n  geom_bar()\n\n\n\n\n\n\n\n\nThe stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: \"identity\", \"dodge\" or \"fill\".\n\nposition = \"identity\" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n  geom_bar(alpha = 1/5, position = \"identity\")\n\n# Right\nggplot(mpg, aes(x = drv, color = class)) + \n  geom_bar(fill = NA, position = \"identity\")\n\n\n\n\n\n\n\n\n\n\nThe identity position adjustment is more useful for 2d geoms, like points, where it is the default.\nposition = \"fill\" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.\nposition = \"dodge\" places overlapping objects directly beside one another. This makes it easier to compare individual values.\n# Left\nggplot(mpg, aes(x = drv, fill = class)) + \n  geom_bar(position = \"fill\")\n\n# Right\nggplot(mpg, aes(x = drv, fill = class)) + \n  geom_bar(position = \"dodge\")\n\n\n\n\n\n\n\n\n\n\n\nThere’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?\n\n\n\n\n\n\n\n\n\nThe underlying values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?\nYou can avoid this gridding by setting the position adjustment to “jitter”. position = \"jitter\" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.\n\nggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point(position = \"jitter\")\n\n\n\n\n\n\n\n\nAdding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = \"jitter\"): geom_jitter().\nTo learn more about a position adjustment, look up the help page associated with each adjustment: ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack.\n\n10.6.1 Exercises\n\nWhat is the problem with the following plot? How could you improve it?\n\nggplot(mpg, aes(x = cty, y = hwy)) + \n  geom_point()\n\nWhat, if anything, is the difference between the two plots? Why?\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point()\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(position = \"identity\")\n\nWhat parameters to geom_jitter() control the amount of jittering?\nCompare and contrast geom_jitter() with geom_count().\nWhat’s the default position adjustment for geom_boxplot()? Create a visualization of the mpg dataset that demonstrates it.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#coordinate-systems",
    "href": "layers.html#coordinate-systems",
    "title": "10  Layers",
    "section": "10.7 Coordinate systems",
    "text": "10.7 Coordinate systems\nCoordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.\n\ncoord_quickmap() sets the aspect ratio correctly for geographic maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the Maps chapter of ggplot2: Elegant graphics for data analysis.\nnz &lt;- map_data(\"nz\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n  geom_polygon(fill = \"white\", color = \"black\")\n\nggplot(nz, aes(x = long, y = lat, group = group)) +\n  geom_polygon(fill = \"white\", color = \"black\") +\n  coord_quickmap()\n\n\n\n\n\n\n\n\n\n\ncoord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.\nbar &lt;- ggplot(data = diamonds) + \n  geom_bar(\n    mapping = aes(x = clarity, fill = clarity), \n    show.legend = FALSE,\n    width = 1\n  ) + \n  theme(aspect.ratio = 1)\n\nbar + coord_flip()\nbar + coord_polar()\n\n\n\n\n\n\n\n\n\n\n\n\n10.7.1 Exercises\n\nTurn a stacked bar chart into a pie chart using coord_polar().\nWhat’s the difference between coord_quickmap() and coord_map()?\nWhat does the following plot tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?\n\nggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +\n  geom_point() + \n  geom_abline() +\n  coord_fixed()",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#the-layered-grammar-of-graphics",
    "href": "layers.html#the-layered-grammar-of-graphics",
    "title": "10  Layers",
    "section": "10.8 The layered grammar of graphics",
    "text": "10.8 The layered grammar of graphics\nWe can expand on the graphing template you learned in Section 2.3 by adding position adjustments, stats, coordinate systems, and faceting:\nggplot(data = &lt;DATA&gt;) + \n  &lt;GEOM_FUNCTION&gt;(\n     mapping = aes(&lt;MAPPINGS&gt;),\n     stat = &lt;STAT&gt;, \n     position = &lt;POSITION&gt;\n  ) +\n  &lt;COORDINATE_FUNCTION&gt; +\n  &lt;FACET_FUNCTION&gt;\nOur new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.\nThe seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, a faceting scheme, and a theme.\nTo see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. These steps are illustrated in Figure 10.3. You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.\n\n\n\n\n\n\n\n\nFigure 10.3: Steps for going from raw data to a table of frequencies to a bar plot where the heights of the bar represent the frequencies.\n\n\n\n\n\nAt this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.\nYou could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.\nIf you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “The Layered Grammar of Graphics”, the scientific paper that describes the theory of ggplot2 in detail.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "layers.html#summary",
    "href": "layers.html#summary",
    "title": "10  Layers",
    "section": "10.9 Summary",
    "text": "10.9 Summary\nIn this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems which allow you to fundamentally change what x and y mean. One layer we have not yet touched on is theme, which we will introduce in Section 12.5.\nTwo very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at https://posit.co/resources/cheatsheets) and the ggplot2 package website (https://ggplot2.tidyverse.org).\nAn important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it’s always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>10</span>  <span class='chapter-title'>Layers</span>"
    ]
  },
  {
    "objectID": "EDA.html",
    "href": "EDA.html",
    "title": "11  Exploratory data analysis",
    "section": "",
    "text": "11.1 Introduction\nThis chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:\nEDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive insights that you’ll eventually write up and communicate to others.\nEDA is an important part of any data analysis, even if the primary research questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#introduction",
    "href": "EDA.html#introduction",
    "title": "11  Exploratory data analysis",
    "section": "",
    "text": "Generate questions about your data.\nSearch for answers by visualizing, transforming, and modelling your data.\nUse what you learn to refine your questions and/or generate new questions.\n\n\n\n\n11.1.1 Prerequisites\nIn this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#questions",
    "href": "EDA.html#questions",
    "title": "11  Exploratory data analysis",
    "section": "11.2 Questions",
    "text": "11.2 Questions\n\n“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox\n\n\n“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey\n\nYour goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.\nEDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.\nThere is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:\n\nWhat type of variation occurs within my variables?\nWhat type of covariation occurs between my variables?\n\nThe rest of this chapter will look at these two questions. We’ll explain what variation and covariation are, and we’ll show you several ways to answer each question.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#variation",
    "href": "EDA.html#variation",
    "title": "11  Exploratory data analysis",
    "section": "11.3 Variation",
    "text": "11.3 Variation\nVariation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g., the eye colors of different people) or at different times (e.g., the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that it varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values, which you’ve learned about in Chapter 2.\nWe’ll start our exploration by visualizing the distribution of weights (carat) of ~54,000 diamonds from the diamonds dataset. Since carat is a numerical variable, we can use a histogram:\n\nggplot(diamonds, aes(x = carat)) +\n  geom_histogram(binwidth = 0.5)\n\n\n\n\n\n\n\n\nNow that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).\n\n11.3.1 Typical values\nIn both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:\n\nWhich values are the most common? Why?\nWhich values are rare? Why? Does that match your expectations?\nCan you see any unusual patterns? What might explain them?\n\nLet’s take a look at the distribution of carat for smaller diamonds.\n\nsmaller &lt;- diamonds |&gt; \n  filter(carat &lt; 3)\n\nggplot(smaller, aes(x = carat)) +\n  geom_histogram(binwidth = 0.01)\n\n\n\n\n\n\n\n\nThis histogram suggests several interesting questions:\n\nWhy are there more diamonds at whole carats and common fractions of carats?\nWhy are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?\n\nVisualizations can also reveal clusters, which suggest that subgroups exist in your data. To understand the subgroups, ask:\n\nHow are the observations within each subgroup similar to each other?\nHow are the observations in separate clusters different from each other?\nHow can you explain or describe the clusters?\nWhy might the appearance of clusters be misleading?\n\nSome of these questions can be answered with the data while some will require domain expertise about the data. Many of them will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.\n\n\n11.3.2 Unusual values\nOutliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors, sometimes they are simply values at the extremes that happened to be observed in this data collection, and other times they suggest important new discoveries. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.\n\nggplot(diamonds, aes(x = y)) + \n  geom_histogram(binwidth = 0.5)\n\n\n\n\n\n\n\n\nThere are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with coord_cartesian():\n\nggplot(diamonds, aes(x = y)) + \n  geom_histogram(binwidth = 0.5) +\n  coord_cartesian(ylim = c(0, 50))\n\n\n\n\n\n\n\n\ncoord_cartesian() also has an xlim() argument for when you need to zoom into the x-axis. ggplot2 also has xlim() and ylim() functions that work slightly differently: they throw away the data outside the limits.\nThis allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:\n\nunusual &lt;- diamonds |&gt; \n  filter(y &lt; 3 | y &gt; 20) |&gt; \n  select(price, x, y, z) |&gt;\n  arrange(y)\nunusual\n#&gt; # A tibble: 9 × 4\n#&gt;   price     x     y     z\n#&gt;   &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  5139  0      0    0   \n#&gt; 2  6381  0      0    0   \n#&gt; 3 12800  0      0    0   \n#&gt; 4 15686  0      0    0   \n#&gt; 5 18034  0      0    0   \n#&gt; 6  2130  0      0    0   \n#&gt; 7  2130  0      0    0   \n#&gt; 8  2075  5.15  31.8  5.12\n#&gt; 9 12210  8.09  58.9  8.06\n\nThe y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. By doing EDA, we have discovered missing data that was coded as 0, which we never would have found by simply searching for NAs. Going forward we might choose to re-code these values as NAs in order to prevent misleading calculations. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!\nIt’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.\n\n\n11.3.3 Exercises\n\nExplore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.\nExplore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)\nHow many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?\nCompare and contrast coord_cartesian() vs. xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#sec-unusual-values-eda",
    "href": "EDA.html#sec-unusual-values-eda",
    "title": "11  Exploratory data analysis",
    "section": "11.4 Unusual values",
    "text": "11.4 Unusual values\nIf you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.\n\nDrop the entire row with the strange values:\n\ndiamonds2 &lt;- diamonds |&gt; \n  filter(between(y, 3, 20))\n\nWe don’t recommend this option because one invalid value doesn’t imply that all the other values for that observation are also invalid. Additionally, if you have low quality data, by the time that you’ve applied this approach to every variable you might find that you don’t have any data left!\nInstead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the if_else() function to replace unusual values with NA:\n\ndiamonds2 &lt;- diamonds |&gt; \n  mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))\n\n\nIt’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n  geom_point()\n#&gt; Warning: Removed 9 rows containing missing values or values outside the scale range\n#&gt; (`geom_point()`).\n\n\n\n\n\n\n\n\nTo suppress that warning, set na.rm = TRUE:\n\nggplot(diamonds2, aes(x = x, y = y)) + \n  geom_point(na.rm = TRUE)\n\nOther times you want to understand what makes observations with missing values different to observations with recorded values. For example, in nycflights13::flights1, missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable, using is.na() to check if dep_time is missing.\n\nnycflights13::flights |&gt; \n  mutate(\n    cancelled = is.na(dep_time),\n    sched_hour = sched_dep_time %/% 100,\n    sched_min = sched_dep_time %% 100,\n    sched_dep_time = sched_hour + (sched_min / 60)\n  ) |&gt; \n  ggplot(aes(x = sched_dep_time)) + \n  geom_freqpoly(aes(color = cancelled), binwidth = 1/4)\n\n\n\n\n\n\n\n\nHowever this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.\n\n11.4.1 Exercises\n\nWhat happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?\nWhat does na.rm = TRUE do in mean() and sum()?\nRecreate the frequency plot of scheduled_dep_time colored by whether the flight was cancelled or not. Also facet by the cancelled variable. Experiment with different values of the scales variable in the faceting function to mitigate the effect of more non-cancelled flights than cancelled flights.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#covariation",
    "href": "EDA.html#covariation",
    "title": "11  Exploratory data analysis",
    "section": "11.5 Covariation",
    "text": "11.5 Covariation\nIf variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.\n\n11.5.1 A categorical and a numerical variable\nFor example, let’s explore how the price of a diamond varies with its quality (measured by cut) using geom_freqpoly():\n\nggplot(diamonds, aes(x = price)) + \n  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\n\n\n\n\nNote that ggplot2 uses an ordered color scale for cut because it’s defined as an ordered factor variable in the data. You’ll learn more about these in Section 17.6.\nThe default appearance of geom_freqpoly() is not that useful here because the height, determined by the overall count, differs so much across cuts, making it hard to see the differences in the shapes of their distributions.\nTo make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.\n\nggplot(diamonds, aes(x = price, y = after_stat(density))) + \n  geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)\n\n\n\n\n\n\n\n\nNote that we’re mapping the density to y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the after_stat() function to do so.\nThere’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.\nA visually simpler plot for exploring this relationship is using side-by-side boxplots.\n\nggplot(diamonds, aes(x = cut, y = price)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\nWe see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are typically cheaper! In the exercises, you’ll be challenged to figure out why.\ncut is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with fct_reorder(). You’ll learn more about that function in Section 17.4, but we want to give you a quick preview here because it’s so useful. For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:\n\nggplot(mpg, aes(x = class, y = hwy)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\nTo make the trend easier to see, we can reorder class based on the median value of hwy:\n\nggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\nIf you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.\n\nggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) +\n  geom_boxplot()\n\n\n\n\n\n\n\n\n\n11.5.1.1 Exercises\n\nUse what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.\nBased on EDA, what variable in the diamonds dataset appears to be most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?\nInstead of exchanging the x and y variables, add coord_flip() as a new layer to the vertical boxplot to create a horizontal one. How does this compare to exchanging the variables?\nOne problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?\nCreate a visualization of diamond prices vs. a categorical variable from the diamonds dataset using geom_violin(), then a faceted geom_histogram(), then a colored geom_freqpoly(), and then a colored geom_density(). Compare and contrast the four plots. What are the pros and cons of each method of visualizing the distribution of a numerical variable based on the levels of a categorical variable?\nIf you have a small dataset, it’s sometimes useful to use geom_jitter() to avoid overplotting to more easily see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.\n\n\n\n\n11.5.2 Two categorical variables\nTo visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in geom_count():\n\nggplot(diamonds, aes(x = cut, y = color)) +\n  geom_count()\n\n\n\n\n\n\n\n\nThe size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.\nAnother approach for exploring the relationship between these variables is computing the counts with dplyr:\n\ndiamonds |&gt; \n  count(color, cut)\n#&gt; # A tibble: 35 × 3\n#&gt;   color cut           n\n#&gt;   &lt;ord&gt; &lt;ord&gt;     &lt;int&gt;\n#&gt; 1 D     Fair        163\n#&gt; 2 D     Good        662\n#&gt; 3 D     Very Good  1513\n#&gt; 4 D     Premium    1603\n#&gt; 5 D     Ideal      2834\n#&gt; 6 E     Fair        224\n#&gt; # ℹ 29 more rows\n\nThen visualize with geom_tile() and the fill aesthetic:\n\ndiamonds |&gt; \n  count(color, cut) |&gt;  \n  ggplot(aes(x = color, y = cut)) +\n  geom_tile(aes(fill = n))\n\n\n\n\n\n\n\n\nIf the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.\n\n11.5.2.1 Exercises\n\nHow could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?\nWhat different data insights do you get with a segmented bar chart if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.\nUse geom_tile() together with dplyr to explore how average flight departure delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?\n\n\n\n\n11.5.3 Two numerical variables\nYou’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with geom_point(). You can see covariation as a pattern in the points. For example, you can see a positive relationship between the carat size and price of a diamond: diamonds with more carats have a higher price. The relationship is exponential.\n\nggplot(smaller, aes(x = carat, y = price)) +\n  geom_point()\n\n\n\n\n\n\n\n\n(In this section we’ll use the smaller dataset to stay focused on the bulk of the diamonds that are smaller than 3 carats)\nScatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black, making it hard to judge differences in the density of the data across the 2-dimensional space as well as making it hard to spot the trend. You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.\n\nggplot(smaller, aes(x = carat, y = price)) + \n  geom_point(alpha = 1 / 100)\n\n\n\n\n\n\n\n\nBut using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used geom_histogram() and geom_freqpoly() to bin in one dimension. Now you’ll learn how to use geom_bin2d() and geom_hex() to bin in two dimensions.\ngeom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. geom_bin2d() creates rectangular bins. geom_hex() creates hexagonal bins. You will need to install the hexbin package to use geom_hex().\nggplot(smaller, aes(x = carat, y = price)) +\n  geom_bin2d()\n\n# install.packages(\"hexbin\")\nggplot(smaller, aes(x = carat, y = price)) +\n  geom_hex()\n#&gt; Warning: Computation failed in `stat_binhex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\n\n\n\n\n\n\n\n\n\nAnother option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:\n\nggplot(smaller, aes(x = carat, y = price)) + \n  geom_boxplot(aes(group = cut_width(carat, 0.1)))\n\n\n\n\n\n\n\n\ncut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summarizes a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.\n\n11.5.3.1 Exercises\n\nInstead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs. cut_number()? How does that impact a visualization of the 2d distribution of carat and price?\nVisualize the distribution of carat, partitioned by price.\nHow does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?\nCombine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.\nTwo dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the following plot have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately. Why is a scatterplot a better display than a binned plot for this case?\n\ndiamonds |&gt; \n  filter(x &gt;= 4) |&gt; \n  ggplot(aes(x = x, y = y)) +\n  geom_point() +\n  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))\n\nInstead of creating boxes of equal width with cut_width(), we could create boxes that contain roughly equal number of points with cut_number(). What are the advantages and disadvantages of this approach?\n\nggplot(smaller, aes(x = carat, y = price)) + \n  geom_boxplot(aes(group = cut_number(carat, 20)))",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#patterns-and-models",
    "href": "EDA.html#patterns-and-models",
    "title": "11  Exploratory data analysis",
    "section": "11.6 Patterns and models",
    "text": "11.6 Patterns and models\nIf a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:\n\nCould this pattern be due to coincidence (i.e. random chance)?\nHow can you describe the relationship implied by the pattern?\nHow strong is the relationship implied by the pattern?\nWhat other variables might affect the relationship?\nDoes the relationship change if you look at individual subgroups of the data?\n\nPatterns in your data provide clues about relationships, i.e., they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.\nModels are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of price and carat, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.\n\nlibrary(tidymodels)\n\ndiamonds &lt;- diamonds |&gt;\n  mutate(\n    log_price = log(price),\n    log_carat = log(carat)\n  )\n\ndiamonds_fit &lt;- linear_reg() |&gt;\n  fit(log_price ~ log_carat, data = diamonds)\n\ndiamonds_aug &lt;- augment(diamonds_fit, new_data = diamonds) |&gt;\n  mutate(.resid = exp(.resid))\n\nggplot(diamonds_aug, aes(x = carat, y = .resid)) + \n  geom_point()\n\n\n\n\n\n\n\n\nOnce you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.\n\nggplot(diamonds_aug, aes(x = cut, y = .resid)) + \n  geom_boxplot()\n\n\n\n\n\n\n\n\nWe’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#summary",
    "href": "EDA.html#summary",
    "title": "11  Exploratory data analysis",
    "section": "11.7 Summary",
    "text": "11.7 Summary\nIn this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen techniques that work with a single variable at a time and with a pair of variables. This might seem painfully restrictive if you have tens or hundreds of variables in your data, but they’re the foundation upon which all other techniques are built.\nIn the next chapter, we’ll focus on the tools we can use to communicate our results.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "EDA.html#footnotes",
    "href": "EDA.html#footnotes",
    "title": "11  Exploratory data analysis",
    "section": "",
    "text": "Remember that when we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() or package::dataset.↩︎",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>11</span>  <span class='chapter-title'>Exploratory data analysis</span>"
    ]
  },
  {
    "objectID": "communication.html",
    "href": "communication.html",
    "title": "12  Communication",
    "section": "",
    "text": "12.1 Introduction\nIn Chapter 11, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.\nNow that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.\nThis chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like The Truthful Art, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#introduction",
    "href": "communication.html#introduction",
    "title": "12  Communication",
    "section": "",
    "text": "12.1.1 Prerequisites\nIn this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, scales to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including ggrepel (https://ggrepel.slowkow.com) by Kamil Slowikowski and patchwork (https://patchwork.data-imaginist.com) by Thomas Lin Pedersen. Don’t forget that you’ll need to install those packages with install.packages() if you don’t already have them.\n\nlibrary(tidyverse)\nlibrary(scales)\nlibrary(ggrepel)\nlibrary(patchwork)",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#labels",
    "href": "communication.html#labels",
    "title": "12  Communication",
    "section": "12.2 Labels",
    "text": "12.2 Labels\nThe easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the labs() function.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class)) +\n  geom_smooth(se = FALSE) +\n  labs(\n    x = \"Engine displacement (L)\",\n    y = \"Highway fuel economy (mpg)\",\n    color = \"Car type\",\n    title = \"Fuel efficiency generally decreases with engine size\",\n    subtitle = \"Two seaters (sports cars) are an exception because of their light weight\",\n    caption = \"Data from fueleconomy.gov\"\n  )\n\n\n\n\n\n\n\n\nThe purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g., “A scatterplot of engine displacement vs. fuel economy”.\nIf you need to add more text, there are two other useful labels: subtitle adds additional detail in a smaller font beneath the title and caption adds text at the bottom right of the plot, often used to describe the source of the data. You can also use labs() to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.\nIt’s possible to use mathematical equations instead of text strings. Just switch \"\" out for quote() and read about the available options in ?plotmath:\n\ndf &lt;- tibble(\n  x = 1:10,\n  y = cumsum(x^2)\n)\n\nggplot(df, aes(x, y)) +\n  geom_point() +\n  labs(\n    x = quote(x[i]),\n    y = quote(sum(x[i] ^ 2, i == 1, n))\n  )\n\n\n\n\n\n\n\n\n\n12.2.1 Exercises\n\nCreate one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.\nRecreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.\n\n\n\n\n\n\n\n\n\nTake an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#annotations",
    "href": "communication.html#annotations",
    "title": "12  Communication",
    "section": "12.3 Annotations",
    "text": "12.3 Annotations\nIn addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is geom_text(). geom_text() is similar to geom_point(), but it has an additional aesthetic: label. This makes it possible to add textual labels to your plots.\nThere are two possible sources of labels. First, you might have a tibble that provides labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called label_info.\n\nlabel_info &lt;- mpg |&gt;\n  group_by(drv) |&gt;\n  arrange(desc(displ)) |&gt;\n  slice_head(n = 1) |&gt;\n  mutate(\n    drive_type = case_when(\n      drv == \"f\" ~ \"front-wheel drive\",\n      drv == \"r\" ~ \"rear-wheel drive\",\n      drv == \"4\" ~ \"4-wheel drive\"\n    )\n  ) |&gt;\n  select(displ, hwy, drv, drive_type)\n\nlabel_info\n#&gt; # A tibble: 3 × 4\n#&gt; # Groups:   drv [3]\n#&gt;   displ   hwy drv   drive_type       \n#&gt;   &lt;dbl&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;            \n#&gt; 1   6.5    17 4     4-wheel drive    \n#&gt; 2   5.3    25 f     front-wheel drive\n#&gt; 3   7      24 r     rear-wheel drive\n\nThen, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (theme(legend.position = \"none\") turns all the legends off — we’ll talk about it more shortly.)\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n  geom_point(alpha = 0.3) +\n  geom_smooth(se = FALSE) +\n  geom_text(\n    data = label_info, \n    aes(x = displ, y = hwy, label = drive_type),\n    fontface = \"bold\", size = 5, hjust = \"right\", vjust = \"bottom\"\n  ) +\n  theme(legend.position = \"none\")\n#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\n\n\n\nNote the use of hjust (horizontal justification) and vjust (vertical justification) to control the alignment of the label.\nHowever the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can use the geom_label_repel() function from the ggrepel package to address both of these issues. This useful package will automatically adjust labels so that they don’t overlap:\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n  geom_point(alpha = 0.3) +\n  geom_smooth(se = FALSE) +\n  geom_label_repel(\n    data = label_info, \n    aes(x = displ, y = hwy, label = drive_type),\n    fontface = \"bold\", size = 5, nudge_y = 2\n  ) +\n  theme(legend.position = \"none\")\n#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\n\n\n\nYou can also use the same idea to highlight certain points on a plot with geom_text_repel() from the ggrepel package. Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.\n\npotential_outliers &lt;- mpg |&gt;\n  filter(hwy &gt; 40 | (hwy &gt; 20 & displ &gt; 5))\n  \nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point() +\n  geom_text_repel(data = potential_outliers, aes(label = model)) +\n  geom_point(data = potential_outliers, color = \"red\") +\n  geom_point(\n    data = potential_outliers,\n    color = \"red\", size = 3, shape = \"circle open\"\n  )\n\n\n\n\n\n\n\n\nRemember, in addition to geom_text() and geom_label(), you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:\n\nUse geom_hline() and geom_vline() to add reference lines. We often make them thick (linewidth = 2) and white (color = white), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.\nUse geom_rect() to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics xmin, xmax, ymin, ymax. Alternatively, look into the ggforce package, specifically geom_mark_hull(), which allows you to annotate subsets of points with hulls.\nUse geom_segment() with the arrow argument to draw attention to a point with an arrow. Use aesthetics x and y to define the starting location, and xend and yend to define the end location.\n\nAnother handy function for adding annotations to plots is annotate(). As a rule of thumb, geoms are generally useful for highlighting a subset of the data while annotate() is useful for adding one or few annotation elements to a plot.\nTo demonstrate using annotate(), let’s create some text to add to our plot. The text is a bit long, so we’ll use stringr::str_wrap() to automatically add line breaks to it given the number of characters you want per line:\n\ntrend_text &lt;- \"Larger engine sizes tend to have lower fuel economy.\" |&gt;\n  str_wrap(width = 30)\ntrend_text\n#&gt; [1] \"Larger engine sizes tend to\\nhave lower fuel economy.\"\n\nThen, we add two layers of annotation: one with a label geom and the other with a segment geom. The x and y aesthetics in both define where the annotation should start, and the xend and yend aesthetics in the segment annotation define the end location of the segment. Note also that the segment is styled as an arrow.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point() +\n  annotate(\n    geom = \"label\", x = 3.5, y = 38,\n    label = trend_text,\n    hjust = \"left\", color = \"red\"\n  ) +\n  annotate(\n    geom = \"segment\",\n    x = 3, y = 35, xend = 5, yend = 25, color = \"red\",\n    arrow = arrow(type = \"closed\")\n  )\n\n\n\n\n\n\n\n\nAnnotation is a powerful tool for communicating main takeaways and interesting features of your visualizations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!\n\n12.3.1 Exercises\n\nUse geom_text() with infinite positions to place text at the four corners of the plot.\nUse annotate() to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.\nHow do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the dataset that is being passed to geom_text().)\nWhat arguments to geom_label() control the appearance of the background box?\nWhat are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#scales",
    "href": "communication.html#scales",
    "title": "12  Communication",
    "section": "12.4 Scales",
    "text": "12.4 Scales\nThe third way you can make your plot better for communication is to adjust the scales. Scales control how the aesthetic mappings manifest visually.\n\n12.4.1 Default scales\nNormally, ggplot2 automatically adds scales for you. For example, when you type:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class))\n\nggplot2 automatically adds default scales behind the scenes:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class)) +\n  scale_x_continuous() +\n  scale_y_continuous() +\n  scale_color_discrete()\n\nNote the naming scheme for scales: scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. scale_x_continuous() puts the numeric values from displ on a continuous number line on the x-axis, scale_color_discrete() chooses colors for each of the class of car, etc. There are lots of non-default scales which you’ll learn about below.\nThe default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:\n\nYou might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.\nYou might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.\n\n\n\n12.4.2 Axis ticks and legend keys\nCollectively axes and legends are called guides. Axes are used for x and y aesthetics; legends are used for everything else.\nThere are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of breaks is to override the default choice:\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n  geom_point() +\n  scale_y_continuous(breaks = seq(15, 40, by = 5)) \n\n\n\n\n\n\n\n\nYou can use labels in the same way (a character vector the same length as breaks), but you can also set it to NULL to suppress the labels altogether. This can be useful for maps, or for publishing plots where you can’t share the absolute numbers. You can also use breaks and labels to control the appearance of legends. For discrete scales for categorical variables, labels can be a named list of the existing level names and the desired labels for them.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n  geom_point() +\n  scale_x_continuous(labels = NULL) +\n  scale_y_continuous(labels = NULL) +\n  scale_color_discrete(labels = c(\"4\" = \"4-wheel\", \"f\" = \"front\", \"r\" = \"rear\"))\n\n\n\n\n\n\n\n\nThe labels argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. The plot on the left shows default labelling with label_dollar(), which adds a dollar sign as well as a thousand separator comma. The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix “K” (for “thousands”) as well as adding custom breaks. Note that breaks is in the original scale of the data.\n# Left\nggplot(diamonds, aes(x = price, y = cut)) +\n  geom_boxplot(alpha = 0.05) +\n  scale_x_continuous(labels = label_dollar())\n\n# Right\nggplot(diamonds, aes(x = price, y = cut)) +\n  geom_boxplot(alpha = 0.05) +\n  scale_x_continuous(\n    labels = label_dollar(scale = 1/1000, suffix = \"K\"), \n    breaks = seq(1000, 19000, by = 6000)\n  )\n\n\n\n\n\n\n\n\n\n\nAnother handy label function is label_percent():\n\nggplot(diamonds, aes(x = cut, fill = clarity)) +\n  geom_bar(position = \"fill\") +\n  scale_y_continuous(name = \"Percentage\", labels = label_percent())\n\n\n\n\n\n\n\n\nAnother use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.\n\npresidential |&gt;\n  mutate(id = 33 + row_number()) |&gt;\n  ggplot(aes(x = start, y = id)) +\n  geom_point() +\n  geom_segment(aes(xend = end, yend = id)) +\n  scale_x_date(name = NULL, breaks = presidential$start, date_labels = \"'%y\")\n\n\n\n\n\n\n\n\nNote that for the breaks argument we pulled out the start variable as a vector with presidential$start because we can’t do an aesthetic mapping for this argument. Also note that the specification of breaks and labels for date and datetime scales is a little different:\n\ndate_labels takes a format specification, in the same form as parse_datetime().\ndate_breaks (not shown here), takes a string like “2 days” or “1 month”.\n\n\n\n12.4.3 Legend layout\nYou will most often use breaks and labels to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.\nTo control the overall position of the legend, you need to use a theme() setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting legend.position controls where the legend is drawn:\nbase &lt;- ggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class))\n\nbase + theme(legend.position = \"right\") # the default\nbase + theme(legend.position = \"left\")\nbase + \n  theme(legend.position = \"top\") +\n  guides(color = guide_legend(nrow = 3))\nbase + \n  theme(legend.position = \"bottom\") +\n  guides(color = guide_legend(nrow = 3))\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf your plot is short and wide, place the legend at the top or bottom, and if it’s tall and narrow, place the legend at the left or right. You can also use legend.position = \"none\" to suppress the display of the legend altogether.\nTo control the display of individual legends, use guides() along with guide_legend() or guide_colorbar(). The following example shows two important settings: controlling the number of rows the legend uses with nrow, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low alpha to display many points on a plot.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class)) +\n  geom_smooth(se = FALSE) +\n  theme(legend.position = \"bottom\") +\n  guides(color = guide_legend(nrow = 2, override.aes = list(size = 4)))\n#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'\n\n\n\n\n\n\n\n\nNote that the name of the argument in guides() matches the name of the aesthetic, just like in labs().\n\n\n12.4.4 Replacing a scale\nInstead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and color scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and color, you’ll be able to quickly pick up other scale replacements.\nIt’s very useful to plot transformations of your variable. For example, it’s easier to see the precise relationship between carat and price if we log transform them:\n# Left\nggplot(diamonds, aes(x = carat, y = price)) +\n  geom_bin2d()\n\n# Right\nggplot(diamonds, aes(x = log10(carat), y = log10(price))) +\n  geom_bin2d()\n\n\n\n\n\n\n\n\n\n\nHowever, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.\n\nggplot(diamonds, aes(x = carat, y = price)) +\n  geom_bin2d() + \n  scale_x_log10() + \n  scale_y_log10()\n\n\n\n\n\n\n\n\nAnother scale that is frequently customized is color. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.1\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv))\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv)) +\n  scale_color_brewer(palette = \"Set1\")\n\n\n\n\n\n\n\n\n\n\nDon’t forget simpler techniques for improving accessibility. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv, shape = drv)) +\n  scale_color_brewer(palette = \"Set1\")\n\n\n\n\n\n\n\n\nThe ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth. Figure 12.1 shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used cut() to make a continuous variable into a categorical variable.\n\n\n\n\n\n\n\n\nFigure 12.1: All colorBrewer scales.\n\n\n\n\n\nWhen you have a predefined mapping between values and colors, use scale_color_manual(). For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats. One approach for assigning these colors is using hex color codes:\n\npresidential |&gt;\n  mutate(id = 33 + row_number()) |&gt;\n  ggplot(aes(x = start, y = id, color = party)) +\n  geom_point() +\n  geom_segment(aes(xend = end, yend = id)) +\n  scale_color_manual(values = c(Republican = \"#E81B23\", Democratic = \"#00AEF3\"))\n\n\n\n\n\n\n\n\nFor continuous color, you can use the built-in scale_color_gradient() or scale_fill_gradient(). If you have a diverging scale, you can use scale_color_gradient2(). That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.\nAnother option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (c), discrete (d), and binned (b) palettes in ggplot2.\ndf &lt;- tibble(\n  x = rnorm(10000),\n  y = rnorm(10000)\n)\n\nggplot(df, aes(x, y)) +\n  geom_hex() +\n  coord_fixed() +\n  labs(title = \"Default, continuous\", x = NULL, y = NULL)\n#&gt; Warning: Computation failed in `stat_binhex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\nggplot(df, aes(x, y)) +\n  geom_hex() +\n  coord_fixed() +\n  scale_fill_viridis_c() +\n  labs(title = \"Viridis, continuous\", x = NULL, y = NULL)\n#&gt; Warning: Computation failed in `stat_binhex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\nggplot(df, aes(x, y)) +\n  geom_hex() +\n  coord_fixed() +\n  scale_fill_viridis_b() +\n  labs(title = \"Viridis, binned\", x = NULL, y = NULL)\n#&gt; Warning: Computation failed in `stat_binhex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNote that all color scales come in two varieties: scale_color_*() and scale_fill_*() for the color and fill aesthetics respectively (the color scales are available in both UK and US spellings).\n\n\n12.4.5 Zooming\nThere are three ways to control the plot limits:\n\nAdjusting what data are plotted.\nSetting the limits in each scale.\nSetting xlim and ylim in coord_cartesian().\n\nWe’ll demonstrate these options in a series of plots. The plot on the left shows the relationship between engine size and fuel efficiency, colored by type of drive train. The plot on the right shows the same variables, but subsets the data that are plotted. Subsetting the data has affected the x and y scales as well as the smooth curve.\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv)) +\n  geom_smooth()\n\n# Right\nmpg |&gt;\n  filter(displ &gt;= 5 & displ &lt;= 6 & hwy &gt;= 10 & hwy &lt;= 25) |&gt;\n  ggplot(aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv)) +\n  geom_smooth()\n\n\n\n\n\n\n\n\n\n\nLet’s compare these to the two plots below where the plot on the left sets the limits on individual scales and the plot on the right sets them in coord_cartesian(). We can see that reducing the limits is equivalent to subsetting the data. Therefore, to zoom in on a region of the plot, it’s generally best to use coord_cartesian().\n# Left\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv)) +\n  geom_smooth() +\n  scale_x_continuous(limits = c(5, 6)) +\n  scale_y_continuous(limits = c(10, 25))\n\n# Right\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = drv)) +\n  geom_smooth() +\n  coord_cartesian(xlim = c(5, 6), ylim = c(10, 25))\n\n\n\n\n\n\n\n\n\n\nOn the other hand, setting the limits on individual scales is generally more useful if you want to expand the limits, e.g., to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.\nsuv &lt;- mpg |&gt; filter(class == \"suv\")\ncompact &lt;- mpg |&gt; filter(class == \"compact\")\n\n# Left\nggplot(suv, aes(x = displ, y = hwy, color = drv)) +\n  geom_point()\n\n# Right\nggplot(compact, aes(x = displ, y = hwy, color = drv)) +\n  geom_point()\n\n\n\n\n\n\n\n\n\n\nOne way to overcome this problem is to share scales across multiple plots, training the scales with the limits of the full data.\nx_scale &lt;- scale_x_continuous(limits = range(mpg$displ))\ny_scale &lt;- scale_y_continuous(limits = range(mpg$hwy))\ncol_scale &lt;- scale_color_discrete(limits = unique(mpg$drv))\n\n# Left\nggplot(suv, aes(x = displ, y = hwy, color = drv)) +\n  geom_point() +\n  x_scale +\n  y_scale +\n  col_scale\n\n# Right\nggplot(compact, aes(x = displ, y = hwy, color = drv)) +\n  geom_point() +\n  x_scale +\n  y_scale +\n  col_scale\n\n\n\n\n\n\n\n\n\n\nIn this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.\n\n\n12.4.6 Exercises\n\nWhy doesn’t the following code override the default scale?\n\ndf &lt;- tibble(\n  x = rnorm(10000),\n  y = rnorm(10000)\n)\n\nggplot(df, aes(x, y)) +\n  geom_hex() +\n  scale_color_gradient(low = \"white\", high = \"red\") +\n  coord_fixed()\n#&gt; Warning: Computation failed in `stat_binhex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_bin_hex()`.\n\nWhat is the first argument to every scale? How does it compare to labs()?\nChange the display of the presidential terms by:\n\nCombining the two variants that customize colors and x axis breaks.\nImproving the display of the y axis.\nLabelling each term with the name of the president.\nAdding informative plot labels.\nPlacing breaks every 4 years (this is trickier than it seems!).\n\nFirst, create the following plot. Then, modify the code using override.aes to make the legend easier to see.\n\nggplot(diamonds, aes(x = carat, y = price)) +\n  geom_point(aes(color = cut), alpha = 1/20)",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#sec-themes",
    "href": "communication.html#sec-themes",
    "title": "12  Communication",
    "section": "12.5 Themes",
    "text": "12.5 Themes\nFinally, you can customize the non-data elements of your plot with a theme:\n\nggplot(mpg, aes(x = displ, y = hwy)) +\n  geom_point(aes(color = class)) +\n  geom_smooth(se = FALSE) +\n  theme_bw()\n\n\n\n\n\n\n\n\nggplot2 includes the eight themes shown in Figure 12.2, with theme_gray() as the default.2 Many more are included in add-on packages like ggthemes (https://jrnold.github.io/ggthemes), by Jeffrey Arnold. You can also create your own themes, if you are trying to match a particular corporate or journal style.\n\n\n\n\n\n\n\n\nFigure 12.2: The eight themes built-in to ggplot2.\n\n\n\n\n\nIt’s also possible to control individual components of each theme, like the size and color of the font used for the y axis. We’ve already seen that legend.position controls where the legend is drawn. There are many other aspects of the legend that can be customized with theme(). For example, in the plot below we change the direction of the legend as well as put a black border around it. Note that customization of the legend box and plot title elements of the theme are done with element_*() functions. These functions specify the styling of non-data components, e.g., the title text is bolded in the face argument of element_text() and the legend border color is defined in the color argument of element_rect(). The theme elements that control the position of the title and the caption are plot.title.position and plot.caption.position, respectively. In the following plot these are set to \"plot\" to indicate these elements are aligned to the entire plot area, instead of the plot panel (the default). A few other helpful theme() components are used to change the placement for format of the title and caption text.\n\nggplot(mpg, aes(x = displ, y = hwy, color = drv)) +\n  geom_point() +\n  labs(\n    title = \"Larger engine sizes tend to have lower fuel economy\",\n    caption = \"Source: https://fueleconomy.gov.\"\n  ) +\n  theme(\n    legend.position = c(0.6, 0.7),\n    legend.direction = \"horizontal\",\n    legend.box.background = element_rect(color = \"black\"),\n    plot.title = element_text(face = \"bold\"),\n    plot.title.position = \"plot\",\n    plot.caption.position = \"plot\",\n    plot.caption = element_text(hjust = 0)\n  )\n#&gt; Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2\n#&gt; 3.5.0.\n#&gt; ℹ Please use the `legend.position.inside` argument of `theme()` instead.\n\n\n\n\n\n\n\n\nFor an overview of all theme() components, see help with ?theme. The ggplot2 book is also a great place to go for the full details on theming.\n\n12.5.1 Exercises\n\nPick a theme offered by the ggthemes package and apply it to the last plot you made.\nMake the axis labels of your plot blue and bolded.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#layout",
    "href": "communication.html#layout",
    "title": "12  Communication",
    "section": "12.6 Layout",
    "text": "12.6 Layout\nSo far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? The patchwork package allows you to combine separate plots into the same graphic. We loaded this package earlier in the chapter.\nTo place two plots next to each other, you can simply add them to each other. Note that you first need to create the plots and save them as objects (in the following example they’re called p1 and p2). Then, you place them next to each other with +.\n\np1 &lt;- ggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  labs(title = \"Plot 1\")\np2 &lt;- ggplot(mpg, aes(x = drv, y = hwy)) + \n  geom_boxplot() + \n  labs(title = \"Plot 2\")\np1 + p2\n\n\n\n\n\n\n\n\nIt’s important to note that in the above code chunk we did not use a new function from the patchwork package. Instead, the package added a new functionality to the + operator.\nYou can also create complex plot layouts with patchwork. In the following, | places the p1 and p3 next to each other and / moves p2 to the next line.\n\np3 &lt;- ggplot(mpg, aes(x = cty, y = hwy)) + \n  geom_point() + \n  labs(title = \"Plot 3\")\n(p1 | p3) / p2\n\n\n\n\n\n\n\n\nAdditionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. Below we create 5 plots. We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with & theme(legend.position = \"top\"). Note the use of the & operator here instead of the usual +. This is because we’re modifying the theme for the patchwork plot as opposed to the individual ggplots. The legend is placed on top, inside the guide_area(). Finally, we have also customized the heights of the various components of our patchwork – the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatterplot 4. Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.\n\np1 &lt;- ggplot(mpg, aes(x = drv, y = cty, color = drv)) + \n  geom_boxplot(show.legend = FALSE) + \n  labs(title = \"Plot 1\")\n\np2 &lt;- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) + \n  geom_boxplot(show.legend = FALSE) + \n  labs(title = \"Plot 2\")\n\np3 &lt;- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) + \n  geom_density(alpha = 0.5) + \n  labs(title = \"Plot 3\")\n\np4 &lt;- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) + \n  geom_density(alpha = 0.5) + \n  labs(title = \"Plot 4\")\n\np5 &lt;- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) + \n  geom_point(show.legend = FALSE) + \n  facet_wrap(~drv) +\n  labs(title = \"Plot 5\")\n\n(guide_area() / (p1 + p2) / (p3 + p4) / p5) +\n  plot_annotation(\n    title = \"City and highway mileage for cars with different drive trains\",\n    caption = \"Source: https://fueleconomy.gov.\"\n  ) +\n  plot_layout(\n    guides = \"collect\",\n    heights = c(1, 3, 2, 4)\n    ) &\n  theme(legend.position = \"top\")\n\n\n\n\n\n\n\n\nIf you’d like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: https://patchwork.data-imaginist.com.\n\n12.6.1 Exercises\n\nWhat happens if you omit the parentheses in the following plot layout. Can you explain why this happens?\n\np1 &lt;- ggplot(mpg, aes(x = displ, y = hwy)) + \n  geom_point() + \n  labs(title = \"Plot 1\")\np2 &lt;- ggplot(mpg, aes(x = drv, y = hwy)) + \n  geom_boxplot() + \n  labs(title = \"Plot 2\")\np3 &lt;- ggplot(mpg, aes(x = cty, y = hwy)) + \n  geom_point() + \n  labs(title = \"Plot 3\")\n\n(p1 | p2) / p3\n\nUsing the three plots from the previous exercise, recreate the following patchwork.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#summary",
    "href": "communication.html#summary",
    "title": "12  Communication",
    "section": "12.7 Summary",
    "text": "12.7 Summary\nIn this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.\nWhile you’ve so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we’ve barely scratched the surface of what you can create with ggplot2. If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, ggplot2: Elegant Graphics for Data Analysis. Other useful resources are the R Graphics Cookbook by Winston Chang and Fundamentals of Data Visualization by Claus Wilke.",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "communication.html#footnotes",
    "href": "communication.html#footnotes",
    "title": "12  Communication",
    "section": "",
    "text": "You can use a tool like SimDaltonism to simulate color blindness to test these images.↩︎\nMany people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The gray background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the gray background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.↩︎",
    "crumbs": [
      "Visualize",
      "<span class='chapter-number'>12</span>  <span class='chapter-title'>Communication</span>"
    ]
  },
  {
    "objectID": "transform.html",
    "href": "transform.html",
    "title": "Transform",
    "section": "",
    "text": "The second part of the book was a deep dive into data visualization. In this part of the book, you’ll learn about the most important types of variables that you’ll encounter inside a data frame and learn the tools you can use to work with them.\n\n\n\n\n\n\n\n\nFigure 1: The options for data transformation depends heavily on the type of data involved, the subject of this part of the book.\n\n\n\n\n\nYou can read these chapters as you need them; they’re designed to be largely standalone so that they can be read out of order.\n\n13  Logical vectors teaches you about logical vectors. These are the simplest types of vectors, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for conditional transformations.\n14  Numbers dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.\n15  Strings will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from character strings.\n16  Regular expressions introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.\n17  Factors introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.\n18  Dates and times will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.\n19  Missing values discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.\n20  Joins finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset.",
    "crumbs": [
      "Transform"
    ]
  },
  {
    "objectID": "logicals.html",
    "href": "logicals.html",
    "title": "13  Logical vectors",
    "section": "",
    "text": "13.1 Introduction\nIn this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.\nWe’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with if_else() and case_when(), two useful functions for making conditional changes powered by logical vectors.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#introduction",
    "href": "logicals.html#introduction",
    "title": "13  Logical vectors",
    "section": "",
    "text": "13.1.1 Prerequisites\nMost of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use mutate(), filter(), and friends to work with data frames. We’ll also continue to draw examples from the nycflights13::flights dataset.\n\nlibrary(tidyverse)\nlibrary(nycflights13)\n\nHowever, as we start to cover more tools, there won’t always be a perfect real example. So we’ll start making up some dummy data with c():\n\nx &lt;- c(1, 2, 3, 5, 7, 11, 13)\nx * 2\n#&gt; [1]  2  4  6 10 14 22 26\n\nThis makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with mutate() and friends.\n\ndf &lt;- tibble(x)\ndf |&gt; \n  mutate(y = x * 2)\n#&gt; # A tibble: 7 × 2\n#&gt;       x     y\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     2\n#&gt; 2     2     4\n#&gt; 3     3     6\n#&gt; 4     5    10\n#&gt; 5     7    14\n#&gt; 6    11    22\n#&gt; # ℹ 1 more row",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#comparisons",
    "href": "logicals.html#comparisons",
    "title": "13  Logical vectors",
    "section": "13.2 Comparisons",
    "text": "13.2 Comparisons\nA very common way to create a logical vector is via a numeric comparison with &lt;, &lt;=, &gt;, &gt;=, !=, and ==. So far, we’ve mostly created logical variables transiently within filter() — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that arrive roughly on time:\n\nflights |&gt; \n  filter(dep_time &gt; 600 & dep_time &lt; 2000 & abs(arr_delay) &lt; 20)\n#&gt; # A tibble: 172,286 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      601            600         1      844            850\n#&gt; 2  2013     1     1      602            610        -8      812            820\n#&gt; 3  2013     1     1      602            605        -3      821            805\n#&gt; 4  2013     1     1      606            610        -4      858            910\n#&gt; 5  2013     1     1      606            610        -4      837            845\n#&gt; 6  2013     1     1      607            607         0      858            915\n#&gt; # ℹ 172,280 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nIt’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with mutate():\n\nflights |&gt; \n  mutate(\n    daytime = dep_time &gt; 600 & dep_time &lt; 2000,\n    approx_ontime = abs(arr_delay) &lt; 20,\n    .keep = \"used\"\n  )\n#&gt; # A tibble: 336,776 × 4\n#&gt;   dep_time arr_delay daytime approx_ontime\n#&gt;      &lt;int&gt;     &lt;dbl&gt; &lt;lgl&gt;   &lt;lgl&gt;        \n#&gt; 1      517        11 FALSE   TRUE         \n#&gt; 2      533        20 FALSE   FALSE        \n#&gt; 3      542        33 FALSE   FALSE        \n#&gt; 4      544       -18 FALSE   TRUE         \n#&gt; 5      554       -25 FALSE   FALSE        \n#&gt; 6      554        12 FALSE   TRUE         \n#&gt; # ℹ 336,770 more rows\n\nThis is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.\nAll up, the initial filter is equivalent to:\n\nflights |&gt; \n  mutate(\n    daytime = dep_time &gt; 600 & dep_time &lt; 2000,\n    approx_ontime = abs(arr_delay) &lt; 20,\n  ) |&gt; \n  filter(daytime & approx_ontime)\n\n\n13.2.1 Floating point comparison\nBeware of using == with numbers. For example, it looks like this vector contains the numbers 1 and 2:\n\nx &lt;- c(1 / 49 * 49, sqrt(2) ^ 2)\nx\n#&gt; [1] 1 2\n\nBut if you test them for equality, you get FALSE:\n\nx == c(1, 2)\n#&gt; [1] FALSE FALSE\n\nWhat’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or sqrt(2) and subsequent computations will be very slightly off. We can see the exact values by calling print() with the digits1 argument:\n\nprint(x, digits = 16)\n#&gt; [1] 0.9999999999999999 2.0000000000000004\n\nYou can see why R defaults to rounding these numbers; they really are very close to what you expect.\nNow that you’ve seen why == is failing, what can you do about it? One option is to use dplyr::near() which ignores small differences:\n\nnear(x, c(1, 2))\n#&gt; [1] TRUE TRUE\n\n\n\n13.2.2 Missing values\nMissing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:\n\nNA &gt; 5\n#&gt; [1] NA\n10 == NA\n#&gt; [1] NA\n\nThe most confusing result is this one:\n\nNA == NA\n#&gt; [1] NA\n\nIt’s easiest to understand why this is true if we artificially supply a little more context:\n\n# We don't know how old Mary is\nage_mary &lt;- NA\n\n# We don't know how old John is\nage_john &lt;- NA\n\n# Are Mary and John the same age?\nage_mary == age_john\n#&gt; [1] NA\n# We don't know!\n\nSo if you want to find all flights where dep_time is missing, the following code doesn’t work because dep_time == NA will yield NA for every single row, and filter() automatically drops missing values:\n\nflights |&gt; \n  filter(dep_time == NA)\n#&gt; # A tibble: 0 × 19\n#&gt; # ℹ 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,\n#&gt; #   sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;, …\n\nInstead we’ll need a new tool: is.na().\n\n\n13.2.3 is.na()\nis.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:\n\nis.na(c(TRUE, NA, FALSE))\n#&gt; [1] FALSE  TRUE FALSE\nis.na(c(1, NA, 3))\n#&gt; [1] FALSE  TRUE FALSE\nis.na(c(\"a\", NA, \"b\"))\n#&gt; [1] FALSE  TRUE FALSE\n\nWe can use is.na() to find all the rows with a missing dep_time:\n\nflights |&gt; \n  filter(is.na(dep_time))\n#&gt; # A tibble: 8,255 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1       NA           1630        NA       NA           1815\n#&gt; 2  2013     1     1       NA           1935        NA       NA           2240\n#&gt; 3  2013     1     1       NA           1500        NA       NA           1825\n#&gt; 4  2013     1     1       NA            600        NA       NA            901\n#&gt; 5  2013     1     2       NA           1540        NA       NA           1747\n#&gt; 6  2013     1     2       NA           1620        NA       NA           1746\n#&gt; # ℹ 8,249 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nis.na() can also be useful in arrange(). arrange() usually puts all the missing values at the end but you can override this default by first sorting by is.na():\n\nflights |&gt; \n  filter(month == 1, day == 1) |&gt; \n  arrange(dep_time)\n#&gt; # A tibble: 842 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 836 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nflights |&gt; \n  filter(month == 1, day == 1) |&gt; \n  arrange(desc(is.na(dep_time)), dep_time)\n#&gt; # A tibble: 842 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1       NA           1630        NA       NA           1815\n#&gt; 2  2013     1     1       NA           1935        NA       NA           2240\n#&gt; 3  2013     1     1       NA           1500        NA       NA           1825\n#&gt; 4  2013     1     1       NA            600        NA       NA            901\n#&gt; 5  2013     1     1      517            515         2      830            819\n#&gt; 6  2013     1     1      533            529         4      850            830\n#&gt; # ℹ 836 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nWe’ll come back to cover missing values in more depth in Chapter 19.\n\n\n13.2.4 Exercises\n\nHow does dplyr::near() work? Type near to see the source code. Is sqrt(2)^2 near 2?\nUse mutate(), is.na(), and count() together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#boolean-algebra",
    "href": "logicals.html#boolean-algebra",
    "title": "13  Logical vectors",
    "section": "13.3 Boolean algebra",
    "text": "13.3 Boolean algebra\nOnce you have multiple logical vectors, you can combine them together using Boolean algebra. In R, & is “and”, | is “or”, ! is “not”, and xor() is exclusive or2. For example, df |&gt; filter(!is.na(x)) finds all rows where x is not missing and df |&gt; filter(x &lt; -10 | x &gt; 0) finds all rows where x is smaller than -10 or bigger than 0. Figure 13.1 shows the complete set of Boolean operations and how they work.\n\n\n\n\n\n\n\n\nFigure 13.1: The complete set of Boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded regions show which parts each operator selects.\n\n\n\n\n\nAs well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science.\n\n13.3.1 Missing values\nThe rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:\n\ndf &lt;- tibble(x = c(TRUE, FALSE, NA))\n\ndf |&gt; \n  mutate(\n    and = x & NA,\n    or = x | NA\n  )\n#&gt; # A tibble: 3 × 3\n#&gt;   x     and   or   \n#&gt;   &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;\n#&gt; 1 TRUE  NA    TRUE \n#&gt; 2 FALSE FALSE NA   \n#&gt; 3 NA    NA    NA\n\nTo understand what’s going on, think about NA | TRUE (NA or TRUE). A missing value in a logical vector means that the value could either be TRUE or FALSE. TRUE | TRUE and FALSE | TRUE are both TRUE because at least one of them is TRUE. NA | TRUE must also be TRUE because NA can either be TRUE or FALSE. However, NA | FALSE is NA because we don’t know if NA is TRUE or FALSE. Similar reasoning applies with NA & FALSE.\n\n\n13.3.2 Order of operations\nNote that the order of operations doesn’t work like English. Take the following code that finds all flights that departed in November or December:\n\nflights |&gt; \n   filter(month == 11 | month == 12)\n\nYou might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”:\n\nflights |&gt; \n   filter(month == 11 | 12)\n#&gt; # A tibble: 336,776 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      533            529         4      850            830\n#&gt; 3  2013     1     1      542            540         2      923            850\n#&gt; 4  2013     1     1      544            545        -1     1004           1022\n#&gt; 5  2013     1     1      554            600        -6      812            837\n#&gt; 6  2013     1     1      554            558        -4      740            728\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nThis code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates month == 11 creating a logical vector, which we call nov. It computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected:\n\nflights |&gt; \n  mutate(\n    nov = month == 11,\n    final = nov | 12,\n    .keep = \"used\"\n  )\n#&gt; # A tibble: 336,776 × 3\n#&gt;   month nov   final\n#&gt;   &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt;\n#&gt; 1     1 FALSE TRUE \n#&gt; 2     1 FALSE TRUE \n#&gt; 3     1 FALSE TRUE \n#&gt; 4     1 FALSE TRUE \n#&gt; 5     1 FALSE TRUE \n#&gt; 6     1 FALSE TRUE \n#&gt; # ℹ 336,770 more rows\n\n\n\n13.3.3 %in%\nAn easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .\n\n1:12 %in% c(1, 5, 11)\n#&gt;  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE\nletters[1:10] %in% c(\"a\", \"e\", \"i\", \"o\", \"u\")\n#&gt;  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE\n\nSo to find all flights in November and December we could write:\n\nflights |&gt; \n  filter(month %in% c(11, 12))\n\nNote that %in% obeys different rules for NA to ==, as NA %in% NA is TRUE.\n\nc(1, 2, NA) == NA\n#&gt; [1] NA NA NA\nc(1, 2, NA) %in% NA\n#&gt; [1] FALSE FALSE  TRUE\n\nThis can make for a useful shortcut:\n\nflights |&gt; \n  filter(dep_time %in% c(NA, 0800))\n#&gt; # A tibble: 8,803 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      800            800         0     1022           1014\n#&gt; 2  2013     1     1      800            810       -10      949            955\n#&gt; 3  2013     1     1       NA           1630        NA       NA           1815\n#&gt; 4  2013     1     1       NA           1935        NA       NA           2240\n#&gt; 5  2013     1     1       NA           1500        NA       NA           1825\n#&gt; 6  2013     1     1       NA            600        NA       NA            901\n#&gt; # ℹ 8,797 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\n\n\n13.3.4 Exercises\n\nFind all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.\nHow many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?\nAssuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#sec-logical-summaries",
    "href": "logicals.html#sec-logical-summaries",
    "title": "13  Logical vectors",
    "section": "13.4 Summaries",
    "text": "13.4 Summaries\nThe following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.\n\n13.4.1 Logical summaries\nThere are two main logical summaries: any() and all(). any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s. Like all summary functions, they’ll return NA if there are any missing values present, and as usual you can make the missing values go away with na.rm = TRUE.\nFor example, we could use all() and any() to find out if every flight was delayed on departure by at most an hour or if any flights were delayed on arrival by five hours or more. And using group_by() allows us to do that by day:\n\nflights |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    all_delayed = all(dep_delay &lt;= 60, na.rm = TRUE),\n    any_long_delay = any(arr_delay &gt;= 300, na.rm = TRUE),\n    .groups = \"drop\"\n  )\n#&gt; # A tibble: 365 × 5\n#&gt;    year month   day all_delayed any_long_delay\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgl&gt;       &lt;lgl&gt;         \n#&gt; 1  2013     1     1 FALSE       TRUE          \n#&gt; 2  2013     1     2 FALSE       TRUE          \n#&gt; 3  2013     1     3 FALSE       FALSE         \n#&gt; 4  2013     1     4 FALSE       FALSE         \n#&gt; 5  2013     1     5 FALSE       TRUE          \n#&gt; 6  2013     1     6 FALSE       FALSE         \n#&gt; # ℹ 359 more rows\n\nIn most cases, however, any() and all() are a little too crude, and it would be nice to be able to get a little more detail about how many values are TRUE or FALSE. That leads us to the numeric summaries.\n\n\n13.4.2 Numeric summaries of logical vectors\nWhen you use a logical vector in a numeric context, TRUE becomes 1 and FALSE becomes 0. This makes sum() and mean() very useful with logical vectors because sum(x) gives the number of TRUEs and mean(x) gives the proportion of TRUEs (because mean() is just sum() divided by length()).\nThat, for example, allows us to see the proportion of flights that were delayed on departure by at most an hour and the number of flights that were delayed on arrival by five hours or more:\n\nflights |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    proportion_delayed = mean(dep_delay &lt;= 60, na.rm = TRUE),\n    count_long_delay = sum(arr_delay &gt;= 300, na.rm = TRUE),\n    .groups = \"drop\"\n  )\n#&gt; # A tibble: 365 × 5\n#&gt;    year month   day proportion_delayed count_long_delay\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;              &lt;dbl&gt;            &lt;int&gt;\n#&gt; 1  2013     1     1              0.939                3\n#&gt; 2  2013     1     2              0.914                3\n#&gt; 3  2013     1     3              0.941                0\n#&gt; 4  2013     1     4              0.953                0\n#&gt; 5  2013     1     5              0.964                1\n#&gt; 6  2013     1     6              0.959                0\n#&gt; # ℹ 359 more rows\n\n\n\n13.4.3 Logical subsetting\nThere’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [ (pronounced subset) operator, which you’ll learn more about in Section 28.2.\nImagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights and then calculate the average delay:\n\nflights |&gt; \n  filter(arr_delay &gt; 0) |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    behind = mean(arr_delay),\n    n = n(),\n    .groups = \"drop\"\n  )\n#&gt; # A tibble: 365 × 5\n#&gt;    year month   day behind     n\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;dbl&gt; &lt;int&gt;\n#&gt; 1  2013     1     1   32.5   461\n#&gt; 2  2013     1     2   32.0   535\n#&gt; 3  2013     1     3   27.7   460\n#&gt; 4  2013     1     4   28.3   297\n#&gt; 5  2013     1     5   22.6   238\n#&gt; 6  2013     1     6   24.4   381\n#&gt; # ℹ 359 more rows\n\nThis works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together3. Instead you could use [ to perform an inline filtering: arr_delay[arr_delay &gt; 0] will yield only the positive arrival delays.\nThis leads to:\n\nflights |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    behind = mean(arr_delay[arr_delay &gt; 0], na.rm = TRUE),\n    ahead = mean(arr_delay[arr_delay &lt; 0], na.rm = TRUE),\n    n = n(),\n    .groups = \"drop\"\n  )\n#&gt; # A tibble: 365 × 6\n#&gt;    year month   day behind ahead     n\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1  2013     1     1   32.5 -12.5   842\n#&gt; 2  2013     1     2   32.0 -14.3   943\n#&gt; 3  2013     1     3   27.7 -18.2   914\n#&gt; 4  2013     1     4   28.3 -17.0   915\n#&gt; 5  2013     1     5   22.6 -14.0   720\n#&gt; 6  2013     1     6   24.4 -13.6   832\n#&gt; # ℹ 359 more rows\n\nAlso note the difference in the group size: in the first chunk n() gives the number of delayed flights per day; in the second, n() gives the total number of flights.\n\n\n13.4.4 Exercises\n\nWhat will sum(is.na(x)) tell you? How about mean(is.na(x))?\nWhat does prod() return when applied to a logical vector? What logical summary function is it equivalent to? What does min() return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#conditional-transformations",
    "href": "logicals.html#conditional-transformations",
    "title": "13  Logical vectors",
    "section": "13.5 Conditional transformations",
    "text": "13.5 Conditional transformations\nOne of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: if_else() and case_when().\n\n13.5.1 if_else()\nIf you want to use one value when a condition is TRUE and another value when it’s FALSE, you can use dplyr::if_else()4. You’ll always use the first three argument of if_else(). The first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false.\nLet’s begin with a simple example of labeling a numeric vector as either “+ve” (positive) or “-ve” (negative):\n\nx &lt;- c(-3:3, NA)\nif_else(x &gt; 0, \"+ve\", \"-ve\")\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" \"-ve\" \"+ve\" \"+ve\" \"+ve\" NA\n\nThere’s an optional fourth argument, missing which will be used if the input is NA:\n\nif_else(x &gt; 0, \"+ve\", \"-ve\", \"???\")\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" \"-ve\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nYou can also use vectors for the true and false arguments. For example, this allows us to create a minimal implementation of abs():\n\nif_else(x &lt; 0, -x, x)\n#&gt; [1]  3  2  1  0  1  2  3 NA\n\nSo far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of coalesce() like this:\n\nx1 &lt;- c(NA, 1, 2, NA)\ny1 &lt;- c(3, NA, 4, 6)\nif_else(is.na(x1), y1, x1)\n#&gt; [1] 3 1 2 6\n\nYou might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional if_else():\n\nif_else(x == 0, \"0\", if_else(x &lt; 0, \"-ve\", \"+ve\"), \"???\")\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" \"0\"   \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nThis is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to dplyr::case_when().\n\n\n13.5.2 case_when()\ndplyr’s case_when() is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.\nThis means we could recreate our previous nested if_else() as follows:\n\nx &lt;- c(-3:3, NA)\ncase_when(\n  x == 0   ~ \"0\",\n  x &lt; 0    ~ \"-ve\", \n  x &gt; 0    ~ \"+ve\",\n  is.na(x) ~ \"???\"\n)\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" \"0\"   \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nThis is more code, but it’s also more explicit.\nTo explain how case_when() works, let’s explore some simpler cases. If none of the cases match, the output gets an NA:\n\ncase_when(\n  x &lt; 0 ~ \"-ve\",\n  x &gt; 0 ~ \"+ve\"\n)\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" NA    \"+ve\" \"+ve\" \"+ve\" NA\n\nUse .default if you want to create a “default”/catch all value:\n\ncase_when(\n  x &lt; 0 ~ \"-ve\",\n  x &gt; 0 ~ \"+ve\",\n  .default = \"???\"\n)\n#&gt; [1] \"-ve\" \"-ve\" \"-ve\" \"???\" \"+ve\" \"+ve\" \"+ve\" \"???\"\n\nAnd note that if multiple conditions match, only the first will be used:\n\ncase_when(\n  x &gt; 0 ~ \"+ve\",\n  x &gt; 2 ~ \"big\"\n)\n#&gt; [1] NA    NA    NA    NA    \"+ve\" \"+ve\" \"+ve\" NA\n\nJust like with if_else() you can use variables on both sides of the ~ and you can mix and match variables as needed for your problem. For example, we could use case_when() to provide some human readable labels for the arrival delay:\n\nflights |&gt; \n  mutate(\n    status = case_when(\n      is.na(arr_delay)      ~ \"cancelled\",\n      arr_delay &lt; -30       ~ \"very early\",\n      arr_delay &lt; -15       ~ \"early\",\n      abs(arr_delay) &lt;= 15  ~ \"on time\",\n      arr_delay &lt; 60        ~ \"late\",\n      arr_delay &lt; Inf       ~ \"very late\",\n    ),\n    .keep = \"used\"\n  )\n#&gt; # A tibble: 336,776 × 2\n#&gt;   arr_delay status \n#&gt;       &lt;dbl&gt; &lt;chr&gt;  \n#&gt; 1        11 on time\n#&gt; 2        20 late   \n#&gt; 3        33 late   \n#&gt; 4       -18 early  \n#&gt; 5       -25 early  \n#&gt; 6        12 on time\n#&gt; # ℹ 336,770 more rows\n\nBe wary when writing this sort of complex case_when() statement; my first two attempts used a mix of &lt; and &gt; and I kept accidentally creating overlapping conditions.\n\n\n13.5.3 Compatible types\nNote that both if_else() and case_when() require compatible types in the output. If they’re not compatible, you’ll see errors like this:\n\nif_else(TRUE, \"a\", 1)\n#&gt; Error in `if_else()`:\n#&gt; ! Can't combine `true` &lt;character&gt; and `false` &lt;double&gt;.\n\ncase_when(\n  x &lt; -1 ~ TRUE,  \n  x &gt; 0  ~ now()\n)\n#&gt; Error in `case_when()`:\n#&gt; ! Can't combine `..1 (right)` &lt;logical&gt; and `..2 (right)` &lt;datetime&lt;local&gt;&gt;.\n\nOverall, relatively few types are compatible, because automatically converting one type of vector to another is a common source of errors. Here are the most important cases that are compatible:\n\nNumeric and logical vectors are compatible, as we discussed in Section 13.4.2.\nStrings and factors (Chapter 17) are compatible, because you can think of a factor as a string with a restricted set of values.\nDates and date-times, which we’ll discuss in Chapter 18, are compatible because you can think of a date as a special case of date-time.\nNA, which is technically a logical vector, is compatible with everything because every vector has some way of representing a missing value.\n\nWe don’t expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.\n\n\n13.5.4 Exercises\n\nA number is even if it’s divisible by two, which in R you can find out with x %% 2 == 0. Use this fact and if_else() to determine whether each number between 0 and 20 is even or odd.\nGiven a vector of days like x &lt;- c(\"Monday\", \"Saturday\", \"Wednesday\"), use an if_else() statement to label them as weekends or weekdays.\nUse if_else() to compute the absolute value of a numeric vector called x.\nWrite a case_when() statement that uses the month and day columns from flights to label a selection of important US holidays (e.g., New Years Day, 4th of July, Thanksgiving, and Christmas). First create a logical column that is either TRUE or FALSE, and then create a character column that either gives the name of the holiday or is NA.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#summary",
    "href": "logicals.html#summary",
    "title": "13  Logical vectors",
    "section": "13.6 Summary",
    "text": "13.6 Summary\nThe definition of a logical vector is simple because each value must be either TRUE, FALSE, or NA. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with &gt;, &lt;, &lt;=, &gt;=, ==, !=, and is.na(), how to combine them with !, &, and |, and how to summarize them with any(), all(), sum(), and mean(). You also learned the powerful if_else() and case_when() functions that allow you to return values depending on the value of a logical vector.\nWe’ll see logical vectors again and again in the following chapters. For example in Chapter 15 you’ll learn about str_detect(x, pattern) which returns a logical vector that’s TRUE for the elements of x that match the pattern, and in Chapter 18 you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "logicals.html#footnotes",
    "href": "logicals.html#footnotes",
    "title": "13  Logical vectors",
    "section": "",
    "text": "R normally calls print for you (i.e. x is a shortcut for print(x)), but calling it explicitly is useful if you want to provide other arguments.↩︎\nThat is, xor(x, y) is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.↩︎\nWe’ll cover this in Chapter 20.↩︎\ndplyr’s if_else() is very similar to base R’s ifelse(). There are two main advantages of if_else()over ifelse(): you can choose what should happen to missing values, and if_else() is much more likely to give you a meaningful error if your variables have incompatible types.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>13</span>  <span class='chapter-title'>Logical vectors</span>"
    ]
  },
  {
    "objectID": "numbers.html",
    "href": "numbers.html",
    "title": "14  Numbers",
    "section": "",
    "text": "14.1 Introduction\nNumeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.\nWe’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of count(). Then we’ll dive into various numeric transformations that pair well with mutate(), including more general transformations that can be applied to other types of vectors, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with summarize() and show you how they can also be used with mutate().",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#introduction",
    "href": "numbers.html#introduction",
    "title": "14  Numbers",
    "section": "",
    "text": "14.1.1 Prerequisites\nThis chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like mutate() and filter(). Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with c() and tribble().\n\nlibrary(tidyverse)\nlibrary(nycflights13)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#making-numbers",
    "href": "numbers.html#making-numbers",
    "title": "14  Numbers",
    "section": "14.2 Making numbers",
    "text": "14.2 Making numbers\nIn most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.\nreadr provides two useful functions for parsing strings into numbers: parse_double() and parse_number(). Use parse_double() when you have numbers that have been written as strings:\n\nx &lt;- c(\"1.2\", \"5.6\", \"1e3\")\nparse_double(x)\n#&gt; [1]    1.2    5.6 1000.0\n\nUse parse_number() when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:\n\nx &lt;- c(\"$1,234\", \"USD 3,513\", \"59%\")\nparse_number(x)\n#&gt; [1] 1234 3513   59",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#sec-counts",
    "href": "numbers.html#sec-counts",
    "title": "14  Numbers",
    "section": "14.3 Counts",
    "text": "14.3 Counts\nIt’s surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with count(). This function is great for quick exploration and checks during analysis:\n\nflights |&gt; count(dest)\n#&gt; # A tibble: 105 × 2\n#&gt;   dest      n\n#&gt;   &lt;chr&gt; &lt;int&gt;\n#&gt; 1 ABQ     254\n#&gt; 2 ACK     265\n#&gt; 3 ALB     439\n#&gt; 4 ANC       8\n#&gt; 5 ATL   17215\n#&gt; 6 AUS    2439\n#&gt; # ℹ 99 more rows\n\n(Despite the advice in Chapter 5, we usually put count() on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)\nIf you want to see the most common values, add sort = TRUE:\n\nflights |&gt; count(dest, sort = TRUE)\n#&gt; # A tibble: 105 × 2\n#&gt;   dest      n\n#&gt;   &lt;chr&gt; &lt;int&gt;\n#&gt; 1 ORD   17283\n#&gt; 2 ATL   17215\n#&gt; 3 LAX   16174\n#&gt; 4 BOS   15508\n#&gt; 5 MCO   14082\n#&gt; 6 CLT   14064\n#&gt; # ℹ 99 more rows\n\nAnd remember that if you want to see all the values, you can use |&gt; View() or |&gt; print(n = Inf).\nYou can perform the same computation “by hand” with group_by(), summarize() and n(). This is useful because it allows you to compute other summaries at the same time:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(\n    n = n(),\n    delay = mean(arr_delay, na.rm = TRUE)\n  )\n#&gt; # A tibble: 105 × 3\n#&gt;   dest      n delay\n#&gt;   &lt;chr&gt; &lt;int&gt; &lt;dbl&gt;\n#&gt; 1 ABQ     254  4.38\n#&gt; 2 ACK     265  4.85\n#&gt; 3 ALB     439 14.4 \n#&gt; 4 ANC       8 -2.5 \n#&gt; 5 ATL   17215 11.3 \n#&gt; 6 AUS    2439  6.02\n#&gt; # ℹ 99 more rows\n\nn() is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:\n\nn()\n#&gt; Error in `n()`:\n#&gt; ! Must only be used inside data-masking verbs like `mutate()`,\n#&gt;   `filter()`, and `group_by()`.\n\nThere are a couple of variants of n() and count() that you might find useful:\n\nn_distinct(x) counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(carriers = n_distinct(carrier)) |&gt; \n  arrange(desc(carriers))\n#&gt; # A tibble: 105 × 2\n#&gt;   dest  carriers\n#&gt;   &lt;chr&gt;    &lt;int&gt;\n#&gt; 1 ATL          7\n#&gt; 2 BOS          7\n#&gt; 3 CLT          7\n#&gt; 4 ORD          7\n#&gt; 5 TPA          7\n#&gt; 6 AUS          6\n#&gt; # ℹ 99 more rows\n\nA weighted count is a sum. For example you could “count” the number of miles each plane flew:\n\nflights |&gt; \n  group_by(tailnum) |&gt; \n  summarize(miles = sum(distance))\n#&gt; # A tibble: 4,044 × 2\n#&gt;   tailnum  miles\n#&gt;   &lt;chr&gt;    &lt;dbl&gt;\n#&gt; 1 D942DN    3418\n#&gt; 2 N0EGMQ  250866\n#&gt; 3 N10156  115966\n#&gt; 4 N102UW   25722\n#&gt; 5 N103US   24619\n#&gt; 6 N104UW   25157\n#&gt; # ℹ 4,038 more rows\n\nWeighted counts are a common problem so count() has a wt argument that does the same thing:\n\nflights |&gt; count(tailnum, wt = distance)\n\nYou can count missing values by combining sum() and is.na(). In the flights dataset this represents flights that are cancelled:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(n_cancelled = sum(is.na(dep_time))) \n#&gt; # A tibble: 105 × 2\n#&gt;   dest  n_cancelled\n#&gt;   &lt;chr&gt;       &lt;int&gt;\n#&gt; 1 ABQ             0\n#&gt; 2 ACK             0\n#&gt; 3 ALB            20\n#&gt; 4 ANC             0\n#&gt; 5 ATL           317\n#&gt; 6 AUS            21\n#&gt; # ℹ 99 more rows\n\n\n\n14.3.1 Exercises\n\nHow can you use count() to count the number of rows with a missing value for a given variable?\nExpand the following calls to count() to instead use group_by(), summarize(), and arrange():\n\nflights |&gt; count(dest, sort = TRUE)\nflights |&gt; count(tailnum, wt = distance)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#numeric-transformations",
    "href": "numbers.html#numeric-transformations",
    "title": "14  Numbers",
    "section": "14.4 Numeric transformations",
    "text": "14.4 Numeric transformations\nTransformation functions work well with mutate() because their output is the same length as the input. The vast majority of transformation functions are already built into base R. It’s impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we don’t list them here because they’re rarely needed for data science.\n\n14.4.1 Arithmetic and recycling rules\nWe introduced the basics of arithmetic (+, -, *, /, ^) in Chapter 3 and have used them a bunch since. These functions don’t need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the recycling rules which determine what happens when the left and right hand sides have different lengths. This is important for operations like flights |&gt; mutate(air_time = air_time / 60) because there are 336,776 numbers on the left of / but only one on the right.\nR handles mismatched lengths by recycling, or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:\n\nx &lt;- c(1, 2, 10, 20)\nx / 5\n#&gt; [1] 0.2 0.4 2.0 4.0\n# is shorthand for\nx / c(5, 5, 5, 5)\n#&gt; [1] 0.2 0.4 2.0 4.0\n\nGenerally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:\n\nx * c(1, 2)\n#&gt; [1]  1  4 10 40\nx * c(1, 2, 3)\n#&gt; Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter\n#&gt; object length\n#&gt; [1]  1  4 30 20\n\nThese recycling rules are also applied to logical comparisons (==, &lt;, &lt;=, &gt;, &gt;=, !=) and can lead to a surprising result if you accidentally use == instead of %in% and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:\n\nflights |&gt; \n  filter(month == c(1, 2))\n#&gt; # A tibble: 25,977 × 19\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1      542            540         2      923            850\n#&gt; 3  2013     1     1      554            600        -6      812            837\n#&gt; 4  2013     1     1      555            600        -5      913            854\n#&gt; 5  2013     1     1      557            600        -3      838            846\n#&gt; 6  2013     1     1      558            600        -2      849            851\n#&gt; # ℹ 25,971 more rows\n#&gt; # ℹ 11 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\nThe code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because flights has an even number of rows.\nTo protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function ==, not filter().\n\n\n14.4.2 Minimum and maximum\nThe arithmetic functions work with pairs of variables. Two closely related functions are pmin() and pmax(), which when given two or more variables will return the smallest or largest value in each row:\n\ndf &lt;- tribble(\n  ~x, ~y,\n  1,  3,\n  5,  2,\n  7, NA,\n)\n\ndf |&gt; \n  mutate(\n    min = pmin(x, y, na.rm = TRUE),\n    max = pmax(x, y, na.rm = TRUE)\n  )\n#&gt; # A tibble: 3 × 4\n#&gt;       x     y   min   max\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     3     1     3\n#&gt; 2     5     2     2     5\n#&gt; 3     7    NA     7     7\n\nNote that these are different to the summary functions min() and max() which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:\n\ndf |&gt; \n  mutate(\n    min = min(x, y, na.rm = TRUE),\n    max = max(x, y, na.rm = TRUE)\n  )\n#&gt; # A tibble: 3 × 4\n#&gt;       x     y   min   max\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1     3     1     7\n#&gt; 2     5     2     1     7\n#&gt; 3     7    NA     1     7\n\n\n\n14.4.3 Modular arithmetic\nModular arithmetic is the technical name for the type of math you did before you learned about decimal places, i.e. division that yields a whole number and a remainder. In R, %/% does integer division and %% computes the remainder:\n\n1:10 %/% 3\n#&gt;  [1] 0 0 1 1 1 2 2 2 3 3\n1:10 %% 3\n#&gt;  [1] 1 2 0 1 2 0 1 2 0 1\n\nModular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into hour and minute:\n\nflights |&gt; \n  mutate(\n    hour = sched_dep_time %/% 100,\n    minute = sched_dep_time %% 100,\n    .keep = \"used\"\n  )\n#&gt; # A tibble: 336,776 × 3\n#&gt;   sched_dep_time  hour minute\n#&gt;            &lt;int&gt; &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1            515     5     15\n#&gt; 2            529     5     29\n#&gt; 3            540     5     40\n#&gt; 4            545     5     45\n#&gt; 5            600     6      0\n#&gt; 6            558     5     58\n#&gt; # ℹ 336,770 more rows\n\nWe can combine that with the mean(is.na(x)) trick from Section 13.4 to see how the proportion of cancelled flights varies over the course of the day. The results are shown in Figure 14.1.\n\nflights |&gt; \n  group_by(hour = sched_dep_time %/% 100) |&gt; \n  summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |&gt; \n  filter(hour &gt; 1) |&gt; \n  ggplot(aes(x = hour, y = prop_cancelled)) +\n  geom_line(color = \"grey50\") + \n  geom_point(aes(size = n))\n\n\n\n\n\n\n\nFigure 14.1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.\n\n\n\n\n\n\n\n14.4.4 Logarithms\nLogarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and converting exponential growth to linear growth. In R, you have a choice of three logarithms: log() (the natural log, base e), log2() (base 2), and log10() (base 10). We recommend using log2() or log10(). log2() is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas log10() is easy to back-transform because (e.g.) 3 is 10^3 = 1000. The inverse of log() is exp(); to compute the inverse of log2() or log10() you’ll need to use 2^ or 10^.\n\n\n14.4.5 Rounding\nUse round(x) to round a number to the nearest integer:\n\nround(123.456)\n#&gt; [1] 123\n\nYou can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01. This definition is useful because it implies round(x, -3) will round to the nearest thousand, which indeed it does:\n\nround(123.456, 2)  # two digits\n#&gt; [1] 123.46\nround(123.456, 1)  # one digit\n#&gt; [1] 123.5\nround(123.456, -1) # round to nearest ten\n#&gt; [1] 120\nround(123.456, -2) # round to nearest hundred\n#&gt; [1] 100\n\nThere’s one weirdness with round() that seems surprising at first glance:\n\nround(c(1.5, 2.5))\n#&gt; [1] 2 2\n\nround() uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.\nround() is paired with floor() which always rounds down and ceiling() which always rounds up:\n\nx &lt;- 123.456\n\nfloor(x)\n#&gt; [1] 123\nceiling(x)\n#&gt; [1] 124\n\nThese functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:\n\n# Round down to nearest two digits\nfloor(x / 0.01) * 0.01\n#&gt; [1] 123.45\n# Round up to nearest two digits\nceiling(x / 0.01) * 0.01\n#&gt; [1] 123.46\n\nYou can use the same technique if you want to round() to a multiple of some other number:\n\n# Round to nearest multiple of 4\nround(x / 4) * 4\n#&gt; [1] 124\n\n# Round to nearest 0.25\nround(x / 0.25) * 0.25\n#&gt; [1] 123.5\n\n\n\n14.4.6 Cutting numbers into ranges\nUse cut()1 to break up (aka bin) a numeric vector into discrete buckets:\n\nx &lt;- c(1, 2, 5, 10, 15, 20)\ncut(x, breaks = c(0, 5, 10, 15, 20))\n#&gt; [1] (0,5]   (0,5]   (0,5]   (5,10]  (10,15] (15,20]\n#&gt; Levels: (0,5] (5,10] (10,15] (15,20]\n\nThe breaks don’t need to be evenly spaced:\n\ncut(x, breaks = c(0, 5, 10, 100))\n#&gt; [1] (0,5]    (0,5]    (0,5]    (5,10]   (10,100] (10,100]\n#&gt; Levels: (0,5] (5,10] (10,100]\n\nYou can optionally supply your own labels. Note that there should be one less labels than breaks.\n\ncut(x, \n  breaks = c(0, 5, 10, 15, 20), \n  labels = c(\"sm\", \"md\", \"lg\", \"xl\")\n)\n#&gt; [1] sm sm sm md lg xl\n#&gt; Levels: sm md lg xl\n\nAny values outside of the range of the breaks will become NA:\n\ny &lt;- c(NA, -10, 5, 10, 30)\ncut(y, breaks = c(0, 5, 10, 15, 20))\n#&gt; [1] &lt;NA&gt;   &lt;NA&gt;   (0,5]  (5,10] &lt;NA&gt;  \n#&gt; Levels: (0,5] (5,10] (10,15] (15,20]\n\nSee the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b].\n\n\n14.4.7 Cumulative and rolling aggregates\nBase R provides cumsum(), cumprod(), cummin(), cummax() for running, or cumulative, sums, products, mins and maxes. dplyr provides cummean() for cumulative means. Cumulative sums tend to come up the most in practice:\n\nx &lt;- 1:10\ncumsum(x)\n#&gt;  [1]  1  3  6 10 15 21 28 36 45 55\n\nIf you need more complex rolling or sliding aggregates, try the slider package.\n\n\n14.4.8 Exercises\n\nExplain in words what each line of the code used to generate Figure 14.1 does.\nWhat trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?\nCurrently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem by running the code below: there’s a gap between each hour.\n\nflights |&gt; \n  filter(month == 1, day == 1) |&gt; \n  ggplot(aes(x = sched_dep_time, y = dep_delay)) +\n  geom_point()\n\nConvert them to a more truthful representation of time (either fractional hours or minutes since midnight).\nRound dep_time and arr_time to the nearest five minutes.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#general-transformations",
    "href": "numbers.html#general-transformations",
    "title": "14  Numbers",
    "section": "14.5 General transformations",
    "text": "14.5 General transformations\nThe following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.\n\n14.5.1 Ranks\ndplyr provides a number of ranking functions inspired by SQL, but you should always start with dplyr::min_rank(). It uses the typical method for dealing with ties, e.g., 1st, 2nd, 2nd, 4th.\n\nx &lt;- c(1, 2, 2, 3, 4, NA)\nmin_rank(x)\n#&gt; [1]  1  2  2  4  5 NA\n\nNote that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:\n\nmin_rank(desc(x))\n#&gt; [1]  5  3  3  2  1 NA\n\nIf min_rank() doesn’t do what you need, look at the variants dplyr::row_number(), dplyr::dense_rank(), dplyr::percent_rank(), and dplyr::cume_dist(). See the documentation for details.\n\ndf &lt;- tibble(x = x)\ndf |&gt; \n  mutate(\n    row_number = row_number(x),\n    dense_rank = dense_rank(x),\n    percent_rank = percent_rank(x),\n    cume_dist = cume_dist(x)\n  )\n#&gt; # A tibble: 6 × 5\n#&gt;       x row_number dense_rank percent_rank cume_dist\n#&gt;   &lt;dbl&gt;      &lt;int&gt;      &lt;int&gt;        &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1     1          1          1         0          0.2\n#&gt; 2     2          2          2         0.25       0.6\n#&gt; 3     2          3          2         0.25       0.6\n#&gt; 4     3          4          3         0.75       0.8\n#&gt; 5     4          5          4         1          1  \n#&gt; 6    NA         NA         NA        NA         NA\n\nYou can achieve many of the same results by picking the appropriate ties.method argument to base R’s rank(); you’ll probably also want to set na.last = \"keep\" to keep NAs as NA.\nrow_number() can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with %% or %/% this can be a useful tool for dividing data into similarly sized groups:\n\ndf &lt;- tibble(id = 1:10)\n\ndf |&gt; \n  mutate(\n    row0 = row_number() - 1,\n    three_groups = row0 %% 3,\n    three_in_each_group = row0 %/% 3\n  )\n#&gt; # A tibble: 10 × 4\n#&gt;      id  row0 three_groups three_in_each_group\n#&gt;   &lt;int&gt; &lt;dbl&gt;        &lt;dbl&gt;               &lt;dbl&gt;\n#&gt; 1     1     0            0                   0\n#&gt; 2     2     1            1                   0\n#&gt; 3     3     2            2                   0\n#&gt; 4     4     3            0                   1\n#&gt; 5     5     4            1                   1\n#&gt; 6     6     5            2                   1\n#&gt; # ℹ 4 more rows\n\n\n\n14.5.2 Offsets\ndplyr::lead() and dplyr::lag() allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:\n\nx &lt;- c(2, 5, 11, 11, 19, 35)\nlag(x)\n#&gt; [1] NA  2  5 11 11 19\nlead(x)\n#&gt; [1]  5 11 11 19 35 NA\n\n\nx - lag(x) gives you the difference between the current and previous value.\n\nx - lag(x)\n#&gt; [1] NA  3  6  0  8 16\n\nx == lag(x) tells you when the current value changes.\n\nx == lag(x)\n#&gt; [1]    NA FALSE FALSE  TRUE FALSE FALSE\n\n\nYou can lead or lag by more than one position by using the second argument, n.\n\n\n14.5.3 Consecutive identifiers\nSometimes you want to start a new group every time some event occurs. For example, when you’re looking at website data, it’s common to want to break up events into sessions, where you begin a new session after gap of more than x minutes since the last activity. For example, imagine you have the times when someone visited a website:\n\nevents &lt;- tibble(\n  time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)\n)\n\nAnd you’ve computed the time between each event, and figured out if there’s a gap that’s big enough to qualify:\n\nevents &lt;- events |&gt; \n  mutate(\n    diff = time - lag(time, default = first(time)),\n    has_gap = diff &gt;= 5\n  )\nevents\n#&gt; # A tibble: 14 × 3\n#&gt;    time  diff has_gap\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;lgl&gt;  \n#&gt; 1     0     0 FALSE  \n#&gt; 2     1     1 FALSE  \n#&gt; 3     2     1 FALSE  \n#&gt; 4     3     1 FALSE  \n#&gt; 5     5     2 FALSE  \n#&gt; 6    10     5 TRUE   \n#&gt; # ℹ 8 more rows\n\nBut how do we go from that logical vector to something that we can group_by()? cumsum(), from Section 14.4.7, comes to the rescue as gap, i.e. has_gap is TRUE, will increment group by one (Section 13.4.2):\n\nevents |&gt; mutate(\n  group = cumsum(has_gap)\n)\n#&gt; # A tibble: 14 × 4\n#&gt;    time  diff has_gap group\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;lgl&gt;   &lt;int&gt;\n#&gt; 1     0     0 FALSE       0\n#&gt; 2     1     1 FALSE       0\n#&gt; 3     2     1 FALSE       0\n#&gt; 4     3     1 FALSE       0\n#&gt; 5     5     2 FALSE       0\n#&gt; 6    10     5 TRUE        1\n#&gt; # ℹ 8 more rows\n\nAnother approach for creating grouping variables is consecutive_id(), which starts a new group every time one of its arguments changes. For example, inspired by this stackoverflow question, imagine you have a data frame with a bunch of repeated values:\n\ndf &lt;- tibble(\n  x = c(\"a\", \"a\", \"a\", \"b\", \"c\", \"c\", \"d\", \"e\", \"a\", \"a\", \"b\", \"b\"),\n  y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)\n)\n\nIf you want to keep the first row from each repeated x, you could use group_by(), consecutive_id(), and slice_head():\n\ndf |&gt; \n  group_by(id = consecutive_id(x)) |&gt; \n  slice_head(n = 1)\n#&gt; # A tibble: 7 × 3\n#&gt; # Groups:   id [7]\n#&gt;   x         y    id\n#&gt;   &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 a         1     1\n#&gt; 2 b         2     2\n#&gt; 3 c         4     3\n#&gt; 4 d         3     4\n#&gt; 5 e         9     5\n#&gt; 6 a         4     6\n#&gt; # ℹ 1 more row\n\n\n\n14.5.4 Exercises\n\nFind the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().\nWhich plane (tailnum) has the worst on-time record?\nWhat time of day should you fly if you want to avoid delays as much as possible?\nWhat does flights |&gt; group_by(dest) |&gt; filter(row_number() &lt; 4) do? What does flights |&gt; group_by(dest) |&gt; filter(row_number(dep_delay) &lt; 4) do?\nFor each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.\nDelays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the average flight delay for an hour is related to the average delay for the previous hour.\n\nflights |&gt; \n  mutate(hour = dep_time %/% 100) |&gt; \n  group_by(year, month, day, hour) |&gt; \n  summarize(\n    dep_delay = mean(dep_delay, na.rm = TRUE),\n    n = n(),\n    .groups = \"drop\"\n  ) |&gt; \n  filter(n &gt; 5)\n\nLook at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?\nFind all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#numeric-summaries",
    "href": "numbers.html#numeric-summaries",
    "title": "14  Numbers",
    "section": "14.6 Numeric summaries",
    "text": "14.6 Numeric summaries\nJust using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.\n\n14.6.1 Center\nSo far, we’ve mostly used mean() to summarize the center of a vector of values. As we’ve seen in Section 4.6, because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the median(), which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.\nFigure 14.2 compares the mean vs. the median departure delay (in minutes) for each destination. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.\n\nflights |&gt;\n  group_by(year, month, day) |&gt;\n  summarize(\n    mean = mean(dep_delay, na.rm = TRUE),\n    median = median(dep_delay, na.rm = TRUE),\n    n = n(),\n    .groups = \"drop\"\n  ) |&gt; \n  ggplot(aes(x = mean, y = median)) + \n  geom_abline(slope = 1, intercept = 0, color = \"white\", linewidth = 2) +\n  geom_point()\n\n\n\n\n\n\n\nFigure 14.2: A scatterplot showing the differences of summarizing daily departure delay with median instead of mean.\n\n\n\n\n\nYou might also wonder about the mode, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesn’t work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and there’s no mode function included in base R2.\n\n\n14.6.2 Minimum, maximum, and quantiles\nWhat if you’re interested in locations other than the center? min() and max() will give you the largest and smallest values. Another powerful tool is quantile() which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find the value that’s greater than 95% of the values.\nFor the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.\n\nflights |&gt;\n  group_by(year, month, day) |&gt;\n  summarize(\n    max = max(dep_delay, na.rm = TRUE),\n    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),\n    .groups = \"drop\"\n  )\n#&gt; # A tibble: 365 × 5\n#&gt;    year month   day   max   q95\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  2013     1     1   853  70.1\n#&gt; 2  2013     1     2   379  85  \n#&gt; 3  2013     1     3   291  68  \n#&gt; 4  2013     1     4   288  60  \n#&gt; 5  2013     1     5   327  41  \n#&gt; 6  2013     1     6   202  51  \n#&gt; # ℹ 359 more rows\n\n\n\n14.6.3 Spread\nSometimes you’re not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, IQR(). We won’t explain sd() here since you’re probably already familiar with it, but IQR() might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and gives you the range that contains the middle 50% of the data.\nWe can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below reveals a data oddity for airport EGE:\n\nflights |&gt; \n  group_by(origin, dest) |&gt; \n  summarize(\n    distance_iqr = IQR(distance), \n    n = n(),\n    .groups = \"drop\"\n  ) |&gt; \n  filter(distance_iqr &gt; 0)\n#&gt; # A tibble: 2 × 4\n#&gt;   origin dest  distance_iqr     n\n#&gt;   &lt;chr&gt;  &lt;chr&gt;        &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 EWR    EGE              1   110\n#&gt; 2 JFK    EGE              1   103\n\n\n\n14.6.4 Distributions\nIt’s worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they’re fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That’s why it’s always a good idea to visualize the distribution before committing to your summary statistics.\nFigure 14.3 shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.\n\n\n\n\n\n\n\n\nFigure 14.3: (Left) The histogram of the full data is extremely skewed making it hard to get any details. (Right) Zooming into delays of less than two hours makes it possible to see what’s happening with the bulk of the observations.\n\n\n\n\n\nIt’s also a good idea to check that distributions for subgroups resemble the whole. In the following plot 365 frequency polygons of dep_delay, one for each day, are overlaid. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.\n\nflights |&gt;\n  filter(dep_delay &lt; 120) |&gt; \n  ggplot(aes(x = dep_delay, group = interaction(day, month))) + \n  geom_freqpoly(binwidth = 5, alpha = 1/5)\n\n\n\n\n\n\n\n\nDon’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs. the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in Section 4.6: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.\n\n\n14.6.5 Positions\nThere’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position: first(x), last(x), and nth(x, n).\nFor example, we can find the first, fifth and last departure for each day:\n\nflights |&gt; \n  group_by(year, month, day) |&gt; \n  summarize(\n    first_dep = first(dep_time, na_rm = TRUE), \n    fifth_dep = nth(dep_time, 5, na_rm = TRUE),\n    last_dep = last(dep_time, na_rm = TRUE)\n  )\n#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using\n#&gt; the `.groups` argument.\n#&gt; # A tibble: 365 × 6\n#&gt; # Groups:   year, month [12]\n#&gt;    year month   day first_dep fifth_dep last_dep\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;     &lt;int&gt;     &lt;int&gt;    &lt;int&gt;\n#&gt; 1  2013     1     1       517       554     2356\n#&gt; 2  2013     1     2        42       535     2354\n#&gt; 3  2013     1     3        32       520     2349\n#&gt; 4  2013     1     4        25       531     2358\n#&gt; 5  2013     1     5        14       534     2357\n#&gt; 6  2013     1     6        16       555     2355\n#&gt; # ℹ 359 more rows\n\n(NB: Because dplyr functions use _ to separate components of function and arguments names, these functions use na_rm instead of na.rm.)\nIf you’re familiar with [, which we’ll come back to in Section 28.2, you might wonder if you ever need these functions. There are three reasons: the default argument allows you to provide a default if the specified position doesn’t exist, the order_by argument allows you to locally override the order of the rows, and the na_rm argument allows you to drop missing values.\nExtracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:\n\nflights |&gt; \n  group_by(year, month, day) |&gt; \n  mutate(r = min_rank(sched_dep_time)) |&gt; \n  filter(r %in% c(1, max(r)))\n#&gt; # A tibble: 1,195 × 20\n#&gt; # Groups:   year, month, day [365]\n#&gt;    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;          &lt;int&gt;\n#&gt; 1  2013     1     1      517            515         2      830            819\n#&gt; 2  2013     1     1     2353           2359        -6      425            445\n#&gt; 3  2013     1     1     2353           2359        -6      418            442\n#&gt; 4  2013     1     1     2356           2359        -3      425            437\n#&gt; 5  2013     1     2       42           2359        43      518            442\n#&gt; 6  2013     1     2      458            500        -2      703            650\n#&gt; # ℹ 1,189 more rows\n#&gt; # ℹ 12 more variables: arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, …\n\n\n\n14.6.6 With mutate()\nAs the names suggest, the summary functions are typically paired with summarize(). However, because of the recycling rules we discussed in Section 14.4.1 they can also be usefully paired with mutate(), particularly when you want do some sort of group standardization. For example:\n\nx / sum(x) calculates the proportion of a total.\n(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1).\n(x - min(x)) / (max(x) - min(x)) standardizes to range [0, 1].\nx / first(x) computes an index based on the first observation.\n\n\n\n14.6.7 Exercises\n\nBrainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. When is mean() useful? When is median() useful? When might you want to use something else? Should you use arrival delay or departure delay? Why might you want to use data from planes?\nWhich destinations show the greatest variation in air speed?\nCreate a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations? Can you find another variable that might explain the difference?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#summary",
    "href": "numbers.html#summary",
    "title": "14  Numbers",
    "section": "14.7 Summary",
    "text": "14.7 Summary\nYou’re already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. You’ve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.\nOver the next two chapters, we’ll dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "numbers.html#footnotes",
    "href": "numbers.html#footnotes",
    "title": "14  Numbers",
    "section": "",
    "text": "ggplot2 provides some helpers for common cases in cut_interval(), cut_number(), and cut_width(). ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.↩︎\nThe mode() function does something quite different!↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>14</span>  <span class='chapter-title'>Numbers</span>"
    ]
  },
  {
    "objectID": "strings.html",
    "href": "strings.html",
    "title": "15  Strings",
    "section": "",
    "text": "15.1 Introduction\nSo far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learn what makes strings tick, and master some of the powerful string manipulation tools you have at your disposal.\nWe’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite: extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.\nWe’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#introduction",
    "href": "strings.html#introduction",
    "title": "15  Strings",
    "section": "",
    "text": "15.1.1 Prerequisites\nIn this chapter, we’ll use functions from the stringr package, which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.\n\nlibrary(tidyverse)\nlibrary(babynames)\n\nYou can quickly tell when you’re using a stringr function because all stringr functions start with str_. This is particularly useful if you use RStudio because typing str_ will trigger autocomplete, allowing you to jog your memory of the available functions.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#creating-a-string",
    "href": "strings.html#creating-a-string",
    "title": "15  Strings",
    "section": "15.2 Creating a string",
    "text": "15.2 Creating a string\nWe’ve created strings in passing earlier in the book but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes (\"). There’s no difference in behavior between the two, so in the interests of consistency, the tidyverse style guide recommends using \", unless the string contains multiple \".\n\nstring1 &lt;- \"This is a string\"\nstring2 &lt;- 'If I want to include a \"quote\" inside a string, I use single quotes'\n\nIf you forget to close a quote, you’ll see +, the continuation prompt:\n&gt; \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK IN A STRING\nIf this happens to you and you can’t figure out which quote to close, press Escape to cancel and try again.\n\n15.2.1 Escapes\nTo include a literal single or double quote in a string, you can use \\ to “escape” it:\n\ndouble_quote &lt;- \"\\\"\" # or '\"'\nsingle_quote &lt;- '\\'' # or \"'\"\n\nSo if you want to include a literal backslash in your string, you’ll need to escape it: \"\\\\\":\n\nbackslash &lt;- \"\\\\\"\n\nBeware that the printed representation of a string is not the same as the string itself because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use str_view()1:\n\nx &lt;- c(single_quote, double_quote, backslash)\nx\n#&gt; [1] \"'\"  \"\\\"\" \"\\\\\"\n\nstr_view(x)\n#&gt; [1] │ '\n#&gt; [2] │ \"\n#&gt; [3] │ \\\n\n\n\n15.2.2 Raw strings\nCreating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, let’s create a string that contains the contents of the code block where we define the double_quote and single_quote variables:\n\ntricky &lt;- \"double_quote &lt;- \\\"\\\\\\\"\\\" # or '\\\"'\nsingle_quote &lt;- '\\\\'' # or \\\"'\\\"\"\nstr_view(tricky)\n#&gt; [1] │ double_quote &lt;- \"\\\"\" # or '\"'\n#&gt;     │ single_quote &lt;- '\\'' # or \"'\"\n\nThat’s a lot of backslashes! (This is sometimes called leaning toothpick syndrome.) To eliminate the escaping, you can instead use a raw string2:\n\ntricky &lt;- r\"(double_quote &lt;- \"\\\"\" # or '\"'\nsingle_quote &lt;- '\\'' # or \"'\")\"\nstr_view(tricky)\n#&gt; [1] │ double_quote &lt;- \"\\\"\" # or '\"'\n#&gt;     │ single_quote &lt;- '\\'' # or \"'\"\n\nA raw string usually starts with r\"( and finishes with )\". But if your string contains )\" you can instead use r\"[]\" or r\"{}\", and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g., r\"--()--\", r\"---()---\", etc. Raw strings are flexible enough to handle any text.\n\n\n15.2.3 Other special characters\nAs well as \\\", \\', and \\\\, there are a handful of other special characters that may come in handy. The most common are \\n, a new line, and \\t, tab. You’ll also sometimes see strings containing Unicode escapes that start with \\u or \\U. This is a way of writing non-English characters that work on all systems. You can see the complete list of other special characters in ?Quotes.\n\nx &lt;- c(\"one\\ntwo\", \"one\\ttwo\", \"\\u00b5\", \"\\U0001f604\")\nx\n#&gt; [1] \"one\\ntwo\" \"one\\ttwo\" \"µ\"        \"😄\"\nstr_view(x)\n#&gt; [1] │ one\n#&gt;     │ two\n#&gt; [2] │ one{\\t}two\n#&gt; [3] │ µ\n#&gt; [4] │ 😄\n\nNote that str_view() uses curly braces for tabs to make them easier to spot3. One of the challenges of working with text is that there’s a variety of ways that white space can end up in the text, so this background helps you recognize that something strange is going on.\n\n\n15.2.4 Exercises\n\nCreate strings that contain the following values:\n\nHe said \"That's amazing!\"\n\\a\\b\\c\\d\n\\\\\\\\\\\\\n\nCreate the string in your R session and print it. What happens to the special “\\u00a0”? How does str_view() display it? Can you do a little googling to figure out what this special character is?\n\nx &lt;- \"This\\u00a0is\\u00a0tricky\"",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#creating-many-strings-from-data",
    "href": "strings.html#creating-many-strings-from-data",
    "title": "15  Strings",
    "section": "15.3 Creating many strings from data",
    "text": "15.3 Creating many strings from data\nNow that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text you wrote that you want to combine with strings from a data frame. For example, you might combine “Hello” with a name variable to create a greeting. We’ll show you how to do this with str_c() and str_glue() and how you can use them with mutate(). That naturally raises the question of what stringr functions you might use with summarize(), so we’ll finish this section with a discussion of str_flatten(), which is a summary function for strings.\n\n15.3.1 str_c()\nstr_c() takes any number of vectors as arguments and returns a character vector:\n\nstr_c(\"x\", \"y\")\n#&gt; [1] \"xy\"\nstr_c(\"x\", \"y\", \"z\")\n#&gt; [1] \"xyz\"\nstr_c(\"Hello \", c(\"John\", \"Susan\"))\n#&gt; [1] \"Hello John\"  \"Hello Susan\"\n\nstr_c() is very similar to the base paste0(), but is designed to be used with mutate() by obeying the usual tidyverse rules for recycling and propagating missing values:\n\ndf &lt;- tibble(name = c(\"Flora\", \"David\", \"Terra\", NA))\ndf |&gt; mutate(greeting = str_c(\"Hi \", name, \"!\"))\n#&gt; # A tibble: 4 × 2\n#&gt;   name  greeting \n#&gt;   &lt;chr&gt; &lt;chr&gt;    \n#&gt; 1 Flora Hi Flora!\n#&gt; 2 David Hi David!\n#&gt; 3 Terra Hi Terra!\n#&gt; 4 &lt;NA&gt;  &lt;NA&gt;\n\nIf you want missing values to display in another way, use coalesce() to replace them. Depending on what you want, you might use it either inside or outside of str_c():\n\ndf |&gt; \n  mutate(\n    greeting1 = str_c(\"Hi \", coalesce(name, \"you\"), \"!\"),\n    greeting2 = coalesce(str_c(\"Hi \", name, \"!\"), \"Hi!\")\n  )\n#&gt; # A tibble: 4 × 3\n#&gt;   name  greeting1 greeting2\n#&gt;   &lt;chr&gt; &lt;chr&gt;     &lt;chr&gt;    \n#&gt; 1 Flora Hi Flora! Hi Flora!\n#&gt; 2 David Hi David! Hi David!\n#&gt; 3 Terra Hi Terra! Hi Terra!\n#&gt; 4 &lt;NA&gt;  Hi you!   Hi!\n\n\n\n15.3.2 str_glue()\nIf you are mixing many fixed and variable strings with str_c(), you’ll notice that you type a lot of \"s, making it hard to see the overall goal of the code. An alternative approach is provided by the glue package via str_glue()4. You give it a single string that has a special feature: anything inside {} will be evaluated like it’s outside of the quotes:\n\ndf |&gt; mutate(greeting = str_glue(\"Hi {name}!\"))\n#&gt; # A tibble: 4 × 2\n#&gt;   name  greeting \n#&gt;   &lt;chr&gt; &lt;glue&gt;   \n#&gt; 1 Flora Hi Flora!\n#&gt; 2 David Hi David!\n#&gt; 3 Terra Hi Terra!\n#&gt; 4 &lt;NA&gt;  Hi NA!\n\nAs you can see, str_glue() currently converts missing values to the string \"NA\", unfortunately making it inconsistent with str_c().\nYou also might wonder what happens if you need to include a regular { or } in your string. You’re on the right track if you guess you’ll need to escape it somehow. The trick is that glue uses a slightly different escaping technique: instead of prefixing with special character like \\, you double up the special characters:\n\ndf |&gt; mutate(greeting = str_glue(\"{{Hi {name}!}}\"))\n#&gt; # A tibble: 4 × 2\n#&gt;   name  greeting   \n#&gt;   &lt;chr&gt; &lt;glue&gt;     \n#&gt; 1 Flora {Hi Flora!}\n#&gt; 2 David {Hi David!}\n#&gt; 3 Terra {Hi Terra!}\n#&gt; 4 &lt;NA&gt;  {Hi NA!}\n\n\n\n15.3.3 str_flatten()\nstr_c() and str_glue() work well with mutate() because their output is the same length as their inputs. What if you want a function that works well with summarize(), i.e. something that always returns a single string? That’s the job of str_flatten()5: it takes a character vector and combines each element of the vector into a single string:\n\nstr_flatten(c(\"x\", \"y\", \"z\"))\n#&gt; [1] \"xyz\"\nstr_flatten(c(\"x\", \"y\", \"z\"), \", \")\n#&gt; [1] \"x, y, z\"\nstr_flatten(c(\"x\", \"y\", \"z\"), \", \", last = \", and \")\n#&gt; [1] \"x, y, and z\"\n\nThis makes it work well with summarize():\n\ndf &lt;- tribble(\n  ~ name, ~ fruit,\n  \"Carmen\", \"banana\",\n  \"Carmen\", \"apple\",\n  \"Marvin\", \"nectarine\",\n  \"Terence\", \"cantaloupe\",\n  \"Terence\", \"papaya\",\n  \"Terence\", \"mandarin\"\n)\ndf |&gt;\n  group_by(name) |&gt; \n  summarize(fruits = str_flatten(fruit, \", \"))\n#&gt; # A tibble: 3 × 2\n#&gt;   name    fruits                      \n#&gt;   &lt;chr&gt;   &lt;chr&gt;                       \n#&gt; 1 Carmen  banana, apple               \n#&gt; 2 Marvin  nectarine                   \n#&gt; 3 Terence cantaloupe, papaya, mandarin\n\n\n\n15.3.4 Exercises\n\nCompare and contrast the results of paste0() with str_c() for the following inputs:\n\nstr_c(\"hi \", NA)\nstr_c(letters[1:2], letters[1:3])\n\nWhat’s the difference between paste() and paste0()? How can you recreate the equivalent of paste() with str_c()?\nConvert the following expressions from str_c() to str_glue() or vice versa:\n\nstr_c(\"The price of \", food, \" is \", price)\nstr_glue(\"I'm {age} years old and live in {country}\")\nstr_c(\"\\\\section{\", title, \"}\")",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#extracting-data-from-strings",
    "href": "strings.html#extracting-data-from-strings",
    "title": "15  Strings",
    "section": "15.4 Extracting data from strings",
    "text": "15.4 Extracting data from strings\nIt’s very common for multiple variables to be crammed together into a single string. In this section, you’ll learn how to use four tidyr functions to extract them:\n\ndf |&gt; separate_longer_delim(col, delim)\ndf |&gt; separate_longer_position(col, width)\ndf |&gt; separate_wider_delim(col, delim, names)\ndf |&gt; separate_wider_position(col, widths)\n\nIf you look closely, you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position. That’s because these four functions are composed of two simpler primitives:\n\nJust like with pivot_longer() and pivot_wider(), _longer functions make the input data frame longer by creating new rows and _wider functions make the input data frame wider by generating new columns.\ndelim splits up a string with a delimiter like \", \" or \" \"; position splits at specified widths, like c(3, 5, 2).\n\nWe’ll return to the last member of this family, separate_wider_regex(), in Chapter 16. It’s the most flexible of the wider functions, but you need to know something about regular expressions before you can use it.\nThe following two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating into columns. We’ll finish off by discussing the tools that the wider functions give you to diagnose problems.\n\n15.4.1 Separating into rows\nSeparating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring separate_longer_delim() to split based on a delimiter:\n\ndf1 &lt;- tibble(x = c(\"a,b,c\", \"d,e\", \"f\"))\ndf1 |&gt; \n  separate_longer_delim(x, delim = \",\")\n#&gt; # A tibble: 6 × 1\n#&gt;   x    \n#&gt;   &lt;chr&gt;\n#&gt; 1 a    \n#&gt; 2 b    \n#&gt; 3 c    \n#&gt; 4 d    \n#&gt; 5 e    \n#&gt; 6 f\n\nIt’s rarer to see separate_longer_position() in the wild, but some older datasets do use a very compact format where each character is used to record a value:\n\ndf2 &lt;- tibble(x = c(\"1211\", \"131\", \"21\"))\ndf2 |&gt; \n  separate_longer_position(x, width = 1)\n#&gt; # A tibble: 9 × 1\n#&gt;   x    \n#&gt;   &lt;chr&gt;\n#&gt; 1 1    \n#&gt; 2 2    \n#&gt; 3 1    \n#&gt; 4 1    \n#&gt; 5 1    \n#&gt; 6 3    \n#&gt; # ℹ 3 more rows\n\n\n\n15.4.2 Separating into columns\nSeparating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset, x is made up of a code, an edition number, and a year, separated by \".\". To use separate_wider_delim(), we supply the delimiter and the names in two arguments:\n\ndf3 &lt;- tibble(x = c(\"a10.1.2022\", \"b10.2.2011\", \"e15.1.2015\"))\ndf3 |&gt; \n  separate_wider_delim(\n    x,\n    delim = \".\",\n    names = c(\"code\", \"edition\", \"year\")\n  )\n#&gt; # A tibble: 3 × 3\n#&gt;   code  edition year \n#&gt;   &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;\n#&gt; 1 a10   1       2022 \n#&gt; 2 b10   2       2011 \n#&gt; 3 e15   1       2015\n\nIf a specific piece is not useful you can use an NA name to omit it from the results:\n\ndf3 |&gt; \n  separate_wider_delim(\n    x,\n    delim = \".\",\n    names = c(\"code\", NA, \"year\")\n  )\n#&gt; # A tibble: 3 × 2\n#&gt;   code  year \n#&gt;   &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 a10   2022 \n#&gt; 2 b10   2011 \n#&gt; 3 e15   2015\n\nseparate_wider_position() works a little differently because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:\n\ndf4 &lt;- tibble(x = c(\"202215TX\", \"202122LA\", \"202325CA\")) \ndf4 |&gt; \n  separate_wider_position(\n    x,\n    widths = c(year = 4, age = 2, state = 2)\n  )\n#&gt; # A tibble: 3 × 3\n#&gt;   year  age   state\n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 2022  15    TX   \n#&gt; 2 2021  22    LA   \n#&gt; 3 2023  25    CA\n\n\n\n15.4.3 Diagnosing widening problems\nseparate_wider_delim()6 requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so separate_wider_delim() provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset:\n\ndf &lt;- tibble(x = c(\"1-1-1\", \"1-1-2\", \"1-3\", \"1-3-2\", \"1\"))\n\ndf |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\")\n  )\n#&gt; Error in `separate_wider_delim()`:\n#&gt; ! Expected 3 pieces in each element of `x`.\n#&gt; ! 2 values were too short.\n#&gt; ℹ Use `too_few = \"debug\"` to diagnose the problem.\n#&gt; ℹ Use `too_few = \"align_start\"/\"align_end\"` to silence this message.\n\nYou’ll notice that we get an error, but the error gives us some suggestions on how you might proceed. Let’s start by debugging the problem:\n\ndebug &lt;- df |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\"),\n    too_few = \"debug\"\n  )\n#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and\n#&gt; `x_remainder`.\ndebug\n#&gt; # A tibble: 5 × 6\n#&gt;   x     y     z     x_ok  x_pieces x_remainder\n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      \n#&gt; 1 1-1-1 1     1     TRUE         3 \"\"         \n#&gt; 2 1-1-2 1     2     TRUE         3 \"\"         \n#&gt; 3 1-3   3     &lt;NA&gt;  FALSE        2 \"\"         \n#&gt; 4 1-3-2 3     2     TRUE         3 \"\"         \n#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 \"\"\n\nWhen you use the debug mode, you get three extra columns added to the output: x_ok, x_pieces, and x_remainder (if you separate a variable with a different name, you’ll get a different prefix). Here, x_ok lets you quickly find the inputs that failed:\n\ndebug |&gt; filter(!x_ok)\n#&gt; # A tibble: 2 × 6\n#&gt;   x     y     z     x_ok  x_pieces x_remainder\n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      \n#&gt; 1 1-3   3     &lt;NA&gt;  FALSE        2 \"\"         \n#&gt; 2 1     &lt;NA&gt;  &lt;NA&gt;  FALSE        1 \"\"\n\nx_pieces tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder isn’t useful when there are too few pieces, but we’ll see it again shortly.\nSometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = \"debug\" to ensure that new problems become errors.\nIn other cases, you may want to fill in the missing pieces with NAs and move on. That’s the job of too_few = \"align_start\" and too_few = \"align_end\" which allow you to control where the NAs should go:\n\ndf |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\"),\n    too_few = \"align_start\"\n  )\n#&gt; # A tibble: 5 × 3\n#&gt;   x     y     z    \n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 1     1     1    \n#&gt; 2 1     1     2    \n#&gt; 3 1     3     &lt;NA&gt; \n#&gt; 4 1     3     2    \n#&gt; 5 1     &lt;NA&gt;  &lt;NA&gt;\n\nThe same principles apply if you have too many pieces:\n\ndf &lt;- tibble(x = c(\"1-1-1\", \"1-1-2\", \"1-3-5-6\", \"1-3-2\", \"1-3-5-7-9\"))\n\ndf |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\")\n  )\n#&gt; Error in `separate_wider_delim()`:\n#&gt; ! Expected 3 pieces in each element of `x`.\n#&gt; ! 2 values were too long.\n#&gt; ℹ Use `too_many = \"debug\"` to diagnose the problem.\n#&gt; ℹ Use `too_many = \"drop\"/\"merge\"` to silence this message.\n\nBut now, when we debug the result, you can see the purpose of x_remainder:\n\ndebug &lt;- df |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\"),\n    too_many = \"debug\"\n  )\n#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and\n#&gt; `x_remainder`.\ndebug |&gt; filter(!x_ok)\n#&gt; # A tibble: 2 × 6\n#&gt;   x         y     z     x_ok  x_pieces x_remainder\n#&gt;   &lt;chr&gt;     &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;    &lt;int&gt; &lt;chr&gt;      \n#&gt; 1 1-3-5-6   3     5     FALSE        4 -6         \n#&gt; 2 1-3-5-7-9 3     5     FALSE        5 -7-9\n\nYou have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:\n\ndf |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\"),\n    too_many = \"drop\"\n  )\n#&gt; # A tibble: 5 × 3\n#&gt;   x     y     z    \n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 1     1     1    \n#&gt; 2 1     1     2    \n#&gt; 3 1     3     5    \n#&gt; 4 1     3     2    \n#&gt; 5 1     3     5\n\n\ndf |&gt; \n  separate_wider_delim(\n    x,\n    delim = \"-\",\n    names = c(\"x\", \"y\", \"z\"),\n    too_many = \"merge\"\n  )\n#&gt; # A tibble: 5 × 3\n#&gt;   x     y     z    \n#&gt;   &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1 1     1     1    \n#&gt; 2 1     1     2    \n#&gt; 3 1     3     5-6  \n#&gt; 4 1     3     2    \n#&gt; 5 1     3     5-7-9",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#letters",
    "href": "strings.html#letters",
    "title": "15  Strings",
    "section": "15.5 Letters",
    "text": "15.5 Letters\nIn this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.\n\n15.5.1 Length\nstr_length() tells you the number of letters in the string:\n\nstr_length(c(\"a\", \"R for data science\", NA))\n#&gt; [1]  1 18 NA\n\nYou could use this with count() to find the distribution of lengths of US babynames and then with filter() to look at the longest names, which happen to have 15 letters7:\n\nbabynames |&gt;\n  count(length = str_length(name), wt = n)\n#&gt; # A tibble: 14 × 2\n#&gt;   length        n\n#&gt;    &lt;int&gt;    &lt;int&gt;\n#&gt; 1      2   338150\n#&gt; 2      3  8589596\n#&gt; 3      4 48506739\n#&gt; 4      5 87011607\n#&gt; 5      6 90749404\n#&gt; 6      7 72120767\n#&gt; # ℹ 8 more rows\n\nbabynames |&gt; \n  filter(str_length(name) == 15) |&gt; \n  count(name, wt = n, sort = TRUE)\n#&gt; # A tibble: 34 × 2\n#&gt;   name                n\n#&gt;   &lt;chr&gt;           &lt;int&gt;\n#&gt; 1 Franciscojavier   123\n#&gt; 2 Christopherjohn   118\n#&gt; 3 Johnchristopher   118\n#&gt; 4 Christopherjame   108\n#&gt; 5 Christophermich    52\n#&gt; 6 Ryanchristopher    45\n#&gt; # ℹ 28 more rows\n\n\n\n15.5.2 Subsetting\nYou can extract parts of a string using str_sub(string, start, end), where start and end are the positions where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1:\n\nx &lt;- c(\"Apple\", \"Banana\", \"Pear\")\nstr_sub(x, 1, 3)\n#&gt; [1] \"App\" \"Ban\" \"Pea\"\n\nYou can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.\n\nstr_sub(x, -3, -1)\n#&gt; [1] \"ple\" \"ana\" \"ear\"\n\nNote that str_sub() won’t fail if the string is too short: it will just return as much as possible:\n\nstr_sub(\"a\", 1, 5)\n#&gt; [1] \"a\"\n\nWe could use str_sub() with mutate() to find the first and last letter of each name:\n\nbabynames |&gt; \n  mutate(\n    first = str_sub(name, 1, 1),\n    last = str_sub(name, -1, -1)\n  )\n#&gt; # A tibble: 1,924,665 × 7\n#&gt;    year sex   name          n   prop first last \n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;     &lt;int&gt;  &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1  1880 F     Mary       7065 0.0724 M     y    \n#&gt; 2  1880 F     Anna       2604 0.0267 A     a    \n#&gt; 3  1880 F     Emma       2003 0.0205 E     a    \n#&gt; 4  1880 F     Elizabeth  1939 0.0199 E     h    \n#&gt; 5  1880 F     Minnie     1746 0.0179 M     e    \n#&gt; 6  1880 F     Margaret   1578 0.0162 M     t    \n#&gt; # ℹ 1,924,659 more rows\n\n\n\n15.5.3 Exercises\n\nWhen computing the distribution of the length of babynames, why did we use wt = n?\nUse str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?\nAre there any major trends in the length of babynames over time? What about the popularity of first and last letters?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#sec-other-languages",
    "href": "strings.html#sec-other-languages",
    "title": "15  Strings",
    "section": "15.6 Non-English text",
    "text": "15.6 Non-English text\nSo far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is relatively simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately, we don’t have room for a full treatment of non-English languages. Still, we wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale-dependent functions.\n\n15.6.1 Encoding\nWhen working with non-English text, the first challenge is often the encoding. To understand what’s going on, we need to dive into how computers represent strings. In R, we can get at the underlying representation of a string using charToRaw():\n\ncharToRaw(\"Hadley\")\n#&gt; [1] 48 61 64 6c 65 79\n\nEach of these six hexadecimal numbers represents one letter: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case, the encoding is called ASCII. ASCII does a great job of representing English characters because it’s the American Standard Code for Information Interchange.\nThings aren’t so easy for languages other than English. In the early days of computing, there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages, and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte b1 is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today and many extra symbols like emojis.\nreadr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times, you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings8:\n\nx1 &lt;- \"text\\nEl Ni\\xf1o was particularly bad this year\"\nread_csv(x1)$text\n#&gt; [1] \"El Ni\\xf1o was particularly bad this year\"\n\nx2 &lt;- \"text\\n\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd\"\nread_csv(x2)$text\n#&gt; [1] \"\\x82\\xb1\\x82\\xf1\\x82ɂ\\xbf\\x82\\xcd\"\n\nTo read these correctly, you specify the encoding via the locale argument:\n\nread_csv(x1, locale = locale(encoding = \"Latin1\"))$text\n#&gt; [1] \"El Niño was particularly bad this year\"\n\nread_csv(x2, locale = locale(encoding = \"Shift-JIS\"))$text\n#&gt; [1] \"こんにちは\"\n\nHow do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides guess_encoding() to help you figure it out. It’s not foolproof and works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.\nEncodings are a rich and complex topic; we’ve only scratched the surface here. If you’d like to learn more, we recommend reading the detailed explanation at http://kunststube.net/encoding/.\n\n\n15.6.2 Letter variations\nWorking in languages with accents poses a significant challenge when determining the position of letters (e.g., with str_length() and str_sub()) as accented letters might be encoded as a single individual character (e.g., ü) or as two characters by combining an unaccented letter (e.g., u) with a diacritic mark (e.g., ¨). For example, this code shows two ways of representing ü that look identical:\n\nu &lt;- c(\"\\u00fc\", \"u\\u0308\")\nstr_view(u)\n#&gt; [1] │ ü\n#&gt; [2] │ ü\n\nBut both strings differ in length, and their first characters are different:\n\nstr_length(u)\n#&gt; [1] 1 2\nstr_sub(u, 1, 1)\n#&gt; [1] \"ü\" \"u\"\n\nFinally, note that a comparison of these strings with == interprets these strings as different, while the handy str_equal() function in stringr recognizes that both have the same appearance:\n\nu[[1]] == u[[2]]\n#&gt; [1] FALSE\n\nstr_equal(u[[1]], u[[2]])\n#&gt; [1] TRUE\n\n\n\n15.6.3 Locale-dependent functions\nFinally, there are a handful of stringr functions whose behavior depends on your locale. A locale is similar to a language but includes an optional region specifier to handle regional variations within a language. A locale is specified by a lower-case language abbreviation, optionally followed by a _ and an upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, Wikipedia has a good list, and you can see which are supported in stringr by looking at stringi::stri_locale_list().\nBase R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in a different country. To avoid this problem, stringr defaults to English rules by using the “en” locale and requires you to specify the locale argument to override it. Fortunately, there are two sets of functions where the locale really matters: changing case and sorting.\nThe rules for changing cases differ among languages. For example, Turkish has two i’s: with and without a dot. Since they’re two distinct letters, they’re capitalized differently:\n\nstr_to_upper(c(\"i\", \"ı\"))\n#&gt; [1] \"I\" \"I\"\nstr_to_upper(c(\"i\", \"ı\"), locale = \"tr\")\n#&gt; [1] \"İ\" \"I\"\n\nSorting strings depends on the order of the alphabet, and the order of the alphabet is not the same in every language9! Here’s an example: in Czech, “ch” is a compound letter that appears after h in the alphabet.\n\nstr_sort(c(\"a\", \"c\", \"ch\", \"h\", \"z\"))\n#&gt; [1] \"a\"  \"c\"  \"ch\" \"h\"  \"z\"\nstr_sort(c(\"a\", \"c\", \"ch\", \"h\", \"z\"), locale = \"cs\")\n#&gt; [1] \"a\"  \"c\"  \"h\"  \"ch\" \"z\"\n\nThis also comes up when sorting strings with dplyr::arrange(), which is why it also has a locale argument.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#summary",
    "href": "strings.html#summary",
    "title": "15  Strings",
    "section": "15.7 Summary",
    "text": "15.7 Summary\nIn this chapter, you’ve learned about some of the power of the stringr package: how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working with strings: regular expressions. Regular expressions are a very concise but very expressive language for describing patterns within strings and are the topic of the next chapter.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "strings.html#footnotes",
    "href": "strings.html#footnotes",
    "title": "15  Strings",
    "section": "",
    "text": "Or use the base R function writeLines().↩︎\nAvailable in R 4.0.0 and above.↩︎\nstr_view() also uses color to bring tabs, spaces, matches, etc. to your attention. The colors don’t currently show up in the book, but you’ll notice them when running code interactively.↩︎\nIf you’re not using stringr, you can also access it directly with glue::glue().↩︎\nThe base R equivalent is paste() used with the collapse argument.↩︎\nThe same principles apply to separate_wider_position() and separate_wider_regex().↩︎\nLooking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.↩︎\nHere I’m using the special \\x to encode binary data directly into a string.↩︎\nSorting in languages that don’t have an alphabet, like Chinese, is more complicated still.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>15</span>  <span class='chapter-title'>Strings</span>"
    ]
  },
  {
    "objectID": "regexps.html",
    "href": "regexps.html",
    "title": "16  Regular expressions",
    "section": "",
    "text": "16.1 Introduction\nIn Chapter 15, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”1 or “regexp”.\nThe chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#introduction",
    "href": "regexps.html#introduction",
    "title": "16  Regular expressions",
    "section": "",
    "text": "16.1.1 Prerequisites\nIn this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.\n\nlibrary(tidyverse)\nlibrary(babynames)\n\nThrough this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:\n\nfruit contains the names of 80 fruits.\nwords contains 980 common English words.\nsentences contains 720 short sentences.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#sec-reg-basics",
    "href": "regexps.html#sec-reg-basics",
    "title": "16  Regular expressions",
    "section": "16.2 Pattern basics",
    "text": "16.2 Pattern basics\nWe’ll use str_view() to learn how regex patterns work. We used str_view() in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, str_view() will show only the elements of the string vector that match, surrounding each match with &lt;&gt;, and, where possible, highlighting the match in blue.\nThe simplest patterns consist of letters and numbers which match those characters exactly:\n\nstr_view(fruit, \"berry\")\n#&gt;  [6] │ bil&lt;berry&gt;\n#&gt;  [7] │ black&lt;berry&gt;\n#&gt; [10] │ blue&lt;berry&gt;\n#&gt; [11] │ boysen&lt;berry&gt;\n#&gt; [19] │ cloud&lt;berry&gt;\n#&gt; [21] │ cran&lt;berry&gt;\n#&gt; ... and 8 more\n\nLetters and numbers match exactly and are called literal characters. Most punctuation characters, like ., +, *, [, ], and ?, have special meanings2 and are called metacharacters. For example, . will match any character3, so \"a.\" will match any string that contains an “a” followed by another character :\n\nstr_view(c(\"a\", \"ab\", \"ae\", \"bd\", \"ea\", \"eab\"), \"a.\")\n#&gt; [2] │ &lt;ab&gt;\n#&gt; [3] │ &lt;ae&gt;\n#&gt; [6] │ e&lt;ab&gt;\n\nOr we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:\n\nstr_view(fruit, \"a...e\")\n#&gt;  [1] │ &lt;apple&gt;\n#&gt;  [7] │ bl&lt;ackbe&gt;rry\n#&gt; [48] │ mand&lt;arine&gt;\n#&gt; [51] │ nect&lt;arine&gt;\n#&gt; [62] │ pine&lt;apple&gt;\n#&gt; [64] │ pomegr&lt;anate&gt;\n#&gt; ... and 2 more\n\nQuantifiers control how many times a pattern can match:\n\n? makes a pattern optional (i.e. it matches 0 or 1 times)\n+ lets a pattern repeat (i.e. it matches at least once)\n* lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).\n\n\n# ab? matches an \"a\", optionally followed by a \"b\".\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab?\")\n#&gt; [1] │ &lt;a&gt;\n#&gt; [2] │ &lt;ab&gt;\n#&gt; [3] │ &lt;ab&gt;b\n\n# ab+ matches an \"a\", followed by at least one \"b\".\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab+\")\n#&gt; [2] │ &lt;ab&gt;\n#&gt; [3] │ &lt;abb&gt;\n\n# ab* matches an \"a\", followed by any number of \"b\"s.\nstr_view(c(\"a\", \"ab\", \"abb\"), \"ab*\")\n#&gt; [1] │ &lt;a&gt;\n#&gt; [2] │ &lt;ab&gt;\n#&gt; [3] │ &lt;abb&gt;\n\nCharacter classes are defined by [] and let you match a set of characters, e.g., [abcd] matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”. We can use this idea to find the words containing an “x” surrounded by vowels, or a “y” surrounded by consonants:\n\nstr_view(words, \"[aeiou]x[aeiou]\")\n#&gt; [284] │ &lt;exa&gt;ct\n#&gt; [285] │ &lt;exa&gt;mple\n#&gt; [288] │ &lt;exe&gt;rcise\n#&gt; [289] │ &lt;exi&gt;st\nstr_view(words, \"[^aeiou]y[^aeiou]\")\n#&gt; [836] │ &lt;sys&gt;tem\n#&gt; [901] │ &lt;typ&gt;e\n\nYou can use alternation, |, to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “melon”, or “nut”, or a repeated vowel.\n\nstr_view(fruit, \"apple|melon|nut\")\n#&gt;  [1] │ &lt;apple&gt;\n#&gt; [13] │ canary &lt;melon&gt;\n#&gt; [20] │ coco&lt;nut&gt;\n#&gt; [52] │ &lt;nut&gt;\n#&gt; [62] │ pine&lt;apple&gt;\n#&gt; [72] │ rock &lt;melon&gt;\n#&gt; ... and 1 more\nstr_view(fruit, \"aa|ee|ii|oo|uu\")\n#&gt;  [9] │ bl&lt;oo&gt;d orange\n#&gt; [33] │ g&lt;oo&gt;seberry\n#&gt; [47] │ lych&lt;ee&gt;\n#&gt; [66] │ purple mangost&lt;ee&gt;n\n\nRegular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#sec-stringr-regex-funs",
    "href": "regexps.html#sec-stringr-regex-funs",
    "title": "16  Regular expressions",
    "section": "16.3 Key functions",
    "text": "16.3 Key functions\nNow that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.\n\n16.3.1 Detect matches\nstr_detect() returns a logical vector that is TRUE if the pattern matches an element of the character vector and FALSE otherwise:\n\nstr_detect(c(\"a\", \"b\", \"c\"), \"[aeiou]\")\n#&gt; [1]  TRUE FALSE FALSE\n\nSince str_detect() returns a logical vector of the same length as the initial vector, it pairs well with filter(). For example, this code finds all the most popular names containing a lower-case “x”:\n\nbabynames |&gt; \n  filter(str_detect(name, \"x\")) |&gt; \n  count(name, wt = n, sort = TRUE)\n#&gt; # A tibble: 974 × 2\n#&gt;   name           n\n#&gt;   &lt;chr&gt;      &lt;int&gt;\n#&gt; 1 Alexander 665492\n#&gt; 2 Alexis    399551\n#&gt; 3 Alex      278705\n#&gt; 4 Alexandra 232223\n#&gt; 5 Max       148787\n#&gt; 6 Alexa     123032\n#&gt; # ℹ 968 more rows\n\nWe can also use str_detect() with summarize() by pairing it with sum() or mean(): sum(str_detect(x, pattern)) tells you the number of observations that match and mean(str_detect(x, pattern)) tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names4 that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!\n\nbabynames |&gt; \n  group_by(year) |&gt; \n  summarize(prop_x = mean(str_detect(name, \"x\"))) |&gt; \n  ggplot(aes(x = year, y = prop_x)) + \n  geom_line()\n\n\n\n\n\n\n\n\nThere are two functions that are closely related to str_detect(): str_subset() and str_which(). str_subset() returns a character vector containing only the strings that match. str_which() returns an integer vector giving the positions of the strings that match.\n\n\n16.3.2 Count matches\nThe next step up in complexity from str_detect() is str_count(): rather than a true or false, it tells you how many matches there are in each string.\n\nx &lt;- c(\"apple\", \"banana\", \"pear\")\nstr_count(x, \"p\")\n#&gt; [1] 2 0 1\n\nNote that each match starts at the end of the previous match, i.e. regex matches never overlap. For example, in \"abababa\", how many times will the pattern \"aba\" match? Regular expressions say two, not three:\n\nstr_count(\"abababa\", \"aba\")\n#&gt; [1] 2\nstr_view(\"abababa\", \"aba\")\n#&gt; [1] │ &lt;aba&gt;b&lt;aba&gt;\n\nIt’s natural to use str_count() with mutate(). The following example uses str_count() with character classes to count the number of vowels and consonants in each name.\n\nbabynames |&gt; \n  count(name) |&gt; \n  mutate(\n    vowels = str_count(name, \"[aeiou]\"),\n    consonants = str_count(name, \"[^aeiou]\")\n  )\n#&gt; # A tibble: 97,310 × 4\n#&gt;   name          n vowels consonants\n#&gt;   &lt;chr&gt;     &lt;int&gt;  &lt;int&gt;      &lt;int&gt;\n#&gt; 1 Aaban        10      2          3\n#&gt; 2 Aabha         5      2          3\n#&gt; 3 Aabid         2      2          3\n#&gt; 4 Aabir         1      2          3\n#&gt; 5 Aabriella     5      4          5\n#&gt; 6 Aada          1      2          2\n#&gt; # ℹ 97,304 more rows\n\nIf you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:\n\nAdd the upper case vowels to the character class: str_count(name, \"[aeiouAEIOU]\").\nTell the regular expression to ignore case: str_count(name, regex(\"[aeiou]\", ignore_case = TRUE)). We’ll talk about more in Section 16.5.1.\nUse str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), \"[aeiou]\").\n\nThis variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.\nIn this case, since we’re applying two functions to the name, I think it’s easier to transform it first:\n\nbabynames |&gt; \n  count(name) |&gt; \n  mutate(\n    name = str_to_lower(name),\n    vowels = str_count(name, \"[aeiou]\"),\n    consonants = str_count(name, \"[^aeiou]\")\n  )\n#&gt; # A tibble: 97,310 × 4\n#&gt;   name          n vowels consonants\n#&gt;   &lt;chr&gt;     &lt;int&gt;  &lt;int&gt;      &lt;int&gt;\n#&gt; 1 aaban        10      3          2\n#&gt; 2 aabha         5      3          2\n#&gt; 3 aabid         2      3          2\n#&gt; 4 aabir         1      3          2\n#&gt; 5 aabriella     5      5          4\n#&gt; 6 aada          1      3          1\n#&gt; # ℹ 97,304 more rows\n\n\n\n16.3.3 Replace values\nAs well as detecting and counting matches, we can also modify them with str_replace() and str_replace_all(). str_replace() replaces the first match, and as the name suggests, str_replace_all() replaces all matches.\n\nx &lt;- c(\"apple\", \"pear\", \"banana\")\nstr_replace_all(x, \"[aeiou]\", \"-\")\n#&gt; [1] \"-ppl-\"  \"p--r\"   \"b-n-n-\"\n\nstr_remove() and str_remove_all() are handy shortcuts for str_replace(x, pattern, \"\"):\n\nx &lt;- c(\"apple\", \"pear\", \"banana\")\nstr_remove_all(x, \"[aeiou]\")\n#&gt; [1] \"ppl\" \"pr\"  \"bnn\"\n\nThese functions are naturally paired with mutate() when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.\n\n\n16.3.4 Extract variables\nThe last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: separate_wider_regex(). It’s a peer of the separate_wider_position() and separate_wider_delim() functions that you learned about in Section 15.4.2. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.\nLet’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird format5:\n\ndf &lt;- tribble(\n  ~str,\n  \"&lt;Sheryl&gt;-F_34\",\n  \"&lt;Kisha&gt;-F_45\", \n  \"&lt;Brandon&gt;-N_33\",\n  \"&lt;Sharon&gt;-F_38\", \n  \"&lt;Penny&gt;-F_58\",\n  \"&lt;Justin&gt;-M_41\", \n  \"&lt;Patricia&gt;-F_84\", \n)\n\nTo extract this data using separate_wider_regex() we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:\n\ndf |&gt; \n  separate_wider_regex(\n    str,\n    patterns = c(\n      \"&lt;\", \n      name = \"[A-Za-z]+\", \n      \"&gt;-\", \n      gender = \".\",\n      \"_\",\n      age = \"[0-9]+\"\n    )\n  )\n#&gt; # A tibble: 7 × 3\n#&gt;   name    gender age  \n#&gt;   &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt;\n#&gt; 1 Sheryl  F      34   \n#&gt; 2 Kisha   F      45   \n#&gt; 3 Brandon N      33   \n#&gt; 4 Sharon  F      38   \n#&gt; 5 Penny   F      58   \n#&gt; 6 Justin  M      41   \n#&gt; # ℹ 1 more row\n\nIf the match fails, you can use too_few = \"debug\" to figure out what went wrong, just like separate_wider_delim() and separate_wider_position().\n\n\n16.3.5 Exercises\n\nWhat baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)\nReplace all forward slashes in \"a/b/c/d/e\" with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)\nImplement a simple version of str_to_lower() using str_replace_all().\nCreate a regular expression that will match telephone numbers as commonly written in your country.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#pattern-details",
    "href": "regexps.html#pattern-details",
    "title": "16  Regular expressions",
    "section": "16.4 Pattern details",
    "text": "16.4 Pattern details\nNow that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, it’s time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll learn more about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.\nThe terms we use here are the technical names for each component. They’re not always the most evocative of their purpose, but it’s very helpful to know the correct terms if you later want to Google for more details.\n\n16.4.1 Escaping\nIn order to match a literal ., you need an escape which tells the regular expression to match metacharacters6 literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \\.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \\ is also used as an escape symbol in strings. So to create the regular expression \\. we need the string \"\\\\.\", as the following example shows.\n\n# To create the regular expression \\., we need to use \\\\.\ndot &lt;- \"\\\\.\"\n\n# But the expression itself only contains one \\\nstr_view(dot)\n#&gt; [1] │ \\.\n\n# And this tells R to look for an explicit .\nstr_view(c(\"abc\", \"a.c\", \"bef\"), \"a\\\\.c\")\n#&gt; [2] │ &lt;a.c&gt;\n\nIn this book, we’ll usually write regular expression without quotes, like \\.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like \"\\\\.\".\nIf \\ is used as an escape character in regular expressions, how do you match a literal \\? Well, you need to escape it, creating the regular expression \\\\. To create that regular expression, you need to use a string, which also needs to escape \\. That means to match a literal \\ you need to write \"\\\\\\\\\" — you need four backslashes to match one!\n\nx &lt;- \"a\\\\b\"\nstr_view(x)\n#&gt; [1] │ a\\b\nstr_view(x, \"\\\\\\\\\")\n#&gt; [1] │ a&lt;\\&gt;b\n\nAlternatively, you might find it easier to use the raw strings you learned about in Section 15.2.2). That lets you avoid one layer of escaping:\n\nstr_view(x, r\"{\\\\}\")\n#&gt; [1] │ a&lt;\\&gt;b\n\nIf you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], … all match the literal values.\n\nstr_view(c(\"abc\", \"a.c\", \"a*c\", \"a c\"), \"a[.]c\")\n#&gt; [2] │ &lt;a.c&gt;\nstr_view(c(\"abc\", \"a.c\", \"a*c\", \"a c\"), \".[*]c\")\n#&gt; [3] │ &lt;a*c&gt;\n\n\n\n16.4.2 Anchors\nBy default, regular expressions will match any part of a string. If you want to match at the start or end you need to anchor the regular expression using ^ to match the start or $ to match the end:\n\nstr_view(fruit, \"^a\")\n#&gt; [1] │ &lt;a&gt;pple\n#&gt; [2] │ &lt;a&gt;pricot\n#&gt; [3] │ &lt;a&gt;vocado\nstr_view(fruit, \"a$\")\n#&gt;  [4] │ banan&lt;a&gt;\n#&gt; [15] │ cherimoy&lt;a&gt;\n#&gt; [30] │ feijo&lt;a&gt;\n#&gt; [36] │ guav&lt;a&gt;\n#&gt; [56] │ papay&lt;a&gt;\n#&gt; [74] │ satsum&lt;a&gt;\n\nIt’s tempting to think that $ should match the start of a string, because that’s how we write dollar amounts, but that’s not what regular expressions want.\nTo force a regular expression to match only the full string, anchor it with both ^ and $:\n\nstr_view(fruit, \"apple\")\n#&gt;  [1] │ &lt;apple&gt;\n#&gt; [62] │ pine&lt;apple&gt;\nstr_view(fruit, \"^apple$\")\n#&gt; [1] │ &lt;apple&gt;\n\nYou can also match the boundary between words (i.e. the start or end of a word) with \\b. This can be particularly useful when using RStudio’s find and replace tool. For example, if to find all uses of sum(), you can search for \\bsum\\b to avoid matching summarize, summary, rowsum and so on:\n\nx &lt;- c(\"summary(x)\", \"summarize(df)\", \"rowsum(x)\", \"sum(x)\")\nstr_view(x, \"sum\")\n#&gt; [1] │ &lt;sum&gt;mary(x)\n#&gt; [2] │ &lt;sum&gt;marize(df)\n#&gt; [3] │ row&lt;sum&gt;(x)\n#&gt; [4] │ &lt;sum&gt;(x)\nstr_view(x, \"\\\\bsum\\\\b\")\n#&gt; [4] │ &lt;sum&gt;(x)\n\nWhen used alone, anchors will produce a zero-width match:\n\nstr_view(\"abc\", c(\"$\", \"^\", \"\\\\b\"))\n#&gt; [1] │ abc&lt;&gt;\n#&gt; [2] │ &lt;&gt;abc\n#&gt; [3] │ &lt;&gt;abc&lt;&gt;\n\nThis helps you understand what happens when you replace a standalone anchor:\n\nstr_replace_all(\"abc\", c(\"$\", \"^\", \"\\\\b\"), \"--\")\n#&gt; [1] \"abc--\"   \"--abc\"   \"--abc--\"\n\n\n\n16.4.3 Character classes\nA character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches “a”, “b”, or “c” and [^abc] matches any character except “a”, “b”, or “c”. Apart from ^ there are two other characters that have special meaning inside of []:\n\n- defines a range, e.g., [a-z] matches any lower case letter and [0-9] matches any number.\n\\ escapes special characters, so [\\^\\-\\]] matches ^, -, or ].\n\nHere are few examples:\n\nx &lt;- \"abcd ABCD 12345 -!@#%.\"\nstr_view(x, \"[abc]+\")\n#&gt; [1] │ &lt;abc&gt;d ABCD 12345 -!@#%.\nstr_view(x, \"[a-z]+\")\n#&gt; [1] │ &lt;abcd&gt; ABCD 12345 -!@#%.\nstr_view(x, \"[^a-z0-9]+\")\n#&gt; [1] │ abcd&lt; ABCD &gt;12345&lt; -!@#%.&gt;\n\n# You need an escape to match characters that are otherwise\n# special inside of []\nstr_view(\"a-b-c\", \"[a-c]\")\n#&gt; [1] │ &lt;a&gt;-&lt;b&gt;-&lt;c&gt;\nstr_view(\"a-b-c\", \"[a\\\\-c]\")\n#&gt; [1] │ &lt;a&gt;&lt;-&gt;b&lt;-&gt;&lt;c&gt;\n\nSome character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairs7:\n\n\\d matches any digit;\n\\D matches anything that isn’t a digit.\n\\s matches any whitespace (e.g., space, tab, newline);\n\\S matches anything that isn’t whitespace.\n\\w matches any “word” character, i.e. letters and numbers;\n\\W matches any “non-word” character.\n\nThe following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.\n\nx &lt;- \"abcd ABCD 12345 -!@#%.\"\nstr_view(x, \"\\\\d+\")\n#&gt; [1] │ abcd ABCD &lt;12345&gt; -!@#%.\nstr_view(x, \"\\\\D+\")\n#&gt; [1] │ &lt;abcd ABCD &gt;12345&lt; -!@#%.&gt;\nstr_view(x, \"\\\\s+\")\n#&gt; [1] │ abcd&lt; &gt;ABCD&lt; &gt;12345&lt; &gt;-!@#%.\nstr_view(x, \"\\\\S+\")\n#&gt; [1] │ &lt;abcd&gt; &lt;ABCD&gt; &lt;12345&gt; &lt;-!@#%.&gt;\nstr_view(x, \"\\\\w+\")\n#&gt; [1] │ &lt;abcd&gt; &lt;ABCD&gt; &lt;12345&gt; -!@#%.\nstr_view(x, \"\\\\W+\")\n#&gt; [1] │ abcd&lt; &gt;ABCD&lt; &gt;12345&lt; -!@#%.&gt;\n\n\n\n16.4.4 Quantifiers\nQuantifiers control how many times a pattern matches. In Section 16.2 you learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \\d+ will match one or more digits, and \\s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with {}:\n\n{n} matches exactly n times.\n{n,} matches at least n times.\n{n,m} matches between n and m times.\n\n\n\n16.4.5 Operator precedence and parentheses\nWhat does ab+ match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does ^a|b$ match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string ending with b?\nThe answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that a + b * c is equivalent to a + (b * c) not (a + b) * c because * has higher precedence and + has lower precedence: you compute * before +.\nSimilarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.\n\n\n16.4.6 Grouping and capturing\nAs well as overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.\nThe first way to use a capturing group is to refer back to it within a match with back reference: \\1 refers to the match contained in the first parenthesis, \\2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:\n\nstr_view(fruit, \"(..)\\\\1\")\n#&gt;  [4] │ b&lt;anan&gt;a\n#&gt; [20] │ &lt;coco&gt;nut\n#&gt; [22] │ &lt;cucu&gt;mber\n#&gt; [41] │ &lt;juju&gt;be\n#&gt; [56] │ &lt;papa&gt;ya\n#&gt; [73] │ s&lt;alal&gt; berry\n\nAnd this one finds all words that start and end with the same pair of letters:\n\nstr_view(words, \"^(..).*\\\\1$\")\n#&gt; [152] │ &lt;church&gt;\n#&gt; [217] │ &lt;decide&gt;\n#&gt; [617] │ &lt;photograph&gt;\n#&gt; [699] │ &lt;require&gt;\n#&gt; [739] │ &lt;sense&gt;\n\nYou can also use back references in str_replace(). For example, this code switches the order of the second and third words in sentences:\n\nsentences |&gt; \n  str_replace(\"(\\\\w+) (\\\\w+) (\\\\w+)\", \"\\\\1 \\\\3 \\\\2\") |&gt; \n  str_view()\n#&gt; [1] │ The canoe birch slid on the smooth planks.\n#&gt; [2] │ Glue sheet the to the dark blue background.\n#&gt; [3] │ It's to easy tell the depth of a well.\n#&gt; [4] │ These a days chicken leg is a rare dish.\n#&gt; [5] │ Rice often is served in round bowls.\n#&gt; [6] │ The of juice lemons makes fine punch.\n#&gt; ... and 714 more\n\nIf you want to extract the matches for each group you can use str_match(). But str_match() returns a matrix, so it’s not particularly easy to work with8:\n\nsentences |&gt; \n  str_match(\"the (\\\\w+) (\\\\w+)\") |&gt; \n  head()\n#&gt;      [,1]                [,2]     [,3]    \n#&gt; [1,] \"the smooth planks\" \"smooth\" \"planks\"\n#&gt; [2,] \"the sheet to\"      \"sheet\"  \"to\"    \n#&gt; [3,] \"the depth of\"      \"depth\"  \"of\"    \n#&gt; [4,] NA                  NA       NA      \n#&gt; [5,] NA                  NA       NA      \n#&gt; [6,] NA                  NA       NA\n\nYou could convert to a tibble and name the columns:\n\nsentences |&gt; \n  str_match(\"the (\\\\w+) (\\\\w+)\") |&gt; \n  as_tibble(.name_repair = \"minimal\") |&gt; \n  set_names(\"match\", \"word1\", \"word2\")\n#&gt; # A tibble: 720 × 3\n#&gt;   match             word1  word2 \n#&gt;   &lt;chr&gt;             &lt;chr&gt;  &lt;chr&gt; \n#&gt; 1 the smooth planks smooth planks\n#&gt; 2 the sheet to      sheet  to    \n#&gt; 3 the depth of      depth  of    \n#&gt; 4 &lt;NA&gt;              &lt;NA&gt;   &lt;NA&gt;  \n#&gt; 5 &lt;NA&gt;              &lt;NA&gt;   &lt;NA&gt;  \n#&gt; 6 &lt;NA&gt;              &lt;NA&gt;   &lt;NA&gt;  \n#&gt; # ℹ 714 more rows\n\nBut then you’ve basically recreated your own version of separate_wider_regex(). Indeed, behind the scenes, separate_wider_regex() converts your vector of patterns to a single regex that uses grouping to capture the named components.\nOccasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with (?:).\n\nx &lt;- c(\"a gray cat\", \"a grey dog\")\nstr_match(x, \"gr(e|a)y\")\n#&gt;      [,1]   [,2]\n#&gt; [1,] \"gray\" \"a\" \n#&gt; [2,] \"grey\" \"e\"\nstr_match(x, \"gr(?:e|a)y\")\n#&gt;      [,1]  \n#&gt; [1,] \"gray\"\n#&gt; [2,] \"grey\"\n\n\n\n16.4.7 Exercises\n\nHow would you match the literal string \"'\\? How about \"$^$\"?\nExplain why each of these patterns don’t match a \\: \"\\\", \"\\\\\", \"\\\\\\\".\nGiven the corpus of common words in stringr::words, create regular expressions that find all words that:\n\nStart with “y”.\nDon’t start with “y”.\nEnd with “x”.\nAre exactly three letters long. (Don’t cheat by using str_length()!)\nHave seven letters or more.\nContain a vowel-consonant pair.\nContain at least two vowel-consonant pairs in a row.\nOnly consist of repeated vowel-consonant pairs.\n\nCreate 11 regular expressions that match the British or American spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. Try and make the shortest possible regex!\nSwitch the first and last letters in words. Which of those strings are still words?\nDescribe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)\n\n^.*$\n\"\\\\{.+\\\\}\"\n\\d{4}-\\d{2}-\\d{2}\n\"\\\\\\\\{4}\"\n\\..\\..\\..\n(.)\\1\\1\n\"(..)\\\\1\"\n\nSolve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#pattern-control",
    "href": "regexps.html#pattern-control",
    "title": "16  Regular expressions",
    "section": "16.5 Pattern control",
    "text": "16.5 Pattern control\nIt’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you to control the so called regex flags and match various types of fixed strings, as described below.\n\n16.5.1 Regex flags\nThere are a number of settings that can be used to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to regex(). The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:\n\nbananas &lt;- c(\"banana\", \"Banana\", \"BANANA\")\nstr_view(bananas, \"banana\")\n#&gt; [1] │ &lt;banana&gt;\nstr_view(bananas, regex(\"banana\", ignore_case = TRUE))\n#&gt; [1] │ &lt;banana&gt;\n#&gt; [2] │ &lt;Banana&gt;\n#&gt; [3] │ &lt;BANANA&gt;\n\nIf you’re doing a lot of work with multiline strings (i.e. strings that contain \\n), dotalland multiline may also be useful:\n\ndotall = TRUE lets . match everything, including \\n:\n\nx &lt;- \"Line 1\\nLine 2\\nLine 3\"\nstr_view(x, \".Line\")\nstr_view(x, regex(\".Line\", dotall = TRUE))\n#&gt; [1] │ Line 1&lt;\n#&gt;     │ Line&gt; 2&lt;\n#&gt;     │ Line&gt; 3\n\nmultiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string:\n\nx &lt;- \"Line 1\\nLine 2\\nLine 3\"\nstr_view(x, \"^Line\")\n#&gt; [1] │ &lt;Line&gt; 1\n#&gt;     │ Line 2\n#&gt;     │ Line 3\nstr_view(x, regex(\"^Line\", multiline = TRUE))\n#&gt; [1] │ &lt;Line&gt; 1\n#&gt;     │ &lt;Line&gt; 2\n#&gt;     │ &lt;Line&gt; 3\n\n\nFinally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try comments = TRUE. It tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandable9, as in the following example:\n\nphone &lt;- regex(\n  r\"(\n    \\(?     # optional opening parens\n    (\\d{3}) # area code\n    [)\\-]?  # optional closing parens or dash\n    \\ ?     # optional space\n    (\\d{3}) # another three numbers\n    [\\ -]?  # optional space or dash\n    (\\d{4}) # four more numbers\n  )\", \n  comments = TRUE\n)\n\nstr_extract(c(\"514-791-8141\", \"(123) 456 7890\", \"123456\"), phone)\n#&gt; [1] \"514-791-8141\"   \"(123) 456 7890\" NA\n\nIf you’re using comments and want to match a space, newline, or #, you’ll need to escape it with \\.\n\n\n16.5.2 Fixed matches\nYou can opt-out of the regular expression rules by using fixed():\n\nstr_view(c(\"\", \"a\", \".\"), fixed(\".\"))\n#&gt; [3] │ &lt;.&gt;\n\nfixed() also gives you the ability to ignore case:\n\nstr_view(\"x X\", \"X\")\n#&gt; [1] │ x &lt;X&gt;\nstr_view(\"x X\", fixed(\"X\", ignore_case = TRUE))\n#&gt; [1] │ &lt;x&gt; &lt;X&gt;\n\nIf you’re working with non-English text, you will probably want coll() instead of fixed(), as it implements the full rules for capitalization as used by the locale you specify. See Section 15.6 for more details on locales.\n\nstr_view(\"i İ ı I\", fixed(\"İ\", ignore_case = TRUE))\n#&gt; [1] │ i &lt;İ&gt; ı I\nstr_view(\"i İ ı I\", coll(\"İ\", ignore_case = TRUE, locale = \"tr\"))\n#&gt; [1] │ &lt;i&gt; &lt;İ&gt; ı I",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#practice",
    "href": "regexps.html#practice",
    "title": "16  Regular expressions",
    "section": "16.6 Practice",
    "text": "16.6 Practice\nTo put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:\n\nchecking your work by creating simple positive and negative controls\ncombining regular expressions with Boolean algebra\ncreating complex patterns using string manipulation\n\n\n16.6.1 Check your work\nFirst, let’s find all sentences that start with “The”. Using the ^ anchor alone is not enough:\n\nstr_view(sentences, \"^The\")\n#&gt;  [1] │ &lt;The&gt; birch canoe slid on the smooth planks.\n#&gt;  [4] │ &lt;The&gt;se days a chicken leg is a rare dish.\n#&gt;  [6] │ &lt;The&gt; juice of lemons makes fine punch.\n#&gt;  [7] │ &lt;The&gt; box was thrown beside the parked truck.\n#&gt;  [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.\n#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.\n#&gt; ... and 271 more\n\nBecause that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding a word boundary:\n\nstr_view(sentences, \"^The\\\\b\")\n#&gt;  [1] │ &lt;The&gt; birch canoe slid on the smooth planks.\n#&gt;  [6] │ &lt;The&gt; juice of lemons makes fine punch.\n#&gt;  [7] │ &lt;The&gt; box was thrown beside the parked truck.\n#&gt;  [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.\n#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.\n#&gt; [13] │ &lt;The&gt; source of the huge river is the clear spring.\n#&gt; ... and 250 more\n\nWhat about finding all sentences that begin with a pronoun?\n\nstr_view(sentences, \"^She|He|It|They\\\\b\")\n#&gt;  [3] │ &lt;It&gt;'s easy to tell the depth of a well.\n#&gt; [15] │ &lt;He&gt;lp the woman get back to her feet.\n#&gt; [27] │ &lt;He&gt;r purse was full of useless trash.\n#&gt; [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.\n#&gt; [63] │ &lt;He&gt; ran half way to the hardware store.\n#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.\n#&gt; ... and 57 more\n\nA quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:\n\nstr_view(sentences, \"^(She|He|It|They)\\\\b\")\n#&gt;   [3] │ &lt;It&gt;'s easy to tell the depth of a well.\n#&gt;  [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.\n#&gt;  [63] │ &lt;He&gt; ran half way to the hardware store.\n#&gt;  [90] │ &lt;He&gt; lay prone and hardly moved a limb.\n#&gt; [116] │ &lt;He&gt; ordered peach pie with ice cream.\n#&gt; [127] │ &lt;It&gt; caught its hind paw in a rusty trap.\n#&gt; ... and 51 more\n\nYou might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:\n\npos &lt;- c(\"He is a boy\", \"She had a good time\")\nneg &lt;- c(\"Shells come from the sea\", \"Hadley said 'It's a great day'\")\n\npattern &lt;- \"^(She|He|It|They)\\\\b\"\nstr_detect(pos, pattern)\n#&gt; [1] TRUE TRUE\nstr_detect(neg, pattern)\n#&gt; [1] FALSE FALSE\n\nIt’s typically much easier to come up with good positive examples than negative examples, because it takes a while before you’re good enough with regular expressions to predict where your weaknesses are. Nevertheless, they’re still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.\n\n\n16.6.2 Boolean operations\nImagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$):\n\nstr_view(words, \"^[^aeiou]+$\")\n#&gt; [123] │ &lt;by&gt;\n#&gt; [249] │ &lt;dry&gt;\n#&gt; [328] │ &lt;fly&gt;\n#&gt; [538] │ &lt;mrs&gt;\n#&gt; [895] │ &lt;try&gt;\n#&gt; [952] │ &lt;why&gt;\n\nBut you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:\n\nstr_view(words[!str_detect(words, \"[aeiou]\")])\n#&gt; [1] │ by\n#&gt; [2] │ dry\n#&gt; [3] │ fly\n#&gt; [4] │ mrs\n#&gt; [5] │ try\n#&gt; [6] │ why\n\nThis is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:\n\nstr_view(words, \"a.*b|b.*a\")\n#&gt;  [2] │ &lt;ab&gt;le\n#&gt;  [3] │ &lt;ab&gt;out\n#&gt;  [4] │ &lt;ab&gt;solute\n#&gt; [62] │ &lt;availab&gt;le\n#&gt; [66] │ &lt;ba&gt;by\n#&gt; [67] │ &lt;ba&gt;ck\n#&gt; ... and 24 more\n\nIt’s simpler to combine the results of two calls to str_detect():\n\nwords[str_detect(words, \"a\") & str_detect(words, \"b\")]\n#&gt;  [1] \"able\"      \"about\"     \"absolute\"  \"available\" \"baby\"      \"back\"     \n#&gt;  [7] \"bad\"       \"bag\"       \"balance\"   \"ball\"      \"bank\"      \"bar\"      \n#&gt; [13] \"base\"      \"basis\"     \"bear\"      \"beat\"      \"beauty\"    \"because\"  \n#&gt; [19] \"black\"     \"board\"     \"boat\"      \"break\"     \"brilliant\" \"britain\"  \n#&gt; [25] \"debate\"    \"husband\"   \"labour\"    \"maybe\"     \"probable\"  \"table\"\n\nWhat if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:\n\nwords[str_detect(words, \"a.*e.*i.*o.*u\")]\n# ...\nwords[str_detect(words, \"u.*o.*i.*e.*a\")]\n\nIt’s much simpler to combine five calls to str_detect():\n\nwords[\n  str_detect(words, \"a\") &\n  str_detect(words, \"e\") &\n  str_detect(words, \"i\") &\n  str_detect(words, \"o\") &\n  str_detect(words, \"u\")\n]\n#&gt; character(0)\n\nIn general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.\n\n\n16.6.3 Creating a pattern with code\nWhat if we wanted to find all sentences that mention a color? The basic idea is simple: we just combine alternation with word boundaries.\n\nstr_view(sentences, \"\\\\b(red|green|blue)\\\\b\")\n#&gt;   [2] │ Glue the sheet to the dark &lt;blue&gt; background.\n#&gt;  [26] │ Two &lt;blue&gt; fish swam in the tank.\n#&gt;  [92] │ A wisp of cloud hung in the &lt;blue&gt; air.\n#&gt; [148] │ The spot on the blotter was made by &lt;green&gt; ink.\n#&gt; [160] │ The sofa cushion is &lt;red&gt; and of light weight.\n#&gt; [174] │ The sky that morning was clear and bright &lt;blue&gt;.\n#&gt; ... and 20 more\n\nBut as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?\n\nrgb &lt;- c(\"red\", \"green\", \"blue\")\n\nWell, we can! We’d just need to create the pattern from the vector using str_c() and str_flatten():\n\nstr_c(\"\\\\b(\", str_flatten(rgb, \"|\"), \")\\\\b\")\n#&gt; [1] \"\\\\b(red|green|blue)\\\\b\"\n\nWe could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:\n\nstr_view(colors())\n#&gt; [1] │ white\n#&gt; [2] │ aliceblue\n#&gt; [3] │ antiquewhite\n#&gt; [4] │ antiquewhite1\n#&gt; [5] │ antiquewhite2\n#&gt; [6] │ antiquewhite3\n#&gt; ... and 651 more\n\nBut lets first eliminate the numbered variants:\n\ncols &lt;- colors()\ncols &lt;- cols[!str_detect(cols, \"\\\\d\")]\nstr_view(cols)\n#&gt; [1] │ white\n#&gt; [2] │ aliceblue\n#&gt; [3] │ antiquewhite\n#&gt; [4] │ aquamarine\n#&gt; [5] │ azure\n#&gt; [6] │ beige\n#&gt; ... and 137 more\n\nThen we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:\n\npattern &lt;- str_c(\"\\\\b(\", str_flatten(cols, \"|\"), \")\\\\b\")\nstr_view(sentences, pattern)\n#&gt;   [2] │ Glue the sheet to the dark &lt;blue&gt; background.\n#&gt;  [12] │ A rod is used to catch &lt;pink&gt; &lt;salmon&gt;.\n#&gt;  [26] │ Two &lt;blue&gt; fish swam in the tank.\n#&gt;  [66] │ Cars and busses stalled in &lt;snow&gt; drifts.\n#&gt;  [92] │ A wisp of cloud hung in the &lt;blue&gt; air.\n#&gt; [112] │ Leaves turn &lt;brown&gt; and &lt;yellow&gt; in the fall.\n#&gt; ... and 57 more\n\nIn this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create patterns from existing strings it’s wise to run them through str_escape() to ensure they match literally.\n\n\n16.6.4 Exercises\n\nFor each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.\n\nFind all words that start or end with x.\nFind all words that start with a vowel and end with a consonant.\nAre there any words that contain at least one of each different vowel?\n\nConstruct patterns to find evidence for and against the rule “i before e except after c”?\ncolors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then remove the colors that are modified).\nCreate a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = \"datasets\")$results[, \"Item\"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#regular-expressions-in-other-places",
    "href": "regexps.html#regular-expressions-in-other-places",
    "title": "16  Regular expressions",
    "section": "16.7 Regular expressions in other places",
    "text": "16.7 Regular expressions in other places\nJust like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.\n\n16.7.1 tidyverse\nThere are three other particularly useful places where you might want to use a regular expressions\n\nmatches(pattern) will select all variables whose name matches the supplied pattern. It’s a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g., select(), rename_with() and across()).\npivot_longer()'s names_pattern argument takes a vector of regular expressions, just like separate_wider_regex(). It’s useful when extracting data out of variable names with a complex structure\nThe delim argument in separate_longer_delim() and separate_wider_delim() usually matches a fixed string, but you can use regex() to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. regex(\", ?\").\n\n\n\n16.7.2 Base R\napropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:\n\napropos(\"replace\")\n#&gt; [1] \"%+replace%\"       \"replace\"          \"replace_na\"      \n#&gt; [4] \"setReplaceMethod\" \"str_replace\"      \"str_replace_all\" \n#&gt; [7] \"str_replace_na\"   \"theme_replace\"\n\nlist.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:\n\nhead(list.files(pattern = \"\\\\.Rmd$\"))\n#&gt; character(0)\n\nIt’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the stringi package, which is in turn built on top of the ICU engine, whereas base R functions use either the TRE engine or the PCRE engine, depending on whether or not you’ve set perl = TRUE. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the (?…) syntax.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#summary",
    "href": "regexps.html#summary",
    "title": "16  Regular expressions",
    "section": "16.8 Summary",
    "text": "16.8 Summary\nWith every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.\nIn this chapter, you’ve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.\nA good place to start is vignette(\"regular-expressions\", package = \"stringr\"): it documents the full set of syntax supported by stringr. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.\nIt’s also good to know that stringr is implemented on top of the stringi package by Marek Gagolewski. If you’re struggling to find a function that does what you need in stringr, don’t be afraid to look in stringi. You’ll find stringi very easy to pick up because it follows many of the the same conventions as stringr.\nIn the next chapter, we’ll talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "regexps.html#footnotes",
    "href": "regexps.html#footnotes",
    "title": "16  Regular expressions",
    "section": "",
    "text": "You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).↩︎\nYou’ll learn how to escape these special meanings in Section 16.4.1.↩︎\nWell, any character apart from \\n.↩︎\nThis gives us the proportion of names that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean.↩︎\nWe wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!↩︎\nThe complete set of metacharacters is .^$\\|*+?{}[]()↩︎\nRemember, to create a regular expression containing \\d or \\s, you’ll need to escape the \\ for the string, so you’ll type \"\\\\d\" or \"\\\\s\".↩︎\nMostly because we never discuss matrices in this book!↩︎\ncomments = TRUE is particularly effective in combination with a raw string, as we use here.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>16</span>  <span class='chapter-title'>Regular expressions</span>"
    ]
  },
  {
    "objectID": "factors.html",
    "href": "factors.html",
    "title": "17  Factors",
    "section": "",
    "text": "17.1 Introduction\nFactors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.\nWe’ll start by motivating why factors are needed for data analysis1 and how you can create them with factor(). We’ll then introduce you to the gss_cat dataset which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#introduction",
    "href": "factors.html#introduction",
    "title": "17  Factors",
    "section": "",
    "text": "17.1.1 Prerequisites\nBase R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#factor-basics",
    "href": "factors.html#factor-basics",
    "title": "17  Factors",
    "section": "17.2 Factor basics",
    "text": "17.2 Factor basics\nImagine that you have a variable that records month:\n\nx1 &lt;- c(\"Dec\", \"Apr\", \"Jan\", \"Mar\")\n\nUsing a string to record this variable has two problems:\n\nThere are only twelve possible months, and there’s nothing saving you from typos:\n\nx2 &lt;- c(\"Dec\", \"Apr\", \"Jam\", \"Mar\")\n\nIt doesn’t sort in a useful way:\n\nsort(x1)\n#&gt; [1] \"Apr\" \"Dec\" \"Jan\" \"Mar\"\n\n\nYou can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:\n\nmonth_levels &lt;- c(\n  \"Jan\", \"Feb\", \"Mar\", \"Apr\", \"May\", \"Jun\",\n  \"Jul\", \"Aug\", \"Sep\", \"Oct\", \"Nov\", \"Dec\"\n)\n\nNow you can create a factor:\n\ny1 &lt;- factor(x1, levels = month_levels)\ny1\n#&gt; [1] Dec Apr Jan Mar\n#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nsort(y1)\n#&gt; [1] Jan Mar Apr Dec\n#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nAnd any values not in the level will be silently converted to NA:\n\ny2 &lt;- factor(x2, levels = month_levels)\ny2\n#&gt; [1] Dec  Apr  &lt;NA&gt; Mar \n#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec\n\nThis seems risky, so you might want to use forcats::fct() instead:\n\ny2 &lt;- fct(x2, levels = month_levels)\n#&gt; Error in `fct()`:\n#&gt; ! All values of `x` must appear in `levels` or `na`\n#&gt; ℹ Missing level: \"Jam\"\n\nIf you omit the levels, they’ll be taken from the data in alphabetical order:\n\nfactor(x1)\n#&gt; [1] Dec Apr Jan Mar\n#&gt; Levels: Apr Dec Jan Mar\n\nSorting alphabetically is slightly risky because not every computer will sort strings in the same way. So forcats::fct() orders by first appearance:\n\nfct(x1)\n#&gt; [1] Dec Apr Jan Mar\n#&gt; Levels: Dec Apr Jan Mar\n\nIf you ever need to access the set of valid levels directly, you can do so with levels():\n\nlevels(y2)\n#&gt;  [1] \"Jan\" \"Feb\" \"Mar\" \"Apr\" \"May\" \"Jun\" \"Jul\" \"Aug\" \"Sep\" \"Oct\" \"Nov\" \"Dec\"\n\nYou can also create a factor when reading your data with readr with col_factor():\n\ncsv &lt;- \"\nmonth,value\nJan,12\nFeb,56\nMar,12\"\n\ndf &lt;- read_csv(csv, col_types = cols(month = col_factor(month_levels)))\ndf$month\n#&gt; [1] Jan Feb Mar\n#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#general-social-survey",
    "href": "factors.html#general-social-survey",
    "title": "17  Factors",
    "section": "17.3 General Social Survey",
    "text": "17.3 General Social Survey\nFor the rest of this chapter, we’re going to use forcats::gss_cat. It’s a sample of data from the General Social Survey, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.\n\ngss_cat\n#&gt; # A tibble: 21,483 × 9\n#&gt;    year marital         age race  rincome        partyid           \n#&gt;   &lt;int&gt; &lt;fct&gt;         &lt;int&gt; &lt;fct&gt; &lt;fct&gt;          &lt;fct&gt;             \n#&gt; 1  2000 Never married    26 White $8000 to 9999  Ind,near rep      \n#&gt; 2  2000 Divorced         48 White $8000 to 9999  Not str republican\n#&gt; 3  2000 Widowed          67 White Not applicable Independent       \n#&gt; 4  2000 Never married    39 White Not applicable Ind,near rep      \n#&gt; 5  2000 Divorced         25 White Not applicable Not str democrat  \n#&gt; 6  2000 Married          25 White $20000 - 24999 Strong democrat   \n#&gt; # ℹ 21,477 more rows\n#&gt; # ℹ 3 more variables: relig &lt;fct&gt;, denom &lt;fct&gt;, tvhours &lt;int&gt;\n\n(Remember, since this dataset is provided by a package, you can get more information about the variables with ?gss_cat.)\nWhen factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with count():\n\ngss_cat |&gt;\n  count(race)\n#&gt; # A tibble: 3 × 2\n#&gt;   race      n\n#&gt;   &lt;fct&gt; &lt;int&gt;\n#&gt; 1 Other  1959\n#&gt; 2 Black  3129\n#&gt; 3 White 16395\n\nWhen working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.\n\n17.3.1 Exercises\n\nExplore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?\nWhat is the most common relig in this survey? What’s the most common partyid?\nWhich relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#sec-modifying-factor-order",
    "href": "factors.html#sec-modifying-factor-order",
    "title": "17  Factors",
    "section": "17.4 Modifying factor order",
    "text": "17.4 Modifying factor order\nIt’s often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:\n\nrelig_summary &lt;- gss_cat |&gt;\n  group_by(relig) |&gt;\n  summarize(\n    tvhours = mean(tvhours, na.rm = TRUE),\n    n = n()\n  )\n\nggplot(relig_summary, aes(x = tvhours, y = relig)) +\n  geom_point()\n\n\n\n\n\n\n\n\nIt is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:\n\n.f, the factor whose levels you want to modify.\n.x, a numeric vector that you want to use to reorder the levels.\nOptionally, .fun, a function that’s used if there are multiple values of .x for each value of .f. The default value is median.\n\n\nggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +\n  geom_point()\n\n\n\n\n\n\n\n\nReordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.\nAs you start making more complicated transformations, we recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:\n\nrelig_summary |&gt;\n  mutate(\n    relig = fct_reorder(relig, tvhours)\n  ) |&gt;\n  ggplot(aes(x = tvhours, y = relig)) +\n  geom_point()\n\nWhat if we create a similar plot looking at how average age varies across reported income level?\n\nrincome_summary &lt;- gss_cat |&gt;\n  group_by(rincome) |&gt;\n  summarize(\n    age = mean(age, na.rm = TRUE),\n    n = n()\n  )\n\nggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +\n  geom_point()\n\n\n\n\n\n\n\n\nHere, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.\nHowever, it does make sense to pull “Not applicable” to the front with the other special levels. You can use fct_relevel(). It takes a factor, .f, and then any number of levels that you want to move to the front of the line.\n\nggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, \"Not applicable\"))) +\n  geom_point()\n\n\n\n\n\n\n\n\nWhy do you think the average age for “Not applicable” is so high?\nAnother type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(.f, .x, .y) reorders the factor .f by the .y values associated with the largest .x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.\nby_age &lt;- gss_cat |&gt;\n  filter(!is.na(age)) |&gt;\n  count(age, marital) |&gt;\n  group_by(age) |&gt;\n  mutate(\n    prop = n / sum(n)\n  )\n\nggplot(by_age, aes(x = age, y = prop, color = marital)) +\n  geom_line(linewidth = 1) +\n  scale_color_brewer(palette = \"Set1\")\n\nggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +\n  geom_line(linewidth = 1) +\n  scale_color_brewer(palette = \"Set1\") +\n  labs(color = \"marital\")\n\n\n\n\n\n\n\n\n\n\nFinally, for bar plots, you can use fct_infreq() to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with fct_rev() if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.\n\ngss_cat |&gt;\n  mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt;\n  ggplot(aes(x = marital)) +\n  geom_bar()\n\n\n\n\n\n\n\n\n\n17.4.1 Exercises\n\nThere are some suspiciously high numbers in tvhours. Is the mean a good summary?\nFor each factor in gss_cat identify whether the order of the levels is arbitrary or principled.\nWhy did moving “Not applicable” to the front of the levels move it to the bottom of the plot?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#modifying-factor-levels",
    "href": "factors.html#modifying-factor-levels",
    "title": "17  Factors",
    "section": "17.5 Modifying factor levels",
    "text": "17.5 Modifying factor levels\nMore powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is fct_recode(). It allows you to recode, or change, the value of each level. For example, take the partyid variable from the gss_cat data frame:\n\ngss_cat |&gt; count(partyid)\n#&gt; # A tibble: 10 × 2\n#&gt;   partyid                n\n#&gt;   &lt;fct&gt;              &lt;int&gt;\n#&gt; 1 No answer            154\n#&gt; 2 Don't know             1\n#&gt; 3 Other party          393\n#&gt; 4 Strong republican   2314\n#&gt; 5 Not str republican  3032\n#&gt; 6 Ind,near rep        1791\n#&gt; # ℹ 4 more rows\n\nThe levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:\n\ngss_cat |&gt;\n  mutate(\n    partyid = fct_recode(partyid,\n      \"Republican, strong\"    = \"Strong republican\",\n      \"Republican, weak\"      = \"Not str republican\",\n      \"Independent, near rep\" = \"Ind,near rep\",\n      \"Independent, near dem\" = \"Ind,near dem\",\n      \"Democrat, weak\"        = \"Not str democrat\",\n      \"Democrat, strong\"      = \"Strong democrat\"\n    )\n  ) |&gt;\n  count(partyid)\n#&gt; # A tibble: 10 × 2\n#&gt;   partyid                   n\n#&gt;   &lt;fct&gt;                 &lt;int&gt;\n#&gt; 1 No answer               154\n#&gt; 2 Don't know                1\n#&gt; 3 Other party             393\n#&gt; 4 Republican, strong     2314\n#&gt; 5 Republican, weak       3032\n#&gt; 6 Independent, near rep  1791\n#&gt; # ℹ 4 more rows\n\nfct_recode() will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.\nTo combine groups, you can assign multiple old levels to the same new level:\n\ngss_cat |&gt;\n  mutate(\n    partyid = fct_recode(partyid,\n      \"Republican, strong\"    = \"Strong republican\",\n      \"Republican, weak\"      = \"Not str republican\",\n      \"Independent, near rep\" = \"Ind,near rep\",\n      \"Independent, near dem\" = \"Ind,near dem\",\n      \"Democrat, weak\"        = \"Not str democrat\",\n      \"Democrat, strong\"      = \"Strong democrat\",\n      \"Other\"                 = \"No answer\",\n      \"Other\"                 = \"Don't know\",\n      \"Other\"                 = \"Other party\"\n    )\n  )\n\nUse this technique with care: if you group together categories that are truly different you will end up with misleading results.\nIf you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:\n\ngss_cat |&gt;\n  mutate(\n    partyid = fct_collapse(partyid,\n      \"other\" = c(\"No answer\", \"Don't know\", \"Other party\"),\n      \"rep\" = c(\"Strong republican\", \"Not str republican\"),\n      \"ind\" = c(\"Ind,near rep\", \"Independent\", \"Ind,near dem\"),\n      \"dem\" = c(\"Not str democrat\", \"Strong democrat\")\n    )\n  ) |&gt;\n  count(partyid)\n#&gt; # A tibble: 4 × 2\n#&gt;   partyid     n\n#&gt;   &lt;fct&gt;   &lt;int&gt;\n#&gt; 1 other     548\n#&gt; 2 rep      5346\n#&gt; 3 ind      8409\n#&gt; 4 dem      7180\n\nSometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*() family of functions. fct_lump_lowfreq() is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.\n\ngss_cat |&gt;\n  mutate(relig = fct_lump_lowfreq(relig)) |&gt;\n  count(relig)\n#&gt; # A tibble: 2 × 2\n#&gt;   relig          n\n#&gt;   &lt;fct&gt;      &lt;int&gt;\n#&gt; 1 Protestant 10846\n#&gt; 2 Other      10637\n\nIn this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details! Instead, we can use the fct_lump_n() to specify that we want exactly 10 groups:\n\ngss_cat |&gt;\n  mutate(relig = fct_lump_n(relig, n = 10)) |&gt;\n  count(relig, sort = TRUE)\n#&gt; # A tibble: 10 × 2\n#&gt;   relig          n\n#&gt;   &lt;fct&gt;      &lt;int&gt;\n#&gt; 1 Protestant 10846\n#&gt; 2 Catholic    5124\n#&gt; 3 None        3523\n#&gt; 4 Christian    689\n#&gt; 5 Other        458\n#&gt; 6 Jewish       388\n#&gt; # ℹ 4 more rows\n\nRead the documentation to learn about fct_lump_min() and fct_lump_prop() which are useful in other cases.\n\n17.5.1 Exercises\n\nHow have the proportions of people identifying as Democrat, Republican, and Independent changed over time?\nHow could you collapse rincome into a small set of categories?\nNotice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type ?fct_lump, and find the default for the argument other_level is “Other”.)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#sec-ordered-factors",
    "href": "factors.html#sec-ordered-factors",
    "title": "17  Factors",
    "section": "17.6 Ordered factors",
    "text": "17.6 Ordered factors\nBefore we continue, it’s important to briefly mention a special type of factor: ordered factors. Created with the ordered() function, ordered factors imply a strict ordering between levels, but don’t specify anything about the magnitude of the differences between the levels. You use ordered factors when you know there the levels are ranked, but there’s no precise numerical ranking.\nYou can identify an ordered factor when its printed because it uses &lt; symbols between the factor levels:\n\nordered(c(\"a\", \"b\", \"c\"))\n#&gt; [1] a b c\n#&gt; Levels: a &lt; b &lt; c\n\nIn both base R and the tidyverse, ordered factors behave very similarly to regular factors. There are only two places where you might notice different behavior:\n\nIf you map an ordered factor to color or fill in ggplot2, it will default to scale_color_viridis()/scale_fill_viridis(), a color scale that implies a ranking.\nIf you use an ordered predictor in a linear model, it will use “polynomial contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don’t routinely interpret them. If you want to learn more, we recommend vignette(\"contrasts\", package = \"faux\") by Lisa DeBruine.\n\nFor the purposes of this book, correctly distinguishing between regular and ordered factors is not particularly important. More broadly, however, certain fields (particularly the social sciences) do use ordered factors extensively. In these contexts, it’s important to correctly identify them so that other analysis packages can offer the appropriate behavior.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#summary",
    "href": "factors.html#summary",
    "title": "17  Factors",
    "section": "17.7 Summary",
    "text": "17.7 Summary\nThis chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didn’t have space to discuss here, so whenever you’re facing a factor analysis challenge that you haven’t encountered before, I highly recommend skimming the reference index to see if there’s a canned function that can help solve your problem.\nIf you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton’s paper, Wrangling categorical data in R. This paper lays out some of the history discussed in stringsAsFactors: An unauthorized biography and stringsAsFactors = &lt;sigh&gt;, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!\nIn the next chapter we’ll switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as you’ll soon see, the more you learn about them, the more complex they seem to get!",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "factors.html#footnotes",
    "href": "factors.html#footnotes",
    "title": "17  Factors",
    "section": "",
    "text": "They’re also really important for modelling.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>17</span>  <span class='chapter-title'>Factors</span>"
    ]
  },
  {
    "objectID": "datetimes.html",
    "href": "datetimes.html",
    "title": "18  Dates and times",
    "section": "",
    "text": "18.1 Introduction\nThis chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!\nTo warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap year1? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.\nDates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won’t teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.\nWe’ll begin by showing you how to create date-times from various inputs, and then once you’ve got a date-time, how you can extract components like year, month, and day. We’ll then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what you’re trying to do. We’ll conclude with a brief discussion of the additional challenges posed by time zones.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#introduction",
    "href": "datetimes.html#introduction",
    "title": "18  Dates and times",
    "section": "",
    "text": "18.1.1 Prerequisites\nThis chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse. We will also need nycflights13 for practice data.\n\nlibrary(tidyverse)\nlibrary(nycflights13)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#sec-creating-datetimes",
    "href": "datetimes.html#sec-creating-datetimes",
    "title": "18  Dates and times",
    "section": "18.2 Creating date/times",
    "text": "18.2 Creating date/times\nThere are three types of date/time data that refer to an instant in time:\n\nA date. Tibbles print this as &lt;date&gt;.\nA time within a day. Tibbles print this as &lt;time&gt;.\nA date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as &lt;dttm&gt;. Base R calls these POSIXct, but that doesn’t exactly trip off the tongue.\n\nIn this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.\nYou should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.\nTo get the current date or date-time you can use today() or now():\n\ntoday()\n#&gt; [1] \"2024-11-14\"\nnow()\n#&gt; [1] \"2024-11-14 20:26:59 JST\"\n\nOtherwise, the following sections describe the four ways you’re likely to create a date/time:\n\nWhile reading a file with readr.\nFrom a string.\nFrom individual date-time components.\nFrom an existing date/time object.\n\n\n18.2.1 During import\nIf your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:\n\ncsv &lt;- \"\n  date,datetime\n  2022-01-02,2022-01-02 05:12\n\"\nread_csv(csv)\n#&gt; # A tibble: 1 × 2\n#&gt;   date       datetime           \n#&gt;   &lt;date&gt;     &lt;dttm&gt;             \n#&gt; 1 2022-01-02 2022-01-02 05:12:00\n\nIf you haven’t heard of ISO8601 before, it’s an international standard2 for writing dates where the components of a date are organized from biggest to smallest separated by -. For example, in ISO8601 May 3 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on May 3 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.\nFor other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table Table 18.1 lists all the options.\n\n\n\nTable 18.1: All date formats understood by readr\n\n\n\n\n\nType\nCode\nMeaning\nExample\n\n\n\n\nYear\n%Y\n4 digit year\n2021\n\n\n\n%y\n2 digit year\n21\n\n\nMonth\n%m\nNumber\n2\n\n\n\n%b\nAbbreviated name\nFeb\n\n\n\n%B\nFull name\nFebruary\n\n\nDay\n%d\nOne or two digits\n2\n\n\n\n%e\nTwo digits\n02\n\n\nTime\n%H\n24-hour hour\n13\n\n\n\n%I\n12-hour hour\n1\n\n\n\n%p\nAM/PM\npm\n\n\n\n%M\nMinutes\n35\n\n\n\n%S\nSeconds\n45\n\n\n\n%OS\nSeconds with decimal component\n45.35\n\n\n\n%Z\nTime zone name\nAmerica/Chicago\n\n\n\n%z\nOffset from UTC\n+0800\n\n\nOther\n%.\nSkip one non-digit\n:\n\n\n\n%*\nSkip any number of non-digits\n\n\n\n\n\n\n\nAnd this code shows a few options applied to a very ambiguous date:\n\ncsv &lt;- \"\n  date\n  01/02/15\n\"\n\nread_csv(csv, col_types = cols(date = col_date(\"%m/%d/%y\")))\n#&gt; # A tibble: 1 × 1\n#&gt;   date      \n#&gt;   &lt;date&gt;    \n#&gt; 1 2015-01-02\n\nread_csv(csv, col_types = cols(date = col_date(\"%d/%m/%y\")))\n#&gt; # A tibble: 1 × 1\n#&gt;   date      \n#&gt;   &lt;date&gt;    \n#&gt; 1 2015-02-01\n\nread_csv(csv, col_types = cols(date = col_date(\"%y/%m/%d\")))\n#&gt; # A tibble: 1 × 1\n#&gt;   date      \n#&gt;   &lt;date&gt;    \n#&gt; 1 2001-02-15\n\nNote that no matter how you specify the date format, it’s always displayed the same way once you get it into R.\nIf you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),\n\n\n18.2.2 From strings\nThe date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:\n\nymd(\"2017-01-31\")\n#&gt; [1] \"2017-01-31\"\nmdy(\"January 31st, 2017\")\n#&gt; [1] \"2017-01-31\"\ndmy(\"31-Jan-2017\")\n#&gt; [1] \"2017-01-31\"\n\nymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:\n\nymd_hms(\"2017-01-31 20:11:59\")\n#&gt; [1] \"2017-01-31 20:11:59 UTC\"\nmdy_hm(\"01/31/2017 08:01\")\n#&gt; [1] \"2017-01-31 08:01:00 UTC\"\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\nymd(\"2017-01-31\", tz = \"UTC\")\n#&gt; [1] \"2017-01-31 UTC\"\n\nHere I use the UTC3 timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude4 . It doesn’t use daylight saving time, making it a bit easier to compute with .\n\n\n18.2.3 From individual components\nInstead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:\n\nflights |&gt; \n  select(year, month, day, hour, minute)\n#&gt; # A tibble: 336,776 × 5\n#&gt;    year month   day  hour minute\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1  2013     1     1     5     15\n#&gt; 2  2013     1     1     5     29\n#&gt; 3  2013     1     1     5     40\n#&gt; 4  2013     1     1     5     45\n#&gt; 5  2013     1     1     6      0\n#&gt; 6  2013     1     1     5     58\n#&gt; # ℹ 336,770 more rows\n\nTo create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:\n\nflights |&gt; \n  select(year, month, day, hour, minute) |&gt; \n  mutate(departure = make_datetime(year, month, day, hour, minute))\n#&gt; # A tibble: 336,776 × 6\n#&gt;    year month   day  hour minute departure          \n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dttm&gt;             \n#&gt; 1  2013     1     1     5     15 2013-01-01 05:15:00\n#&gt; 2  2013     1     1     5     29 2013-01-01 05:29:00\n#&gt; 3  2013     1     1     5     40 2013-01-01 05:40:00\n#&gt; 4  2013     1     1     5     45 2013-01-01 05:45:00\n#&gt; 5  2013     1     1     6      0 2013-01-01 06:00:00\n#&gt; 6  2013     1     1     5     58 2013-01-01 05:58:00\n#&gt; # ℹ 336,770 more rows\n\nLet’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.\n\nmake_datetime_100 &lt;- function(year, month, day, time) {\n  make_datetime(year, month, day, time %/% 100, time %% 100)\n}\n\nflights_dt &lt;- flights |&gt; \n  filter(!is.na(dep_time), !is.na(arr_time)) |&gt; \n  mutate(\n    dep_time = make_datetime_100(year, month, day, dep_time),\n    arr_time = make_datetime_100(year, month, day, arr_time),\n    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),\n    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)\n  ) |&gt; \n  select(origin, dest, ends_with(\"delay\"), ends_with(\"time\"))\n\nflights_dt\n#&gt; # A tibble: 328,063 × 9\n#&gt;   origin dest  dep_delay arr_delay dep_time            sched_dep_time     \n#&gt;   &lt;chr&gt;  &lt;chr&gt;     &lt;dbl&gt;     &lt;dbl&gt; &lt;dttm&gt;              &lt;dttm&gt;             \n#&gt; 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00\n#&gt; 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00\n#&gt; 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00\n#&gt; 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00\n#&gt; 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00\n#&gt; 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00\n#&gt; # ℹ 328,057 more rows\n#&gt; # ℹ 3 more variables: arr_time &lt;dttm&gt;, sched_arr_time &lt;dttm&gt;, …\n\nWith this data, we can visualize the distribution of departure times across the year:\n\nflights_dt |&gt; \n  ggplot(aes(x = dep_time)) + \n  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day\n\n\n\n\n\n\n\n\nOr within a single day:\n\nflights_dt |&gt; \n  filter(dep_time &lt; ymd(20130102)) |&gt; \n  ggplot(aes(x = dep_time)) + \n  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes\n\n\n\n\n\n\n\n\nNote that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.\n\n\n18.2.4 From other types\nYou may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():\n\nas_datetime(today())\n#&gt; [1] \"2024-11-14 UTC\"\nas_date(now())\n#&gt; [1] \"2024-11-14\"\n\nSometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().\n\nas_datetime(60 * 60 * 10)\n#&gt; [1] \"1970-01-01 10:00:00 UTC\"\nas_date(365 * 10 + 2)\n#&gt; [1] \"1980-01-01\"\n\n\n\n18.2.5 Exercises\n\nWhat happens if you parse a string that contains invalid dates?\n\nymd(c(\"2010-10-10\", \"bananas\"))\n\nWhat does the tzone argument to today() do? Why is it important?\nFor each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.\n\nd1 &lt;- \"January 1, 2010\"\nd2 &lt;- \"2015-Mar-07\"\nd3 &lt;- \"06-Jun-2017\"\nd4 &lt;- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 &lt;- \"12/30/14\" # Dec 30, 2014\nt1 &lt;- \"1705\"\nt2 &lt;- \"11:15:10.12 PM\"",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#date-time-components",
    "href": "datetimes.html#date-time-components",
    "title": "18  Dates and times",
    "section": "18.3 Date-time components",
    "text": "18.3 Date-time components\nNow that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.\n\n18.3.1 Getting components\nYou can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second(). These are effectively the opposites of make_datetime().\n\ndatetime &lt;- ymd_hms(\"2026-07-08 12:34:56\")\n\nyear(datetime)\n#&gt; [1] 2026\nmonth(datetime)\n#&gt; [1] 7\nmday(datetime)\n#&gt; [1] 8\n\nyday(datetime)\n#&gt; [1] 189\nwday(datetime)\n#&gt; [1] 4\n\nFor month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.\n\nmonth(datetime, label = TRUE)\n#&gt; [1] Jul\n#&gt; 12 Levels: Jan &lt; Feb &lt; Mar &lt; Apr &lt; May &lt; Jun &lt; Jul &lt; Aug &lt; Sep &lt; ... &lt; Dec\nwday(datetime, label = TRUE, abbr = FALSE)\n#&gt; [1] Wednesday\n#&gt; 7 Levels: Sunday &lt; Monday &lt; Tuesday &lt; Wednesday &lt; Thursday &lt; ... &lt; Saturday\n\nWe can use wday() to see that more flights depart during the week than on the weekend:\n\nflights_dt |&gt; \n  mutate(wday = wday(dep_time, label = TRUE)) |&gt; \n  ggplot(aes(x = wday)) +\n  geom_bar()\n\n\n\n\n\n\n\n\nWe can also look at the average departure delay by minute within the hour. There’s an interesting pattern: flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!\n\nflights_dt |&gt; \n  mutate(minute = minute(dep_time)) |&gt; \n  group_by(minute) |&gt; \n  summarize(\n    avg_delay = mean(dep_delay, na.rm = TRUE),\n    n = n()\n  ) |&gt; \n  ggplot(aes(x = minute, y = avg_delay)) +\n  geom_line()\n\n\n\n\n\n\n\n\nInterestingly, if we look at the scheduled departure time we don’t see such a strong pattern:\n\nsched_dep &lt;- flights_dt |&gt; \n  mutate(minute = minute(sched_dep_time)) |&gt; \n  group_by(minute) |&gt; \n  summarize(\n    avg_delay = mean(arr_delay, na.rm = TRUE),\n    n = n()\n  )\n\nggplot(sched_dep, aes(x = minute, y = avg_delay)) +\n  geom_line()\n\n\n\n\n\n\n\n\nSo why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times, as Figure 18.1 shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!\n\n\n\n\n\n\n\n\nFigure 18.1: A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.\n\n\n\n\n\n\n\n18.3.2 Rounding\nAn alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit to round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:\n\nflights_dt |&gt; \n  count(week = floor_date(dep_time, \"week\")) |&gt; \n  ggplot(aes(x = week, y = n)) +\n  geom_line() + \n  geom_point()\n\n\n\n\n\n\n\n\nYou can use rounding to show the distribution of flights across the course of a day by computing the difference between dep_time and the earliest instant of that day:\n\nflights_dt |&gt; \n  mutate(dep_hour = dep_time - floor_date(dep_time, \"day\")) |&gt; \n  ggplot(aes(x = dep_hour)) +\n  geom_freqpoly(binwidth = 60 * 30)\n#&gt; Don't know how to automatically pick scale for object of type &lt;difftime&gt;.\n#&gt; Defaulting to continuous.\n\n\n\n\n\n\n\n\nComputing the difference between a pair of date-times yields a difftime (more on that in Section 18.4.3). We can convert that to an hms object to get a more useful x-axis:\n\nflights_dt |&gt; \n  mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, \"day\"))) |&gt; \n  ggplot(aes(x = dep_hour)) +\n  geom_freqpoly(binwidth = 60 * 30)\n\n\n\n\n\n\n\n\n\n\n18.3.3 Modifying components\nYou can also use each accessor function to modify the components of a date/time. This doesn’t come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.\n\n(datetime &lt;- ymd_hms(\"2026-07-08 12:34:56\"))\n#&gt; [1] \"2026-07-08 12:34:56 UTC\"\n\nyear(datetime) &lt;- 2030\ndatetime\n#&gt; [1] \"2030-07-08 12:34:56 UTC\"\nmonth(datetime) &lt;- 01\ndatetime\n#&gt; [1] \"2030-01-08 12:34:56 UTC\"\nhour(datetime) &lt;- hour(datetime) + 1\ndatetime\n#&gt; [1] \"2030-01-08 13:34:56 UTC\"\n\nAlternatively, rather than modifying an existing variable, you can create a new date-time with update(). This also allows you to set multiple values in one step:\n\nupdate(datetime, year = 2030, month = 2, mday = 2, hour = 2)\n#&gt; [1] \"2030-02-02 02:34:56 UTC\"\n\nIf values are too big, they will roll-over:\n\nupdate(ymd(\"2023-02-01\"), mday = 30)\n#&gt; [1] \"2023-03-02\"\nupdate(ymd(\"2023-02-01\"), hour = 400)\n#&gt; [1] \"2023-02-17 16:00:00 UTC\"\n\n\n\n18.3.4 Exercises\n\nHow does the distribution of flight times within a day change over the course of the year?\nCompare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.\nCompare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)\nHow does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?\nOn what day of the week should you leave if you want to minimise the chance of a delay?\nWhat makes the distribution of diamonds$carat and flights$sched_dep_time similar?\nConfirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#time-spans",
    "href": "datetimes.html#time-spans",
    "title": "18  Dates and times",
    "section": "18.4 Time spans",
    "text": "18.4 Time spans\nNext you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:\n\nDurations, which represent an exact number of seconds.\nPeriods, which represent human units like weeks and months.\nIntervals, which represent a starting and ending point.\n\nHow do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.\n\n18.4.1 Durations\nIn R, when you subtract two dates, you get a difftime object:\n\n# How old is Hadley?\nh_age &lt;- today() - ymd(\"1979-10-14\")\nh_age\n#&gt; Time difference of 16468 days\n\nA difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.\n\nas.duration(h_age)\n#&gt; [1] \"1422835200s (~45.09 years)\"\n\nDurations come with a bunch of convenient constructors:\n\ndseconds(15)\n#&gt; [1] \"15s\"\ndminutes(10)\n#&gt; [1] \"600s (~10 minutes)\"\ndhours(c(12, 24))\n#&gt; [1] \"43200s (~12 hours)\" \"86400s (~1 days)\"\nddays(0:5)\n#&gt; [1] \"0s\"                \"86400s (~1 days)\"  \"172800s (~2 days)\"\n#&gt; [4] \"259200s (~3 days)\" \"345600s (~4 days)\" \"432000s (~5 days)\"\ndweeks(3)\n#&gt; [1] \"1814400s (~3 weeks)\"\ndyears(1)\n#&gt; [1] \"31557600s (~1 years)\"\n\nDurations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.\nYou can add and multiply durations:\n\n2 * dyears(1)\n#&gt; [1] \"63115200s (~2 years)\"\ndyears(1) + dweeks(12) + dhours(15)\n#&gt; [1] \"38869200s (~1.23 years)\"\n\nYou can add and subtract durations to and from days:\n\ntomorrow &lt;- today() + ddays(1)\nlast_year &lt;- today() - dyears(1)\n\nHowever, because durations represent an exact number of seconds, sometimes you might get an unexpected result:\n\none_am &lt;- ymd_hms(\"2026-03-08 01:00:00\", tz = \"America/New_York\")\n\none_am\n#&gt; [1] \"2026-03-08 01:00:00 EST\"\none_am + ddays(1)\n#&gt; [1] \"2026-03-09 02:00:00 EDT\"\n\nWhy is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.\n\n\n18.4.2 Periods\nTo solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:\n\none_am\n#&gt; [1] \"2026-03-08 01:00:00 EST\"\none_am + days(1)\n#&gt; [1] \"2026-03-09 01:00:00 EDT\"\n\nLike durations, periods can be created with a number of friendly constructor functions.\n\nhours(c(12, 24))\n#&gt; [1] \"12H 0M 0S\" \"24H 0M 0S\"\ndays(7)\n#&gt; [1] \"7d 0H 0M 0S\"\nmonths(1:6)\n#&gt; [1] \"1m 0d 0H 0M 0S\" \"2m 0d 0H 0M 0S\" \"3m 0d 0H 0M 0S\" \"4m 0d 0H 0M 0S\"\n#&gt; [5] \"5m 0d 0H 0M 0S\" \"6m 0d 0H 0M 0S\"\n\nYou can add and multiply periods:\n\n10 * (months(6) + days(1))\n#&gt; [1] \"60m 10d 0H 0M 0S\"\ndays(50) + hours(25) + minutes(2)\n#&gt; [1] \"50d 25H 2M 0S\"\n\nAnd of course, add them to dates. Compared to durations, periods are more likely to do what you expect:\n\n# A leap year\nymd(\"2024-01-01\") + dyears(1)\n#&gt; [1] \"2024-12-31 06:00:00 UTC\"\nymd(\"2024-01-01\") + years(1)\n#&gt; [1] \"2025-01-01\"\n\n# Daylight saving time\none_am + ddays(1)\n#&gt; [1] \"2026-03-09 02:00:00 EDT\"\none_am + days(1)\n#&gt; [1] \"2026-03-09 01:00:00 EDT\"\n\nLet’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.\n\nflights_dt |&gt; \n  filter(arr_time &lt; dep_time) \n#&gt; # A tibble: 10,633 × 9\n#&gt;   origin dest  dep_delay arr_delay dep_time            sched_dep_time     \n#&gt;   &lt;chr&gt;  &lt;chr&gt;     &lt;dbl&gt;     &lt;dbl&gt; &lt;dttm&gt;              &lt;dttm&gt;             \n#&gt; 1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00\n#&gt; 2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00\n#&gt; 3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00\n#&gt; 4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00\n#&gt; 5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00\n#&gt; 6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00\n#&gt; # ℹ 10,627 more rows\n#&gt; # ℹ 3 more variables: arr_time &lt;dttm&gt;, sched_arr_time &lt;dttm&gt;, …\n\nThese are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.\n\nflights_dt &lt;- flights_dt |&gt; \n  mutate(\n    overnight = arr_time &lt; dep_time,\n    arr_time = arr_time + days(overnight),\n    sched_arr_time = sched_arr_time + days(overnight)\n  )\n\nNow all of our flights obey the laws of physics.\n\nflights_dt |&gt; \n  filter(arr_time &lt; dep_time) \n#&gt; # A tibble: 0 × 10\n#&gt; # ℹ 10 variables: origin &lt;chr&gt;, dest &lt;chr&gt;, dep_delay &lt;dbl&gt;,\n#&gt; #   arr_delay &lt;dbl&gt;, dep_time &lt;dttm&gt;, sched_dep_time &lt;dttm&gt;, …\n\n\n\n18.4.3 Intervals\nWhat does dyears(1) / ddays(365) return? It’s not quite one, because dyears() is defined as the number of seconds per average year, which is 365.25 days.\nWhat does years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:\n\nyears(1) / days(1)\n#&gt; [1] 365.25\n\nIf you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.\nYou can create an interval by writing start %--% end:\n\ny2023 &lt;- ymd(\"2023-01-01\") %--% ymd(\"2024-01-01\")\ny2024 &lt;- ymd(\"2024-01-01\") %--% ymd(\"2025-01-01\")\n\ny2023\n#&gt; [1] 2023-01-01 UTC--2024-01-01 UTC\ny2024\n#&gt; [1] 2024-01-01 UTC--2025-01-01 UTC\n\nYou could then divide it by days() to find out how many days fit in the year:\n\ny2023 / days(1)\n#&gt; [1] 365\ny2024 / days(1)\n#&gt; [1] 366\n\n\n\n18.4.4 Exercises\n\nExplain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?\nCreate a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.\nWrite a function that given your birthday (as a date), returns how old you are in years.\nWhy can’t (today() %--% (today() + years(1))) / months(1) work?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#time-zones",
    "href": "datetimes.html#time-zones",
    "title": "18  Dates and times",
    "section": "18.5 Time zones",
    "text": "18.5 Time zones\nTime zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don’t need to dig into all the details as they’re not all important for data analysis, but there are a few challenges we’ll need to tackle head on.\n\nThe first challenge is that everyday names of time zones tend to be ambiguous. For example, if you’re American you’re probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme {area}/{location}, typically in the form {continent}/{city} or {ocean}/{city}. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.\nYou might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It’s worth reading the raw time zone database (available at https://www.iana.org/time-zones) just to read some of these stories!\nYou can find out what R thinks your current time zone is with Sys.timezone():\n\nSys.timezone()\n#&gt; [1] \"Asia/Tokyo\"\n\n(If R doesn’t know, you’ll get an NA.)\nAnd see the complete list of all time zone names with OlsonNames():\n\nlength(OlsonNames())\n#&gt; [1] 597\nhead(OlsonNames())\n#&gt; [1] \"Africa/Abidjan\"     \"Africa/Accra\"       \"Africa/Addis_Ababa\"\n#&gt; [4] \"Africa/Algiers\"     \"Africa/Asmara\"      \"Africa/Asmera\"\n\nIn R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:\n\nx1 &lt;- ymd_hms(\"2024-06-01 12:00:00\", tz = \"America/New_York\")\nx1\n#&gt; [1] \"2024-06-01 12:00:00 EDT\"\n\nx2 &lt;- ymd_hms(\"2024-06-01 18:00:00\", tz = \"Europe/Copenhagen\")\nx2\n#&gt; [1] \"2024-06-01 18:00:00 CEST\"\n\nx3 &lt;- ymd_hms(\"2024-06-02 04:00:00\", tz = \"Pacific/Auckland\")\nx3\n#&gt; [1] \"2024-06-02 04:00:00 NZST\"\n\nYou can verify that they’re the same time using subtraction:\n\nx1 - x2\n#&gt; Time difference of 0 secs\nx1 - x3\n#&gt; Time difference of 0 secs\n\nUnless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like c(), will often drop the time zone. In that case, the date-times will display in the time zone of the first element:\n\nx4 &lt;- c(x1, x2, x3)\nx4\n#&gt; [1] \"2024-06-01 12:00:00 EDT\" \"2024-06-01 12:00:00 EDT\"\n#&gt; [3] \"2024-06-01 12:00:00 EDT\"\n\nYou can change the time zone in two ways:\n\nKeep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.\n\nx4a &lt;- with_tz(x4, tzone = \"Australia/Lord_Howe\")\nx4a\n#&gt; [1] \"2024-06-02 02:30:00 +1030\" \"2024-06-02 02:30:00 +1030\"\n#&gt; [3] \"2024-06-02 02:30:00 +1030\"\nx4a - x4\n#&gt; Time differences in secs\n#&gt; [1] 0 0 0\n\n(This also illustrates another challenge of times zones: they’re not all integer hour offsets!)\nChange the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.\n\nx4b &lt;- force_tz(x4, tzone = \"Australia/Lord_Howe\")\nx4b\n#&gt; [1] \"2024-06-01 12:00:00 +1030\" \"2024-06-01 12:00:00 +1030\"\n#&gt; [3] \"2024-06-01 12:00:00 +1030\"\nx4b - x4\n#&gt; Time differences in hours\n#&gt; [1] -14.5 -14.5 -14.5",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#summary",
    "href": "datetimes.html#summary",
    "title": "18  Dates and times",
    "section": "18.6 Summary",
    "text": "18.6 Summary\nThis chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.\nThe next chapter gives a round up of missing values. You’ve seen them in a few places and have no doubt encounter in your own analysis, and it’s now time to provide a grab bag of useful techniques for dealing with them.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "datetimes.html#footnotes",
    "href": "datetimes.html#footnotes",
    "title": "18  Dates and times",
    "section": "",
    "text": "A year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.↩︎\nhttps://xkcd.com/1179/↩︎\nYou might wonder what UTC stands for. It’s a compromise between the English “Coordinated Universal Time” and French “Temps Universel Coordonné”.↩︎\nNo prizes for guessing which country came up with the longitude system.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>18</span>  <span class='chapter-title'>Dates and times</span>"
    ]
  },
  {
    "objectID": "missing-values.html",
    "href": "missing-values.html",
    "title": "19  Missing values",
    "section": "",
    "text": "19.1 Introduction\nYou’ve already learned the basics of missing values earlier in the book. You first saw them in Chapter 2 where they resulted in a warning when making a plot as well as in Section 4.5.2 where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in Section 13.2.2. Now we’ll come back to them in more depth, so you can learn more of the details.\nWe’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#introduction",
    "href": "missing-values.html#introduction",
    "title": "19  Missing values",
    "section": "",
    "text": "19.1.1 Prerequisites\nThe functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#explicit-missing-values",
    "href": "missing-values.html#explicit-missing-values",
    "title": "19  Missing values",
    "section": "19.2 Explicit missing values",
    "text": "19.2 Explicit missing values\nTo begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA.\n\n19.2.1 Last observation carried forward\nA common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):\n\ntreatment &lt;- tribble(\n  ~person,           ~treatment, ~response,\n  \"Derrick Whitmore\", 1,         7,\n  NA,                 2,         10,\n  NA,                 3,         NA,\n  \"Katherine Burke\",  1,         4\n)\n\nYou can fill in these missing values with tidyr::fill(). It works like select(), taking a set of columns:\n\ntreatment |&gt;\n  fill(everything())\n#&gt; # A tibble: 4 × 3\n#&gt;   person           treatment response\n#&gt;   &lt;chr&gt;                &lt;dbl&gt;    &lt;dbl&gt;\n#&gt; 1 Derrick Whitmore         1        7\n#&gt; 2 Derrick Whitmore         2       10\n#&gt; 3 Derrick Whitmore         3       10\n#&gt; 4 Katherine Burke          1        4\n\nThis treatment is sometimes called “last observation carried forward”, or locf for short. You can use the .direction argument to fill in missing values that have been generated in more exotic ways.\n\n\n19.2.2 Fixed values\nSome times missing values represent some fixed and known value, most commonly 0. You can use dplyr::coalesce() to replace them:\n\nx &lt;- c(1, 4, 5, 7, NA)\ncoalesce(x, 0)\n#&gt; [1] 1 4 5 7 0\n\nSometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.\nIf possible, handle this when reading in the data, for example, by using the na argument to readr::read_csv(), e.g., read_csv(path, na = \"99\"). If you discover the problem later, or your data source doesn’t provide a way to handle it on read, you can use dplyr::na_if():\n\nx &lt;- c(1, 4, 5, 7, -99)\nna_if(x, -99)\n#&gt; [1]  1  4  5  7 NA\n\n\n\n19.2.3 NaN\nBefore we continue, there’s one special type of missing value that you’ll encounter from time to time: a NaN (pronounced “nan”), or not a number. It’s not that important to know about because it generally behaves just like NA:\n\nx &lt;- c(NA, NaN)\nx * 10\n#&gt; [1]  NA NaN\nx == 1\n#&gt; [1] NA NA\nis.na(x)\n#&gt; [1] TRUE TRUE\n\nIn the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).\nYou’ll generally encounter a NaN when you perform a mathematical operation that has an indeterminate result:\n\n0 / 0 \n#&gt; [1] NaN\n0 * Inf\n#&gt; [1] NaN\nInf - Inf\n#&gt; [1] NaN\nsqrt(-1)\n#&gt; Warning in sqrt(-1): NaNs produced\n#&gt; [1] NaN",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#sec-missing-implicit",
    "href": "missing-values.html#sec-missing-implicit",
    "title": "19  Missing values",
    "section": "19.3 Implicit missing values",
    "text": "19.3 Implicit missing values\nSo far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA in your data. But missing values can also be implicitly missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple dataset that records the price of some stock each quarter:\n\nstocks &lt;- tibble(\n  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),\n  qtr   = c(   1,    2,    3,    4,    2,    3,    4),\n  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)\n)\n\nThis dataset has two missing observations:\n\nThe price in the fourth quarter of 2020 is explicitly missing, because its value is NA.\nThe price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.\n\nOne way to think about the difference is with this Zen-like koan:\n\nAn explicit missing value is the presence of an absence.\n\nAn implicit missing value is the absence of a presence.\n\nSometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.\n\n19.3.1 Pivoting\nYou’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing values become explicit:\n\nstocks |&gt;\n  pivot_wider(\n    names_from = qtr, \n    values_from = price\n  )\n#&gt; # A tibble: 2 × 5\n#&gt;    year   `1`   `2`   `3`   `4`\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  2020  1.88  0.59  0.35 NA   \n#&gt; 2  2021 NA     0.92  0.17  2.66\n\nBy default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting values_drop_na = TRUE. See the examples in Section 6.2 for more details.\n\n\n19.3.2 Complete\ntidyr::complete() allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of year and qtr should exist in the stocks data:\n\nstocks |&gt;\n  complete(year, qtr)\n#&gt; # A tibble: 8 × 3\n#&gt;    year   qtr price\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  2020     1  1.88\n#&gt; 2  2020     2  0.59\n#&gt; 3  2020     3  0.35\n#&gt; 4  2020     4 NA   \n#&gt; 5  2021     1 NA   \n#&gt; 6  2021     2  0.92\n#&gt; # ℹ 2 more rows\n\nTypically, you’ll call complete() with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the stocks dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for year:\n\nstocks |&gt;\n  complete(year = 2019:2021, qtr)\n#&gt; # A tibble: 12 × 3\n#&gt;    year   qtr price\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  2019     1 NA   \n#&gt; 2  2019     2 NA   \n#&gt; 3  2019     3 NA   \n#&gt; 4  2019     4 NA   \n#&gt; 5  2020     1  1.88\n#&gt; 6  2020     2  0.59\n#&gt; # ℹ 6 more rows\n\nIf the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.\nIn some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what complete() does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with dplyr::full_join().\n\n\n19.3.3 Joins\nThis brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in Chapter 20, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it to another.\ndplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two anti_join()s to reveal that we’re missing information for four airports and 722 planes mentioned in flights:\n\nlibrary(nycflights13)\n\nflights |&gt; \n  distinct(faa = dest) |&gt; \n  anti_join(airports)\n#&gt; Joining with `by = join_by(faa)`\n#&gt; # A tibble: 4 × 1\n#&gt;   faa  \n#&gt;   &lt;chr&gt;\n#&gt; 1 BQN  \n#&gt; 2 SJU  \n#&gt; 3 STT  \n#&gt; 4 PSE\n\nflights |&gt; \n  distinct(tailnum) |&gt; \n  anti_join(planes)\n#&gt; Joining with `by = join_by(tailnum)`\n#&gt; # A tibble: 722 × 1\n#&gt;   tailnum\n#&gt;   &lt;chr&gt;  \n#&gt; 1 N3ALAA \n#&gt; 2 N3DUAA \n#&gt; 3 N542MQ \n#&gt; 4 N730MQ \n#&gt; 5 N9EAMQ \n#&gt; 6 N532UA \n#&gt; # ℹ 716 more rows\n\n\n\n19.3.4 Exercises\n\nCan you find any relationship between the carrier and the rows that appear to be missing from planes?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#factors-and-empty-groups",
    "href": "missing-values.html#factors-and-empty-groups",
    "title": "19  Missing values",
    "section": "19.4 Factors and empty groups",
    "text": "19.4 Factors and empty groups\nA final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:\n\nhealth &lt;- tibble(\n  name   = c(\"Ikaia\", \"Oletta\", \"Leriah\", \"Dashay\", \"Tresaun\"),\n  smoker = factor(c(\"no\", \"no\", \"no\", \"no\", \"no\"), levels = c(\"yes\", \"no\")),\n  age    = c(34, 88, 75, 47, 56),\n)\n\nAnd we want to count the number of smokers with dplyr::count():\n\nhealth |&gt; count(smoker)\n#&gt; # A tibble: 1 × 2\n#&gt;   smoker     n\n#&gt;   &lt;fct&gt;  &lt;int&gt;\n#&gt; 1 no         5\n\nThis dataset only contains non-smokers, but we know that smokers exist; the group of non-smokers is empty. We can request count() to keep all the groups, even those not seen in the data by using .drop = FALSE:\n\nhealth |&gt; count(smoker, .drop = FALSE)\n#&gt; # A tibble: 2 × 2\n#&gt;   smoker     n\n#&gt;   &lt;fct&gt;  &lt;int&gt;\n#&gt; 1 yes        0\n#&gt; 2 no         5\n\nThe same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis:\nggplot(health, aes(x = smoker)) +\n  geom_bar() +\n  scale_x_discrete()\n\nggplot(health, aes(x = smoker)) +\n  geom_bar() +\n  scale_x_discrete(drop = FALSE)\n\n\n\n\n\n\n\n\n\n\nThe same problem comes up more generally with dplyr::group_by(). And again you can use .drop = FALSE to preserve all factor levels:\n\nhealth |&gt; \n  group_by(smoker, .drop = FALSE) |&gt; \n  summarize(\n    n = n(),\n    mean_age = mean(age),\n    min_age = min(age),\n    max_age = max(age),\n    sd_age = sd(age)\n  )\n#&gt; # A tibble: 2 × 6\n#&gt;   smoker     n mean_age min_age max_age sd_age\n#&gt;   &lt;fct&gt;  &lt;int&gt;    &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1 yes        0      NaN     Inf    -Inf   NA  \n#&gt; 2 no         5       60      34      88   21.6\n\nWe get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.\n\n# A vector containing two missing values\nx1 &lt;- c(NA, NA)\nlength(x1)\n#&gt; [1] 2\n\n# A vector containing nothing\nx2 &lt;- numeric()\nlength(x2)\n#&gt; [1] 0\n\nAll summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. max() and min() return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new data1.\nSometimes a simpler approach is to perform the summary and then make the implicit missings explicit with complete().\n\nhealth |&gt; \n  group_by(smoker) |&gt; \n  summarize(\n    n = n(),\n    mean_age = mean(age),\n    min_age = min(age),\n    max_age = max(age),\n    sd_age = sd(age)\n  ) |&gt; \n  complete(smoker)\n#&gt; # A tibble: 2 × 6\n#&gt;   smoker     n mean_age min_age max_age sd_age\n#&gt;   &lt;fct&gt;  &lt;int&gt;    &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1 yes       NA       NA      NA      NA   NA  \n#&gt; 2 no         5       60      34      88   21.6\n\nThe main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#summary",
    "href": "missing-values.html#summary",
    "title": "19  Missing values",
    "section": "19.5 Summary",
    "text": "19.5 Summary\nMissing values are weird! Sometimes they’re recorded as an explicit NA but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.\nIn the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because we’re going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "missing-values.html#footnotes",
    "href": "missing-values.html#footnotes",
    "title": "19  Missing values",
    "section": "",
    "text": "In other words, min(c(x, y)) is always equal to min(min(x), min(y)).↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>19</span>  <span class='chapter-title'>Missing values</span>"
    ]
  },
  {
    "objectID": "joins.html",
    "href": "joins.html",
    "title": "20  Joins",
    "section": "",
    "text": "20.1 Introduction\nIt’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:\nWe’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the datasets from the nycflights13 package, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#introduction",
    "href": "joins.html#introduction",
    "title": "20  Joins",
    "section": "",
    "text": "Mutating joins, which add new variables to one data frame from matching observations in another.\nFiltering joins, which filter observations from one data frame based on whether or not they match an observation in another.\n\n\n\n20.1.1 Prerequisites\nIn this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.\n\nlibrary(tidyverse)\nlibrary(nycflights13)",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#keys",
    "href": "joins.html#keys",
    "title": "20  Joins",
    "section": "20.2 Keys",
    "text": "20.2 Keys\nTo understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.\n\n20.2.1 Primary and foreign keys\nEvery join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key. For example, in nycflights13:\n\nairlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.\n\nairlines\n#&gt; # A tibble: 16 × 2\n#&gt;   carrier name                    \n#&gt;   &lt;chr&gt;   &lt;chr&gt;                   \n#&gt; 1 9E      Endeavor Air Inc.       \n#&gt; 2 AA      American Airlines Inc.  \n#&gt; 3 AS      Alaska Airlines Inc.    \n#&gt; 4 B6      JetBlue Airways         \n#&gt; 5 DL      Delta Air Lines Inc.    \n#&gt; 6 EV      ExpressJet Airlines Inc.\n#&gt; # ℹ 10 more rows\n\nairports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.\n\nairports\n#&gt; # A tibble: 1,458 × 8\n#&gt;   faa   name                            lat   lon   alt    tz dst  \n#&gt;   &lt;chr&gt; &lt;chr&gt;                         &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;\n#&gt; 1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A    \n#&gt; 2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A    \n#&gt; 3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A    \n#&gt; 4 06N   Randall Airport                41.4 -74.4   523    -5 A    \n#&gt; 5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A    \n#&gt; 6 0A9   Elizabethton Municipal Airpo…  36.4 -82.2  1593    -5 A    \n#&gt; # ℹ 1,452 more rows\n#&gt; # ℹ 1 more variable: tzone &lt;chr&gt;\n\nplanes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.\n\nplanes\n#&gt; # A tibble: 3,322 × 9\n#&gt;   tailnum  year type              manufacturer    model     engines\n#&gt;   &lt;chr&gt;   &lt;int&gt; &lt;chr&gt;             &lt;chr&gt;           &lt;chr&gt;       &lt;int&gt;\n#&gt; 1 N10156   2004 Fixed wing multi… EMBRAER         EMB-145XR       2\n#&gt; 2 N102UW   1998 Fixed wing multi… AIRBUS INDUSTR… A320-214        2\n#&gt; 3 N103US   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2\n#&gt; 4 N104UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2\n#&gt; 5 N10575   2002 Fixed wing multi… EMBRAER         EMB-145LR       2\n#&gt; 6 N105UW   1999 Fixed wing multi… AIRBUS INDUSTR… A320-214        2\n#&gt; # ℹ 3,316 more rows\n#&gt; # ℹ 3 more variables: seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;\n\nweather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.\n\nweather\n#&gt; # A tibble: 26,115 × 15\n#&gt;   origin  year month   day  hour  temp  dewp humid wind_dir\n#&gt;   &lt;chr&gt;  &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;    &lt;dbl&gt;\n#&gt; 1 EWR     2013     1     1     1  39.0  26.1  59.4      270\n#&gt; 2 EWR     2013     1     1     2  39.0  27.0  61.6      250\n#&gt; 3 EWR     2013     1     1     3  39.0  28.0  64.4      240\n#&gt; 4 EWR     2013     1     1     4  39.9  28.0  62.2      250\n#&gt; 5 EWR     2013     1     1     5  39.0  28.0  64.4      260\n#&gt; 6 EWR     2013     1     1     6  37.9  28.0  67.2      240\n#&gt; # ℹ 26,109 more rows\n#&gt; # ℹ 6 more variables: wind_speed &lt;dbl&gt;, wind_gust &lt;dbl&gt;, …\n\n\nA foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:\n\nflights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.\nflights$carrier is a foreign key that corresponds to the primary key airlines$carrier.\nflights$origin is a foreign key that corresponds to the primary key airports$faa.\nflights$dest is a foreign key that corresponds to the primary key airports$faa.\nflights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.\n\nThese relationships are summarized visually in Figure 20.1.\n\n\n\n\n\n\n\n\nFigure 20.1: Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.\n\n\n\n\n\nYou’ll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as you’ll see shortly, will make your joining life much easier. It’s also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. There’s only one exception: year means year of departure in flights and year of manufacturer in planes. This will become important when we start actually joining tables together.\n\n\n20.2.2 Checking primary keys\nNow that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:\n\nplanes |&gt; \n  count(tailnum) |&gt; \n  filter(n &gt; 1)\n#&gt; # A tibble: 0 × 2\n#&gt; # ℹ 2 variables: tailnum &lt;chr&gt;, n &lt;int&gt;\n\nweather |&gt; \n  count(time_hour, origin) |&gt; \n  filter(n &gt; 1)\n#&gt; # A tibble: 0 × 3\n#&gt; # ℹ 3 variables: time_hour &lt;dttm&gt;, origin &lt;chr&gt;, n &lt;int&gt;\n\nYou should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!\n\nplanes |&gt; \n  filter(is.na(tailnum))\n#&gt; # A tibble: 0 × 9\n#&gt; # ℹ 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;, manufacturer &lt;chr&gt;,\n#&gt; #   model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;\n\nweather |&gt; \n  filter(is.na(time_hour) | is.na(origin))\n#&gt; # A tibble: 0 × 15\n#&gt; # ℹ 15 variables: origin &lt;chr&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,\n#&gt; #   hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;, …\n\n\n\n20.2.3 Surrogate keys\nSo far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if we have some way to describe them to others.\nAfter a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:\n\nflights |&gt; \n  count(time_hour, carrier, flight) |&gt; \n  filter(n &gt; 1)\n#&gt; # A tibble: 0 × 4\n#&gt; # ℹ 4 variables: time_hour &lt;dttm&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, n &lt;int&gt;\n\nDoes the absence of duplicates automatically make time_hour-carrier-flight a primary key? It’s certainly a good start, but it doesn’t guarantee it. For example, are altitude and latitude a good primary key for airports?\n\nairports |&gt;\n  count(alt, lat) |&gt; \n  filter(n &gt; 1)\n#&gt; # A tibble: 1 × 3\n#&gt;     alt   lat     n\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1    13  40.6     2\n\nIdentifying an airport by its altitude and latitude is clearly a bad idea, and in general it’s not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of time_hour, carrier, and flight seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.\nThat said, we might be better off introducing a simple numeric surrogate key using the row number:\n\nflights2 &lt;- flights |&gt; \n  mutate(id = row_number(), .before = 1)\nflights2\n#&gt; # A tibble: 336,776 × 20\n#&gt;      id  year month   day dep_time sched_dep_time dep_delay arr_time\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;    &lt;int&gt;          &lt;int&gt;     &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1     1  2013     1     1      517            515         2      830\n#&gt; 2     2  2013     1     1      533            529         4      850\n#&gt; 3     3  2013     1     1      542            540         2      923\n#&gt; 4     4  2013     1     1      544            545        -1     1004\n#&gt; 5     5  2013     1     1      554            600        -6      812\n#&gt; 6     6  2013     1     1      554            558        -4      740\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 12 more variables: sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, …\n\nSurrogate keys can be particularly useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.\n\n\n20.2.4 Exercises\n\nWe forgot to draw the relationship between weather and airports in Figure 20.1. What is the relationship and how should it appear in the diagram?\nweather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?\nThe year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?\nWe know that some days of the year are special and fewer people than usual fly on them (e.g., Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?\nDraw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterize the relationship between the Batting, Pitching, and Fielding data frames?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#sec-mutating-joins",
    "href": "joins.html#sec-mutating-joins",
    "title": "20  Joins",
    "section": "20.3 Basic joins",
    "text": "20.3 Basic joins\nNow that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: left_join(), inner_join(), right_join(), full_join(), semi_join(), and anti_join(). They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.\nIn this section, you’ll learn how to use one mutating join, left_join(), and two filtering joins, semi_join() and anti_join(). In the next section, you’ll learn exactly how these functions work, and about the remaining inner_join(), right_join() and full_join().\n\n20.3.1 Mutating joins\nA mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables1:\n\nflights2 &lt;- flights |&gt; \n  select(year, time_hour, origin, dest, tailnum, carrier)\nflights2\n#&gt; # A tibble: 336,776 × 6\n#&gt;    year time_hour           origin dest  tailnum carrier\n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;  \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     \n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     \n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     \n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     \n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     \n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     \n#&gt; # ℹ 336,770 more rows\n\nThere are four types of mutating join, but there’s one that you’ll use almost all of the time: left_join(). It’s special because the output will always have the same rows as x, the data frame you’re joining to2. The primary use of left_join() is to add in additional metadata. For example, we can use left_join() to add the full airline name to the flights2 data:\n\nflights2 |&gt;\n  left_join(airlines)\n#&gt; Joining with `by = join_by(carrier)`\n#&gt; # A tibble: 336,776 × 7\n#&gt;    year time_hour           origin dest  tailnum carrier name                \n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;               \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines In…\n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines In…\n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines I…\n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways     \n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.\n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines In…\n#&gt; # ℹ 336,770 more rows\n\nOr we could find out the temperature and wind speed when each plane departed:\n\nflights2 |&gt; \n  left_join(weather |&gt; select(origin, time_hour, temp, wind_speed))\n#&gt; Joining with `by = join_by(time_hour, origin)`\n#&gt; # A tibble: 336,776 × 8\n#&gt;    year time_hour           origin dest  tailnum carrier  temp wind_speed\n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;dbl&gt;      &lt;dbl&gt;\n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7\n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0\n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0\n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0\n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL       39.9       16.1\n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA       39.0       12.7\n#&gt; # ℹ 336,770 more rows\n\nOr what size of plane was flying:\n\nflights2 |&gt; \n  left_join(planes |&gt; select(tailnum, type, engines, seats))\n#&gt; Joining with `by = join_by(tailnum)`\n#&gt; # A tibble: 336,776 × 9\n#&gt;    year time_hour           origin dest  tailnum carrier type                \n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;               \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed wing multi en…\n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed wing multi en…\n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed wing multi en…\n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed wing multi en…\n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed wing multi en…\n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed wing multi en…\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 2 more variables: engines &lt;int&gt;, seats &lt;int&gt;\n\nWhen left_join() fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:\n\nflights2 |&gt; \n  filter(tailnum == \"N3ALAA\") |&gt; \n  left_join(planes |&gt; select(tailnum, type, engines, seats))\n#&gt; Joining with `by = join_by(tailnum)`\n#&gt; # A tibble: 63 × 9\n#&gt;    year time_hour           origin dest  tailnum carrier type  engines seats\n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;   &lt;int&gt; &lt;int&gt;\n#&gt; 1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; 2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; 3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; 4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; 5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; 6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      &lt;NA&gt;       NA    NA\n#&gt; # ℹ 57 more rows\n\nWe’ll come back to this problem a few times in the rest of the chapter.\n\n\n20.3.2 Specifying join keys\nBy default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join flights2 with the complete planes dataset?\n\nflights2 |&gt; \n  left_join(planes)\n#&gt; Joining with `by = join_by(year, tailnum)`\n#&gt; # A tibble: 336,776 × 13\n#&gt;    year time_hour           origin dest  tailnum carrier type  manufacturer\n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt; &lt;chr&gt;       \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      &lt;NA&gt;  &lt;NA&gt;        \n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 5 more variables: model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, …\n\nWe get a lot of missing matches because our join is trying to use tailnum and year as a compound key. Both flights and planes have a year column but they mean different things: flights$year is the year the flight occurred and planes$year is the year the plane was built. We only want to join on tailnum so we need to provide an explicit specification with join_by():\n\nflights2 |&gt; \n  left_join(planes, join_by(tailnum))\n#&gt; # A tibble: 336,776 × 14\n#&gt;   year.x time_hour           origin dest  tailnum carrier year.y\n#&gt;    &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;    &lt;int&gt;\n#&gt; 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999\n#&gt; 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998\n#&gt; 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990\n#&gt; 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012\n#&gt; 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991\n#&gt; 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 7 more variables: type &lt;chr&gt;, manufacturer &lt;chr&gt;, model &lt;chr&gt;, …\n\nNote that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.\njoin_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi join. You’ll learn about non-equi joins in Section 20.5.\nSecondly, it’s how you specify different join keys in each table. For example, there are two ways to join the flight2 and airports table: either by dest or origin:\n\nflights2 |&gt; \n  left_join(airports, join_by(dest == faa))\n#&gt; # A tibble: 336,776 × 13\n#&gt;    year time_hour           origin dest  tailnum carrier name                \n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;               \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bush Interco…\n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bush Interco…\n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl          \n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      &lt;NA&gt;                \n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfield Jackson …\n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago Ohare Intl  \n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;, alt &lt;dbl&gt;, tz &lt;dbl&gt;, …\n\nflights2 |&gt; \n  left_join(airports, join_by(origin == faa))\n#&gt; # A tibble: 336,776 × 13\n#&gt;    year time_hour           origin dest  tailnum carrier name               \n#&gt;   &lt;int&gt; &lt;dttm&gt;              &lt;chr&gt;  &lt;chr&gt; &lt;chr&gt;   &lt;chr&gt;   &lt;chr&gt;              \n#&gt; 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark Liberty Intl\n#&gt; 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guardia         \n#&gt; 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F Kennedy Intl\n#&gt; 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F Kennedy Intl\n#&gt; 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guardia         \n#&gt; 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark Liberty Intl\n#&gt; # ℹ 336,770 more rows\n#&gt; # ℹ 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;, alt &lt;dbl&gt;, tz &lt;dbl&gt;, …\n\nIn older code you might see a different way of specifying the join keys, using a character vector:\n\nby = \"x\" corresponds to join_by(x).\nby = c(\"a\" = \"x\") corresponds to join_by(a == x).\n\nNow that it exists, we prefer join_by() since it provides a clearer and more flexible specification.\ninner_join(), right_join(), full_join() have the same interface as left_join(). The difference is which rows they keep: left join keeps all the rows in x, the right join keeps all rows in y, the full join keeps all rows in either x or y, and the inner join only keeps rows that occur in both x and y. We’ll come back to these in more detail later.\n\n\n20.3.3 Filtering joins\nAs you might guess the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x that have a match in y. For example, we could use a semi-join to filter the airports dataset to show just the origin airports:\n\nairports |&gt; \n  semi_join(flights2, join_by(faa == origin))\n#&gt; # A tibble: 3 × 8\n#&gt;   faa   name                  lat   lon   alt    tz dst   tzone           \n#&gt;   &lt;chr&gt; &lt;chr&gt;               &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;           \n#&gt; 1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York\n#&gt; 2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York\n#&gt; 3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York\n\nOr just the destinations:\n\nairports |&gt; \n  semi_join(flights2, join_by(faa == dest))\n#&gt; # A tibble: 101 × 8\n#&gt;   faa   name                     lat    lon   alt    tz dst   tzone          \n#&gt;   &lt;chr&gt; &lt;chr&gt;                  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;          \n#&gt; 1 ABQ   Albuquerque Internati…  35.0 -107.   5355    -7 A     America/Denver \n#&gt; 2 ACK   Nantucket Mem           41.3  -70.1    48    -5 A     America/New_Yo…\n#&gt; 3 ALB   Albany Intl             42.7  -73.8   285    -5 A     America/New_Yo…\n#&gt; 4 ANC   Ted Stevens Anchorage…  61.2 -150.    152    -9 A     America/Anchor…\n#&gt; 5 ATL   Hartsfield Jackson At…  33.6  -84.4  1026    -5 A     America/New_Yo…\n#&gt; 6 AUS   Austin Bergstrom Intl   30.2  -97.7   542    -6 A     America/Chicago\n#&gt; # ℹ 95 more rows\n\nAnti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of Section 19.3. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:\n\nflights2 |&gt; \n  anti_join(airports, join_by(dest == faa)) |&gt; \n  distinct(dest)\n#&gt; # A tibble: 4 × 1\n#&gt;   dest \n#&gt;   &lt;chr&gt;\n#&gt; 1 BQN  \n#&gt; 2 SJU  \n#&gt; 3 STT  \n#&gt; 4 PSE\n\nOr we can find which tailnums are missing from planes:\n\nflights2 |&gt;\n  anti_join(planes, join_by(tailnum)) |&gt; \n  distinct(tailnum)\n#&gt; # A tibble: 722 × 1\n#&gt;   tailnum\n#&gt;   &lt;chr&gt;  \n#&gt; 1 N3ALAA \n#&gt; 2 N3DUAA \n#&gt; 3 N542MQ \n#&gt; 4 N730MQ \n#&gt; 5 N9EAMQ \n#&gt; 6 N532UA \n#&gt; # ℹ 716 more rows\n\n\n\n20.3.4 Exercises\n\nFind the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?\nImagine you’ve found the top 10 most popular destinations using this code:\n\ntop_dest &lt;- flights2 |&gt;\n  count(dest, sort = TRUE) |&gt;\n  head(10)\n\nHow can you find all flights to those destinations?\nDoes every departing flight have corresponding weather data for that hour?\nWhat do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)\nAdd a column to planes that lists every carrier that has flown that plane. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned in previous chapters.\nAdd the latitude and the longitude of the origin and destination airport to flights. Is it easier to rename the columns before or after the join?\nCompute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:\n\nairports |&gt;\n  semi_join(flights, join_by(faa == dest)) |&gt;\n  ggplot(aes(x = lon, y = lat)) +\n    borders(\"state\") +\n    geom_point() +\n    coord_quickmap()\n\nYou might want to use the size or color of the points to display the average delay for each airport.\nWhat happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#how-do-joins-work",
    "href": "joins.html#how-do-joins-work",
    "title": "20  Joins",
    "section": "20.4 How do joins work?",
    "text": "20.4 How do joins work?\nNow that you’ve used joins a few times it’s time to learn more about how they work, focusing on how each row in x matches rows in y. We’ll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in Figure 20.2. In these examples we’ll use a single key called key and a single value column (val_x and val_y), but the ideas all generalize to multiple keys and multiple values.\n\nx &lt;- tribble(\n  ~key, ~val_x,\n     1, \"x1\",\n     2, \"x2\",\n     3, \"x3\"\n)\ny &lt;- tribble(\n  ~key, ~val_y,\n     1, \"y1\",\n     2, \"y2\",\n     4, \"y3\"\n)\n\n\n\n\n\n\n\n\n\nFigure 20.2: Graphical representation of two simple tables. The colored key columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.\n\n\n\n\n\nFigure 20.3 introduces the foundation for our visual representation. It shows all potential matches between x and y as the intersection between lines drawn from each row of x and each row of y. The rows and columns in the output are primarily determined by x, so the x table is horizontal and lines up with the output.\n\n\n\n\n\n\n\n\nFigure 20.3: To understand how joins work, it’s useful to think of every possible match. Here we show that with a grid of connecting lines.\n\n\n\n\n\nTo describe a specific type of join, we indicate matches with dots. The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values. For example, Figure 20.4 shows an inner join, where rows are retained if and only if the keys are equal.\n\n\n\n\n\n\n\n\nFigure 20.4: An inner join matches each row in x to the row in y that has the same value of key. Each match becomes a row in the output.\n\n\n\n\n\nWe can apply the same principles to explain the outer joins, which keep observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with NA. There are three types of outer joins:\n\nA left join keeps all observations in x, Figure 20.5. Every row of x is preserved in the output because it can fall back to matching a row of NAs in y.\n\n\n\n\n\n\n\n\nFigure 20.5: A visual representation of the left join where every row in x appears in the output.\n\n\n\n\n\nA right join keeps all observations in y, Figure 20.6. Every row of y is preserved in the output because it can fall back to matching a row of NAs in x. The output still matches x as much as possible; any extra rows from y are added to the end.\n\n\n\n\n\n\n\n\nFigure 20.6: A visual representation of the right join where every row of y appears in the output.\n\n\n\n\n\nA full join keeps all observations that appear in x or y, Figure 20.7. Every row of x and y is included in the output because both x and y have a fall back row of NAs. Again, the output starts with all rows from x, followed by the remaining unmatched y rows.\n\n\n\n\n\n\n\n\nFigure 20.7: A visual representation of the full join where every row in x and y appears in the output.\n\n\n\n\n\n\nAnother way to show how the types of outer join differ is with a Venn diagram, as in Figure 20.8. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate what’s happening with the columns.\n\n\n\n\n\n\n\n\nFigure 20.8: Venn diagrams showing the difference between inner, left, right, and full joins.\n\n\n\n\n\nThe joins shown here are the so-called equi joins, where rows match if the keys are equal. Equi joins are the most common type of join, so we’ll typically omit the equi prefix, and just say “inner join” rather than “equi inner join”. We’ll come back to non-equi joins in Section 20.5.\n\n20.4.1 Row matching\nSo far we’ve explored what happens if a row in x matches zero or one row in y. What happens if it matches more than one row? To understand what’s going let’s first narrow our focus to the inner_join() and then draw a picture, Figure 20.9.\n\n\n\n\n\n\n\n\nFigure 20.9: The three ways a row in x can match. x1 matches one row in y, x2 matches two rows in y, x3 matches zero rows in y. Note that while there are three rows in x and three rows in the output, there isn’t a direct correspondence between the rows.\n\n\n\n\n\nThere are three possible outcomes for a row in x:\n\nIf it doesn’t match anything, it’s dropped.\nIf it matches 1 row in y, it’s preserved.\nIf it matches more than 1 row in y, it’s duplicated once for each match.\n\nIn principle, this means that there’s no guaranteed correspondence between the rows in the output and the rows in x, but in practice, this rarely causes problems. There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows. Imagine joining the following two tables:\n\ndf1 &lt;- tibble(key = c(1, 2, 2), val_x = c(\"x1\", \"x2\", \"x3\"))\ndf2 &lt;- tibble(key = c(1, 2, 2), val_y = c(\"y1\", \"y2\", \"y3\"))\n\nWhile the first row in df1 only matches one row in df2, the second and third rows both match two rows. This is sometimes called a many-to-many join, and will cause dplyr to emit a warning:\n\ndf1 |&gt; \n  inner_join(df2, join_by(key))\n#&gt; Warning in inner_join(df1, df2, join_by(key)): Detected an unexpected many-to-many relationship between `x` and `y`.\n#&gt; ℹ Row 2 of `x` matches multiple rows in `y`.\n#&gt; ℹ Row 2 of `y` matches multiple rows in `x`.\n#&gt; ℹ If a many-to-many relationship is expected, set `relationship =\n#&gt;   \"many-to-many\"` to silence this warning.\n#&gt; # A tibble: 5 × 3\n#&gt;     key val_x val_y\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1     1 x1    y1   \n#&gt; 2     2 x2    y2   \n#&gt; 3     2 x2    y3   \n#&gt; 4     2 x3    y2   \n#&gt; 5     2 x3    y3\n\nIf you are doing this deliberately, you can set relationship = \"many-to-many\", as the warning suggests.\n\n\n20.4.2 Filtering joins\nThe number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in x that have one or more matches in y, as in Figure 20.10. The anti-join keeps rows in x that match zero rows in y, as in Figure 20.11. In both cases, only the existence of a match is important; it doesn’t matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.\n\n\n\n\n\n\n\n\nFigure 20.10: In a semi-join it only matters that there is a match; otherwise values in y don’t affect the output.\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 20.11: An anti-join is the inverse of a semi-join, dropping rows from x that have a match in y.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#sec-non-equi-joins",
    "href": "joins.html#sec-non-equi-joins",
    "title": "20  Joins",
    "section": "20.5 Non-equi joins",
    "text": "20.5 Non-equi joins\nSo far you’ve only seen equi joins, joins where the rows match if the x key equals the y key. Now we’re going to relax that restriction and discuss other ways of determining if a pair of rows match.\nBut before we can do that, we need to revisit a simplification we made above. In equi joins the x keys and y are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with keep = TRUE, leading to the code below and the re-drawn inner_join() in Figure 20.12.\n\nx |&gt; inner_join(y, join_by(key == key), keep = TRUE)\n#&gt; # A tibble: 2 × 4\n#&gt;   key.x val_x key.y val_y\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;\n#&gt; 1     1 x1        1 y1   \n#&gt; 2     2 x2        2 y2\n\n\n\n\n\n\n\n\n\nFigure 20.12: An inner join showing both x and y keys in the output.\n\n\n\n\n\nWhen we move away from equi joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the x$key and y$key are equal, we could match whenever the x$key is greater than or equal to the y$key, leading to Figure 20.13. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.\n\n\n\n\n\n\n\n\nFigure 20.13: A non-equi join where the x key must be greater than or equal to the y key. Many rows generate multiple matches.\n\n\n\n\n\nNon-equi join isn’t a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi join:\n\nCross joins match every pair of rows.\nInequality joins use &lt;, &lt;=, &gt;, and &gt;= instead of ==.\nRolling joins are similar to inequality joins but only find the closest match.\nOverlap joins are a special type of inequality join designed to work with ranges.\n\nEach of these is described in more detail in the following sections.\n\n20.5.1 Cross joins\nA cross join matches everything, as in Figure 20.14, generating the Cartesian product of rows. This means the output will have nrow(x) * nrow(y) rows.\n\n\n\n\n\n\n\n\nFigure 20.14: A cross join matches each row in x with every row in y.\n\n\n\n\n\nCross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining df to itself, this is sometimes called a self-join. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.\n\ndf &lt;- tibble(name = c(\"John\", \"Simon\", \"Tracy\", \"Max\"))\ndf |&gt; cross_join(df)\n#&gt; # A tibble: 16 × 2\n#&gt;   name.x name.y\n#&gt;   &lt;chr&gt;  &lt;chr&gt; \n#&gt; 1 John   John  \n#&gt; 2 John   Simon \n#&gt; 3 John   Tracy \n#&gt; 4 John   Max   \n#&gt; 5 Simon  John  \n#&gt; 6 Simon  Simon \n#&gt; # ℹ 10 more rows\n\n\n\n20.5.2 Inequality joins\nInequality joins use &lt;, &lt;=, &gt;=, or &gt; to restrict the set of possible matches, as in Figure 20.13 and Figure 20.15.\n\n\n\n\n\n\n\n\nFigure 20.15: An inequality join where x is joined to y on rows where the key of x is less than the key of y. This makes a triangular shape in the top-left corner.\n\n\n\n\n\nInequality joins are extremely general, so general that it’s hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:\n\ndf &lt;- tibble(id = 1:4, name = c(\"John\", \"Simon\", \"Tracy\", \"Max\"))\n\ndf |&gt; inner_join(df, join_by(id &lt; id))\n#&gt; # A tibble: 6 × 4\n#&gt;    id.x name.x  id.y name.y\n#&gt;   &lt;int&gt; &lt;chr&gt;  &lt;int&gt; &lt;chr&gt; \n#&gt; 1     1 John       2 Simon \n#&gt; 2     1 John       3 Tracy \n#&gt; 3     1 John       4 Max   \n#&gt; 4     2 Simon      3 Tracy \n#&gt; 5     2 Simon      4 Max   \n#&gt; 6     3 Tracy      4 Max\n\n\n\n20.5.3 Rolling joins\nRolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, you get just the closest row, as in Figure 20.16. You can turn any inequality join into a rolling join by adding closest(). For example join_by(closest(x &lt;= y)) matches the smallest y that’s greater than or equal to x, and join_by(closest(x &gt; y)) matches the biggest y that’s less than x.\n\n\n\n\n\n\n\n\nFigure 20.16: A rolling join is similar to a greater-than-or-equal inequality join but only matches the first value.\n\n\n\n\n\nRolling joins are particularly useful when you have two tables of dates that don’t perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.\nFor example, imagine that you’re in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:\n\nparties &lt;- tibble(\n  q = 1:4,\n  party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\"))\n)\n\nNow imagine that you have a table of employee birthdays:\n\nset.seed(123)\nemployees &lt;- tibble(\n  name = sample(babynames::babynames$name, 100),\n  birthday = ymd(\"2022-01-01\") + (sample(365, 100, replace = TRUE) - 1)\n)\nemployees\n#&gt; # A tibble: 100 × 2\n#&gt;   name     birthday  \n#&gt;   &lt;chr&gt;    &lt;date&gt;    \n#&gt; 1 Kemba    2022-01-22\n#&gt; 2 Orean    2022-06-26\n#&gt; 3 Kirstyn  2022-02-11\n#&gt; 4 Amparo   2022-11-11\n#&gt; 5 Belen    2022-03-25\n#&gt; 6 Rayshaun 2022-01-11\n#&gt; # ℹ 94 more rows\n\nAnd for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:\n\nemployees |&gt; \n  left_join(parties, join_by(closest(birthday &gt;= party)))\n#&gt; # A tibble: 100 × 4\n#&gt;   name     birthday       q party     \n#&gt;   &lt;chr&gt;    &lt;date&gt;     &lt;int&gt; &lt;date&gt;    \n#&gt; 1 Kemba    2022-01-22     1 2022-01-10\n#&gt; 2 Orean    2022-06-26     2 2022-04-04\n#&gt; 3 Kirstyn  2022-02-11     1 2022-01-10\n#&gt; 4 Amparo   2022-11-11     4 2022-10-03\n#&gt; 5 Belen    2022-03-25     1 2022-01-10\n#&gt; 6 Rayshaun 2022-01-11     1 2022-01-10\n#&gt; # ℹ 94 more rows\n\nThere is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:\n\nemployees |&gt; \n  anti_join(parties, join_by(closest(birthday &gt;= party)))\n#&gt; # A tibble: 2 × 2\n#&gt;   name   birthday  \n#&gt;   &lt;chr&gt;  &lt;date&gt;    \n#&gt; 1 Maks   2022-01-07\n#&gt; 2 Nalani 2022-01-04\n\nTo resolve that issue we’ll need to tackle the problem a different way, with overlap joins.\n\n\n20.5.4 Overlap joins\nOverlap joins provide three helpers that use inequality joins to make it easier to work with intervals:\n\nbetween(x, y_lower, y_upper) is short for x &gt;= y_lower, x &lt;= y_upper.\nwithin(x_lower, x_upper, y_lower, y_upper) is short for x_lower &gt;= y_lower, x_upper &lt;= y_upper.\noverlaps(x_lower, x_upper, y_lower, y_upper) is short for x_lower &lt;= y_upper, x_upper &gt;= y_lower.\n\nLet’s continue the birthday example to see how you might use them. There’s one problem with the strategy we used above: there’s no party preceding the birthdays Jan 1-9. So it might be better to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:\n\nparties &lt;- tibble(\n  q = 1:4,\n  party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n  start = ymd(c(\"2022-01-01\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n  end = ymd(c(\"2022-04-03\", \"2022-07-11\", \"2022-10-02\", \"2022-12-31\"))\n)\nparties\n#&gt; # A tibble: 4 × 4\n#&gt;       q party      start      end       \n#&gt;   &lt;int&gt; &lt;date&gt;     &lt;date&gt;     &lt;date&gt;    \n#&gt; 1     1 2022-01-10 2022-01-01 2022-04-03\n#&gt; 2     2 2022-04-04 2022-04-04 2022-07-11\n#&gt; 3     3 2022-07-11 2022-07-11 2022-10-02\n#&gt; 4     4 2022-10-03 2022-10-03 2022-12-31\n\nHadley is hopelessly bad at data entry so he also wanted to check that the party periods don’t overlap. One way to do this is by using a self-join to check if any start-end interval overlap with another:\n\nparties |&gt; \n  inner_join(parties, join_by(overlaps(start, end, start, end), q &lt; q)) |&gt; \n  select(start.x, end.x, start.y, end.y)\n#&gt; # A tibble: 1 × 4\n#&gt;   start.x    end.x      start.y    end.y     \n#&gt;   &lt;date&gt;     &lt;date&gt;     &lt;date&gt;     &lt;date&gt;    \n#&gt; 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02\n\nOoops, there is an overlap, so let’s fix that problem and continue:\n\nparties &lt;- tibble(\n  q = 1:4,\n  party = ymd(c(\"2022-01-10\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n  start = ymd(c(\"2022-01-01\", \"2022-04-04\", \"2022-07-11\", \"2022-10-03\")),\n  end = ymd(c(\"2022-04-03\", \"2022-07-10\", \"2022-10-02\", \"2022-12-31\"))\n)\n\nNow we can match each employee to their party. This is a good place to use unmatched = \"error\" because we want to quickly find out if any employees didn’t get assigned a party.\n\nemployees |&gt; \n  inner_join(parties, join_by(between(birthday, start, end)), unmatched = \"error\")\n#&gt; # A tibble: 100 × 6\n#&gt;   name     birthday       q party      start      end       \n#&gt;   &lt;chr&gt;    &lt;date&gt;     &lt;int&gt; &lt;date&gt;     &lt;date&gt;     &lt;date&gt;    \n#&gt; 1 Kemba    2022-01-22     1 2022-01-10 2022-01-01 2022-04-03\n#&gt; 2 Orean    2022-06-26     2 2022-04-04 2022-04-04 2022-07-10\n#&gt; 3 Kirstyn  2022-02-11     1 2022-01-10 2022-01-01 2022-04-03\n#&gt; 4 Amparo   2022-11-11     4 2022-10-03 2022-10-03 2022-12-31\n#&gt; 5 Belen    2022-03-25     1 2022-01-10 2022-01-01 2022-04-03\n#&gt; 6 Rayshaun 2022-01-11     1 2022-01-10 2022-01-01 2022-04-03\n#&gt; # ℹ 94 more rows\n\n\n\n20.5.5 Exercises\n\nCan you explain what’s happening with the keys in this equi join? Why are they different?\n\nx |&gt; full_join(y, join_by(key == key))\n#&gt; # A tibble: 4 × 3\n#&gt;     key val_x val_y\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;\n#&gt; 1     1 x1    y1   \n#&gt; 2     2 x2    y2   \n#&gt; 3     3 x3    &lt;NA&gt; \n#&gt; 4     4 &lt;NA&gt;  y3\n\nx |&gt; full_join(y, join_by(key == key), keep = TRUE)\n#&gt; # A tibble: 4 × 4\n#&gt;   key.x val_x key.y val_y\n#&gt;   &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;\n#&gt; 1     1 x1        1 y1   \n#&gt; 2     2 x2        2 y2   \n#&gt; 3     3 x3       NA &lt;NA&gt; \n#&gt; 4    NA &lt;NA&gt;      4 y3\n\nWhen finding if any party period overlapped with another party period we used q &lt; q in the join_by()? Why? What happens if you remove this inequality?",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#summary",
    "href": "joins.html#summary",
    "title": "20  Joins",
    "section": "20.6 Summary",
    "text": "20.6 Summary\nIn this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi joins and seen a few interesting use cases.\nThis chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working with strings, lubridate functions for working with date-times, and forcats functions for working with factors.\nIn the next part of the book, you’ll learn more about getting various types of data into R in a tidy form.",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "joins.html#footnotes",
    "href": "joins.html#footnotes",
    "title": "20  Joins",
    "section": "",
    "text": "Remember that in RStudio you can also use View() to avoid this problem.↩︎\nThat’s not 100% true, but you’ll get a warning whenever it isn’t.↩︎",
    "crumbs": [
      "Transform",
      "<span class='chapter-number'>20</span>  <span class='chapter-title'>Joins</span>"
    ]
  },
  {
    "objectID": "import.html",
    "href": "import.html",
    "title": "Import",
    "section": "",
    "text": "In this part of the book, you’ll learn how to import a wider range of data into R, as well as how to get it into a form useful for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation in order to get to the tidy rectangle that you’d prefer to work with.\n\n\n\n\n\n\n\n\nFigure 1: Data import is the beginning of the data science process; without data you can’t do data science!\n\n\n\n\n\nIn this part of the book you’ll learn how to access data stored in the following ways:\n\nIn 21  Spreadsheets, you’ll learn how to import data from Excel spreadsheets and Google Sheets.\nIn 22  Databases, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).\nIn 23  Arrow, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.\nIn 24  Hierarchical data, you’ll learn how to work with hierarchical data, including the deeply nested lists produced by data stored in the JSON format.\nIn 25  Web scraping, you’ll learn web “scraping”, the art and science of extracting data from web pages.\n\nThere are two important tidyverse packages that we don’t discuss here: haven and xml2. If you’re working with data from SPSS, Stata, and SAS files, check out the haven package, https://haven.tidyverse.org. If you’re working with XML data, check out the xml2 package, https://xml2.r-lib.org. Otherwise, you’ll need to do some research to figure which package you’ll need to use; google is your friend here 😃.",
    "crumbs": [
      "Import"
    ]
  },
  {
    "objectID": "spreadsheets.html",
    "href": "spreadsheets.html",
    "title": "21  Spreadsheets",
    "section": "",
    "text": "21.1 Introduction\nIn Chapter 8 you learned about importing data from plain text files like .csv and .tsv. Now it’s time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet. This will build on much of what you’ve learned in Chapter 8, but we will also discuss additional considerations and complexities when working with data from spreadsheets.\nIf you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper “Data Organization in Spreadsheets” by Karl Broman and Kara Woo: https://doi.org/10.1080/00031305.2017.1375989. The best practices presented in this paper will save you much headache when you import data from a spreadsheet into R to analyze and visualize.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>21</span>  <span class='chapter-title'>Spreadsheets</span>"
    ]
  },
  {
    "objectID": "spreadsheets.html#excel",
    "href": "spreadsheets.html#excel",
    "title": "21  Spreadsheets",
    "section": "21.2 Excel",
    "text": "21.2 Excel\nMicrosoft Excel is a widely used spreadsheet software program where data are organized in worksheets inside of spreadsheet files.\n\n21.2.1 Prerequisites\nIn this section, you’ll learn how to load data from Excel spreadsheets in R with the readxl package. This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package. Later, we’ll also use the writexl package, which allows us to create Excel spreadsheets.\n\nlibrary(readxl)\nlibrary(tidyverse)\nlibrary(writexl)\n\n\n\n21.2.2 Getting started\nMost of readxl’s functions allow you to load Excel spreadsheets into R:\n\nread_xls() reads Excel files with xls format.\nread_xlsx() read Excel files with xlsx format.\nread_excel() can read files with both xls and xlsx format. It guesses the file type based on the input.\n\nThese functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g., read_csv(), read_table(), etc. For the rest of the chapter we will focus on using read_excel().\n\n\n21.2.3 Reading Excel spreadsheets\nFigure 21.1 shows what the spreadsheet we’re going to read into R looks like in Excel. This spreadsheet can be downloaded an Excel file from https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/.\n\n\n\n\n\n\n\n\nFigure 21.1: Spreadsheet called students.xlsx in Excel.\n\n\n\n\n\nThe first argument to read_excel() is the path to the file to read.\n\nstudents &lt;- read_excel(\"data/students.xlsx\")\n\nread_excel() will read the file in as a tibble.\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  \n#&gt;          &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2            2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    \n#&gt; 4            4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6            6 Güvenç Attila    Ice cream          Lunch only          6\n\nWe have six students in the data and five variables on each student. However there are a few things we might want to address in this dataset:\n\nThe column names are all over the place. You can provide column names that follow a consistent format; we recommend snake_case using the col_names argument.\n\nread_excel(\n  \"data/students.xlsx\",\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\")\n)\n#&gt; # A tibble: 7 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;   &lt;chr&gt;      &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1 Student ID Full Name        favourite.food     mealPlan            AGE  \n#&gt; 2 1          Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 3 2          Barclay Lynn     French fries       Lunch only          5    \n#&gt; 4 3          Jayendra Lyne    N/A                Breakfast and lunch 7    \n#&gt; 5 4          Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 6 5          Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 7 6          Güvenç Attila    Ice cream          Lunch only          6\n\nUnfortunately, this didn’t quite do the trick. We now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the skip argument.\n\nread_excel(\n  \"data/students.xlsx\",\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n  skip = 1\n)\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    N/A                Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nIn the favourite_food column, one of the observations is N/A, which stands for “not available” but it’s currently not recognized as an NA (note the contrast between this N/A and the age of the fourth student in the list). You can specify which character strings should be recognized as NAs with the na argument. By default, only \"\" (empty string, or, in the case of reading from a spreadsheet, an empty cell or a cell with the formula =NA()) is recognized as an NA.\n\nread_excel(\n  \"data/students.xlsx\",\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n  skip = 1,\n  na = c(\"\", \"N/A\")\n)\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nOne other remaining issue is that age is read in as a character variable, but it really should be numeric. Just like with read_csv() and friends for reading data from flat files, you can supply a col_types argument to read_excel() and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are \"skip\", \"guess\", \"logical\", \"numeric\", \"date\", \"text\" or \"list\".\n\nread_excel(\n  \"data/students.xlsx\",\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n  skip = 1,\n  na = c(\"\", \"N/A\"),\n  col_types = c(\"numeric\", \"text\", \"text\", \"text\", \"numeric\")\n)\n#&gt; Warning: Expecting numeric in E6 / R6C5: got 'five'\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch    NA\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\nHowever, this didn’t quite produce the desired result either. By specifying that age should be numeric, we have turned the one cell with the non-numeric entry (which had the value five) into an NA. In this case, we should read age in as \"text\" and then make the change once the data is loaded in R.\n\nstudents &lt;- read_excel(\n  \"data/students.xlsx\",\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n  skip = 1,\n  na = c(\"\", \"N/A\"),\n  col_types = c(\"numeric\", \"text\", \"text\", \"text\", \"text\")\n)\n\nstudents &lt;- students |&gt;\n  mutate(\n    age = if_else(age == \"five\", \"5\", age),\n    age = parse_number(age)\n  )\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan             age\n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;dbl&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4\n#&gt; 2          2 Barclay Lynn     French fries       Lunch only              5\n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch     7\n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only             NA\n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5\n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only              6\n\n\nIt took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected. Data science is an iterative process, and the process of iteration can be even more tedious when reading data in from spreadsheets compared to other plain text, rectangular data files because humans tend to input data into spreadsheets and use them not just for data storage but also for sharing and communication.\nThere is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. If you’re going to do so, we recommend making a copy of the Excel file to open and browse interactively while leaving the original data file untouched and reading into R from the untouched file. This will ensure you don’t accidentally overwrite anything in the spreadsheet while inspecting it. You should also not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until you’re happy with the result.\n\n\n21.2.4 Reading worksheets\nAn important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets, called worksheets. Figure 21.2 shows an Excel spreadsheet with multiple worksheets. The data come from the palmerpenguins package, and you can download this spreadsheet as an Excel file from https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/. Each worksheet contains information on penguins from a different island where data were collected.\n\n\n\n\n\n\n\n\nFigure 21.2: Spreadsheet called penguins.xlsx in Excel containing three worksheets.\n\n\n\n\n\nYou can read a single worksheet from a spreadsheet with the sheet argument in read_excel(). The default, which we’ve been relying on up until now, is the first sheet.\n\nread_excel(\"data/penguins.xlsx\", sheet = \"Torgersen Island\")\n#&gt; # A tibble: 52 × 8\n#&gt;   species island    bill_length_mm     bill_depth_mm      flipper_length_mm\n#&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;chr&gt;              &lt;chr&gt;              &lt;chr&gt;            \n#&gt; 1 Adelie  Torgersen 39.1               18.7               181              \n#&gt; 2 Adelie  Torgersen 39.5               17.399999999999999 186              \n#&gt; 3 Adelie  Torgersen 40.299999999999997 18                 195              \n#&gt; 4 Adelie  Torgersen NA                 NA                 NA               \n#&gt; 5 Adelie  Torgersen 36.700000000000003 19.3               193              \n#&gt; 6 Adelie  Torgersen 39.299999999999997 20.6               190              \n#&gt; # ℹ 46 more rows\n#&gt; # ℹ 3 more variables: body_mass_g &lt;chr&gt;, sex &lt;chr&gt;, year &lt;dbl&gt;\n\nSome variables that appear to contain numerical data are read in as characters due to the character string \"NA\" not being recognized as a true NA.\n\npenguins_torgersen &lt;- read_excel(\"data/penguins.xlsx\", sheet = \"Torgersen Island\", na = \"NA\")\n\npenguins_torgersen\n#&gt; # A tibble: 52 × 8\n#&gt;   species island    bill_length_mm bill_depth_mm flipper_length_mm\n#&gt;   &lt;chr&gt;   &lt;chr&gt;              &lt;dbl&gt;         &lt;dbl&gt;             &lt;dbl&gt;\n#&gt; 1 Adelie  Torgersen           39.1          18.7               181\n#&gt; 2 Adelie  Torgersen           39.5          17.4               186\n#&gt; 3 Adelie  Torgersen           40.3          18                 195\n#&gt; 4 Adelie  Torgersen           NA            NA                  NA\n#&gt; 5 Adelie  Torgersen           36.7          19.3               193\n#&gt; 6 Adelie  Torgersen           39.3          20.6               190\n#&gt; # ℹ 46 more rows\n#&gt; # ℹ 3 more variables: body_mass_g &lt;dbl&gt;, sex &lt;chr&gt;, year &lt;dbl&gt;\n\nAlternatively, you can use excel_sheets() to get information on all worksheets in an Excel spreadsheet, and then read the one(s) you’re interested in.\n\nexcel_sheets(\"data/penguins.xlsx\")\n#&gt; [1] \"Torgersen Island\" \"Biscoe Island\"    \"Dream Island\"\n\nOnce you know the names of the worksheets, you can read them in individually with read_excel().\n\npenguins_biscoe &lt;- read_excel(\"data/penguins.xlsx\", sheet = \"Biscoe Island\", na = \"NA\")\npenguins_dream  &lt;- read_excel(\"data/penguins.xlsx\", sheet = \"Dream Island\", na = \"NA\")\n\nIn this case the full penguins dataset is spread across three worksheets in the spreadsheet. Each worksheet has the same number of columns but different numbers of rows.\n\ndim(penguins_torgersen)\n#&gt; [1] 52  8\ndim(penguins_biscoe)\n#&gt; [1] 168   8\ndim(penguins_dream)\n#&gt; [1] 124   8\n\nWe can put them together with bind_rows().\n\npenguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)\npenguins\n#&gt; # A tibble: 344 × 8\n#&gt;   species island    bill_length_mm bill_depth_mm flipper_length_mm\n#&gt;   &lt;chr&gt;   &lt;chr&gt;              &lt;dbl&gt;         &lt;dbl&gt;             &lt;dbl&gt;\n#&gt; 1 Adelie  Torgersen           39.1          18.7               181\n#&gt; 2 Adelie  Torgersen           39.5          17.4               186\n#&gt; 3 Adelie  Torgersen           40.3          18                 195\n#&gt; 4 Adelie  Torgersen           NA            NA                  NA\n#&gt; 5 Adelie  Torgersen           36.7          19.3               193\n#&gt; 6 Adelie  Torgersen           39.3          20.6               190\n#&gt; # ℹ 338 more rows\n#&gt; # ℹ 3 more variables: body_mass_g &lt;dbl&gt;, sex &lt;chr&gt;, year &lt;dbl&gt;\n\nIn Chapter 27 we’ll talk about ways of doing this sort of task without repetitive code.\n\n\n21.2.5 Reading part of a sheet\nSince many use Excel spreadsheets for presentation as well as for data storage, it’s quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R. Figure 21.3 shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.\n\n\n\n\n\n\n\n\nFigure 21.3: Spreadsheet called deaths.xlsx in Excel.\n\n\n\n\n\nThis spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the readxl_example() function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in read_excel() as usual.\n\ndeaths_path &lt;- readxl_example(\"deaths.xlsx\")\ndeaths &lt;- read_excel(deaths_path)\n#&gt; New names:\n#&gt; • `` -&gt; `...2`\n#&gt; • `` -&gt; `...3`\n#&gt; • `` -&gt; `...4`\n#&gt; • `` -&gt; `...5`\n#&gt; • `` -&gt; `...6`\ndeaths\n#&gt; # A tibble: 18 × 6\n#&gt;   `Lots of people`    ...2       ...3  ...4     ...5          ...6           \n#&gt;   &lt;chr&gt;               &lt;chr&gt;      &lt;chr&gt; &lt;chr&gt;    &lt;chr&gt;         &lt;chr&gt;          \n#&gt; 1 simply cannot resi… &lt;NA&gt;       &lt;NA&gt;  &lt;NA&gt;     &lt;NA&gt;          some notes     \n#&gt; 2 at                  the        top   &lt;NA&gt;     of            their spreadsh…\n#&gt; 3 or                  merging    &lt;NA&gt;  &lt;NA&gt;     &lt;NA&gt;          cells          \n#&gt; 4 Name                Profession Age   Has kids Date of birth Date of death  \n#&gt; 5 David Bowie         musician   69    TRUE     17175         42379          \n#&gt; 6 Carrie Fisher       actor      60    TRUE     20749         42731          \n#&gt; # ℹ 12 more rows\n\nThe top three rows and the bottom four rows are not part of the data frame. It’s possible to eliminate these extraneous rows using the skip and n_max arguments, but we recommend using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.\nHere the data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15, which we supply to the range argument:\n\nread_excel(deaths_path, range = \"A5:F15\")\n#&gt; # A tibble: 10 × 6\n#&gt;   Name          Profession   Age `Has kids` `Date of birth`    \n#&gt;   &lt;chr&gt;         &lt;chr&gt;      &lt;dbl&gt; &lt;lgl&gt;      &lt;dttm&gt;             \n#&gt; 1 David Bowie   musician      69 TRUE       1947-01-08 00:00:00\n#&gt; 2 Carrie Fisher actor         60 TRUE       1956-10-21 00:00:00\n#&gt; 3 Chuck Berry   musician      90 TRUE       1926-10-18 00:00:00\n#&gt; 4 Bill Paxton   actor         61 TRUE       1955-05-17 00:00:00\n#&gt; 5 Prince        musician      57 TRUE       1958-06-07 00:00:00\n#&gt; 6 Alan Rickman  actor         69 FALSE      1946-02-21 00:00:00\n#&gt; # ℹ 4 more rows\n#&gt; # ℹ 1 more variable: `Date of death` &lt;dttm&gt;\n\n\n\n21.2.6 Data types\nIn CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.\nThe underlying data in Excel spreadsheets is more complex. A cell can be one of four things:\n\nA boolean, like TRUE, FALSE, or NA.\nA number, like “10” or “10.5”.\nA datetime, which can also include time like “11/1/21” or “11/1/21 3:00 PM”.\nA text string, like “ten”.\n\nWhen working with spreadsheet data, it’s important to keep in mind that the underlying data can be very different than what you see in the cell. For example, Excel has no notion of an integer. All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points. Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970. You can customize how you display the date by applying formatting in Excel. Confusingly, it’s also possible to have something that looks like a number but is actually a string (e.g., type '10 into a cell in Excel).\nThese differences between how the underlying data are stored vs. how they’re displayed can cause surprises when the data are loaded into R. By default readxl will guess the data type in a given column. A recommended workflow is to let readxl guess the column types, confirm that you’re happy with the guessed column types, and if not, go back and re-import specifying col_types as shown in Section 21.2.3.\nAnother challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g., some cells are numeric, others text, others dates. When importing the data into R readxl has to make some decisions. In these cases you can set the type for this column to \"list\", which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.\n\n\n\n\n\n\nSometimes data is stored in more exotic ways, like the color of the cell background, or whether or not the text is bold. In such cases, you might find the tidyxl package useful. See https://nacnudus.github.io/spreadsheet-munging-strategies/ for more on strategies for working with non-tabular data from Excel.\n\n\n\n\n\n21.2.7 Writing to Excel\nLet’s create a small data frame that we can then write out. Note that item is a factor and quantity is an integer.\n\nbake_sale &lt;- tibble(\n  item     = factor(c(\"brownie\", \"cupcake\", \"cookie\")),\n  quantity = c(10, 5, 8)\n)\n\nbake_sale\n#&gt; # A tibble: 3 × 2\n#&gt;   item    quantity\n#&gt;   &lt;fct&gt;      &lt;dbl&gt;\n#&gt; 1 brownie       10\n#&gt; 2 cupcake        5\n#&gt; 3 cookie         8\n\nYou can write data back to disk as an Excel file using the write_xlsx() from the writexl package:\n\nwrite_xlsx(bake_sale, path = \"data/bake-sale.xlsx\")\n\nFigure 21.4 shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting col_names and format_headers arguments to FALSE.\n\n\n\n\n\n\n\n\nFigure 21.4: Spreadsheet called bake_sale.xlsx in Excel.\n\n\n\n\n\nJust like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see Section 8.5.\n\nread_excel(\"data/bake-sale.xlsx\")\n#&gt; # A tibble: 3 × 2\n#&gt;   item    quantity\n#&gt;   &lt;chr&gt;      &lt;dbl&gt;\n#&gt; 1 brownie       10\n#&gt; 2 cupcake        5\n#&gt; 3 cookie         8\n\n\n\n21.2.8 Formatted output\nThe writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. We won’t go into the details of using this package here, but we recommend reading https://ycphs.github.io/openxlsx/articles/Formatting.html for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.\nNote that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.\n\n\n21.2.9 Exercises\n\nIn an Excel file, create the following dataset and save it as survey.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\n\n\n\n\nThen, read it into R, with survey_id as a character variable and n_pets as a numerical variable.\n\n#&gt; # A tibble: 6 × 2\n#&gt;   survey_id n_pets\n#&gt;   &lt;chr&gt;      &lt;dbl&gt;\n#&gt; 1 1              0\n#&gt; 2 2              1\n#&gt; 3 3             NA\n#&gt; 4 4              2\n#&gt; 5 5              2\n#&gt; 6 6             NA\n\nIn another Excel file, create the following dataset and save it as roster.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\n\n\n\n\nThen, read it into R. The resulting data frame should be called roster and should look like the following.\n\n#&gt; # A tibble: 12 × 3\n#&gt;    group subgroup    id\n#&gt;    &lt;dbl&gt; &lt;chr&gt;    &lt;dbl&gt;\n#&gt;  1     1 A            1\n#&gt;  2     1 A            2\n#&gt;  3     1 A            3\n#&gt;  4     1 B            4\n#&gt;  5     1 B            5\n#&gt;  6     1 B            6\n#&gt;  7     1 B            7\n#&gt;  8     2 A            8\n#&gt;  9     2 A            9\n#&gt; 10     2 B           10\n#&gt; 11     2 B           11\n#&gt; 12     2 B           12\n\nIn a new Excel file, create the following dataset and save it as sales.xlsx. Alternatively, you can download it as an Excel file from here.\n\n\n\n\n\n\n\n\n\na. Read sales.xlsx in and save as sales. The data frame should look like the following, with id and n as column names and with 9 rows.\n\n#&gt; # A tibble: 9 × 2\n#&gt;   id      n    \n#&gt;   &lt;chr&gt;   &lt;chr&gt;\n#&gt; 1 Brand 1 n    \n#&gt; 2 1234    8    \n#&gt; 3 8721    2    \n#&gt; 4 1822    3    \n#&gt; 5 Brand 2 n    \n#&gt; 6 3333    1    \n#&gt; 7 2156    3    \n#&gt; 8 3987    6    \n#&gt; 9 3216    5\n\nb. Modify sales further to get it into the following tidy format with three columns (brand, id, and n) and 7 rows of data. Note that id and n are numeric, brand is a character variable.\n\n#&gt; # A tibble: 7 × 3\n#&gt;   brand      id     n\n#&gt;   &lt;chr&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 Brand 1  1234     8\n#&gt; 2 Brand 1  8721     2\n#&gt; 3 Brand 1  1822     3\n#&gt; 4 Brand 2  3333     1\n#&gt; 5 Brand 2  2156     3\n#&gt; 6 Brand 2  3987     6\n#&gt; 7 Brand 2  3216     5\n\nRecreate the bake_sale data frame, write it out to an Excel file using the write.xlsx() function from the openxlsx package.\nIn Chapter 8 you learned about the janitor::clean_names() function to turn column names into snake case. Read the students.xlsx file that we introduced earlier in this section and use this function to “clean” the column names.\nWhat happens if you try to read in a file with .xlsx extension with read_xls()?",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>21</span>  <span class='chapter-title'>Spreadsheets</span>"
    ]
  },
  {
    "objectID": "spreadsheets.html#google-sheets",
    "href": "spreadsheets.html#google-sheets",
    "title": "21  Spreadsheets",
    "section": "21.3 Google Sheets",
    "text": "21.3 Google Sheets\nGoogle Sheets is another widely used spreadsheet program. It’s free and web-based. Just like with Excel, in Google Sheets data are organized in worksheets (also called sheets) inside of spreadsheet files.\n\n21.3.1 Prerequisites\nThis section will also focus on spreadsheets, but this time you’ll be loading data from a Google Sheet with the googlesheets4 package. This package is non-core tidyverse as well, you need to load it explicitly.\n\nlibrary(googlesheets4)\nlibrary(tidyverse)\n\nA quick note about the name of the package: googlesheets4 uses v4 of the Sheets API v4 to provide an R interface to Google Sheets, hence the name.\n\n\n21.3.2 Getting started\nThe main function of the googlesheets4 package is read_sheet(), which reads a Google Sheet from a URL or a file id. This function also goes by the name range_read().\nYou can also create a brand new sheet with gs4_create() or write to an existing sheet with sheet_write() and friends.\nIn this section we’ll work with the same datasets as the ones in the Excel section to highlight similarities and differences between workflows for reading data from Excel and Google Sheets. readxl and googlesheets4 packages are both designed to mimic the functionality of the readr package, which provides the read_csv() function you’ve seen in Chapter 8. Therefore, many of the tasks can be accomplished with simply swapping out read_excel() for read_sheet(). However you’ll also see that Excel and Google Sheets don’t behave in exactly the same way, therefore other tasks may require further updates to the function calls.\n\n\n21.3.3 Reading Google Sheets\nFigure 21.5 shows what the spreadsheet we’re going to read into R looks like in Google Sheets. This is the same dataset as in Figure 21.1, except it’s stored in a Google Sheet instead of Excel.\n\n\n\n\n\n\n\n\nFigure 21.5: Google Sheet called students in a browser window.\n\n\n\n\n\nThe first argument to read_sheet() is the URL of the file to read, and it returns a tibble:\nhttps://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w. These URLs are not pleasant to work with, so you’ll often want to identify a sheet by its ID.\n\ngs4_deauth()\n\n\nstudents_sheet_id &lt;- \"1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w\"\nstudents &lt;- read_sheet(students_sheet_id)\n#&gt; ✔ Reading from students.\n#&gt; ✔ Range Sheet1.\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   `Student ID` `Full Name`      favourite.food     mealPlan            AGE   \n#&gt;          &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;list&gt;\n#&gt; 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          &lt;dbl&gt; \n#&gt; 2            2 Barclay Lynn     French fries       Lunch only          &lt;dbl&gt; \n#&gt; 3            3 Jayendra Lyne    N/A                Breakfast and lunch &lt;dbl&gt; \n#&gt; 4            4 Leon Rossini     Anchovies          Lunch only          &lt;NULL&gt;\n#&gt; 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch &lt;chr&gt; \n#&gt; 6            6 Güvenç Attila    Ice cream          Lunch only          &lt;dbl&gt;\n\nJust like we did with read_excel(), we can supply column names, NA strings, and column types to read_sheet().\n\nstudents &lt;- read_sheet(\n  students_sheet_id,\n  col_names = c(\"student_id\", \"full_name\", \"favourite_food\", \"meal_plan\", \"age\"),\n  skip = 1,\n  na = c(\"\", \"N/A\"),\n  col_types = \"dcccc\"\n)\n#&gt; ✔ Reading from students.\n#&gt; ✔ Range 2:10000000.\n\nstudents\n#&gt; # A tibble: 6 × 5\n#&gt;   student_id full_name        favourite_food     meal_plan           age  \n#&gt;        &lt;dbl&gt; &lt;chr&gt;            &lt;chr&gt;              &lt;chr&gt;               &lt;chr&gt;\n#&gt; 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    \n#&gt; 2          2 Barclay Lynn     French fries       Lunch only          5    \n#&gt; 3          3 Jayendra Lyne    &lt;NA&gt;               Breakfast and lunch 7    \n#&gt; 4          4 Leon Rossini     Anchovies          Lunch only          &lt;NA&gt; \n#&gt; 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five \n#&gt; 6          6 Güvenç Attila    Ice cream          Lunch only          6\n\nNote that we defined column types a bit differently here, using short codes. For example, “dcccc” stands for “double, character, character, character, character”.\nIt’s also possible to read individual sheets from Google Sheets as well. Let’s read the “Torgersen Island” sheet from the penguins Google Sheet:\n\npenguins_sheet_id &lt;- \"1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY\"\nread_sheet(penguins_sheet_id, sheet = \"Torgersen Island\")\n#&gt; ✔ Reading from penguins.\n#&gt; ✔ Range ''Torgersen Island''.\n#&gt; # A tibble: 52 × 8\n#&gt;   species island    bill_length_mm bill_depth_mm flipper_length_mm\n#&gt;   &lt;chr&gt;   &lt;chr&gt;     &lt;list&gt;         &lt;list&gt;        &lt;list&gt;           \n#&gt; 1 Adelie  Torgersen &lt;dbl [1]&gt;      &lt;dbl [1]&gt;     &lt;dbl [1]&gt;        \n#&gt; 2 Adelie  Torgersen &lt;dbl [1]&gt;      &lt;dbl [1]&gt;     &lt;dbl [1]&gt;        \n#&gt; 3 Adelie  Torgersen &lt;dbl [1]&gt;      &lt;dbl [1]&gt;     &lt;dbl [1]&gt;        \n#&gt; 4 Adelie  Torgersen &lt;chr [1]&gt;      &lt;chr [1]&gt;     &lt;chr [1]&gt;        \n#&gt; 5 Adelie  Torgersen &lt;dbl [1]&gt;      &lt;dbl [1]&gt;     &lt;dbl [1]&gt;        \n#&gt; 6 Adelie  Torgersen &lt;dbl [1]&gt;      &lt;dbl [1]&gt;     &lt;dbl [1]&gt;        \n#&gt; # ℹ 46 more rows\n#&gt; # ℹ 3 more variables: body_mass_g &lt;list&gt;, sex &lt;chr&gt;, year &lt;dbl&gt;\n\nYou can obtain a list of all sheets within a Google Sheet with sheet_names():\n\nsheet_names(penguins_sheet_id)\n#&gt; [1] \"Torgersen Island\" \"Biscoe Island\"    \"Dream Island\"\n\nFinally, just like with read_excel(), we can read in a portion of a Google Sheet by defining a range in read_sheet(). Note that we’re also using the gs4_example() function below to locate an example Google Sheet that comes with the googlesheets4 package.\n\ndeaths_url &lt;- gs4_example(\"deaths\")\ndeaths &lt;- read_sheet(deaths_url, range = \"A5:F15\")\n#&gt; ✔ Reading from deaths.\n#&gt; ✔ Range A5:F15.\ndeaths\n#&gt; # A tibble: 10 × 6\n#&gt;   Name          Profession   Age `Has kids` `Date of birth`    \n#&gt;   &lt;chr&gt;         &lt;chr&gt;      &lt;dbl&gt; &lt;lgl&gt;      &lt;dttm&gt;             \n#&gt; 1 David Bowie   musician      69 TRUE       1947-01-08 00:00:00\n#&gt; 2 Carrie Fisher actor         60 TRUE       1956-10-21 00:00:00\n#&gt; 3 Chuck Berry   musician      90 TRUE       1926-10-18 00:00:00\n#&gt; 4 Bill Paxton   actor         61 TRUE       1955-05-17 00:00:00\n#&gt; 5 Prince        musician      57 TRUE       1958-06-07 00:00:00\n#&gt; 6 Alan Rickman  actor         69 FALSE      1946-02-21 00:00:00\n#&gt; # ℹ 4 more rows\n#&gt; # ℹ 1 more variable: `Date of death` &lt;dttm&gt;\n\n\n\n21.3.4 Writing to Google Sheets\nYou can write from R to Google Sheets with write_sheet(). The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:\n\nwrite_sheet(bake_sale, ss = \"bake-sale\")\n\nIf you’d like to write your data to a specific (work)sheet inside a Google Sheet, you can specify that with the sheet argument as well.\n\nwrite_sheet(bake_sale, ss = \"bake-sale\", sheet = \"Sales\")\n\n\n\n21.3.5 Authentication\nWhile you can read from a public Google Sheet without authenticating with your Google account and with gs4_deauth(), reading a private sheet or writing to a sheet requires authentication so that googlesheets4 can view and manage your Google Sheets.\nWhen you attempt to read in a sheet that requires authentication, googlesheets4 will direct you to a web browser with a prompt to sign in to your Google account and grant permission to operate on your behalf with Google Sheets. However, if you want to specify a specific Google account, authentication scope, etc. you can do so with gs4_auth(), e.g., gs4_auth(email = \"mine@example.com\"), which will force the use of a token associated with a specific email. For further authentication details, we recommend reading the documentation googlesheets4 auth vignette: https://googlesheets4.tidyverse.org/articles/auth.html.\n\n\n21.3.6 Exercises\n\nRead the students dataset from earlier in the chapter from Excel and also from Google Sheets, with no additional arguments supplied to the read_excel() and read_sheet() functions. Are the resulting data frames in R exactly the same? If not, how are they different?\nRead the Google Sheet titled survey from https://pos.it/r4ds-survey, with survey_id as a character variable and n_pets as a numerical variable.\nRead the Google Sheet titled roster from https://pos.it/r4ds-roster. The resulting data frame should be called roster and should look like the following.\n\n#&gt; # A tibble: 12 × 3\n#&gt;    group subgroup    id\n#&gt;    &lt;dbl&gt; &lt;chr&gt;    &lt;dbl&gt;\n#&gt;  1     1 A            1\n#&gt;  2     1 A            2\n#&gt;  3     1 A            3\n#&gt;  4     1 B            4\n#&gt;  5     1 B            5\n#&gt;  6     1 B            6\n#&gt;  7     1 B            7\n#&gt;  8     2 A            8\n#&gt;  9     2 A            9\n#&gt; 10     2 B           10\n#&gt; 11     2 B           11\n#&gt; 12     2 B           12",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>21</span>  <span class='chapter-title'>Spreadsheets</span>"
    ]
  },
  {
    "objectID": "spreadsheets.html#summary",
    "href": "spreadsheets.html#summary",
    "title": "21  Spreadsheets",
    "section": "21.4 Summary",
    "text": "21.4 Summary\nMicrosoft Excel and Google Sheets are two of the most popular spreadsheet systems. Being able to interact with data stored in Excel and Google Sheets files directly from R is a superpower! In this chapter you learned how to read data into R from spreadsheets from Excel with read_excel() from the readxl package and from Google Sheets with read_sheet() from the googlesheets4 package. These functions work very similarly to each other and have similar arguments for specifying column names, NA strings, rows to skip on top of the file you’re reading in, etc. Additionally, both functions make it possible to read a single sheet from a spreadsheet as well.\nOn the other hand, writing to an Excel file requires a different package and function (writexl::write_xlsx()) while you can write to a Google Sheet with the googlesheets4 package, with write_sheet().\nIn the next chapter, you’ll learn about a different data source and how to read data from that source into R: databases.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>21</span>  <span class='chapter-title'>Spreadsheets</span>"
    ]
  },
  {
    "objectID": "databases.html",
    "href": "databases.html",
    "title": "22  Databases",
    "section": "",
    "text": "22.1 Introduction\nA huge amount of data lives in databases, so it’s essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.\nIn this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL1 query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL. We’ll use that as a way to teach you some of the most important features of SQL. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#introduction",
    "href": "databases.html#introduction",
    "title": "22  Databases",
    "section": "",
    "text": "22.1.1 Prerequisites\nIn this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.\n\nlibrary(DBI)\nlibrary(dbplyr)\nlibrary(tidyverse)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#database-basics",
    "href": "databases.html#database-basics",
    "title": "22  Databases",
    "section": "22.2 Database basics",
    "text": "22.2 Database basics\nAt the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a data frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:\n\nDatabase tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).\nDatabase tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.\nMost classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R. More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.\n\nDatabases are run by database management systems (DBMS’s for short), which come in three basic forms:\n\nClient-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organization. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.\nCloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.\nIn-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#connecting-to-a-database",
    "href": "databases.html#connecting-to-a-database",
    "title": "22  Databases",
    "section": "22.3 Connecting to a database",
    "text": "22.3 Connecting to a database\nTo connect to the database from R, you’ll use a pair of packages:\n\nYou’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.\nYou’ll also use a package tailored for the DBMS you’re connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. There’s usually one package for each DBMS, e.g. RPostgres for PostgreSQL and RMariaDB for MySQL.\n\nIf you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.\nConcretely, you create a database connection using DBI::dbConnect(). The first argument selects the DBMS2, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:\n\ncon &lt;- DBI::dbConnect(\n  RMariaDB::MariaDB(), \n  username = \"foo\"\n)\ncon &lt;- DBI::dbConnect(\n  RPostgres::Postgres(), \n  hostname = \"databases.mycompany.com\", \n  port = 1234\n)\n\nThe precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. This means you’ll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (database administrator). The initial setup will often take a little fiddling (and maybe some googling) to get it right, but you’ll generally only need to do it once.\n\n22.3.1 In this book\nSetting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.\nConnecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:\n\ncon &lt;- DBI::dbConnect(duckdb::duckdb())\n\nduckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (Chapter 7), it’s reasonable to store it in the duckdb directory of the current project:\n\ncon &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = \"duckdb\")\n\n\n\n22.3.2 Load some data\nSince this is a new database, we need to start by adding some data. Here we’ll add mpg and diamonds datasets from ggplot2 using DBI::dbWriteTable(). The simplest usage of dbWriteTable() needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.\n\ndbWriteTable(con, \"mpg\", ggplot2::mpg)\ndbWriteTable(con, \"diamonds\", ggplot2::diamonds)\n\nIf you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv() and duckdb_register_arrow(). These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R. We’ll also show off a useful technique for loading multiple files into a database in Section 27.4.1.\n\n\n22.3.3 DBI basics\nYou can check that the data is loaded correctly by using a couple of other DBI functions: dbListTables() lists all tables in the database3 and dbReadTable() retrieves the contents of a table.\n\ndbListTables(con)\n#&gt; [1] \"diamonds\" \"mpg\"\n\ncon |&gt; \n  dbReadTable(\"diamonds\") |&gt; \n  as_tibble()\n#&gt; # A tibble: 53,940 × 10\n#&gt;   carat cut       color clarity depth table price     x     y     z\n#&gt;   &lt;dbl&gt; &lt;fct&gt;     &lt;fct&gt; &lt;fct&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n#&gt; 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n#&gt; 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31\n#&gt; 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n#&gt; 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n#&gt; 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n#&gt; # ℹ 53,934 more rows\n\ndbReadTable() returns a data.frame so we use as_tibble() to convert it into a tibble so that it prints nicely.\nIf you already know SQL, you can use dbGetQuery() to get the results of running a query on the database:\n\nsql &lt;- \"\n  SELECT carat, cut, clarity, color, price \n  FROM diamonds \n  WHERE price &gt; 15000\n\"\nas_tibble(dbGetQuery(con, sql))\n#&gt; # A tibble: 1,655 × 5\n#&gt;   carat cut       clarity color price\n#&gt;   &lt;dbl&gt; &lt;fct&gt;     &lt;fct&gt;   &lt;fct&gt; &lt;int&gt;\n#&gt; 1  1.54 Premium   VS2     E     15002\n#&gt; 2  1.19 Ideal     VVS1    F     15005\n#&gt; 3  2.1  Premium   SI1     I     15007\n#&gt; 4  1.69 Ideal     SI1     D     15011\n#&gt; 5  1.5  Very Good VVS2    G     15013\n#&gt; 6  1.73 Very Good VS1     G     15014\n#&gt; # ℹ 1,649 more rows\n\nIf you’ve never seen SQL before, don’t worry! You’ll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#dbplyr-basics",
    "href": "databases.html#dbplyr-basics",
    "title": "22  Databases",
    "section": "22.4 dbplyr basics",
    "text": "22.4 dbplyr basics\nNow that we’ve connected to a database and loaded up some data, we can start to learn about dbplyr. dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include dtplyr which translates to data.table, and multidplyr which executes your code on multiple cores.\nTo use dbplyr, you must first use tbl() to create an object that represents a database table:\n\ndiamonds_db &lt;- tbl(con, \"diamonds\")\ndiamonds_db\n#&gt; # Source:   table&lt;diamonds&gt; [?? x 10]\n#&gt; # Database: DuckDB v1.1.2 [root@Darwin 23.4.0:R 4.4.1/:memory:]\n#&gt;   carat cut       color clarity depth table price     x     y     z\n#&gt;   &lt;dbl&gt; &lt;fct&gt;     &lt;fct&gt; &lt;fct&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43\n#&gt; 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31\n#&gt; 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31\n#&gt; 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63\n#&gt; 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75\n#&gt; 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48\n#&gt; # ℹ more rows\n\n\n\n\n\n\n\nThere are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organized. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:\n\ndiamonds_db &lt;- tbl(con, in_schema(\"sales\", \"diamonds\"))\ndiamonds_db &lt;- tbl(con, in_catalog(\"north_america\", \"sales\", \"diamonds\"))\n\nOther times you might want to use your own SQL query as a starting point:\n\ndiamonds_db &lt;- tbl(con, sql(\"SELECT * FROM diamonds\"))\n\n\n\n\nThis object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:\n\nbig_diamonds_db &lt;- diamonds_db |&gt; \n  filter(price &gt; 15000) |&gt; \n  select(carat:clarity, price)\n\nbig_diamonds_db\n#&gt; # Source:   SQL [?? x 5]\n#&gt; # Database: DuckDB v1.1.2 [root@Darwin 23.4.0:R 4.4.1/:memory:]\n#&gt;   carat cut       color clarity price\n#&gt;   &lt;dbl&gt; &lt;fct&gt;     &lt;fct&gt; &lt;fct&gt;   &lt;int&gt;\n#&gt; 1  1.54 Premium   E     VS2     15002\n#&gt; 2  1.19 Ideal     F     VVS1    15005\n#&gt; 3  2.1  Premium   I     SI1     15007\n#&gt; 4  1.69 Ideal     D     SI1     15011\n#&gt; 5  1.5  Very Good G     VVS2    15013\n#&gt; 6  1.73 Very Good G     VS1     15014\n#&gt; # ℹ more rows\n\nYou can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.\nYou can see the SQL code generated by the dplyr function show_query(). If you know dplyr, this is a great way to learn SQL! Write some dplyr code, get dbplyr to translate it to SQL, and then try to figure out how the two languages match up.\n\nbig_diamonds_db |&gt;\n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT carat, cut, color, clarity, price\n#&gt; FROM diamonds\n#&gt; WHERE (price &gt; 15000.0)\n\nTo get all the data back into R, you call collect(). Behind the scenes, this generates the SQL, calls dbGetQuery() to get the data, then turns the result into a tibble:\n\nbig_diamonds &lt;- big_diamonds_db |&gt; \n  collect()\nbig_diamonds\n#&gt; # A tibble: 1,655 × 5\n#&gt;   carat cut       color clarity price\n#&gt;   &lt;dbl&gt; &lt;fct&gt;     &lt;fct&gt; &lt;fct&gt;   &lt;int&gt;\n#&gt; 1  1.54 Premium   E     VS2     15002\n#&gt; 2  1.19 Ideal     F     VVS1    15005\n#&gt; 3  2.1  Premium   I     SI1     15007\n#&gt; 4  1.69 Ideal     D     SI1     15011\n#&gt; 5  1.5  Very Good G     VVS2    15013\n#&gt; 6  1.73 Very Good G     VS1     15014\n#&gt; # ℹ 1,649 more rows\n\nTypically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll collect() the data to get an in-memory tibble, and continue your work with pure R code.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#sql",
    "href": "databases.html#sql",
    "title": "22  Databases",
    "section": "22.5 SQL",
    "text": "22.5 SQL\nThe rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.\nWe’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes. These datasets are easy to get into our learning database because dbplyr comes with a function that copies the tables from nycflights13 to our database:\n\ndbplyr::copy_nycflights13(con)\n#&gt; Creating table: airlines\n#&gt; Creating table: airports\n#&gt; Creating table: flights\n#&gt; Creating table: planes\n#&gt; Creating table: weather\nflights &lt;- tbl(con, \"flights\")\nplanes &lt;- tbl(con, \"planes\")\n\n\n22.5.1 SQL basics\nThe top-level components of SQL are called statements. Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data. We will focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.\nA query is made up of clauses. There are five important clauses: SELECT, FROM, WHERE, ORDER BY, and GROUP BY. Every query must have the SELECT4 and FROM5 clauses and the simplest query is SELECT * FROM table, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :\n\nflights |&gt; show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT *\n#&gt; FROM flights\nplanes |&gt; show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT *\n#&gt; FROM planes\n\nWHERE and ORDER BY control which rows are included and how they are ordered:\n\nflights |&gt; \n  filter(dest == \"IAH\") |&gt; \n  arrange(dep_delay) |&gt;\n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; WHERE (dest = 'IAH')\n#&gt; ORDER BY dep_delay\n\nGROUP BY converts the query to a summary, causing aggregation to happen:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT dest, AVG(dep_delay) AS dep_delay\n#&gt; FROM flights\n#&gt; GROUP BY dest\n\nThere are two important differences between dplyr verbs and SELECT clauses:\n\nIn SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.\nIn SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.\n\nThe following sections explore each clause in more detail.\n\n\n\n\n\n\nNote that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMS’s, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue on GitHub to help us do better.\n\n\n\n\n\n22.5.2 SELECT\nThe SELECT clause is the workhorse of queries and performs the same job as select(), mutate(), rename(), relocate(), and, as you’ll learn in the next section, summarize().\nselect(), rename(), and relocate() have very direct translations to SELECT as they just affect where a column appears (if at all) along with its name:\n\nplanes |&gt; \n  select(tailnum, type, manufacturer, model, year) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT tailnum, \"type\", manufacturer, model, \"year\"\n#&gt; FROM planes\n\nplanes |&gt; \n  select(tailnum, type, manufacturer, model, year) |&gt; \n  rename(year_built = year) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT tailnum, \"type\", manufacturer, model, \"year\" AS year_built\n#&gt; FROM planes\n\nplanes |&gt; \n  select(tailnum, type, manufacturer, model, year) |&gt; \n  relocate(manufacturer, model, .before = type) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT tailnum, manufacturer, model, \"type\", \"year\"\n#&gt; FROM planes\n\nThis example also shows you how SQL does renaming. In SQL terminology renaming is called aliasing and is done with AS. Note that unlike mutate(), the old name is on the left and the new name is on the right.\n\n\n\n\n\n\nIn the examples above note that \"year\" and \"type\" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.\nWhen working with other databases you’re likely to see every variable name quoted because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.\nSELECT \"tailnum\", \"type\", \"manufacturer\", \"model\", \"year\"\nFROM \"planes\"\nSome other database systems use backticks instead of quotes:\nSELECT `tailnum`, `type`, `manufacturer`, `model`, `year`\nFROM `planes`\n\n\n\nThe translations for mutate() are similarly straightforward: each variable becomes a new expression in SELECT:\n\nflights |&gt; \n  mutate(\n    speed = distance / (air_time / 60)\n  ) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*, distance / (air_time / 60.0) AS speed\n#&gt; FROM flights\n\nWe’ll come back to the translation of individual components (like /) in Section 22.6.\n\n\n22.5.3 FROM\nThe FROM clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.\n\n\n22.5.4 GROUP BY\ngroup_by() is translated to the GROUP BY6 clause and summarize() is translated to the SELECT clause:\n\ndiamonds_db |&gt; \n  group_by(cut) |&gt; \n  summarize(\n    n = n(),\n    avg_price = mean(price, na.rm = TRUE)\n  ) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price\n#&gt; FROM diamonds\n#&gt; GROUP BY cut\n\nWe’ll come back to what’s happening with translation n() and mean() in Section 22.6.\n\n\n22.5.5 WHERE\nfilter() is translated to the WHERE clause:\n\nflights |&gt; \n  filter(dest == \"IAH\" | dest == \"HOU\") |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; WHERE (dest = 'IAH' OR dest = 'HOU')\n\nflights |&gt; \n  filter(arr_delay &gt; 0 & arr_delay &lt; 20) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; WHERE (arr_delay &gt; 0.0 AND arr_delay &lt; 20.0)\n\nThere are a few important details to note here:\n\n| becomes OR and & becomes AND.\nSQL uses = for comparison, not ==. SQL doesn’t have assignment, so there’s no potential for confusion there.\nSQL uses only '' for strings, not \"\". In SQL, \"\" is used to identify variables, like R’s ``.\n\nAnother useful SQL operator is IN, which is very close to R’s %in%:\n\nflights |&gt; \n  filter(dest %in% c(\"IAH\", \"HOU\")) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; WHERE (dest IN ('IAH', 'HOU'))\n\nSQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:\n\nflights |&gt; \n  group_by(dest) |&gt; \n  summarize(delay = mean(arr_delay))\n#&gt; Warning: Missing values are always removed in SQL aggregation functions.\n#&gt; Use `na.rm = TRUE` to silence this warning\n#&gt; This warning is displayed once every 8 hours.\n#&gt; # Source:   SQL [?? x 2]\n#&gt; # Database: DuckDB v1.1.2 [root@Darwin 23.4.0:R 4.4.1/:memory:]\n#&gt;   dest  delay\n#&gt;   &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1 ATL   11.3 \n#&gt; 2 MCO    5.45\n#&gt; 3 MSY    6.49\n#&gt; 4 XNA    7.47\n#&gt; 5 BNA   11.8 \n#&gt; 6 ALB   14.4 \n#&gt; # ℹ more rows\n\nIf you want to learn more about how NULLs work, you might enjoy “The Three-Valued Logic of SQL” by Markus Winand.\nIn general, you can work with NULLs using the functions you’d use for NAs in R:\n\nflights |&gt; \n  filter(!is.na(dep_delay)) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; WHERE (NOT((dep_delay IS NULL)))\n\nThis SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:\nWHERE \"dep_delay\" IS NOT NULL\nNote that if you filter() a variable that you created using a summarize, dbplyr will generate a HAVING clause, rather than a WHERE clause. This is a one of the idiosyncrasies of SQL: WHERE is evaluated before SELECT and GROUP BY, so SQL needs another clause that’s evaluated afterwards.\n\ndiamonds_db |&gt; \n  group_by(cut) |&gt; \n  summarize(n = n()) |&gt; \n  filter(n &gt; 100) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT cut, COUNT(*) AS n\n#&gt; FROM diamonds\n#&gt; GROUP BY cut\n#&gt; HAVING (COUNT(*) &gt; 100.0)\n\n\n\n22.5.6 ORDER BY\nOrdering rows involves a straightforward translation from arrange() to the ORDER BY clause:\n\nflights |&gt; \n  arrange(year, month, day, desc(dep_delay)) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT flights.*\n#&gt; FROM flights\n#&gt; ORDER BY \"year\", \"month\", \"day\", dep_delay DESC\n\nNotice how desc() is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.\n\n\n22.5.7 Subqueries\nSometimes it’s not possible to translate a dplyr pipeline into a single SELECT statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM clause, instead of the usual table.\ndbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes year1 and then the second (outer) query can compute year2.\n\nflights |&gt; \n  mutate(\n    year1 = year + 1,\n    year2 = year1 + 1\n  ) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT q01.*, year1 + 1.0 AS year2\n#&gt; FROM (\n#&gt;   SELECT flights.*, \"year\" + 1.0 AS year1\n#&gt;   FROM flights\n#&gt; ) q01\n\nYou’ll also see this if you attempted to filter() a variable that you just created. Remember, even though WHERE is written after SELECT, it’s evaluated before it, so we need a subquery in this (silly) example:\n\nflights |&gt; \n  mutate(year1 = year + 1) |&gt; \n  filter(year1 == 2014) |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT q01.*\n#&gt; FROM (\n#&gt;   SELECT flights.*, \"year\" + 1.0 AS year1\n#&gt;   FROM flights\n#&gt; ) q01\n#&gt; WHERE (year1 = 2014.0)\n\nSometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.\n\n\n22.5.8 Joins\nIf you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:\n\nflights |&gt; \n  left_join(planes |&gt; rename(year_built = year), by = \"tailnum\") |&gt; \n  show_query()\n#&gt; &lt;SQL&gt;\n#&gt; SELECT\n#&gt;   flights.*,\n#&gt;   planes.\"year\" AS year_built,\n#&gt;   \"type\",\n#&gt;   manufacturer,\n#&gt;   model,\n#&gt;   engines,\n#&gt;   seats,\n#&gt;   speed,\n#&gt;   engine\n#&gt; FROM flights\n#&gt; LEFT JOIN planes\n#&gt;   ON (flights.tailnum = planes.tailnum)\n\nThe main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM clause to bring in additional tables, using ON to define how the tables are related.\ndplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for inner_join(), right_join(), and full_join():\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nINNER JOIN planes ON (flights.tailnum = planes.tailnum)\n\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nRIGHT JOIN planes ON (flights.tailnum = planes.tailnum)\n\nSELECT flights.*, \"type\", manufacturer, model, engines, seats, speed\nFROM flights\nFULL JOIN planes ON (flights.tailnum = planes.tailnum)\nYou’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the dm package, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.\n\n\n22.5.9 Other verbs\ndbplyr also translates other verbs like distinct(), slice_*(), and intersect(), and a growing selection of tidyr functions like pivot_longer() and pivot_wider(). The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.\n\n\n22.5.10 Exercises\n\nWhat is distinct() translated to? How about head()?\nExplain what each of the following SQL queries do and try recreate them using dbplyr.\nSELECT * \nFROM flights\nWHERE dep_delay &lt; arr_delay\n\nSELECT *, distance / (air_time / 60) AS speed\nFROM flights",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#sec-sql-expressions",
    "href": "databases.html#sec-sql-expressions",
    "title": "22  Databases",
    "section": "22.6 Function translations",
    "text": "22.6 Function translations\nSo far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g., what happens when you use mean(x) in a summarize()?\nTo help see what’s going on, we’ll use a couple of little helper functions that run a summarize() or mutate() and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.\n\nsummarize_query &lt;- function(df, ...) {\n  df |&gt; \n    summarize(...) |&gt; \n    show_query()\n}\nmutate_query &lt;- function(df, ...) {\n  df |&gt; \n    mutate(..., .keep = \"none\") |&gt; \n    show_query()\n}\n\nLet’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like mean(), have a relatively simple translation while others, like median(), are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.\n\nflights |&gt; \n  group_by(year, month, day) |&gt;  \n  summarize_query(\n    mean = mean(arr_delay, na.rm = TRUE),\n    median = median(arr_delay, na.rm = TRUE)\n  )\n#&gt; `summarise()` has grouped output by \"year\" and \"month\". You can override\n#&gt; using the `.groups` argument.\n#&gt; &lt;SQL&gt;\n#&gt; SELECT\n#&gt;   \"year\",\n#&gt;   \"month\",\n#&gt;   \"day\",\n#&gt;   AVG(arr_delay) AS mean,\n#&gt;   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median\n#&gt; FROM flights\n#&gt; GROUP BY \"year\", \"month\", \"day\"\n\nThe translation of summary functions becomes more complicated when you use them inside a mutate() because they have to turn into so-called window functions. In SQL, you turn an ordinary aggregation function into a window function by adding OVER after it:\n\nflights |&gt; \n  group_by(year, month, day) |&gt;  \n  mutate_query(\n    mean = mean(arr_delay, na.rm = TRUE),\n  )\n#&gt; &lt;SQL&gt;\n#&gt; SELECT\n#&gt;   \"year\",\n#&gt;   \"month\",\n#&gt;   \"day\",\n#&gt;   AVG(arr_delay) OVER (PARTITION BY \"year\", \"month\", \"day\") AS mean\n#&gt; FROM flights\n\nIn SQL, the GROUP BY clause is used exclusively for summaries so here you can see that the grouping has moved from the PARTITION BY argument to OVER.\nWindow functions include all functions that look forward or backwards, like lead() and lag() which look at the “previous” or “next” value respectively:\n\nflights |&gt; \n  group_by(dest) |&gt;  \n  arrange(time_hour) |&gt; \n  mutate_query(\n    lead = lead(arr_delay),\n    lag = lag(arr_delay)\n  )\n#&gt; &lt;SQL&gt;\n#&gt; SELECT\n#&gt;   dest,\n#&gt;   LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,\n#&gt;   LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag\n#&gt; FROM flights\n#&gt; ORDER BY time_hour\n\nHere it’s important to arrange() the data, because SQL tables have no intrinsic order. In fact, if you don’t use arrange() you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.\nAnother important SQL function is CASE WHEN. It’s used as the translation of if_else() and case_when(), the dplyr function that it directly inspired. Here are a couple of simple examples:\n\nflights |&gt; \n  mutate_query(\n    description = if_else(arr_delay &gt; 0, \"delayed\", \"on-time\")\n  )\n#&gt; &lt;SQL&gt;\n#&gt; SELECT CASE WHEN (arr_delay &gt; 0.0) THEN 'delayed' WHEN NOT (arr_delay &gt; 0.0) THEN 'on-time' END AS description\n#&gt; FROM flights\nflights |&gt; \n  mutate_query(\n    description = \n      case_when(\n        arr_delay &lt; -5 ~ \"early\", \n        arr_delay &lt; 5 ~ \"on-time\",\n        arr_delay &gt;= 5 ~ \"late\"\n      )\n  )\n#&gt; &lt;SQL&gt;\n#&gt; SELECT CASE\n#&gt; WHEN (arr_delay &lt; -5.0) THEN 'early'\n#&gt; WHEN (arr_delay &lt; 5.0) THEN 'on-time'\n#&gt; WHEN (arr_delay &gt;= 5.0) THEN 'late'\n#&gt; END AS description\n#&gt; FROM flights\n\nCASE WHEN is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is cut():\n\nflights |&gt; \n  mutate_query(\n    description =  cut(\n      arr_delay, \n      breaks = c(-Inf, -5, 5, Inf), \n      labels = c(\"early\", \"on-time\", \"late\")\n    )\n  )\n#&gt; &lt;SQL&gt;\n#&gt; SELECT CASE\n#&gt; WHEN (arr_delay &lt;= -5.0) THEN 'early'\n#&gt; WHEN (arr_delay &lt;= 5.0) THEN 'on-time'\n#&gt; WHEN (arr_delay &gt; 5.0) THEN 'late'\n#&gt; END AS description\n#&gt; FROM flights\n\ndbplyr also translates common string and date-time manipulation functions, which you can learn about in vignette(\"translation-function\", package = \"dbplyr\"). dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#summary",
    "href": "databases.html#summary",
    "title": "22  Databases",
    "section": "22.7 Summary",
    "text": "22.7 Summary\nIn this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s the most commonly used language for working with data and knowing some will make it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:\n\nSQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organizations.\nPractical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.\n\nIn the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "databases.html#footnotes",
    "href": "databases.html#footnotes",
    "title": "22  Databases",
    "section": "",
    "text": "SQL is either pronounced “s”-“q”-“l” or “sequel”.↩︎\nTypically, this is the only function you’ll use from the client package, so we recommend using :: to pull out that one function, rather than loading the complete package with library().↩︎\nAt least, all the tables that you have permission to see.↩︎\nConfusingly, depending on the context, SELECT is either a statement or a clause. To avoid this confusion, we’ll generally use SELECT query instead of SELECT statement.↩︎\nOk, technically, only the SELECT is required, since you can write queries like SELECT 1+1 to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a FROM clause.↩︎\nThis is no coincidence: the dplyr function name was inspired by the SQL clause.↩︎",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>22</span>  <span class='chapter-title'>Databases</span>"
    ]
  },
  {
    "objectID": "arrow.html",
    "href": "arrow.html",
    "title": "23  Arrow",
    "section": "",
    "text": "23.1 Introduction\nCSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the parquet format, an open standards-based format widely used by big data systems.\nWe’ll pair parquet files with Apache Arrow, a multi-language toolbox designed for efficient analysis and transport of large datasets. We’ll use Apache Arrow via the arrow package, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.\nBoth arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.\n(A big thanks to Danielle Navarro who contributed the initial version of this chapter.)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#introduction",
    "href": "arrow.html#introduction",
    "title": "23  Arrow",
    "section": "",
    "text": "23.1.1 Prerequisites\nIn this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.\n\nlibrary(tidyverse)\nlibrary(arrow)\n\nLater in the chapter, we’ll also see some connections between arrow and duckdb, so we’ll also need dbplyr and duckdb.\n\nlibrary(dbplyr, warn.conflicts = FALSE)\nlibrary(duckdb)\n#&gt; Loading required package: DBI",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#getting-the-data",
    "href": "arrow.html#getting-the-data",
    "title": "23  Arrow",
    "section": "23.2 Getting the data",
    "text": "23.2 Getting the data\nWe begin by getting a dataset worthy of these tools: a dataset of item checkouts from Seattle public libraries, available online at data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2005 to October 2022.\nThe following code will get you a cached copy of the data. The data is a 9GB CSV file, so it will take some time to download. I highly recommend using curl::multi_download() to get very large files as it’s built for exactly this purpose: it gives you a progress bar and it can resume the download if its interrupted.\n\ndir.create(\"data\", showWarnings = FALSE)\n\ncurl::multi_download(\n  \"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv\",\n  \"data/seattle-library-checkouts.csv\",\n  resume = TRUE\n)\n#&gt; # A tibble: 1 × 10\n#&gt;   success status_code resumefrom url                    destfile        error\n#&gt;   &lt;lgl&gt;         &lt;int&gt;      &lt;dbl&gt; &lt;chr&gt;                  &lt;chr&gt;           &lt;chr&gt;\n#&gt; 1 TRUE            200          0 https://r4ds.s3.us-we… data/seattle-l… &lt;NA&gt; \n#&gt; # ℹ 4 more variables: type &lt;chr&gt;, modified &lt;dttm&gt;, time &lt;dbl&gt;,\n#&gt; #   headers &lt;list&gt;",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#opening-a-dataset",
    "href": "arrow.html#opening-a-dataset",
    "title": "23  Arrow",
    "section": "23.3 Opening a dataset",
    "text": "23.3 Opening a dataset\nLet’s start by taking a look at the data. At 9 GB, this file is large enough that we probably don’t want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 GB. This means we want to avoid read_csv() and instead use the arrow::open_dataset():\n\nseattle_csv &lt;- open_dataset(\n  sources = \"data/seattle-library-checkouts.csv\", \n  col_types = schema(ISBN = string()),\n  format = \"csv\"\n)\n\nWhat happens when this code is run? open_dataset() will scan a few thousand rows to figure out the structure of the dataset. The ISBN column contains blank values for the first 80,000 rows, so we have to specify the column type to help arrow work out the data structure. Once the data has been scanned by open_dataset(), it records what it’s found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print seattle_csv:\n\nseattle_csv\n#&gt; FileSystemDataset with 1 csv file\n#&gt; UsageClass: string\n#&gt; CheckoutType: string\n#&gt; MaterialType: string\n#&gt; CheckoutYear: int64\n#&gt; CheckoutMonth: int64\n#&gt; Checkouts: int64\n#&gt; Title: string\n#&gt; ISBN: string\n#&gt; Creator: string\n#&gt; Subjects: string\n#&gt; Publisher: string\n#&gt; PublicationYear: string\n\nThe first line in the output tells you that seattle_csv is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has imputed for each column.\nWe can see what’s actually in with glimpse(). This reveals that there are ~41 million rows and 12 columns, and shows us a few values.\n\nseattle_csv |&gt; glimpse()\n#&gt; FileSystemDataset with 1 csv file\n#&gt; 41,389,465 rows x 12 columns\n#&gt; $ UsageClass      &lt;string&gt; \"Physical\", \"Physical\", \"Digital\", \"Physical\", \"Ph…\n#&gt; $ CheckoutType    &lt;string&gt; \"Horizon\", \"Horizon\", \"OverDrive\", \"Horizon\", \"Hor…\n#&gt; $ MaterialType    &lt;string&gt; \"BOOK\", \"BOOK\", \"EBOOK\", \"BOOK\", \"SOUNDDISC\", \"BOO…\n#&gt; $ CheckoutYear     &lt;int64&gt; 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…\n#&gt; $ CheckoutMonth    &lt;int64&gt; 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…\n#&gt; $ Checkouts        &lt;int64&gt; 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…\n#&gt; $ Title           &lt;string&gt; \"Super rich : a guide to having it all / Russell S…\n#&gt; $ ISBN            &lt;string&gt; \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\", \"\"…\n#&gt; $ Creator         &lt;string&gt; \"Simmons, Russell\", \"Barclay, James, 1965-\", \"Tim …\n#&gt; $ Subjects        &lt;string&gt; \"Self realization, Conduct of life, Attitude Psych…\n#&gt; $ Publisher       &lt;string&gt; \"Gotham Books,\", \"Pyr,\", \"Random House, Inc.\", \"Di…\n#&gt; $ PublicationYear &lt;string&gt; \"c2011.\", \"2010.\", \"2015\", \"2005.\", \"c2004.\", \"c20…\n\nWe can start to use this dataset with dplyr verbs, using collect() to force arrow to perform the computation and return some data. For example, this code tells us the total number of checkouts per year:\n\nseattle_csv |&gt; \n  group_by(CheckoutYear) |&gt; \n  summarise(Checkouts = sum(Checkouts)) |&gt; \n  arrange(CheckoutYear) |&gt; \n  collect()\n#&gt; # A tibble: 18 × 2\n#&gt;   CheckoutYear Checkouts\n#&gt;          &lt;int&gt;     &lt;int&gt;\n#&gt; 1         2005   3798685\n#&gt; 2         2006   6599318\n#&gt; 3         2007   7126627\n#&gt; 4         2008   8438486\n#&gt; 5         2009   9135167\n#&gt; 6         2010   8608966\n#&gt; # ℹ 12 more rows\n\nThanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow: on Hadley’s computer, it took ~10s to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#sec-parquet",
    "href": "arrow.html#sec-parquet",
    "title": "23  Arrow",
    "section": "23.4 The parquet format",
    "text": "23.4 The parquet format\nTo make this data easier to work with, let’s switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.\n\n23.4.1 Advantages of parquet\nLike CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it’s a custom binary format designed specifically for the needs of big data. This means that:\n\nParquet files are usually smaller than the equivalent CSV file. Parquet relies on efficient encodings to keep file size down, and supports file compression. This helps make parquet files fast because there’s less data to move from disk to memory.\nParquet files have a rich type system. As we talked about in Section 8.3, a CSV file does not provide any information about column types. For example, a CSV reader has to guess whether \"08-10-2022\" should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.\nParquet files are “column-oriented”. This means that they’re organized column-by-column, much like R’s data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organized row-by-row.\nParquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if you’re lucky, to skip some chunks altogether.\n\nThere’s one primary disadvantage to parquet files: they are no longer “human readable”, i.e. if you look at a parquet file using readr::read_file(), you’ll just see a bunch of gibberish.\n\n\n23.4.2 Partitioning\nAs datasets get larger and larger, storing all the data in a single file gets increasingly painful and it’s often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.\nThere are no hard and fast rules about how to partition your dataset: the results will depend on your data, access patterns, and the systems that read the data. You’re likely to need to do some experimentation before you find the ideal partitioning for your situation. As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as you’ll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.\n\n\n23.4.3 Rewriting the Seattle library data\nLet’s apply these ideas to the Seattle library data to see how they play out in practice. We’re going to partition by CheckoutYear, since it’s likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.\nTo rewrite the data we define the partition using dplyr::group_by() and then save the partitions to a directory with arrow::write_dataset(). write_dataset() has two important arguments: a directory where we’ll create the files and the format we’ll use.\n\npq_path &lt;- \"data/seattle-library-checkouts\"\n\n\nseattle_csv |&gt;\n  group_by(CheckoutYear) |&gt;\n  write_dataset(path = pq_path, format = \"parquet\")\n\nThis takes about a minute to run; as we’ll see shortly this is an initial investment that pays off by making future operations much much faster.\nLet’s take a look at what we just produced:\n\ntibble(\n  files = list.files(pq_path, recursive = TRUE),\n  size_MB = file.size(file.path(pq_path, files)) / 1024^2\n)\n#&gt; # A tibble: 18 × 2\n#&gt;   files                            size_MB\n#&gt;   &lt;chr&gt;                              &lt;dbl&gt;\n#&gt; 1 CheckoutYear=2005/part-0.parquet    109.\n#&gt; 2 CheckoutYear=2006/part-0.parquet    164.\n#&gt; 3 CheckoutYear=2007/part-0.parquet    178.\n#&gt; 4 CheckoutYear=2008/part-0.parquet    195.\n#&gt; 5 CheckoutYear=2009/part-0.parquet    214.\n#&gt; 6 CheckoutYear=2010/part-0.parquet    222.\n#&gt; # ℹ 12 more rows\n\nOur single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the Apache Hive project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the CheckoutYear=2005 directory contains all the data where CheckoutYear is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#using-dplyr-with-arrow",
    "href": "arrow.html#using-dplyr-with-arrow",
    "title": "23  Arrow",
    "section": "23.5 Using dplyr with arrow",
    "text": "23.5 Using dplyr with arrow\nNow we’ve created these parquet files, we’ll need to read them in again. We use open_dataset() again, but this time we give it a directory:\n\nseattle_pq &lt;- open_dataset(pq_path)\n\nNow we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:\n\nquery &lt;- seattle_pq |&gt; \n  filter(CheckoutYear &gt;= 2018, MaterialType == \"BOOK\") |&gt;\n  group_by(CheckoutYear, CheckoutMonth) |&gt;\n  summarize(TotalCheckouts = sum(Checkouts)) |&gt;\n  arrange(CheckoutYear, CheckoutMonth)\n\nWriting dplyr code for arrow data is conceptually similar to dbplyr, Chapter 22: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call collect(). If we print out the query object we can see a little information about what we expect Arrow to return when the execution takes place:\n\nquery\n#&gt; FileSystemDataset (query)\n#&gt; CheckoutYear: int32\n#&gt; CheckoutMonth: int64\n#&gt; TotalCheckouts: int64\n#&gt; \n#&gt; * Grouped by CheckoutYear\n#&gt; * Sorted by CheckoutYear [asc], CheckoutMonth [asc]\n#&gt; See $.data for the source Arrow object\n\nAnd we can get the results by calling collect():\n\nquery |&gt; collect()\n#&gt; # A tibble: 58 × 3\n#&gt; # Groups:   CheckoutYear [5]\n#&gt;   CheckoutYear CheckoutMonth TotalCheckouts\n#&gt;          &lt;int&gt;         &lt;int&gt;          &lt;int&gt;\n#&gt; 1         2018             1         355101\n#&gt; 2         2018             2         309813\n#&gt; 3         2018             3         344487\n#&gt; 4         2018             4         330988\n#&gt; 5         2018             5         318049\n#&gt; 6         2018             6         341825\n#&gt; # ℹ 52 more rows\n\nLike dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in ?acero.\n\n23.5.1 Performance\nLet’s take a quick look at the performance impact of switching from CSV to parquet. First, let’s time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:\n\nseattle_csv |&gt; \n  filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |&gt;\n  group_by(CheckoutMonth) |&gt;\n  summarize(TotalCheckouts = sum(Checkouts)) |&gt;\n  arrange(desc(CheckoutMonth)) |&gt;\n  collect() |&gt; \n  system.time()\n#&gt;    user  system elapsed \n#&gt;  11.951   1.297  11.387\n\nNow let’s use our new version of the dataset in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:\n\nseattle_pq |&gt; \n  filter(CheckoutYear == 2021, MaterialType == \"BOOK\") |&gt;\n  group_by(CheckoutMonth) |&gt;\n  summarize(TotalCheckouts = sum(Checkouts)) |&gt;\n  arrange(desc(CheckoutMonth)) |&gt;\n  collect() |&gt; \n  system.time()\n#&gt;    user  system elapsed \n#&gt;   0.263   0.058   0.063\n\nThe ~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:\n\nPartitioning improves performance because this query uses CheckoutYear == 2021 to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.\nThe parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (CheckoutYear, MaterialType, CheckoutMonth, and Checkouts).\n\nThis massive difference in performance is why it pays off to convert large CSVs to parquet!\n\n\n23.5.2 Using duckdb with arrow\nThere’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a DuckDB database (Chapter 22) by calling arrow::to_duckdb():\n\nseattle_pq |&gt; \n  to_duckdb() |&gt;\n  filter(CheckoutYear &gt;= 2018, MaterialType == \"BOOK\") |&gt;\n  group_by(CheckoutYear) |&gt;\n  summarize(TotalCheckouts = sum(Checkouts)) |&gt;\n  arrange(desc(CheckoutYear)) |&gt;\n  collect()\n#&gt; Warning: Missing values are always removed in SQL aggregation functions.\n#&gt; Use `na.rm = TRUE` to silence this warning\n#&gt; This warning is displayed once every 8 hours.\n#&gt; # A tibble: 5 × 2\n#&gt;   CheckoutYear TotalCheckouts\n#&gt;          &lt;int&gt;          &lt;dbl&gt;\n#&gt; 1         2022        2431502\n#&gt; 2         2021        2266438\n#&gt; 3         2020        1241999\n#&gt; 4         2019        3931688\n#&gt; 5         2018        3987569\n\nThe neat thing about to_duckdb() is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.\n\n\n23.5.3 Exercises\n\nFigure out the most popular book each year.\nWhich author has the most books in the Seattle library system?\nHow has checkouts of books vs ebooks changed over the last 10 years?",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "arrow.html#summary",
    "href": "arrow.html#summary",
    "title": "23  Arrow",
    "section": "23.6 Summary",
    "text": "23.6 Summary\nIn this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, and it’s much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.\nNext up you’ll learn about your first non-rectangular data source, which you’ll handle using tools provided by the tidyr package. We’ll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>23</span>  <span class='chapter-title'>Arrow</span>"
    ]
  },
  {
    "objectID": "rectangling.html",
    "href": "rectangling.html",
    "title": "24  Hierarchical data",
    "section": "",
    "text": "24.1 Introduction\nIn this chapter, you’ll learn the art of data rectangling: taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.\nTo learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: tidyr::unnest_longer() and tidyr::unnest_wider(). We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#introduction",
    "href": "rectangling.html#introduction",
    "title": "24  Hierarchical data",
    "section": "",
    "text": "24.1.1 Prerequisites\nIn this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.\n\nlibrary(tidyverse)\nlibrary(repurrrsive)\nlibrary(jsonlite)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#lists",
    "href": "rectangling.html#lists",
    "title": "24  Hierarchical data",
    "section": "24.2 Lists",
    "text": "24.2 Lists\nSo far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a list, which you create with list():\n\nx1 &lt;- list(1:4, \"a\", TRUE)\nx1\n#&gt; [[1]]\n#&gt; [1] 1 2 3 4\n#&gt; \n#&gt; [[2]]\n#&gt; [1] \"a\"\n#&gt; \n#&gt; [[3]]\n#&gt; [1] TRUE\n\nIt’s often convenient to name the components, or children, of a list, which you can do in the same way as naming the columns of a tibble:\n\nx2 &lt;- list(a = 1:2, b = 1:3, c = 1:4)\nx2\n#&gt; $a\n#&gt; [1] 1 2\n#&gt; \n#&gt; $b\n#&gt; [1] 1 2 3\n#&gt; \n#&gt; $c\n#&gt; [1] 1 2 3 4\n\nEven for these very simple lists, printing takes up quite a lot of space. A useful alternative is str(), which generates a compact display of the structure, de-emphasizing the contents:\n\nstr(x1)\n#&gt; List of 3\n#&gt;  $ : int [1:4] 1 2 3 4\n#&gt;  $ : chr \"a\"\n#&gt;  $ : logi TRUE\nstr(x2)\n#&gt; List of 3\n#&gt;  $ a: int [1:2] 1 2\n#&gt;  $ b: int [1:3] 1 2 3\n#&gt;  $ c: int [1:4] 1 2 3 4\n\nAs you can see, str() displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.\n\n24.2.1 Hierarchy\nLists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:\n\nx3 &lt;- list(list(1, 2), list(3, 4))\nstr(x3)\n#&gt; List of 2\n#&gt;  $ :List of 2\n#&gt;   ..$ : num 1\n#&gt;   ..$ : num 2\n#&gt;  $ :List of 2\n#&gt;   ..$ : num 3\n#&gt;   ..$ : num 4\n\nThis is notably different to c(), which generates a flat vector:\n\nc(c(1, 2), c(3, 4))\n#&gt; [1] 1 2 3 4\n\nx4 &lt;- c(list(1, 2), list(3, 4))\nstr(x4)\n#&gt; List of 4\n#&gt;  $ : num 1\n#&gt;  $ : num 2\n#&gt;  $ : num 3\n#&gt;  $ : num 4\n\nAs lists get more complex, str() gets more useful, as it lets you see the hierarchy at a glance:\n\nx5 &lt;- list(1, list(2, list(3, list(4, list(5)))))\nstr(x5)\n#&gt; List of 2\n#&gt;  $ : num 1\n#&gt;  $ :List of 2\n#&gt;   ..$ : num 2\n#&gt;   ..$ :List of 2\n#&gt;   .. ..$ : num 3\n#&gt;   .. ..$ :List of 2\n#&gt;   .. .. ..$ : num 4\n#&gt;   .. .. ..$ :List of 1\n#&gt;   .. .. .. ..$ : num 5\n\nAs lists get even larger and more complex, str() eventually starts to fail, and you’ll need to switch to View()1. Figure 24.1 shows the result of calling View(x5). The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in Figure 24.2. RStudio will also show you the code you need to access that element, as in Figure 24.3. We’ll come back to how this code works in Section 28.3.\n\n\n\n\n\n\n\n\nFigure 24.1: The RStudio view lets you interactively explore a complex list. The viewer opens showing only the top level of the list.\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 24.2: Clicking on the rightward facing triangle expands that component of the list so that you can also see its children.\n\n\n\n\n\n\n\n\n\n\n\n\n\nFigure 24.3: You can repeat this operation as many times as needed to get to the data you’re interested in. Note the bottom-left corner: if you click an element of the list, RStudio will give you the subsetting code needed to access it, in this case x5[[2]][[2]][[2]].\n\n\n\n\n\n\n\n24.2.2 List-columns\nLists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldn’t usually belong in there. In particular, list-columns are used a lot in the tidymodels ecosystem, because they allow you to store things like model outputs or resamples in a data frame.\nHere’s a simple example of a list-column:\n\ndf &lt;- tibble(\n  x = 1:2, \n  y = c(\"a\", \"b\"),\n  z = list(list(1, 2), list(3, 4, 5))\n)\ndf\n#&gt; # A tibble: 2 × 3\n#&gt;       x y     z         \n#&gt;   &lt;int&gt; &lt;chr&gt; &lt;list&gt;    \n#&gt; 1     1 a     &lt;list [2]&gt;\n#&gt; 2     2 b     &lt;list [3]&gt;\n\nThere’s nothing special about lists in a tibble; they behave like any other column:\n\ndf |&gt; \n  filter(x == 1)\n#&gt; # A tibble: 1 × 3\n#&gt;       x y     z         \n#&gt;   &lt;int&gt; &lt;chr&gt; &lt;list&gt;    \n#&gt; 1     1 a     &lt;list [2]&gt;\n\nComputing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in Chapter 27. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.\nThe default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull out just the one list-column and apply one of the techniques that you’ve learned above, like df |&gt; pull(z) |&gt; str() or df |&gt; pull(z) |&gt; View().\n\n\n\n\n\n\nBase R\n\n\n\nIt’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because data.frame() treats a list as a list of columns:\n\ndata.frame(x = list(1:3, 3:5))\n#&gt;   x.1.3 x.3.5\n#&gt; 1     1     3\n#&gt; 2     2     4\n#&gt; 3     3     5\n\nYou can force data.frame() to treat a list as a list of rows by wrapping it in list I(), but the result doesn’t print particularly well:\n\ndata.frame(\n  x = I(list(1:2, 3:5)), \n  y = c(\"1, 2\", \"3, 4, 5\")\n)\n#&gt;         x       y\n#&gt; 1    1, 2    1, 2\n#&gt; 2 3, 4, 5 3, 4, 5\n\nIt’s easier to use list-columns with tibbles because tibble() treats lists like vectors and the print method has been designed with lists in mind.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#unnesting",
    "href": "rectangling.html#unnesting",
    "title": "24  Hierarchical data",
    "section": "24.3 Unnesting",
    "text": "24.3 Unnesting\nNow that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.\nList-columns tend to come in two basic forms: named and unnamed. When the children are named, they tend to have the same names in every row. For example, in df1, every element of list-column y has two elements named a and b. Named list-columns naturally unnest into columns: each named element becomes a new named column.\n\ndf1 &lt;- tribble(\n  ~x, ~y,\n  1, list(a = 11, b = 12),\n  2, list(a = 21, b = 22),\n  3, list(a = 31, b = 32),\n)\n\nWhen the children are unnamed, the number of elements tends to vary from row-to-row. For example, in df2, the elements of list-column y are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest into rows: you’ll get one row for each child.\n\n\ndf2 &lt;- tribble(\n  ~x, ~y,\n  1, list(11, 12, 13),\n  2, list(21),\n  3, list(31, 32),\n)\n\ntidyr provides two functions for these two cases: unnest_wider() and unnest_longer(). The following sections explain how they work.\n\n24.3.1 unnest_wider()\nWhen each row has the same number of elements with the same names, like df1, it’s natural to put each component into its own column with unnest_wider():\n\ndf1 |&gt; \n  unnest_wider(y)\n#&gt; # A tibble: 3 × 3\n#&gt;       x     a     b\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1    11    12\n#&gt; 2     2    21    22\n#&gt; 3     3    31    32\n\nBy default, the names of the new columns come exclusively from the names of the list elements, but you can use the names_sep argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.\n\ndf1 |&gt; \n  unnest_wider(y, names_sep = \"_\")\n#&gt; # A tibble: 3 × 3\n#&gt;       x   y_a   y_b\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1    11    12\n#&gt; 2     2    21    22\n#&gt; 3     3    31    32\n\n\n\n24.3.2 unnest_longer()\nWhen each row contains an unnamed list, it’s most natural to put each element into its own row with unnest_longer():\n\ndf2 |&gt; \n  unnest_longer(y)\n#&gt; # A tibble: 6 × 2\n#&gt;       x     y\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1    11\n#&gt; 2     1    12\n#&gt; 3     1    13\n#&gt; 4     2    21\n#&gt; 5     3    31\n#&gt; 6     3    32\n\nNote how x is duplicated for each element inside of y: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?\n\ndf6 &lt;- tribble(\n  ~x, ~y,\n  \"a\", list(1, 2),\n  \"b\", list(3),\n  \"c\", list()\n)\ndf6 |&gt; unnest_longer(y)\n#&gt; # A tibble: 3 × 2\n#&gt;   x         y\n#&gt;   &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1 a         1\n#&gt; 2 a         2\n#&gt; 3 b         3\n\nWe get zero rows in the output, so the row effectively disappears. If you want to preserve that row, adding NA in y, set keep_empty = TRUE.\n\n\n24.3.3 Inconsistent types\nWhat happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column y contains two numbers, a character, and a logical, which can’t normally be mixed in a single column.\n\ndf4 &lt;- tribble(\n  ~x, ~y,\n  \"a\", list(1),\n  \"b\", list(\"a\", TRUE, 5)\n)\n\nunnest_longer() always keeps the set of columns unchanged, while changing the number of rows. So what happens? How does unnest_longer() produce five rows while keeping everything in y?\n\ndf4 |&gt; \n  unnest_longer(y)\n#&gt; # A tibble: 4 × 2\n#&gt;   x     y        \n#&gt;   &lt;chr&gt; &lt;list&gt;   \n#&gt; 1 a     &lt;dbl [1]&gt;\n#&gt; 2 b     &lt;chr [1]&gt;\n#&gt; 3 b     &lt;lgl [1]&gt;\n#&gt; 4 b     &lt;dbl [1]&gt;\n\nAs you can see, the output contains a list-column, but every element of the list-column contains a single element. Because unnest_longer() can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type. It doesn’t: every element is a list, even though the contents are of different types.\nDealing with inconsistent types is challenging and the details depend on the precise nature of the problem and your goals, but you’ll most likely need tools from Chapter 27.\n\n\n24.3.4 Other functions\ntidyr has a few other useful rectangling functions that we’re not going to cover in this book:\n\nunnest_auto() automatically picks between unnest_longer() and unnest_wider() based on the structure of the list-column. It’s great for rapid exploration, but ultimately it’s a bad idea because it doesn’t force you to understand how your data is structured, and makes your code harder to understand.\nunnest() expands both rows and columns. It’s useful when you have a list-column that contains a 2d structure like a data frame, which you don’t see in this book, but you might encounter if you use the tidymodels ecosystem.\n\nThese functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.\n\n\n24.3.5 Exercises\n\nWhat happens when you use unnest_wider() with unnamed list-columns like df2? What argument is now necessary? What happens to missing values?\nWhat happens when you use unnest_longer() with named list-columns like df1? What additional information do you get in the output? How can you suppress that extra detail?\nFrom time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of y and z are aligned (i.e. y and z will always have the same length within a row, and the first value of y corresponds to the first value of z). What happens if you apply two unnest_longer() calls to this data frame? How can you preserve the relationship between x and y? (Hint: carefully read the docs).\n\ndf4 &lt;- tribble(\n  ~x, ~y, ~z,\n  \"a\", list(\"y-a-1\", \"y-a-2\"), list(\"z-a-1\", \"z-a-2\"),\n  \"b\", list(\"y-b-1\", \"y-b-2\", \"y-b-3\"), list(\"z-b-1\", \"z-b-2\", \"z-b-3\")\n)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#case-studies",
    "href": "rectangling.html#case-studies",
    "title": "24  Hierarchical data",
    "section": "24.4 Case studies",
    "text": "24.4 Case studies\nThe main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to unnest_longer() and/or unnest_wider(). To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.\n\n24.4.1 Very wide data\nWe’ll start with gh_repos. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; we recommend exploring a little on your own with View(gh_repos) before we continue.\ngh_repos is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call this column json for reasons we’ll get to later.\n\nrepos &lt;- tibble(json = gh_repos)\nrepos\n#&gt; # A tibble: 6 × 1\n#&gt;   json       \n#&gt;   &lt;list&gt;     \n#&gt; 1 &lt;list [30]&gt;\n#&gt; 2 &lt;list [30]&gt;\n#&gt; 3 &lt;list [30]&gt;\n#&gt; 4 &lt;list [26]&gt;\n#&gt; 5 &lt;list [30]&gt;\n#&gt; 6 &lt;list [30]&gt;\n\nThis tibble contains 6 rows, one row for each child of gh_repos. Each row contains an unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with unnest_longer() to put each child in its own row:\n\nrepos |&gt; \n  unnest_longer(json)\n#&gt; # A tibble: 176 × 1\n#&gt;   json             \n#&gt;   &lt;list&gt;           \n#&gt; 1 &lt;named list [68]&gt;\n#&gt; 2 &lt;named list [68]&gt;\n#&gt; 3 &lt;named list [68]&gt;\n#&gt; 4 &lt;named list [68]&gt;\n#&gt; 5 &lt;named list [68]&gt;\n#&gt; 6 &lt;named list [68]&gt;\n#&gt; # ℹ 170 more rows\n\nAt first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of json is still a list. However, there’s an important difference: now each element is a named list so we can use unnest_wider() to put each element into its own column:\n\nrepos |&gt; \n  unnest_longer(json) |&gt; \n  unnest_wider(json) \n#&gt; # A tibble: 176 × 68\n#&gt;         id name        full_name         owner        private html_url       \n#&gt;      &lt;int&gt; &lt;chr&gt;       &lt;chr&gt;             &lt;list&gt;       &lt;lgl&gt;   &lt;chr&gt;          \n#&gt; 1 61160198 after       gaborcsardi/after &lt;named list&gt; FALSE   https://github…\n#&gt; 2 40500181 argufy      gaborcsardi/argu… &lt;named list&gt; FALSE   https://github…\n#&gt; 3 36442442 ask         gaborcsardi/ask   &lt;named list&gt; FALSE   https://github…\n#&gt; 4 34924886 baseimports gaborcsardi/base… &lt;named list&gt; FALSE   https://github…\n#&gt; 5 61620661 citest      gaborcsardi/cite… &lt;named list&gt; FALSE   https://github…\n#&gt; 6 33907457 clisymbols  gaborcsardi/clis… &lt;named list&gt; FALSE   https://github…\n#&gt; # ℹ 170 more rows\n#&gt; # ℹ 62 more variables: description &lt;chr&gt;, fork &lt;lgl&gt;, url &lt;chr&gt;, …\n\nThis has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with names(); and here we look at the first 10:\n\nrepos |&gt; \n  unnest_longer(json) |&gt; \n  unnest_wider(json) |&gt; \n  names() |&gt; \n  head(10)\n#&gt;  [1] \"id\"          \"name\"        \"full_name\"   \"owner\"       \"private\"    \n#&gt;  [6] \"html_url\"    \"description\" \"fork\"        \"url\"         \"forks_url\"\n\nLet’s pull out a few that look interesting:\n\nrepos |&gt; \n  unnest_longer(json) |&gt; \n  unnest_wider(json) |&gt; \n  select(id, full_name, owner, description)\n#&gt; # A tibble: 176 × 4\n#&gt;         id full_name               owner             description             \n#&gt;      &lt;int&gt; &lt;chr&gt;                   &lt;list&gt;            &lt;chr&gt;                   \n#&gt; 1 61160198 gaborcsardi/after       &lt;named list [17]&gt; Run Code in the Backgro…\n#&gt; 2 40500181 gaborcsardi/argufy      &lt;named list [17]&gt; Declarative function ar…\n#&gt; 3 36442442 gaborcsardi/ask         &lt;named list [17]&gt; Friendly CLI interactio…\n#&gt; 4 34924886 gaborcsardi/baseimports &lt;named list [17]&gt; Do we get warnings for …\n#&gt; 5 61620661 gaborcsardi/citest      &lt;named list [17]&gt; Test R package and repo…\n#&gt; 6 33907457 gaborcsardi/clisymbols  &lt;named list [17]&gt; Unicode symbols for CLI…\n#&gt; # ℹ 170 more rows\n\nYou can use this to work back to understand how gh_repos was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.\nowner is another list-column, and since it contains a named list, we can use unnest_wider() to get at the values:\n\nrepos |&gt; \n  unnest_longer(json) |&gt; \n  unnest_wider(json) |&gt; \n  select(id, full_name, owner, description) |&gt; \n  unnest_wider(owner)\n#&gt; Error in `unnest_wider()`:\n#&gt; ! Can't duplicate names between the affected columns and the original\n#&gt;   data.\n#&gt; ✖ These names are duplicated:\n#&gt;   ℹ `id`, from `owner`.\n#&gt; ℹ Use `names_sep` to disambiguate using the column name.\n#&gt; ℹ Or use `names_repair` to specify a repair strategy.\n\nUh oh, this list column also contains an id column and we can’t have two id columns in the same data frame. As suggested, lets use names_sep to resolve the problem:\n\nrepos |&gt; \n  unnest_longer(json) |&gt; \n  unnest_wider(json) |&gt; \n  select(id, full_name, owner, description) |&gt; \n  unnest_wider(owner, names_sep = \"_\")\n#&gt; # A tibble: 176 × 20\n#&gt;         id full_name               owner_login owner_id owner_avatar_url     \n#&gt;      &lt;int&gt; &lt;chr&gt;                   &lt;chr&gt;          &lt;int&gt; &lt;chr&gt;                \n#&gt; 1 61160198 gaborcsardi/after       gaborcsardi   660288 https://avatars.gith…\n#&gt; 2 40500181 gaborcsardi/argufy      gaborcsardi   660288 https://avatars.gith…\n#&gt; 3 36442442 gaborcsardi/ask         gaborcsardi   660288 https://avatars.gith…\n#&gt; 4 34924886 gaborcsardi/baseimports gaborcsardi   660288 https://avatars.gith…\n#&gt; 5 61620661 gaborcsardi/citest      gaborcsardi   660288 https://avatars.gith…\n#&gt; 6 33907457 gaborcsardi/clisymbols  gaborcsardi   660288 https://avatars.gith…\n#&gt; # ℹ 170 more rows\n#&gt; # ℹ 15 more variables: owner_gravatar_id &lt;chr&gt;, owner_url &lt;chr&gt;, …\n\nThis gives another wide dataset, but you can get the sense that owner appears to contain a lot of additional data about the person who “owns” the repository.\n\n\n24.4.2 Relational data\nNested data is sometimes used to represent data that we’d usually spread across multiple data frames. For example, take got_chars which contains data about characters that appear in the Game of Thrones books and TV series. Like gh_repos it’s a list, so we start by turning it into a list-column of a tibble:\n\nchars &lt;- tibble(json = got_chars)\nchars\n#&gt; # A tibble: 30 × 1\n#&gt;   json             \n#&gt;   &lt;list&gt;           \n#&gt; 1 &lt;named list [18]&gt;\n#&gt; 2 &lt;named list [18]&gt;\n#&gt; 3 &lt;named list [18]&gt;\n#&gt; 4 &lt;named list [18]&gt;\n#&gt; 5 &lt;named list [18]&gt;\n#&gt; 6 &lt;named list [18]&gt;\n#&gt; # ℹ 24 more rows\n\nThe json column contains named elements, so we’ll start by widening it:\n\nchars |&gt; \n  unnest_wider(json)\n#&gt; # A tibble: 30 × 18\n#&gt;   url                    id name            gender culture    born           \n#&gt;   &lt;chr&gt;               &lt;int&gt; &lt;chr&gt;           &lt;chr&gt;  &lt;chr&gt;      &lt;chr&gt;          \n#&gt; 1 https://www.anapio…  1022 Theon Greyjoy   Male   \"Ironborn\" \"In 278 AC or …\n#&gt; 2 https://www.anapio…  1052 Tyrion Lannist… Male   \"\"         \"In 273 AC, at…\n#&gt; 3 https://www.anapio…  1074 Victarion Grey… Male   \"Ironborn\" \"In 268 AC or …\n#&gt; 4 https://www.anapio…  1109 Will            Male   \"\"         \"\"             \n#&gt; 5 https://www.anapio…  1166 Areo Hotah      Male   \"Norvoshi\" \"In 257 AC or …\n#&gt; 6 https://www.anapio…  1267 Chett           Male   \"\"         \"At Hag's Mire\"\n#&gt; # ℹ 24 more rows\n#&gt; # ℹ 12 more variables: died &lt;chr&gt;, alive &lt;lgl&gt;, titles &lt;list&gt;, …\n\nAnd selecting a few columns to make it easier to read:\n\ncharacters &lt;- chars |&gt; \n  unnest_wider(json) |&gt; \n  select(id, name, gender, culture, born, died, alive)\ncharacters\n#&gt; # A tibble: 30 × 7\n#&gt;      id name              gender culture    born              died           \n#&gt;   &lt;int&gt; &lt;chr&gt;             &lt;chr&gt;  &lt;chr&gt;      &lt;chr&gt;             &lt;chr&gt;          \n#&gt; 1  1022 Theon Greyjoy     Male   \"Ironborn\" \"In 278 AC or 27… \"\"             \n#&gt; 2  1052 Tyrion Lannister  Male   \"\"         \"In 273 AC, at C… \"\"             \n#&gt; 3  1074 Victarion Greyjoy Male   \"Ironborn\" \"In 268 AC or be… \"\"             \n#&gt; 4  1109 Will              Male   \"\"         \"\"                \"In 297 AC, at…\n#&gt; 5  1166 Areo Hotah        Male   \"Norvoshi\" \"In 257 AC or be… \"\"             \n#&gt; 6  1267 Chett             Male   \"\"         \"At Hag's Mire\"   \"In 299 AC, at…\n#&gt; # ℹ 24 more rows\n#&gt; # ℹ 1 more variable: alive &lt;lgl&gt;\n\nThis dataset contains also many list-columns:\n\nchars |&gt; \n  unnest_wider(json) |&gt; \n  select(id, where(is.list))\n#&gt; # A tibble: 30 × 8\n#&gt;      id titles    aliases    allegiances books     povBooks tvSeries playedBy\n#&gt;   &lt;int&gt; &lt;list&gt;    &lt;list&gt;     &lt;list&gt;      &lt;list&gt;    &lt;list&gt;   &lt;list&gt;   &lt;list&gt;  \n#&gt; 1  1022 &lt;chr [2]&gt; &lt;chr [4]&gt;  &lt;chr [1]&gt;   &lt;chr [3]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; 2  1052 &lt;chr [2]&gt; &lt;chr [11]&gt; &lt;chr [1]&gt;   &lt;chr [2]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; 3  1074 &lt;chr [2]&gt; &lt;chr [1]&gt;  &lt;chr [1]&gt;   &lt;chr [3]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; 4  1109 &lt;chr [1]&gt; &lt;chr [1]&gt;  &lt;NULL&gt;      &lt;chr [1]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; 5  1166 &lt;chr [1]&gt; &lt;chr [1]&gt;  &lt;chr [1]&gt;   &lt;chr [3]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; 6  1267 &lt;chr [1]&gt; &lt;chr [1]&gt;  &lt;NULL&gt;      &lt;chr [2]&gt; &lt;chr&gt;    &lt;chr&gt;    &lt;chr&gt;   \n#&gt; # ℹ 24 more rows\n\nLet’s explore the titles column. It’s an unnamed list-column, so we’ll unnest it into rows:\n\nchars |&gt; \n  unnest_wider(json) |&gt; \n  select(id, titles) |&gt; \n  unnest_longer(titles)\n#&gt; # A tibble: 59 × 2\n#&gt;      id titles                                              \n#&gt;   &lt;int&gt; &lt;chr&gt;                                               \n#&gt; 1  1022 Prince of Winterfell                                \n#&gt; 2  1022 Lord of the Iron Islands (by law of the green lands)\n#&gt; 3  1052 Acting Hand of the King (former)                    \n#&gt; 4  1052 Master of Coin (former)                             \n#&gt; 5  1074 Lord Captain of the Iron Fleet                      \n#&gt; 6  1074 Master of the Iron Victory                          \n#&gt; # ℹ 53 more rows\n\nYou might expect to see this data in its own table because it would be easy to join to the characters data as needed. Let’s do that, which requires little cleaning: removing the rows containing empty strings and renaming titles to title since each row now only contains a single title.\n\ntitles &lt;- chars |&gt; \n  unnest_wider(json) |&gt; \n  select(id, titles) |&gt; \n  unnest_longer(titles) |&gt; \n  filter(titles != \"\") |&gt; \n  rename(title = titles)\ntitles\n#&gt; # A tibble: 52 × 2\n#&gt;      id title                                               \n#&gt;   &lt;int&gt; &lt;chr&gt;                                               \n#&gt; 1  1022 Prince of Winterfell                                \n#&gt; 2  1022 Lord of the Iron Islands (by law of the green lands)\n#&gt; 3  1052 Acting Hand of the King (former)                    \n#&gt; 4  1052 Master of Coin (former)                             \n#&gt; 5  1074 Lord Captain of the Iron Fleet                      \n#&gt; 6  1074 Master of the Iron Victory                          \n#&gt; # ℹ 46 more rows\n\nYou could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.\n\n\n24.4.3 Deeply nested\nWe’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of unnest_wider() and unnest_longer() to unravel: gmaps_cities. This is a two column tibble containing five city names and the results of using Google’s geocoding API to determine their location:\n\ngmaps_cities\n#&gt; # A tibble: 5 × 2\n#&gt;   city       json            \n#&gt;   &lt;chr&gt;      &lt;list&gt;          \n#&gt; 1 Houston    &lt;named list [2]&gt;\n#&gt; 2 Washington &lt;named list [2]&gt;\n#&gt; 3 New York   &lt;named list [2]&gt;\n#&gt; 4 Chicago    &lt;named list [2]&gt;\n#&gt; 5 Arlington  &lt;named list [2]&gt;\n\njson is a list-column with internal names, so we start with an unnest_wider():\n\ngmaps_cities |&gt; \n  unnest_wider(json)\n#&gt; # A tibble: 5 × 3\n#&gt;   city       results    status\n#&gt;   &lt;chr&gt;      &lt;list&gt;     &lt;chr&gt; \n#&gt; 1 Houston    &lt;list [1]&gt; OK    \n#&gt; 2 Washington &lt;list [2]&gt; OK    \n#&gt; 3 New York   &lt;list [1]&gt; OK    \n#&gt; 4 Chicago    &lt;list [1]&gt; OK    \n#&gt; 5 Arlington  &lt;list [2]&gt; OK\n\nThis gives us the status and the results. We’ll drop the status column since they’re all OK; in a real analysis, you’d also want to capture all the rows where status != \"OK\" and figure out what went wrong. results is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:\n\ngmaps_cities |&gt; \n  unnest_wider(json) |&gt; \n  select(-status) |&gt; \n  unnest_longer(results)\n#&gt; # A tibble: 7 × 2\n#&gt;   city       results         \n#&gt;   &lt;chr&gt;      &lt;list&gt;          \n#&gt; 1 Houston    &lt;named list [5]&gt;\n#&gt; 2 Washington &lt;named list [5]&gt;\n#&gt; 3 Washington &lt;named list [5]&gt;\n#&gt; 4 New York   &lt;named list [5]&gt;\n#&gt; 5 Chicago    &lt;named list [5]&gt;\n#&gt; 6 Arlington  &lt;named list [5]&gt;\n#&gt; # ℹ 1 more row\n\nNow results is a named list, so we’ll use unnest_wider():\n\nlocations &lt;- gmaps_cities |&gt; \n  unnest_wider(json) |&gt; \n  select(-status) |&gt; \n  unnest_longer(results) |&gt; \n  unnest_wider(results)\nlocations\n#&gt; # A tibble: 7 × 6\n#&gt;   city       address_components formatted_address   geometry        \n#&gt;   &lt;chr&gt;      &lt;list&gt;             &lt;chr&gt;               &lt;list&gt;          \n#&gt; 1 Houston    &lt;list [4]&gt;         Houston, TX, USA    &lt;named list [4]&gt;\n#&gt; 2 Washington &lt;list [2]&gt;         Washington, USA     &lt;named list [4]&gt;\n#&gt; 3 Washington &lt;list [4]&gt;         Washington, DC, USA &lt;named list [4]&gt;\n#&gt; 4 New York   &lt;list [3]&gt;         New York, NY, USA   &lt;named list [4]&gt;\n#&gt; 5 Chicago    &lt;list [4]&gt;         Chicago, IL, USA    &lt;named list [4]&gt;\n#&gt; 6 Arlington  &lt;list [4]&gt;         Arlington, TX, USA  &lt;named list [4]&gt;\n#&gt; # ℹ 1 more row\n#&gt; # ℹ 2 more variables: place_id &lt;chr&gt;, types &lt;list&gt;\n\nNow we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.\nThere are a few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:\n\nlocations |&gt; \n  select(city, formatted_address, geometry) |&gt; \n  unnest_wider(geometry)\n#&gt; # A tibble: 7 × 6\n#&gt;   city       formatted_address   bounds           location     location_type\n#&gt;   &lt;chr&gt;      &lt;chr&gt;               &lt;list&gt;           &lt;list&gt;       &lt;chr&gt;        \n#&gt; 1 Houston    Houston, TX, USA    &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; 2 Washington Washington, USA     &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; 3 Washington Washington, DC, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; 4 New York   New York, NY, USA   &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; 5 Chicago    Chicago, IL, USA    &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; 6 Arlington  Arlington, TX, USA  &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE  \n#&gt; # ℹ 1 more row\n#&gt; # ℹ 1 more variable: viewport &lt;list&gt;\n\nThat gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng):\n\nlocations |&gt; \n  select(city, formatted_address, geometry) |&gt; \n  unnest_wider(geometry) |&gt; \n  unnest_wider(location)\n#&gt; # A tibble: 7 × 7\n#&gt;   city       formatted_address   bounds             lat    lng location_type\n#&gt;   &lt;chr&gt;      &lt;chr&gt;               &lt;list&gt;           &lt;dbl&gt;  &lt;dbl&gt; &lt;chr&gt;        \n#&gt; 1 Houston    Houston, TX, USA    &lt;named list [2]&gt;  29.8  -95.4 APPROXIMATE  \n#&gt; 2 Washington Washington, USA     &lt;named list [2]&gt;  47.8 -121.  APPROXIMATE  \n#&gt; 3 Washington Washington, DC, USA &lt;named list [2]&gt;  38.9  -77.0 APPROXIMATE  \n#&gt; 4 New York   New York, NY, USA   &lt;named list [2]&gt;  40.7  -74.0 APPROXIMATE  \n#&gt; 5 Chicago    Chicago, IL, USA    &lt;named list [2]&gt;  41.9  -87.6 APPROXIMATE  \n#&gt; 6 Arlington  Arlington, TX, USA  &lt;named list [2]&gt;  32.7  -97.1 APPROXIMATE  \n#&gt; # ℹ 1 more row\n#&gt; # ℹ 1 more variable: viewport &lt;list&gt;\n\nExtracting the bounds requires a few more steps:\n\nlocations |&gt; \n  select(city, formatted_address, geometry) |&gt; \n  unnest_wider(geometry) |&gt; \n  # focus on the variables of interest\n  select(!location:viewport) |&gt;\n  unnest_wider(bounds)\n#&gt; # A tibble: 7 × 4\n#&gt;   city       formatted_address   northeast        southwest       \n#&gt;   &lt;chr&gt;      &lt;chr&gt;               &lt;list&gt;           &lt;list&gt;          \n#&gt; 1 Houston    Houston, TX, USA    &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; 2 Washington Washington, USA     &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; 3 Washington Washington, DC, USA &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; 4 New York   New York, NY, USA   &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; 5 Chicago    Chicago, IL, USA    &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; 6 Arlington  Arlington, TX, USA  &lt;named list [2]&gt; &lt;named list [2]&gt;\n#&gt; # ℹ 1 more row\n\nWe then rename southwest and northeast (the corners of the rectangle) so we can use names_sep to create short but evocative names:\n\nlocations |&gt; \n  select(city, formatted_address, geometry) |&gt; \n  unnest_wider(geometry) |&gt; \n  select(!location:viewport) |&gt;\n  unnest_wider(bounds) |&gt; \n  rename(ne = northeast, sw = southwest) |&gt; \n  unnest_wider(c(ne, sw), names_sep = \"_\") \n#&gt; # A tibble: 7 × 6\n#&gt;   city       formatted_address   ne_lat ne_lng sw_lat sw_lng\n#&gt;   &lt;chr&gt;      &lt;chr&gt;                &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1 Houston    Houston, TX, USA      30.1  -95.0   29.5  -95.8\n#&gt; 2 Washington Washington, USA       49.0 -117.    45.5 -125. \n#&gt; 3 Washington Washington, DC, USA   39.0  -76.9   38.8  -77.1\n#&gt; 4 New York   New York, NY, USA     40.9  -73.7   40.5  -74.3\n#&gt; 5 Chicago    Chicago, IL, USA      42.0  -87.5   41.6  -87.9\n#&gt; 6 Arlington  Arlington, TX, USA    32.8  -97.0   32.6  -97.2\n#&gt; # ℹ 1 more row\n\nNote how we unnest two columns simultaneously by supplying a vector of variable names to unnest_wider().\nOnce you’ve discovered the path to get to the components you’re interested in, you can extract them directly using another tidyr function, hoist():\n\nlocations |&gt; \n  select(city, formatted_address, geometry) |&gt; \n  hoist(\n    geometry,\n    ne_lat = c(\"bounds\", \"northeast\", \"lat\"),\n    sw_lat = c(\"bounds\", \"southwest\", \"lat\"),\n    ne_lng = c(\"bounds\", \"northeast\", \"lng\"),\n    sw_lng = c(\"bounds\", \"southwest\", \"lng\"),\n  )\n\nIf these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette(\"rectangling\", package = \"tidyr\").\n\n\n24.4.4 Exercises\n\nRoughly estimate when gh_repos was created. Why can you only roughly estimate the date?\nThe owner column of gh_repo contains a lot of duplicated information because each owner can have many repos. Can you construct an owners data frame that contains one row for each owner? (Hint: does distinct() work with list-cols?)\nFollow the steps used for titles to create similar tables for the aliases, allegiances, books, and TV series for the Game of Thrones characters.\nExplain the following code line-by-line. Why is it interesting? Why does it work for got_chars but might not work in general?\n\ntibble(json = got_chars) |&gt; \n  unnest_wider(json) |&gt; \n  select(id, where(is.list)) |&gt; \n  pivot_longer(\n    where(is.list), \n    names_to = \"name\", \n    values_to = \"value\"\n  ) |&gt;  \n  unnest_longer(value)\n\nIn gmaps_cities, what does address_components contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: types always appears to contain two elements. Does unnest_wider() make it easier to work with than unnest_longer()?) .",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#json",
    "href": "rectangling.html#json",
    "title": "24  Hierarchical data",
    "section": "24.5 JSON",
    "text": "24.5 JSON\nAll of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.\n\n24.5.1 Data types\nJSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:\n\nThe simplest type is a null (null) which plays the same role as NA in R. It represents the absence of data.\nA string is much like a string in R, but must always use double quotes.\nA number is similar to R’s numbers: they can use integer (e.g., 123), decimal (e.g., 123.45), or scientific (e.g., 1.23e3) notation. JSON doesn’t support Inf, -Inf, or NaN.\nA boolean is similar to R’s TRUE and FALSE, but uses lowercase true and false.\n\nJSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.\nBoth arrays and objects are similar to lists in R; the difference is whether or not they’re named. An array is like an unnamed list, and is written with []. For example [1, 2, 3] is an array containing 3 numbers, and [null, 1, \"string\", false] is an array that contains a null, a number, a string, and a boolean. An object is like a named list, and is written with {}. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, {\"x\": 1, \"y\": 2} is an object that maps x to 1 and y to 2.\nNote that JSON doesn’t have any native way to represent dates or date-times, so they’re often stored as strings, and you’ll need to use readr::parse_date() or readr::parse_datetime() to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply readr::parse_double() as needed to get the correct variable type.\n\n\n24.5.2 jsonlite\nTo convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: read_json() and parse_json(). In real life, you’ll use read_json() to read a JSON file from disk. For example, the repurrsive package also provides the source for gh_user as a JSON file and you can read it with read_json():\n\n# A path to a json file inside the package:\ngh_users_json()\n#&gt; [1] \"/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/repurrrsive/extdata/gh_users.json\"\n\n# Read it with read_json()\ngh_users2 &lt;- read_json(gh_users_json())\n\n# Check it's the same as the data we were using previously\nidentical(gh_users, gh_users2)\n#&gt; [1] TRUE\n\nIn this book, we’ll also use parse_json(), since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:\n\nstr(parse_json('1'))\n#&gt;  int 1\nstr(parse_json('[1, 2, 3]'))\n#&gt; List of 3\n#&gt;  $ : int 1\n#&gt;  $ : int 2\n#&gt;  $ : int 3\nstr(parse_json('{\"x\": [1, 2, 3]}'))\n#&gt; List of 1\n#&gt;  $ x:List of 3\n#&gt;   ..$ : int 1\n#&gt;   ..$ : int 2\n#&gt;   ..$ : int 3\n\njsonlite has another important function called fromJSON(). We don’t use it here because it performs automatic simplification (simplifyVector = TRUE). This often works well, particularly in simple cases, but we think you’re better off doing the rectangling yourself so you know exactly what’s happening and can more easily handle the most complicated nested structures.\n\n\n24.5.3 Starting the rectangling process\nIn most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g., multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with tibble(json) so that each element becomes a row:\n\njson &lt;- '[\n  {\"name\": \"John\", \"age\": 34},\n  {\"name\": \"Susan\", \"age\": 27}\n]'\ndf &lt;- tibble(json = parse_json(json))\ndf\n#&gt; # A tibble: 2 × 1\n#&gt;   json            \n#&gt;   &lt;list&gt;          \n#&gt; 1 &lt;named list [2]&gt;\n#&gt; 2 &lt;named list [2]&gt;\n\ndf |&gt; \n  unnest_wider(json)\n#&gt; # A tibble: 2 × 2\n#&gt;   name    age\n#&gt;   &lt;chr&gt; &lt;int&gt;\n#&gt; 1 John     34\n#&gt; 2 Susan    27\n\nIn rarer cases, the JSON file consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.\n\njson &lt;- '{\n  \"status\": \"OK\", \n  \"results\": [\n    {\"name\": \"John\", \"age\": 34},\n    {\"name\": \"Susan\", \"age\": 27}\n ]\n}\n'\ndf &lt;- tibble(json = list(parse_json(json)))\ndf\n#&gt; # A tibble: 1 × 1\n#&gt;   json            \n#&gt;   &lt;list&gt;          \n#&gt; 1 &lt;named list [2]&gt;\n\ndf |&gt; \n  unnest_wider(json) |&gt; \n  unnest_longer(results) |&gt; \n  unnest_wider(results)\n#&gt; # A tibble: 2 × 3\n#&gt;   status name    age\n#&gt;   &lt;chr&gt;  &lt;chr&gt; &lt;int&gt;\n#&gt; 1 OK     John     34\n#&gt; 2 OK     Susan    27\n\nAlternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:\n\ndf &lt;- tibble(results = parse_json(json)$results)\ndf |&gt; \n  unnest_wider(results)\n#&gt; # A tibble: 2 × 2\n#&gt;   name    age\n#&gt;   &lt;chr&gt; &lt;int&gt;\n#&gt; 1 John     34\n#&gt; 2 Susan    27\n\n\n\n24.5.4 Exercises\n\nRectangle the df_col and df_row below. They represent the two ways of encoding a data frame in JSON.\n\njson_col &lt;- parse_json('\n  {\n    \"x\": [\"a\", \"x\", \"z\"],\n    \"y\": [10, null, 3]\n  }\n')\njson_row &lt;- parse_json('\n  [\n    {\"x\": \"a\", \"y\": 10},\n    {\"x\": \"x\", \"y\": null},\n    {\"x\": \"z\", \"y\": 3}\n  ]\n')\n\ndf_col &lt;- tibble(json = list(json_col)) \ndf_row &lt;- tibble(json = json_row)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#summary",
    "href": "rectangling.html#summary",
    "title": "24  Hierarchical data",
    "section": "24.6 Summary",
    "text": "24.6 Summary\nIn this chapter, you learned what lists are, how you can generate them from JSON files, and how to turn them into rectangular data frames. Surprisingly we only need two new functions: unnest_longer() to put list elements into rows and unnest_wider() to put list elements into columns. It doesn’t matter how deeply nested the list-column is; all you need to do is repeatedly call these two functions.\nJSON is the most common data format returned by web APIs. What happens if the website doesn’t have an API, but you can see data you want on the website? That’s the topic of the next chapter: web scraping, extracting data from HTML webpages.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "rectangling.html#footnotes",
    "href": "rectangling.html#footnotes",
    "title": "24  Hierarchical data",
    "section": "",
    "text": "This is an RStudio feature.↩︎",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>24</span>  <span class='chapter-title'>Hierarchical data</span>"
    ]
  },
  {
    "objectID": "webscraping.html",
    "href": "webscraping.html",
    "title": "25  Web scraping",
    "section": "",
    "text": "25.1 Introduction\nThis chapter introduces you to the basics of web scraping with rvest. Web scraping is a very useful tool for extracting data from web pages. Some websites will offer an API, a set of structured HTTP requests that return data as JSON, which you handle using the techniques from Chapter 24. Where possible, you should use the API1, because typically it will give you more reliable data. Unfortunately, however, programming with web APIs is out of scope for this book. Instead, we are teaching scraping, a technique that works whether or not a site provides an API.\nIn this chapter, we’ll first discuss the ethics and legalities of scraping before we dive into the basics of HTML. You’ll then learn the basics of CSS selectors to locate specific elements on the page, and how to use rvest functions to get data from text and attributes out of HTML and into R. We’ll then discuss some techniques to figure out what CSS selector you need for the page you’re scraping, before finishing up with a couple of case studies, and a brief discussion of dynamic websites.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#introduction",
    "href": "webscraping.html#introduction",
    "title": "25  Web scraping",
    "section": "",
    "text": "25.1.1 Prerequisites\nIn this chapter, we’ll focus on tools provided by rvest. rvest is a member of the tidyverse, but is not a core member so you’ll need to load it explicitly. We’ll also load the full tidyverse since we’ll find it generally useful working with the data we’ve scraped.\n\nlibrary(tidyverse)\nlibrary(rvest)",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#scraping-ethics-and-legalities",
    "href": "webscraping.html#scraping-ethics-and-legalities",
    "title": "25  Web scraping",
    "section": "25.2 Scraping ethics and legalities",
    "text": "25.2 Scraping ethics and legalities\nBefore we get started discussing the code you’ll need to perform web scraping, we need to talk about whether it’s legal and ethical for you to do so. Overall, the situation is complicated with regards to both of these.\nLegalities depend a lot on where you live. However, as a general principle, if the data is public, non-personal, and factual, you’re likely to be ok2. These three factors are important because they’re connected to the site’s terms and conditions, personally identifiable information, and copyright, as we’ll discuss below.\nIf the data isn’t public, non-personal, or factual or you’re scraping the data specifically to make money with it, you’ll need to talk to a lawyer. In any case, you should be respectful of the resources of the server hosting the pages you are scraping. Most importantly, this means that if you’re scraping many pages, you should make sure to wait a little between each request. One easy way to do so is to use the polite package by Dmytro Perepolkin. It will automatically pause between requests and cache the results so you never ask for the same page twice.\n\n25.2.1 Terms of service\nIf you look closely, you’ll find many websites include a “terms and conditions” or “terms of service” link somewhere on the page, and if you read that page closely you’ll often discover that the site specifically prohibits web scraping. These pages tend to be a legal land grab where companies make very broad claims. It’s polite to respect these terms of service where possible, but take any claims with a grain of salt.\nUS courts have generally found that simply putting the terms of service in the footer of the website isn’t sufficient for you to be bound by them, e.g., HiQ Labs v. LinkedIn. Generally, to be bound to the terms of service, you must have taken some explicit action like creating an account or checking a box. This is why whether or not the data is public is important; if you don’t need an account to access them, it is unlikely that you are bound to the terms of service. Note, however, the situation is rather different in Europe where courts have found that terms of service are enforceable even if you don’t explicitly agree to them.\n\n\n25.2.2 Personally identifiable information\nEven if the data is public, you should be extremely careful about scraping personally identifiable information like names, email addresses, phone numbers, dates of birth, etc. Europe has particularly strict laws about the collection or storage of such data (GDPR), and regardless of where you live you’re likely to be entering an ethical quagmire. For example, in 2016, a group of researchers scraped public profile information (e.g., usernames, age, gender, location, etc.) about 70,000 people on the dating site OkCupid and they publicly released these data without any attempts for anonymization. While the researchers felt that there was nothing wrong with this since the data were already public, this work was widely condemned due to ethics concerns around identifiability of users whose information was released in the dataset. If your work involves scraping personally identifiable information, we strongly recommend reading about the OkCupid study3 as well as similar studies with questionable research ethics involving the acquisition and release of personally identifiable information.\n\n\n25.2.3 Copyright\nFinally, you also need to worry about copyright law. Copyright law is complicated, but it’s worth taking a look at the US law which describes exactly what’s protected: “[…] original works of authorship fixed in any tangible medium of expression, […]”. It then goes on to describe specific categories that it applies like literary works, musical works, motion pictures and more. Notably absent from copyright protection are data. This means that as long as you limit your scraping to facts, copyright protection does not apply. (But note that Europe has a separate “sui generis” right that protects databases.)\nAs a brief example, in the US, lists of ingredients and instructions are not copyrightable, so copyright can not be used to protect a recipe. But if that list of recipes is accompanied by substantial novel literary content, that is copyrightable. This is why when you’re looking for a recipe on the internet there’s always so much content beforehand.\nIf you do need to scrape original content (like text or images), you may still be protected under the doctrine of fair use. Fair use is not a hard and fast rule, but weighs up a number of factors. It’s more likely to apply if you are collecting the data for research or non-commercial purposes and if you limit what you scrape to just what you need.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#html-basics",
    "href": "webscraping.html#html-basics",
    "title": "25  Web scraping",
    "section": "25.3 HTML basics",
    "text": "25.3 HTML basics\nTo scrape webpages, you need to first understand a little bit about HTML, the language that describes web pages. HTML stands for HyperText Markup Language and looks something like this:\n&lt;html&gt;\n&lt;head&gt;\n  &lt;title&gt;Page title&lt;/title&gt;\n&lt;/head&gt;\n&lt;body&gt;\n  &lt;h1 id='first'&gt;A heading&lt;/h1&gt;\n  &lt;p&gt;Some text &amp; &lt;b&gt;some bold text.&lt;/b&gt;&lt;/p&gt;\n  &lt;img src='myimg.png' width='100' height='100'&gt;\n&lt;/body&gt;\nHTML has a hierarchical structure formed by elements which consist of a start tag (e.g., &lt;tag&gt;), optional attributes (id='first'), an end tag4 (like &lt;/tag&gt;), and contents (everything in between the start and end tag).\nSince &lt; and &gt; are used for start and end tags, you can’t write them directly. Instead you have to use the HTML escapes &gt; (greater than) and &lt; (less than). And since those escapes use &, if you want a literal ampersand you have to escape it as &amp;. There are a wide range of possible HTML escapes but you don’t need to worry about them too much because rvest automatically handles them for you.\nWeb scraping is possible because most pages that contain data that you want to scrape generally have a consistent structure.\n\n25.3.1 Elements\nThere are over 100 HTML elements. Some of the most important are:\n\nEvery HTML page must be in an &lt;html&gt; element, and it must have two children: &lt;head&gt;, which contains document metadata like the page title, and &lt;body&gt;, which contains the content you see in the browser.\nBlock tags like &lt;h1&gt; (heading 1), &lt;section&gt; (section), &lt;p&gt; (paragraph), and &lt;ol&gt; (ordered list) form the overall structure of the page.\nInline tags like &lt;b&gt; (bold), &lt;i&gt; (italics), and &lt;a&gt; (link) format text inside block tags.\n\nIf you encounter a tag that you’ve never seen before, you can find out what it does with a little googling. Another good place to start are the MDN Web Docs which describe just about every aspect of web programming.\nMost elements can have content in between their start and end tags. This content can either be text or more elements. For example, the following HTML contains paragraph of text, with one word in bold.\n&lt;p&gt;\n  Hi! My &lt;b&gt;name&lt;/b&gt; is Hadley.\n&lt;/p&gt;\nThe children are the elements it contains, so the &lt;p&gt; element above has one child, the &lt;b&gt; element. The &lt;b&gt; element has no children, but it does have contents (the text “name”).\n\n\n25.3.2 Attributes\nTags can have named attributes which look like name1='value1' name2='value2'. Two of the most important attributes are id and class, which are used in conjunction with CSS (Cascading Style Sheets) to control the visual appearance of the page. These are often useful when scraping data off a page. Attributes are also used to record the destination of links (the href attribute of &lt;a&gt; elements) and the source of images (the src attribute of the &lt;img&gt; element).",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#extracting-data",
    "href": "webscraping.html#extracting-data",
    "title": "25  Web scraping",
    "section": "25.4 Extracting data",
    "text": "25.4 Extracting data\nTo get started scraping, you’ll need the URL of the page you want to scrape, which you can usually copy from your web browser. You’ll then need to read the HTML for that page into R with read_html(). This returns an xml_document5 object which you’ll then manipulate using rvest functions:\n\nhtml &lt;- read_html(\"http://rvest.tidyverse.org/\")\nhtml\n#&gt; {html_document}\n#&gt; &lt;html lang=\"en\"&gt;\n#&gt; [1] &lt;head&gt;\\n&lt;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UT ...\n#&gt; [2] &lt;body&gt;\\n    &lt;a href=\"#container\" class=\"visually-hidden-focusable\"&gt;Ski ...\n\nrvest also includes a function that lets you write HTML inline. We’ll use this a bunch in this chapter as we teach how the various rvest functions work with simple examples.\n\nhtml &lt;- minimal_html(\"\n  &lt;p&gt;This is a paragraph&lt;/p&gt;\n  &lt;ul&gt;\n    &lt;li&gt;This is a bulleted list&lt;/li&gt;\n  &lt;/ul&gt;\n\")\nhtml\n#&gt; {html_document}\n#&gt; &lt;html&gt;\n#&gt; [1] &lt;head&gt;\\n&lt;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UT ...\n#&gt; [2] &lt;body&gt;\\n&lt;p&gt;This is a paragraph&lt;/p&gt;\\n  &lt;ul&gt;\\n&lt;li&gt;This is a bulleted lis ...\n\nNow that you have the HTML in R, it’s time to extract the data of interest. You’ll first learn about the CSS selectors that allow you to identify the elements of interest and the rvest functions that you can use to extract data from them. Then we’ll briefly cover HTML tables, which have some special tools.\n\n25.4.1 Find elements\nCSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.\nWe’ll come back to CSS selectors in more detail in Section 25.5, but luckily you can get a long way with just three:\n\np selects all &lt;p&gt; elements.\n.title selects all elements with class “title”.\n#title selects the element with the id attribute that equals “title”. Id attributes must be unique within a document, so this will only ever select a single element.\n\nLet’s try out these selectors with a simple example:\n\nhtml &lt;- minimal_html(\"\n  &lt;h1&gt;This is a heading&lt;/h1&gt;\n  &lt;p id='first'&gt;This is a paragraph&lt;/p&gt;\n  &lt;p class='important'&gt;This is an important paragraph&lt;/p&gt;\n\")\n\nUse html_elements() to find all elements that match the selector:\n\nhtml |&gt; html_elements(\"p\")\n#&gt; {xml_nodeset (2)}\n#&gt; [1] &lt;p id=\"first\"&gt;This is a paragraph&lt;/p&gt;\n#&gt; [2] &lt;p class=\"important\"&gt;This is an important paragraph&lt;/p&gt;\nhtml |&gt; html_elements(\".important\")\n#&gt; {xml_nodeset (1)}\n#&gt; [1] &lt;p class=\"important\"&gt;This is an important paragraph&lt;/p&gt;\nhtml |&gt; html_elements(\"#first\")\n#&gt; {xml_nodeset (1)}\n#&gt; [1] &lt;p id=\"first\"&gt;This is a paragraph&lt;/p&gt;\n\nAnother important function is html_element() which always returns the same number of outputs as inputs. If you apply it to a whole document it’ll give you the first match:\n\nhtml |&gt; html_element(\"p\")\n#&gt; {html_node}\n#&gt; &lt;p id=\"first\"&gt;\n\nThere’s an important difference between html_element() and html_elements() when you use a selector that doesn’t match any elements. html_elements() returns a vector of length 0, where html_element() returns a missing value. This will be important shortly.\n\nhtml |&gt; html_elements(\"b\")\n#&gt; {xml_nodeset (0)}\nhtml |&gt; html_element(\"b\")\n#&gt; {xml_missing}\n#&gt; &lt;NA&gt;\n\n\n\n25.4.2 Nesting selections\nIn most cases, you’ll use html_elements() and html_element() together, typically using html_elements() to identify elements that will become observations then using html_element() to find elements that will become variables. Let’s see this in action using a simple example. Here we have an unordered list (&lt;ul&gt;) where each list item (&lt;li&gt;) contains some information about four characters from StarWars:\n\nhtml &lt;- minimal_html(\"\n  &lt;ul&gt;\n    &lt;li&gt;&lt;b&gt;C-3PO&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt; that weighs &lt;span class='weight'&gt;167 kg&lt;/span&gt;&lt;/li&gt;\n    &lt;li&gt;&lt;b&gt;R4-P17&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt;&lt;/li&gt;\n    &lt;li&gt;&lt;b&gt;R2-D2&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt; that weighs &lt;span class='weight'&gt;96 kg&lt;/span&gt;&lt;/li&gt;\n    &lt;li&gt;&lt;b&gt;Yoda&lt;/b&gt; weighs &lt;span class='weight'&gt;66 kg&lt;/span&gt;&lt;/li&gt;\n  &lt;/ul&gt;\n  \")\n\nWe can use html_elements() to make a vector where each element corresponds to a different character:\n\ncharacters &lt;- html |&gt; html_elements(\"li\")\ncharacters\n#&gt; {xml_nodeset (4)}\n#&gt; [1] &lt;li&gt;\\n&lt;b&gt;C-3PO&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt; that weighs &lt;span class=\"weight\"&gt; ...\n#&gt; [2] &lt;li&gt;\\n&lt;b&gt;R4-P17&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt;\\n&lt;/li&gt;\n#&gt; [3] &lt;li&gt;\\n&lt;b&gt;R2-D2&lt;/b&gt; is a &lt;i&gt;droid&lt;/i&gt; that weighs &lt;span class=\"weight\"&gt; ...\n#&gt; [4] &lt;li&gt;\\n&lt;b&gt;Yoda&lt;/b&gt; weighs &lt;span class=\"weight\"&gt;66 kg&lt;/span&gt;\\n&lt;/li&gt;\n\nTo extract the name of each character, we use html_element(), because when applied to the output of html_elements() it’s guaranteed to return one response per element:\n\ncharacters |&gt; html_element(\"b\")\n#&gt; {xml_nodeset (4)}\n#&gt; [1] &lt;b&gt;C-3PO&lt;/b&gt;\n#&gt; [2] &lt;b&gt;R4-P17&lt;/b&gt;\n#&gt; [3] &lt;b&gt;R2-D2&lt;/b&gt;\n#&gt; [4] &lt;b&gt;Yoda&lt;/b&gt;\n\nThe distinction between html_element() and html_elements() isn’t important for name, but it is important for weight. We want to get one weight for each character, even if there’s no weight &lt;span&gt;. That’s what html_element() does:\n\ncharacters |&gt; html_element(\".weight\")\n#&gt; {xml_nodeset (4)}\n#&gt; [1] &lt;span class=\"weight\"&gt;167 kg&lt;/span&gt;\n#&gt; [2] NA\n#&gt; [3] &lt;span class=\"weight\"&gt;96 kg&lt;/span&gt;\n#&gt; [4] &lt;span class=\"weight\"&gt;66 kg&lt;/span&gt;\n\nhtml_elements() finds all weight &lt;span&gt;s that are children of characters. There’s only three of these, so we lose the connection between names and weights:\n\ncharacters |&gt; html_elements(\".weight\")\n#&gt; {xml_nodeset (3)}\n#&gt; [1] &lt;span class=\"weight\"&gt;167 kg&lt;/span&gt;\n#&gt; [2] &lt;span class=\"weight\"&gt;96 kg&lt;/span&gt;\n#&gt; [3] &lt;span class=\"weight\"&gt;66 kg&lt;/span&gt;\n\nNow that you’ve selected the elements of interest, you’ll need to extract the data, either from the text contents or some attributes.\n\n\n25.4.3 Text and attributes\nhtml_text2()6 extracts the plain text contents of an HTML element:\n\ncharacters |&gt; \n  html_element(\"b\") |&gt; \n  html_text2()\n#&gt; [1] \"C-3PO\"  \"R4-P17\" \"R2-D2\"  \"Yoda\"\n\ncharacters |&gt; \n  html_element(\".weight\") |&gt; \n  html_text2()\n#&gt; [1] \"167 kg\" NA       \"96 kg\"  \"66 kg\"\n\nNote that any escapes will be automatically handled; you’ll only ever see HTML escapes in the source HTML, not in the data returned by rvest.\nhtml_attr() extracts data from attributes:\n\nhtml &lt;- minimal_html(\"\n  &lt;p&gt;&lt;a href='https://en.wikipedia.org/wiki/Cat'&gt;cats&lt;/a&gt;&lt;/p&gt;\n  &lt;p&gt;&lt;a href='https://en.wikipedia.org/wiki/Dog'&gt;dogs&lt;/a&gt;&lt;/p&gt;\n\")\n\nhtml |&gt; \n  html_elements(\"p\") |&gt; \n  html_element(\"a\") |&gt; \n  html_attr(\"href\")\n#&gt; [1] \"https://en.wikipedia.org/wiki/Cat\" \"https://en.wikipedia.org/wiki/Dog\"\n\nhtml_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.\n\n\n25.4.4 Tables\nIf you’re lucky, your data will be already stored in an HTML table, and it’ll be a matter of just reading it from that table. It’s usually straightforward to recognize a table in your browser: it’ll have a rectangular structure of rows and columns, and you can copy and paste it into a tool like Excel.\nHTML tables are built up from four main elements: &lt;table&gt;, &lt;tr&gt; (table row), &lt;th&gt; (table heading), and &lt;td&gt; (table data). Here’s a simple HTML table with two columns and three rows:\n\nhtml &lt;- minimal_html(\"\n  &lt;table class='mytable'&gt;\n    &lt;tr&gt;&lt;th&gt;x&lt;/th&gt;   &lt;th&gt;y&lt;/th&gt;&lt;/tr&gt;\n    &lt;tr&gt;&lt;td&gt;1.5&lt;/td&gt; &lt;td&gt;2.7&lt;/td&gt;&lt;/tr&gt;\n    &lt;tr&gt;&lt;td&gt;4.9&lt;/td&gt; &lt;td&gt;1.3&lt;/td&gt;&lt;/tr&gt;\n    &lt;tr&gt;&lt;td&gt;7.2&lt;/td&gt; &lt;td&gt;8.1&lt;/td&gt;&lt;/tr&gt;\n  &lt;/table&gt;\n  \")\n\nrvest provides a function that knows how to read this sort of data: html_table(). It returns a list containing one tibble for each table found on the page. Use html_element() to identify the table you want to extract:\n\nhtml |&gt; \n  html_element(\".mytable\") |&gt; \n  html_table()\n#&gt; # A tibble: 3 × 2\n#&gt;       x     y\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1   1.5   2.7\n#&gt; 2   4.9   1.3\n#&gt; 3   7.2   8.1\n\nNote that x and y have automatically been converted to numbers. This automatic conversion doesn’t always work, so in more complex scenarios you may want to turn it off with convert = FALSE and then do your own conversion.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#sec-css-selectors",
    "href": "webscraping.html#sec-css-selectors",
    "title": "25  Web scraping",
    "section": "25.5 Finding the right selectors",
    "text": "25.5 Finding the right selectors\nFiguring out the selector you need for your data is typically the hardest part of the problem. You’ll often need to do some experimenting to find a selector that is both specific (i.e. it doesn’t select things you don’t care about) and sensitive (i.e. it does select everything you care about). Lots of trial and error is a normal part of the process! There are two main tools that are available to help you with this process: SelectorGadget and your browser’s developer tools.\nSelectorGadget is a javascript bookmarklet that automatically generates CSS selectors based on the positive and negative examples that you provide. It doesn’t always work, but when it does, it’s magic! You can learn how to install and use SelectorGadget either by reading https://rvest.tidyverse.org/articles/selectorgadget.html or watching Mine’s video at https://www.youtube.com/watch?v=PetWV5g1Xsc.\nEvery modern browser comes with some toolkit for developers, but we recommend Chrome, even if it isn’t your regular browser: its web developer tools are some of the best and they’re immediately available. Right click on an element on the page and click Inspect. This will open an expandable view of the complete HTML page, centered on the element that you just clicked. You can use this to explore the page and get a sense of what selectors might work. Pay particular attention to the class and id attributes, since these are often used to form the visual structure of the page, and hence make for good tools to extract the data that you’re looking for.\nInside the Elements view, you can also right click on an element and choose Copy as Selector to generate a selector that will uniquely identify the element of interest.\nIf either SelectorGadget or Chrome DevTools have generated a CSS selector that you don’t understand, try Selectors Explained which translates CSS selectors into plain English. If you find yourself doing this a lot, you might want to learn more about CSS selectors generally. We recommend starting with the fun CSS dinner tutorial and then referring to the MDN web docs.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#putting-it-all-together",
    "href": "webscraping.html#putting-it-all-together",
    "title": "25  Web scraping",
    "section": "25.6 Putting it all together",
    "text": "25.6 Putting it all together\nLet’s put this all together to scrape some websites. There’s some risk that these examples may no longer work when you run them — that’s the fundamental challenge of web scraping; if the structure of the site changes, then you’ll have to change your scraping code.\n\n25.6.1 StarWars\nrvest includes a very simple example in vignette(\"starwars\"). This is a simple page with minimal HTML so it’s a good place to start. I’d encourage you to navigate to that page now and use “Inspect Element” to inspect one of the headings that’s the title of a Star Wars movie. Use the keyboard or mouse to explore the hierarchy of the HTML and see if you can get a sense of the shared structure used by each movie.\nYou should be able to see that each movie has a shared structure that looks like this:\n&lt;section&gt;\n  &lt;h2 data-id=\"1\"&gt;The Phantom Menace&lt;/h2&gt;\n  &lt;p&gt;Released: 1999-05-19&lt;/p&gt;\n  &lt;p&gt;Director: &lt;span class=\"director\"&gt;George Lucas&lt;/span&gt;&lt;/p&gt;\n  \n  &lt;div class=\"crawl\"&gt;\n    &lt;p&gt;...&lt;/p&gt;\n    &lt;p&gt;...&lt;/p&gt;\n    &lt;p&gt;...&lt;/p&gt;\n  &lt;/div&gt;\n&lt;/section&gt;\nOur goal is to turn this data into a 7 row data frame with variables title, year, director, and intro. We’ll start by reading the HTML and extracting all the &lt;section&gt; elements:\n\nurl &lt;- \"https://rvest.tidyverse.org/articles/starwars.html\"\nhtml &lt;- read_html(url)\n\nsection &lt;- html |&gt; html_elements(\"section\")\nsection\n#&gt; {xml_nodeset (7)}\n#&gt; [1] &lt;section&gt;&lt;h2 data-id=\"1\"&gt;\\nThe Phantom Menace\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased: 1 ...\n#&gt; [2] &lt;section&gt;&lt;h2 data-id=\"2\"&gt;\\nAttack of the Clones\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased: ...\n#&gt; [3] &lt;section&gt;&lt;h2 data-id=\"3\"&gt;\\nRevenge of the Sith\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased:  ...\n#&gt; [4] &lt;section&gt;&lt;h2 data-id=\"4\"&gt;\\nA New Hope\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased: 1977-05-2 ...\n#&gt; [5] &lt;section&gt;&lt;h2 data-id=\"5\"&gt;\\nThe Empire Strikes Back\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleas ...\n#&gt; [6] &lt;section&gt;&lt;h2 data-id=\"6\"&gt;\\nReturn of the Jedi\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased: 1 ...\n#&gt; [7] &lt;section&gt;&lt;h2 data-id=\"7\"&gt;\\nThe Force Awakens\\n&lt;/h2&gt;\\n&lt;p&gt;\\nReleased: 20 ...\n\nThis retrieves seven elements matching the seven movies found on that page, suggesting that using section as a selector is good. Extracting the individual elements is straightforward since the data is always found in the text. It’s just a matter of finding the right selector:\n\nsection |&gt; html_element(\"h2\") |&gt; html_text2()\n#&gt; [1] \"The Phantom Menace\"      \"Attack of the Clones\"   \n#&gt; [3] \"Revenge of the Sith\"     \"A New Hope\"             \n#&gt; [5] \"The Empire Strikes Back\" \"Return of the Jedi\"     \n#&gt; [7] \"The Force Awakens\"\n\nsection |&gt; html_element(\".director\") |&gt; html_text2()\n#&gt; [1] \"George Lucas\"     \"George Lucas\"     \"George Lucas\"    \n#&gt; [4] \"George Lucas\"     \"Irvin Kershner\"   \"Richard Marquand\"\n#&gt; [7] \"J. J. Abrams\"\n\nOnce we’ve done that for each component, we can wrap all the results up into a tibble:\n\ntibble(\n  title = section |&gt; \n    html_element(\"h2\") |&gt; \n    html_text2(),\n  released = section |&gt; \n    html_element(\"p\") |&gt; \n    html_text2() |&gt; \n    str_remove(\"Released: \") |&gt; \n    parse_date(),\n  director = section |&gt; \n    html_element(\".director\") |&gt; \n    html_text2(),\n  intro = section |&gt; \n    html_element(\".crawl\") |&gt; \n    html_text2()\n)\n#&gt; # A tibble: 7 × 4\n#&gt;   title                   released   director         intro                  \n#&gt;   &lt;chr&gt;                   &lt;date&gt;     &lt;chr&gt;            &lt;chr&gt;                  \n#&gt; 1 The Phantom Menace      1999-05-19 George Lucas     \"Turmoil has engulfed …\n#&gt; 2 Attack of the Clones    2002-05-16 George Lucas     \"There is unrest in th…\n#&gt; 3 Revenge of the Sith     2005-05-19 George Lucas     \"War! The Republic is …\n#&gt; 4 A New Hope              1977-05-25 George Lucas     \"It is a period of civ…\n#&gt; 5 The Empire Strikes Back 1980-05-17 Irvin Kershner   \"It is a dark time for…\n#&gt; 6 Return of the Jedi      1983-05-25 Richard Marquand \"Luke Skywalker has re…\n#&gt; # ℹ 1 more row\n\nWe did a little more processing of released to get a variable that will be easy to use later in our analysis.\n\n\n25.6.2 IMDB top films\nFor our next task we’ll tackle something a little trickier, extracting the top 250 movies from the internet movie database (IMDb). At the time we wrote this chapter, the page looked like Figure 25.1.\n\n\n\n\n\n\n\n\nFigure 25.1: Screenshot of the IMDb top movies web page taken on 2022-12-05.\n\n\n\n\n\nThis data has a clear tabular structure so it’s worth starting with html_table():\n\nurl &lt;- \"https://web.archive.org/web/20220201012049/https://www.imdb.com/chart/top/\"\nhtml &lt;- read_html(url)\n\ntable &lt;- html |&gt; \n  html_element(\"table\") |&gt; \n  html_table()\ntable\n#&gt; # A tibble: 250 × 5\n#&gt;   ``    `Rank & Title`                    `IMDb Rating` `Your Rating`   ``   \n#&gt;   &lt;lgl&gt; &lt;chr&gt;                                     &lt;dbl&gt; &lt;chr&gt;           &lt;lgl&gt;\n#&gt; 1 NA    \"1.\\n      The Shawshank Redempt…           9.2 \"12345678910\\n… NA   \n#&gt; 2 NA    \"2.\\n      The Godfather\\n      …           9.1 \"12345678910\\n… NA   \n#&gt; 3 NA    \"3.\\n      The Godfather: Part I…           9   \"12345678910\\n… NA   \n#&gt; 4 NA    \"4.\\n      The Dark Knight\\n    …           9   \"12345678910\\n… NA   \n#&gt; 5 NA    \"5.\\n      12 Angry Men\\n       …           8.9 \"12345678910\\n… NA   \n#&gt; 6 NA    \"6.\\n      Schindler's List\\n   …           8.9 \"12345678910\\n… NA   \n#&gt; # ℹ 244 more rows\n\nThis includes a few empty columns, but overall does a good job of capturing the information from the table. However, we need to do some more processing to make it easier to use. First, we’ll rename the columns to be easier to work with, and remove the extraneous whitespace in rank and title. We will do this with select() (instead of rename()) to do the renaming and selecting of just these two columns in one step. Then we’ll remove the new lines and extra spaces, and then apply separate_wider_regex() (from Section 16.3.4) to pull out the title, year, and rank into their own variables.\n\nratings &lt;- table |&gt;\n  select(\n    rank_title_year = `Rank & Title`,\n    rating = `IMDb Rating`\n  ) |&gt; \n  mutate(\n    rank_title_year = str_replace_all(rank_title_year, \"\\n +\", \" \")\n  ) |&gt; \n  separate_wider_regex(\n    rank_title_year,\n    patterns = c(\n      rank = \"\\\\d+\", \"\\\\. \",\n      title = \".+\", \" +\\\\(\",\n      year = \"\\\\d+\", \"\\\\)\"\n    )\n  )\nratings\n#&gt; # A tibble: 250 × 4\n#&gt;   rank  title                    year  rating\n#&gt;   &lt;chr&gt; &lt;chr&gt;                    &lt;chr&gt;  &lt;dbl&gt;\n#&gt; 1 1     The Shawshank Redemption 1994     9.2\n#&gt; 2 2     The Godfather            1972     9.1\n#&gt; 3 3     The Godfather: Part II   1974     9  \n#&gt; 4 4     The Dark Knight          2008     9  \n#&gt; 5 5     12 Angry Men             1957     8.9\n#&gt; 6 6     Schindler's List         1993     8.9\n#&gt; # ℹ 244 more rows\n\nEven in this case where most of the data comes from table cells, it’s still worth looking at the raw HTML. If you do so, you’ll discover that we can add a little extra data by using one of the attributes. This is one of the reasons it’s worth spending a little time spelunking the source of the page; you might find extra data, or might find a parsing route that’s slightly easier.\n\nhtml |&gt; \n  html_elements(\"td strong\") |&gt; \n  head() |&gt; \n  html_attr(\"title\")\n#&gt; [1] \"9.2 based on 2,536,415 user ratings\"\n#&gt; [2] \"9.1 based on 1,745,675 user ratings\"\n#&gt; [3] \"9.0 based on 1,211,032 user ratings\"\n#&gt; [4] \"9.0 based on 2,486,931 user ratings\"\n#&gt; [5] \"8.9 based on 749,563 user ratings\"  \n#&gt; [6] \"8.9 based on 1,295,705 user ratings\"\n\nWe can combine this with the tabular data and again apply separate_wider_regex() to extract out the bit of data we care about:\n\nratings |&gt;\n  mutate(\n    rating_n = html |&gt; html_elements(\"td strong\") |&gt; html_attr(\"title\")\n  ) |&gt; \n  separate_wider_regex(\n    rating_n,\n    patterns = c(\n      \"[0-9.]+ based on \",\n      number = \"[0-9,]+\",\n      \" user ratings\"\n    )\n  ) |&gt; \n  mutate(\n    number = parse_number(number)\n  )\n#&gt; # A tibble: 250 × 5\n#&gt;   rank  title                    year  rating  number\n#&gt;   &lt;chr&gt; &lt;chr&gt;                    &lt;chr&gt;  &lt;dbl&gt;   &lt;dbl&gt;\n#&gt; 1 1     The Shawshank Redemption 1994     9.2 2536415\n#&gt; 2 2     The Godfather            1972     9.1 1745675\n#&gt; 3 3     The Godfather: Part II   1974     9   1211032\n#&gt; 4 4     The Dark Knight          2008     9   2486931\n#&gt; 5 5     12 Angry Men             1957     8.9  749563\n#&gt; 6 6     Schindler's List         1993     8.9 1295705\n#&gt; # ℹ 244 more rows",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#dynamic-sites",
    "href": "webscraping.html#dynamic-sites",
    "title": "25  Web scraping",
    "section": "25.7 Dynamic sites",
    "text": "25.7 Dynamic sites\nSo far we have focused on websites where html_elements() returns what you see in the browser and discussed how to parse what it returns and how to organize that information in tidy data frames. From time-to-time, however, you’ll hit a site where html_elements() and friends don’t return anything like what you see in the browser. In many cases, that’s because you’re trying to scrape a website that dynamically generates the content of the page with javascript. This doesn’t currently work with rvest, because rvest downloads the raw HTML and doesn’t run any javascript.\nIt’s still possible to scrape these types of sites, but rvest needs to use a more expensive process: fully simulating the web browser including running all javascript. This functionality is not available at the time of writing, but it’s something we’re actively working on and might be available by the time you read this. It uses the chromote package which actually runs the Chrome browser in the background, and gives you additional tools to interact with the site, like a human typing text and clicking buttons. Check out the rvest website for more details.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#summary",
    "href": "webscraping.html#summary",
    "title": "25  Web scraping",
    "section": "25.8 Summary",
    "text": "25.8 Summary\nIn this chapter, you’ve learned about the why, the why not, and the how of scraping data from web pages. First, you’ve learned about the basics of HTML and using CSS selectors to refer to specific elements, then you’ve learned about using the rvest package to get data out of HTML into R. We then demonstrated web scraping with two case studies: a simpler scenario on scraping data on StarWars films from the rvest package website and a more complex scenario on scraping the top 250 films from IMDB.\nTechnical details of scraping data off the web can be complex, particularly when dealing with sites, however legal and ethical considerations can be even more complex. It’s important for you to educate yourself about both of these before setting out to scrape data.\nThis brings us to the end of the import part of the book where you’ve learned techniques to get data from where it lives (spreadsheets, databases, JSON files, and web sites) into a tidy form in R. Now it’s time to turn our sights to a new topic: making the most of R as a programming language.",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "webscraping.html#footnotes",
    "href": "webscraping.html#footnotes",
    "title": "25  Web scraping",
    "section": "",
    "text": "And many popular APIs already have CRAN packages that wrap them, so start with a little research first!↩︎\nObviously we’re not lawyers, and this is not legal advice. But this is the best summary we can give having read a bunch about this topic.↩︎\nOne example of an article on the OkCupid study was published by Wired, https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science.↩︎\nA number of tags (including &lt;p&gt; and &lt;li&gt;) don’t require end tags, but we think it’s best to include them because it makes seeing the structure of the HTML a little easier.↩︎\nThis class comes from the xml2 package. xml2 is a low-level package that rvest builds on top of.↩︎\nrvest also provides html_text() but you should almost always use html_text2() since it does a better job of converting nested HTML to text.↩︎",
    "crumbs": [
      "Import",
      "<span class='chapter-number'>25</span>  <span class='chapter-title'>Web scraping</span>"
    ]
  },
  {
    "objectID": "program.html",
    "href": "program.html",
    "title": "Program",
    "section": "",
    "text": "In this part of the book, you’ll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.\n\n\n\n\n\n\n\n\nFigure 1: Programming is the water in which all the other components swim.\n\n\n\n\n\nProgramming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.\nIn the following three chapters, you’ll learn skills to improve your programming skills:\n\nCopy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in 26  Functions, you’ll learn how to write functions which let you extract out repeated tidyverse code so that it can be easily reused.\nFunctions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in 27  Iteration.\nAs you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In 28  A field guide to base R, you’ll learn some of the most important base R functions that you’ll see in the wild.\n\nThe goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. We’ve written two books that you might find helpful. Hands on Programming with R, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. Advanced R by Hadley Wickham dives into the details of R the programming language; it’s great place to start if you have existing programming experience and great next step once you’ve internalized the ideas in these chapters.",
    "crumbs": [
      "Program"
    ]
  },
  {
    "objectID": "functions.html",
    "href": "functions.html",
    "title": "26  Functions",
    "section": "",
    "text": "26.1 Introduction\nOne of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:\nA good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, you’ll learn about three useful types of functions:\nEach of these sections includes many examples to help you generalize the patterns that you see. These examples wouldn’t be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for general functions and plotting functions to see even more functions.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#introduction",
    "href": "functions.html#introduction",
    "title": "26  Functions",
    "section": "",
    "text": "You can give a function an evocative name that makes your code easier to understand.\nAs requirements change, you only need to update code in one place, instead of many.\nYou eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).\nIt makes it easier to reuse work from project-to-project, increasing your productivity over time.\n\n\n\nVector functions take one or more vectors as input and return a vector as output.\nData frame functions take a data frame as input and return a data frame as output.\nPlot functions that take a data frame as input and return a plot as output.\n\n\n\n26.1.1 Prerequisites\nWe’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.\n\nlibrary(tidyverse)\nlibrary(nycflights13)",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#vector-functions",
    "href": "functions.html#vector-functions",
    "title": "26  Functions",
    "section": "26.2 Vector functions",
    "text": "26.2 Vector functions\nWe’ll begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?\n\ndf &lt;- tibble(\n  a = rnorm(5),\n  b = rnorm(5),\n  c = rnorm(5),\n  d = rnorm(5),\n)\n\ndf |&gt; mutate(\n  a = (a - min(a, na.rm = TRUE)) / \n    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),\n  b = (b - min(a, na.rm = TRUE)) / \n    (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)),\n  c = (c - min(c, na.rm = TRUE)) / \n    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),\n  d = (d - min(d, na.rm = TRUE)) / \n    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),\n)\n#&gt; # A tibble: 5 × 4\n#&gt;       a       b     c     d\n#&gt;   &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 0.339  0.387  0.291 0    \n#&gt; 2 0.880 -0.613  0.611 0.557\n#&gt; 3 0     -0.0833 1     0.752\n#&gt; 4 0.795 -0.0822 0     1    \n#&gt; 5 1     -0.0952 0.580 0.394\n\nYou might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing this type of mistake is one very good reason to learn how to write functions.\n\n26.2.1 Writing a function\nTo write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of mutate(), it’s a little easier to see the pattern because each repetition is now one line:\n\n(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))\n(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))\n(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))\n(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))  \n\nTo make this a bit clearer we can replace the bit that varies with █:\n\n(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))\n\nTo turn this into a function you need three things:\n\nA name. Here we’ll use rescale01 because this function rescales a vector to lie between 0 and 1.\nThe arguments. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it x because this is the conventional name for a numeric vector.\nThe body. The body is the code that’s repeated across all the calls.\n\nThen you create a function by following the template:\n\nname &lt;- function(arguments) {\n  body\n}\n\nFor this case that leads to:\n\nrescale01 &lt;- function(x) {\n  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))\n}\n\nAt this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:\n\nrescale01(c(-10, 0, 10))\n#&gt; [1] 0.0 0.5 1.0\nrescale01(c(1, 2, 3, NA, 5))\n#&gt; [1] 0.00 0.25 0.50   NA 1.00\n\nThen you can rewrite the call to mutate() as:\n\ndf |&gt; mutate(\n  a = rescale01(a),\n  b = rescale01(b),\n  c = rescale01(c),\n  d = rescale01(d),\n)\n#&gt; # A tibble: 5 × 4\n#&gt;       a     b     c     d\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 0.339 1     0.291 0    \n#&gt; 2 0.880 0     0.611 0.557\n#&gt; 3 0     0.530 1     0.752\n#&gt; 4 0.795 0.531 0     1    \n#&gt; 5 1     0.518 0.580 0.394\n\n(In Chapter 27, you’ll learn how to use across() to reduce the duplication even further so all you need is df |&gt; mutate(across(a:d, rescale01))).\n\n\n26.2.2 Improving our function\nYou might notice that the rescale01() function does some unnecessary work — instead of computing min() twice and max() once we could instead compute both the minimum and maximum in one step with range():\n\nrescale01 &lt;- function(x) {\n  rng &lt;- range(x, na.rm = TRUE)\n  (x - rng[1]) / (rng[2] - rng[1])\n}\n\nOr you might try this function on a vector that includes an infinite value:\n\nx &lt;- c(1:10, Inf)\nrescale01(x)\n#&gt;  [1]   0   0   0   0   0   0   0   0   0   0 NaN\n\nThat result is not particularly useful so we could ask range() to ignore infinite values:\n\nrescale01 &lt;- function(x) {\n  rng &lt;- range(x, na.rm = TRUE, finite = TRUE)\n  (x - rng[1]) / (rng[2] - rng[1])\n}\n\nrescale01(x)\n#&gt;  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667\n#&gt;  [8] 0.7777778 0.8888889 1.0000000       Inf\n\nThese changes illustrate an important benefit of functions: because we’ve moved the repeated code into a function, we only need to make the change in one place.\n\n\n26.2.3 Mutate functions\nNow you’ve got the basic idea of functions, let’s take a look at a whole bunch of examples. We’ll start by looking at “mutate” functions, i.e. functions that work well inside of mutate() and filter() because they return an output of the same length as the input.\nLet’s start with a simple variation of rescale01(). Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:\n\nz_score &lt;- function(x) {\n  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)\n}\n\nOr maybe you want to wrap up a straightforward case_when() and give it a useful name. For example, this clamp() function ensures all values of a vector lie in between a minimum or a maximum:\n\nclamp &lt;- function(x, min, max) {\n  case_when(\n    x &lt; min ~ min,\n    x &gt; max ~ max,\n    .default = x\n  )\n}\n\nclamp(1:10, min = 3, max = 7)\n#&gt;  [1] 3 3 3 4 5 6 7 7 7 7\n\nOf course functions don’t just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:\n\nfirst_upper &lt;- function(x) {\n  str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))\n  x\n}\n\nfirst_upper(\"hello\")\n#&gt; [1] \"Hello\"\n\nOr maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:\n\n# https://twitter.com/NVlabormarket/status/1571939851922198530\nclean_number &lt;- function(x) {\n  is_pct &lt;- str_detect(x, \"%\")\n  num &lt;- x |&gt; \n    str_remove_all(\"%\") |&gt; \n    str_remove_all(\",\") |&gt; \n    str_remove_all(fixed(\"$\")) |&gt; \n    as.numeric()\n  if_else(is_pct, num / 100, num)\n}\n\nclean_number(\"$12,300\")\n#&gt; [1] 12300\nclean_number(\"45%\")\n#&gt; [1] 0.45\n\nSometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with NA:\n\nfix_na &lt;- function(x) {\n  if_else(x %in% c(997, 998, 999), NA, x)\n}\n\nWe’ve focused on examples that take a single vector because we think they’re the most common. But there’s no reason that your function can’t take multiple vector inputs.\n\n\n26.2.4 Summary functions\nAnother important family of vector functions is summary functions, functions that return a single value for use in summarize(). Sometimes this can just be a matter of setting a default argument or two:\n\ncommas &lt;- function(x) {\n  str_flatten(x, collapse = \", \", last = \" and \")\n}\n\ncommas(c(\"cat\", \"dog\", \"pigeon\"))\n#&gt; [1] \"cat, dog and pigeon\"\n\nOr you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:\n\ncv &lt;- function(x, na.rm = FALSE) {\n  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)\n}\n\ncv(runif(100, min = 0, max = 50))\n#&gt; [1] 0.5196276\ncv(runif(100, min = 0, max = 500))\n#&gt; [1] 0.5652554\n\nOr maybe you just want to make a common pattern easier to remember by giving it a memorable name:\n\n# https://twitter.com/gbganalyst/status/1571619641390252033\nn_missing &lt;- function(x) {\n  sum(is.na(x))\n} \n\nYou can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute percentage error to help you compare model predictions with actual values:\n\n# https://twitter.com/neilgcurrie/status/1571607727255834625\nmape &lt;- function(actual, predicted) {\n  sum(abs((actual - predicted) / actual)) / length(actual)\n}\n\n\n\n\n\n\n\nRStudio\n\n\n\nOnce you start writing functions, there are two RStudio shortcuts that are super useful:\n\nTo find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.\nTo quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.\n\n\n\n\n\n26.2.5 Exercises\n\nPractice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?\n\nmean(is.na(x))\nmean(is.na(y))\nmean(is.na(z))\n\nx / sum(x, na.rm = TRUE)\ny / sum(y, na.rm = TRUE)\nz / sum(z, na.rm = TRUE)\n\nround(x / sum(x, na.rm = TRUE) * 100, 1)\nround(y / sum(y, na.rm = TRUE) * 100, 1)\nround(z / sum(z, na.rm = TRUE) * 100, 1)\n\nIn the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?\nGiven a vector of birthdates, write a function to compute the age in years.\nWrite your own functions to compute the variance and skewness of a numeric vector. You can look up the definitions on Wikipedia or elsewhere.\nWrite both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.\nRead the documentation to figure out what the following functions do. Why are they useful even though they are so short?\n\nis_directory &lt;- function(x) {\n  file.info(x)$isdir\n}\nis_readable &lt;- function(x) {\n  file.access(x, 4) == 0\n}",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#data-frame-functions",
    "href": "functions.html#data-frame-functions",
    "title": "26  Functions",
    "section": "26.3 Data frame functions",
    "text": "26.3 Data frame functions\nVector functions are useful for pulling out code that’s repeated within a dplyr verb. But you’ll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or a vector.\nTo let you write a function that uses dplyr verbs, we’ll first introduce you to the challenge of indirection and how you can overcome it with embracing, { }. With this theory under your belt, we’ll then show you a bunch of examples to illustrate what you might do with it.\n\n26.3.1 Indirection and tidy evaluation\nWhen you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: grouped_mean(). The goal of this function is to compute the mean of mean_var grouped by group_var:\n\ngrouped_mean &lt;- function(df, group_var, mean_var) {\n  df |&gt; \n    group_by(group_var) |&gt; \n    summarize(mean(mean_var))\n}\n\nIf we try and use it, we get an error:\n\ndiamonds |&gt; grouped_mean(cut, carat)\n#&gt; Error in `group_by()`:\n#&gt; ! Must group by variables found in `.data`.\n#&gt; ✖ Column `group_var` is not found.\n\nTo make the problem a bit more clear, we can use a made up data frame:\n\ndf &lt;- tibble(\n  mean_var = 1,\n  group_var = \"g\",\n  group = 1,\n  x = 10,\n  y = 100\n)\n\ndf |&gt; grouped_mean(group, x)\n#&gt; # A tibble: 1 × 2\n#&gt;   group_var `mean(mean_var)`\n#&gt;   &lt;chr&gt;                &lt;dbl&gt;\n#&gt; 1 g                        1\ndf |&gt; grouped_mean(group, y)\n#&gt; # A tibble: 1 × 2\n#&gt;   group_var `mean(mean_var)`\n#&gt;   &lt;chr&gt;                &lt;dbl&gt;\n#&gt; 1 g                        1\n\nRegardless of how we call grouped_mean() it always does df |&gt; group_by(group_var) |&gt; summarize(mean(mean_var)), instead of df |&gt; group_by(group) |&gt; summarize(mean(x)) or df |&gt; group_by(group) |&gt; summarize(mean(y)). This is a problem of indirection, and it arises because dplyr uses tidy evaluation to allow you to refer to the names of variables inside your data frame without any special treatment.\nTidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell group_by() and summarize() not to treat group_var and mean_var as the name of the variables, but instead look inside them for the variable we actually want to use.\nTidy evaluation includes a solution to this problem called embracing 🤗. Embracing a variable means to wrap it in braces so (e.g.) var becomes { var }. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of { } as looking down a tunnel — { var } will make a dplyr function look inside of var rather than looking for a variable called var.\nSo to make grouped_mean() work, we need to surround group_var and mean_var with { }:\n\ngrouped_mean &lt;- function(df, group_var, mean_var) {\n  df |&gt; \n    group_by({{ group_var }}) |&gt; \n    summarize(mean({{ mean_var }}))\n}\n\ndf |&gt; grouped_mean(group, x)\n#&gt; # A tibble: 1 × 2\n#&gt;   group `mean(x)`\n#&gt;   &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1     1        10\n\nSuccess!\n\n\n26.3.2 When to embrace?\nSo the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:\n\nData-masking: this is used in functions like arrange(), filter(), and summarize() that compute with variables.\nTidy-selection: this is used for functions like select(), relocate(), and rename() that select variables.\n\nYour intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g., x + 1) or select (e.g., a:x).\nIn the following sections, we’ll explore the sorts of handy functions you might write once you understand embracing.\n\n\n26.3.3 Common use cases\nIf you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:\n\nsummary6 &lt;- function(data, var) {\n  data |&gt; summarize(\n    min = min({{ var }}, na.rm = TRUE),\n    mean = mean({{ var }}, na.rm = TRUE),\n    median = median({{ var }}, na.rm = TRUE),\n    max = max({{ var }}, na.rm = TRUE),\n    n = n(),\n    n_miss = sum(is.na({{ var }})),\n    .groups = \"drop\"\n  )\n}\n\ndiamonds |&gt; summary6(carat)\n#&gt; # A tibble: 1 × 6\n#&gt;     min  mean median   max     n n_miss\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;  &lt;int&gt;\n#&gt; 1   0.2 0.798    0.7  5.01 53940      0\n\n(Whenever you wrap summarize() in a helper, we think it’s good practice to set .groups = \"drop\" to both avoid the message and leave the data in an ungrouped state.)\nThe nice thing about this function is, because it wraps summarize(), you can use it on grouped data:\n\ndiamonds |&gt; \n  group_by(cut) |&gt; \n  summary6(carat)\n#&gt; # A tibble: 5 × 7\n#&gt;   cut         min  mean median   max     n n_miss\n#&gt;   &lt;ord&gt;     &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;  &lt;int&gt;\n#&gt; 1 Fair       0.22 1.05    1     5.01  1610      0\n#&gt; 2 Good       0.23 0.849   0.82  3.01  4906      0\n#&gt; 3 Very Good  0.2  0.806   0.71  4    12082      0\n#&gt; 4 Premium    0.2  0.892   0.86  4.01 13791      0\n#&gt; 5 Ideal      0.2  0.703   0.54  3.5  21551      0\n\nFurthermore, since the arguments to summarize are data-masking also means that the var argument to summary6() is data-masking. That means you can also summarize computed variables:\n\ndiamonds |&gt; \n  group_by(cut) |&gt; \n  summary6(log10(carat))\n#&gt; # A tibble: 5 × 7\n#&gt;   cut          min    mean  median   max     n n_miss\n#&gt;   &lt;ord&gt;      &lt;dbl&gt;   &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;  &lt;int&gt;\n#&gt; 1 Fair      -0.658 -0.0273  0      0.700  1610      0\n#&gt; 2 Good      -0.638 -0.133  -0.0862 0.479  4906      0\n#&gt; 3 Very Good -0.699 -0.164  -0.149  0.602 12082      0\n#&gt; 4 Premium   -0.699 -0.125  -0.0655 0.603 13791      0\n#&gt; 5 Ideal     -0.699 -0.225  -0.268  0.544 21551      0\n\nTo summarize multiple variables, you’ll need to wait until Section 27.2, where you’ll learn how to use across().\nAnother popular summarize() helper function is a version of count() that also computes proportions:\n\n# https://twitter.com/Diabb6/status/1571635146658402309\ncount_prop &lt;- function(df, var, sort = FALSE) {\n  df |&gt;\n    count({{ var }}, sort = sort) |&gt;\n    mutate(prop = n / sum(n))\n}\n\ndiamonds |&gt; count_prop(clarity)\n#&gt; # A tibble: 8 × 3\n#&gt;   clarity     n   prop\n#&gt;   &lt;ord&gt;   &lt;int&gt;  &lt;dbl&gt;\n#&gt; 1 I1        741 0.0137\n#&gt; 2 SI2      9194 0.170 \n#&gt; 3 SI1     13065 0.242 \n#&gt; 4 VS2     12258 0.227 \n#&gt; 5 VS1      8171 0.151 \n#&gt; 6 VVS2     5066 0.0939\n#&gt; # ℹ 2 more rows\n\nThis function has three arguments: df, var, and sort, and only var needs to be embraced because it’s passed to count() which uses data-masking for all variables. Note that we use a default value for sort so that if the user doesn’t supply their own value it will default to FALSE.\nOr maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, we’ll allow the user to supply a condition:\n\nunique_where &lt;- function(df, condition, var) {\n  df |&gt; \n    filter({{ condition }}) |&gt; \n    distinct({{ var }}) |&gt; \n    arrange({{ var }})\n}\n\n# Find all the destinations in December\nflights |&gt; unique_where(month == 12, dest)\n#&gt; # A tibble: 96 × 1\n#&gt;   dest \n#&gt;   &lt;chr&gt;\n#&gt; 1 ABQ  \n#&gt; 2 ALB  \n#&gt; 3 ATL  \n#&gt; 4 AUS  \n#&gt; 5 AVL  \n#&gt; 6 BDL  \n#&gt; # ℹ 90 more rows\n\nHere we embrace condition because it’s passed to filter() and var because it’s passed to distinct() and arrange().\nWe’ve made all these examples to take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects time_hour, carrier, and flight since they form the compound primary key that allows you to identify a row.\n\nsubset_flights &lt;- function(rows, cols) {\n  flights |&gt; \n    filter({{ rows }}) |&gt; \n    select(time_hour, carrier, flight, {{ cols }})\n}\n\n\n\n26.3.4 Data-masking vs. tidy-selection\nSometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a count_missing() that counts the number of missing observations in rows. You might try writing something like:\n\ncount_missing &lt;- function(df, group_vars, x_var) {\n  df |&gt; \n    group_by({{ group_vars }}) |&gt; \n    summarize(\n      n_miss = sum(is.na({{ x_var }})),\n      .groups = \"drop\"\n    )\n}\n\nflights |&gt; \n  count_missing(c(year, month, day), dep_time)\n#&gt; Error in `group_by()`:\n#&gt; ℹ In argument: `c(year, month, day)`.\n#&gt; Caused by error:\n#&gt; ! `c(year, month, day)` must be size 336776 or 1, not 1010328.\n\nThis doesn’t work because group_by() uses data-masking, not tidy-selection. We can work around that problem by using the handy pick() function, which allows you to use tidy-selection inside data-masking functions:\n\ncount_missing &lt;- function(df, group_vars, x_var) {\n  df |&gt; \n    group_by(pick({{ group_vars }})) |&gt; \n    summarize(\n      n_miss = sum(is.na({{ x_var }})),\n      .groups = \"drop\"\n  )\n}\n\nflights |&gt; \n  count_missing(c(year, month, day), dep_time)\n#&gt; # A tibble: 365 × 4\n#&gt;    year month   day n_miss\n#&gt;   &lt;int&gt; &lt;int&gt; &lt;int&gt;  &lt;int&gt;\n#&gt; 1  2013     1     1      4\n#&gt; 2  2013     1     2      8\n#&gt; 3  2013     1     3     10\n#&gt; 4  2013     1     4      6\n#&gt; 5  2013     1     5      3\n#&gt; 6  2013     1     6      1\n#&gt; # ℹ 359 more rows\n\nAnother convenient use of pick() is to make a 2d table of counts. Here we count using all the variables in the rows and columns, then use pivot_wider() to rearrange the counts into a grid:\n\n# https://twitter.com/pollicipes/status/1571606508944719876\ncount_wide &lt;- function(data, rows, cols) {\n  data |&gt; \n    count(pick(c({{ rows }}, {{ cols }}))) |&gt; \n    pivot_wider(\n      names_from = {{ cols }}, \n      values_from = n,\n      names_sort = TRUE,\n      values_fill = 0\n    )\n}\n\ndiamonds |&gt; count_wide(c(clarity, color), cut)\n#&gt; # A tibble: 56 × 7\n#&gt;   clarity color  Fair  Good `Very Good` Premium Ideal\n#&gt;   &lt;ord&gt;   &lt;ord&gt; &lt;int&gt; &lt;int&gt;       &lt;int&gt;   &lt;int&gt; &lt;int&gt;\n#&gt; 1 I1      D         4     8           5      12    13\n#&gt; 2 I1      E         9    23          22      30    18\n#&gt; 3 I1      F        35    19          13      34    42\n#&gt; 4 I1      G        53    19          16      46    16\n#&gt; 5 I1      H        52    14          12      46    38\n#&gt; 6 I1      I        34     9           8      24    17\n#&gt; # ℹ 50 more rows\n\nWhile our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the pivot_wider() docs you can see that names_from uses tidy-selection.\n\n\n26.3.5 Exercises\n\nUsing the datasets from nycflights13, write a function that:\n\nFinds all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.\n\nflights |&gt; filter_severe()\n\nCounts the number of cancelled flights and the number of flights delayed by more than an hour.\n\nflights |&gt; group_by(dest) |&gt; summarize_severe()\n\nFinds all flights that were cancelled or delayed by more than a user supplied number of hours:\n\nflights |&gt; filter_severe(hours = 2)\n\nSummarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:\n\nweather |&gt; summarize_weather(temp)\n\nConverts the user supplied variable that uses clock time (e.g., dep_time, arr_time, etc.) into a decimal time (i.e. hours + (minutes / 60)).\n\nflights |&gt; standardize_time(sched_dep_time)\n\n\nFor each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: distinct(), count(), group_by(), rename_with(), slice_min(), slice_sample().\nGeneralize the following function so that you can supply any number of variables to count.\n\ncount_prop &lt;- function(df, var, sort = FALSE) {\n  df |&gt;\n    count({{ var }}, sort = sort) |&gt;\n    mutate(prop = n / sum(n))\n}",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#plot-functions",
    "href": "functions.html#plot-functions",
    "title": "26  Functions",
    "section": "26.4 Plot functions",
    "text": "26.4 Plot functions\nInstead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because aes() is a data-masking function. For example, imagine that you’re making a lot of histograms:\n\ndiamonds |&gt; \n  ggplot(aes(x = carat)) +\n  geom_histogram(binwidth = 0.1)\n\ndiamonds |&gt; \n  ggplot(aes(x = carat)) +\n  geom_histogram(binwidth = 0.05)\n\nWouldn’t it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that aes() is a data-masking function and you need to embrace:\n\nhistogram &lt;- function(df, var, binwidth = NULL) {\n  df |&gt; \n    ggplot(aes(x = {{ var }})) + \n    geom_histogram(binwidth = binwidth)\n}\n\ndiamonds |&gt; histogram(carat, 0.1)\n\n\n\n\n\n\n\n\nNote that histogram() returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from |&gt; to +:\n\ndiamonds |&gt; \n  histogram(carat, 0.1) +\n  labs(x = \"Size (in carats)\", y = \"Number of diamonds\")\n\n\n26.4.1 More variables\nIt’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a dataset is linear by overlaying a smooth line and a straight line:\n\n# https://twitter.com/tyler_js_smith/status/1574377116988104704\nlinearity_check &lt;- function(df, x, y) {\n  df |&gt;\n    ggplot(aes(x = {{ x }}, y = {{ y }})) +\n    geom_point() +\n    geom_smooth(method = \"loess\", formula = y ~ x, color = \"red\", se = FALSE) +\n    geom_smooth(method = \"lm\", formula = y ~ x, color = \"blue\", se = FALSE) \n}\n\nstarwars |&gt; \n  filter(mass &lt; 1000) |&gt; \n  linearity_check(mass, height)\n\n\n\n\n\n\n\n\nOr maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:\n\n# https://twitter.com/ppaxisa/status/1574398423175921665\nhex_plot &lt;- function(df, x, y, z, bins = 20, fun = \"mean\") {\n  df |&gt; \n    ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) + \n    stat_summary_hex(\n      aes(color = after_scale(fill)), # make border same color as fill\n      bins = bins, \n      fun = fun,\n    )\n}\n\ndiamonds |&gt; hex_plot(carat, price, depth)\n#&gt; Warning: Computation failed in `stat_summary_hex()`.\n#&gt; Caused by error in `compute_group()`:\n#&gt; ! The package \"hexbin\" is required for `stat_summary_hex()`.\n\n\n\n\n\n\n\n\n\n\n26.4.2 Combining with other tidyverse\nSome of the most useful helpers combine a dash of data manipulation with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using fct_infreq(). Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:\n\nsorted_bars &lt;- function(df, var) {\n  df |&gt; \n    mutate({{ var }} := fct_rev(fct_infreq({{ var }})))  |&gt;\n    ggplot(aes(y = {{ var }})) +\n    geom_bar()\n}\n\ndiamonds |&gt; sorted_bars(clarity)\n\n\n\n\n\n\n\n\nWe have to use a new operator here, := (commonly referred to as the “walrus operator”), because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of =, but R’s syntax doesn’t allow anything to the left of = except for a single literal name. To work around this problem, we use the special operator := which tidy evaluation treats in exactly the same way as =.\nOr maybe you want to make it easy to draw a bar plot just for a subset of the data:\n\nconditional_bars &lt;- function(df, condition, var) {\n  df |&gt; \n    filter({{ condition }}) |&gt; \n    ggplot(aes(x = {{ var }})) + \n    geom_bar()\n}\n\ndiamonds |&gt; conditional_bars(cut == \"Good\", clarity)\n\n\n\n\n\n\n\n\nYou can also get creative and display data summaries in other ways. You can find a cool application at https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b; it uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.\nWe’ll finish with a more complicated case: labelling the plots you create.\n\n\n26.4.3 Labeling\nRemember the histogram function we showed you earlier?\n\nhistogram &lt;- function(df, var, binwidth = NULL) {\n  df |&gt; \n    ggplot(aes(x = {{ var }})) + \n    geom_histogram(binwidth = binwidth)\n}\n\nWouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from the package we haven’t talked about yet: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).\nTo solve the labeling problem we can use rlang::englue(). This works similarly to str_glue(), so any value wrapped in { } will be inserted into the string. But it also understands { }, which automatically inserts the appropriate variable name:\n\nhistogram &lt;- function(df, var, binwidth) {\n  label &lt;- rlang::englue(\"A histogram of {{var}} with binwidth {binwidth}\")\n  \n  df |&gt; \n    ggplot(aes(x = {{ var }})) + \n    geom_histogram(binwidth = binwidth) + \n    labs(title = label)\n}\n\ndiamonds |&gt; histogram(carat, 0.1)\n\n\n\n\n\n\n\n\nYou can use the same approach in any other place where you want to supply a string in a ggplot2 plot.\n\n\n26.4.4 Exercises\nBuild up a rich plotting function by incrementally implementing each of the steps below:\n\nDraw a scatterplot given dataset and x and y variables.\nAdd a line of best fit (i.e. a linear model with no standard errors).\nAdd a title.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#style",
    "href": "functions.html#style",
    "title": "26  Functions",
    "section": "26.5 Style",
    "text": "26.5 Style\nR doesn’t care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as RStudio’s autocomplete makes it easy to type long names.\nGenerally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()), or accessing some property of an object (i.e. coef() is better than get_coefficients()). Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.\n\n# Too short\nf()\n\n# Not a verb, or descriptive\nmy_awesome_function()\n\n# Long, but clear\nimpute_missing()\ncollapse_years()\n\nR also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from Chapter 5. Additionally, function() should always be followed by squiggly brackets ({}), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.\n\n# Missing extra two spaces\ndensity &lt;- function(color, facets, binwidth = 0.1) {\ndiamonds |&gt; \n  ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +\n  geom_freqpoly(binwidth = binwidth) +\n  facet_wrap(vars({{ facets }}))\n}\n\n# Pipe indented incorrectly\ndensity &lt;- function(color, facets, binwidth = 0.1) {\n  diamonds |&gt; \n  ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +\n  geom_freqpoly(binwidth = binwidth) +\n  facet_wrap(vars({{ facets }}))\n}\n\nAs you can see we recommend putting extra spaces inside of { }. This makes it very obvious that something unusual is happening.\n\n26.5.1 Exercises\n\nRead the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.\n\nf1 &lt;- function(string, prefix) {\n  str_sub(string, 1, str_length(prefix)) == prefix\n}\n\nf3 &lt;- function(x, y) {\n  rep(y, length.out = length(x))\n}\n\nTake a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.\nMake a case for why norm_r(), norm_d() etc. would be better than rnorm(), dnorm(). Make a case for the opposite. How could you make the names even clearer?",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "functions.html#summary",
    "href": "functions.html#summary",
    "title": "26  Functions",
    "section": "26.6 Summary",
    "text": "26.6 Summary\nIn this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frame, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.\nWe have only shown you the bare minimum to get started with functions and there’s much more to learn. A few places to learn more are:\n\nTo learn more about programming with tidy evaluation, see useful recipes in programming with dplyr and programming with tidyr and learn more about the theory in What is data-masking and why do I need {{?.\nTo learn more about reducing duplication in your ggplot2 code, read the Programming with ggplot2 chapter of the ggplot2 book.\nFor more advice on function style, see the tidyverse style guide.\n\nIn the next chapter, we’ll dive into iteration which gives you further tools for reducing code duplication.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>26</span>  <span class='chapter-title'>Functions</span>"
    ]
  },
  {
    "objectID": "iteration.html",
    "href": "iteration.html",
    "title": "27  Iteration",
    "section": "",
    "text": "27.1 Introduction\nIn this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector x in R, you can just write 2 * x. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.\nThis book has already given you a small but powerful number of tools that perform the same action for multiple “things”:\nNow it’s time to learn some more general tools, often called functional programming tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#introduction",
    "href": "iteration.html#introduction",
    "title": "27  Iteration",
    "section": "",
    "text": "facet_wrap() and facet_grid() draws a plot for each subset.\ngroup_by() plus summarize() computes summary statistics for each subset.\nunnest_wider() and unnest_longer() create new rows and columns for each element of a list-column.\n\n\n\n27.1.1 Prerequisites\nIn this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but purrr is new. We’re just going to use a couple of purrr functions in this chapter, but it’s a great package to explore as you improve your programming skills.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#sec-across",
    "href": "iteration.html#sec-across",
    "title": "27  Iteration",
    "section": "27.2 Modifying multiple columns",
    "text": "27.2 Modifying multiple columns\nImagine you have this simple tibble and you want to count the number of observations and compute the median of every column.\n\ndf &lt;- tibble(\n  a = rnorm(10),\n  b = rnorm(10),\n  c = rnorm(10),\n  d = rnorm(10)\n)\n\nYou could do it with copy-and-paste:\n\ndf |&gt; summarize(\n  n = n(),\n  a = median(a),\n  b = median(b),\n  c = median(c),\n  d = median(d),\n)\n#&gt; # A tibble: 1 × 5\n#&gt;       n      a      b       c     d\n#&gt;   &lt;int&gt;  &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1    10 -0.246 -0.287 -0.0567 0.144\n\nThat breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead, you can use across():\n\ndf |&gt; summarize(\n  n = n(),\n  across(a:d, median),\n)\n#&gt; # A tibble: 1 × 5\n#&gt;       n      a      b       c     d\n#&gt;   &lt;int&gt;  &lt;dbl&gt;  &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1    10 -0.246 -0.287 -0.0567 0.144\n\nacross() has three particularly important arguments, which we’ll discuss in detail in the following sections. You’ll use the first two every time you use across(): the first argument, .cols, specifies which columns you want to iterate over, and the second argument, .fns, specifies what to do with each column. You can use the .names argument when you need additional control over the names of output columns, which is particularly important when you use across() with mutate(). We’ll also discuss two important variations, if_any() and if_all(), which work with filter().\n\n27.2.1 Selecting columns with .cols\nThe first argument to across(), .cols, selects the columns to transform. This uses the same specifications as select(), Section 4.3.2, so you can use functions like starts_with() and ends_with() to select columns based on their name.\nThere are two additional selection techniques that are particularly useful for across(): everything() and where(). everything() is straightforward: it selects every (non-grouping) column:\n\ndf &lt;- tibble(\n  grp = sample(2, 10, replace = TRUE),\n  a = rnorm(10),\n  b = rnorm(10),\n  c = rnorm(10),\n  d = rnorm(10)\n)\n\ndf |&gt; \n  group_by(grp) |&gt; \n  summarize(across(everything(), median))\n#&gt; # A tibble: 2 × 5\n#&gt;     grp       a       b     c     d\n#&gt;   &lt;int&gt;   &lt;dbl&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1 -0.0935 -0.0163 0.363 0.364\n#&gt; 2     2  0.312  -0.0576 0.208 0.565\n\nNote grouping columns (grp here) are not included in across(), because they’re automatically preserved by summarize().\nwhere() allows you to select columns based on their type:\n\nwhere(is.numeric) selects all numeric columns.\nwhere(is.character) selects all string columns.\nwhere(is.Date) selects all date columns.\nwhere(is.POSIXct) selects all date-time columns.\nwhere(is.logical) selects all logical columns.\n\nJust like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns, and starts_with(\"a\") & where(is.logical) selects all logical columns whose name starts with “a”.\n\n\n27.2.2 Calling a single function\nThe second argument to across() defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (median, mean, str_flatten, …) to another function (across). This is one of the features that makes R a functional programming language.\nIt’s important to note that we’re passing this function to across(), so across() can call it; we’re not calling it ourselves. That means the function name should never be followed by (). If you forget, you’ll get an error:\n\ndf |&gt; \n  group_by(grp) |&gt; \n  summarize(across(everything(), median()))\n#&gt; Error in `summarize()`:\n#&gt; ℹ In argument: `across(everything(), median())`.\n#&gt; Caused by error in `median.default()`:\n#&gt; ! argument \"x\" is missing, with no default\n\nThis error arises because you’re calling the function with no input, e.g.:\n\nmedian()\n#&gt; Error in median.default(): argument \"x\" is missing, with no default\n\n\n\n27.2.3 Calling multiple functions\nIn more complex cases, you might want to supply additional arguments or perform multiple transformations. Let’s motivate this problem with a simple example: what happens if we have some missing values in our data? median() propagates those missing values, giving us a suboptimal output:\n\nrnorm_na &lt;- function(n, n_na, mean = 0, sd = 1) {\n  sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))\n}\n\ndf_miss &lt;- tibble(\n  a = rnorm_na(5, 1),\n  b = rnorm_na(5, 1),\n  c = rnorm_na(5, 2),\n  d = rnorm(5)\n)\ndf_miss |&gt; \n  summarize(\n    across(a:d, median),\n    n = n()\n  )\n#&gt; # A tibble: 1 × 5\n#&gt;       a     b     c     d     n\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1    NA    NA    NA  1.15     5\n\nIt would be nice if we could pass along na.rm = TRUE to median() to remove these missing values. To do so, instead of calling median() directly, we need to create a new function that calls median() with the desired arguments:\n\ndf_miss |&gt; \n  summarize(\n    across(a:d, function(x) median(x, na.rm = TRUE)),\n    n = n()\n  )\n#&gt; # A tibble: 1 × 5\n#&gt;       a     b      c     d     n\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 0.139 -1.11 -0.387  1.15     5\n\nThis is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or anonymous1, function you can replace function with \\2:\n\ndf_miss |&gt; \n  summarize(\n    across(a:d, \\(x) median(x, na.rm = TRUE)),\n    n = n()\n  )\n\nIn either case, across() effectively expands to the following code:\n\ndf_miss |&gt; \n  summarize(\n    a = median(a, na.rm = TRUE),\n    b = median(b, na.rm = TRUE),\n    c = median(c, na.rm = TRUE),\n    d = median(d, na.rm = TRUE),\n    n = n()\n  )\n\nWhen we remove the missing values from the median(), it would be nice to know just how many values were removed. We can find that out by supplying two functions to across(): one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to .fns:\n\ndf_miss |&gt; \n  summarize(\n    across(a:d, list(\n      median = \\(x) median(x, na.rm = TRUE),\n      n_miss = \\(x) sum(is.na(x))\n    )),\n    n = n()\n  )\n#&gt; # A tibble: 1 × 9\n#&gt;   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss\n#&gt;      &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1    0.139        1    -1.11        1   -0.387        2     1.15        0\n#&gt; # ℹ 1 more variable: n &lt;int&gt;\n\nIf you look carefully, you might intuit that the columns are named using a glue specification (Section 15.3.2) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.\n\n\n27.2.4 Column names\nThe result of across() is named according to the specification provided in the .names argument. We could specify our own if we wanted the name of the function to come first3:\n\ndf_miss |&gt; \n  summarize(\n    across(\n      a:d,\n      list(\n        median = \\(x) median(x, na.rm = TRUE),\n        n_miss = \\(x) sum(is.na(x))\n      ),\n      .names = \"{.fn}_{.col}\"\n    ),\n    n = n(),\n  )\n#&gt; # A tibble: 1 × 9\n#&gt;   median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d\n#&gt;      &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;    &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1    0.139        1    -1.11        1   -0.387        2     1.15        0\n#&gt; # ℹ 1 more variable: n &lt;int&gt;\n\nThe .names argument is particularly important when you use across() with mutate(). By default, the output of across() is given the same names as the inputs. This means that across() inside of mutate() will replace existing columns. For example, here we use coalesce() to replace NAs with 0:\n\ndf_miss |&gt; \n  mutate(\n    across(a:d, \\(x) coalesce(x, 0))\n  )\n#&gt; # A tibble: 5 × 4\n#&gt;        a      b      c     d\n#&gt;    &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  0.434 -1.25   0     1.60 \n#&gt; 2  0     -1.43  -0.297 0.776\n#&gt; 3 -0.156 -0.980  0     1.15 \n#&gt; 4 -2.61  -0.683 -0.785 2.13 \n#&gt; 5  1.11   0     -0.387 0.704\n\nIf you’d like to instead create new columns, you can use the .names argument to give the output new names:\n\ndf_miss |&gt; \n  mutate(\n    across(a:d, \\(x) coalesce(x, 0), .names = \"{.col}_na_zero\")\n  )\n#&gt; # A tibble: 5 × 8\n#&gt;        a      b      c     d a_na_zero b_na_zero c_na_zero d_na_zero\n#&gt;    &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1  0.434 -1.25  NA     1.60      0.434    -1.25      0         1.60 \n#&gt; 2 NA     -1.43  -0.297 0.776     0        -1.43     -0.297     0.776\n#&gt; 3 -0.156 -0.980 NA     1.15     -0.156    -0.980     0         1.15 \n#&gt; 4 -2.61  -0.683 -0.785 2.13     -2.61     -0.683    -0.785     2.13 \n#&gt; 5  1.11  NA     -0.387 0.704     1.11      0        -0.387     0.704\n\n\n\n27.2.5 Filtering\nacross() is a great match for summarize() and mutate() but it’s more awkward to use with filter(), because you usually combine multiple conditions with either | or &. It’s clear that across() can help to create multiple logical columns, but then what? So dplyr provides two variants of across() called if_any() and if_all():\n\n# same as df_miss |&gt; filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))\ndf_miss |&gt; filter(if_any(a:d, is.na))\n#&gt; # A tibble: 4 × 4\n#&gt;        a      b      c     d\n#&gt;    &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  0.434 -1.25  NA     1.60 \n#&gt; 2 NA     -1.43  -0.297 0.776\n#&gt; 3 -0.156 -0.980 NA     1.15 \n#&gt; 4  1.11  NA     -0.387 0.704\n\n# same as df_miss |&gt; filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))\ndf_miss |&gt; filter(if_all(a:d, is.na))\n#&gt; # A tibble: 0 × 4\n#&gt; # ℹ 4 variables: a &lt;dbl&gt;, b &lt;dbl&gt;, c &lt;dbl&gt;, d &lt;dbl&gt;\n\n\n\n27.2.6 across() in functions\nacross() is particularly useful to program with because it allows you to operate on multiple columns. For example, Jacob Scott uses this little helper which wraps a bunch of lubridate functions to expand all date columns into year, month, and day columns:\n\nexpand_dates &lt;- function(df) {\n  df |&gt; \n    mutate(\n      across(where(is.Date), list(year = year, month = month, day = mday))\n    )\n}\n\ndf_date &lt;- tibble(\n  name = c(\"Amy\", \"Bob\"),\n  date = ymd(c(\"2009-08-03\", \"2010-01-16\"))\n)\n\ndf_date |&gt; \n  expand_dates()\n#&gt; # A tibble: 2 × 5\n#&gt;   name  date       date_year date_month date_day\n#&gt;   &lt;chr&gt; &lt;date&gt;         &lt;dbl&gt;      &lt;dbl&gt;    &lt;int&gt;\n#&gt; 1 Amy   2009-08-03      2009          8        3\n#&gt; 2 Bob   2010-01-16      2010          1       16\n\nacross() also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in Section 26.3.2. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:\n\nsummarize_means &lt;- function(df, summary_vars = where(is.numeric)) {\n  df |&gt; \n    summarize(\n      across({{ summary_vars }}, \\(x) mean(x, na.rm = TRUE)),\n      n = n(),\n      .groups = \"drop\"\n    )\n}\ndiamonds |&gt; \n  group_by(cut) |&gt; \n  summarize_means()\n#&gt; # A tibble: 5 × 9\n#&gt;   cut       carat depth table price     x     y     z     n\n#&gt;   &lt;ord&gt;     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610\n#&gt; 2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906\n#&gt; 3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082\n#&gt; 4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791\n#&gt; 5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551\n\ndiamonds |&gt; \n  group_by(cut) |&gt; \n  summarize_means(c(carat, x:z))\n#&gt; # A tibble: 5 × 6\n#&gt;   cut       carat     x     y     z     n\n#&gt;   &lt;ord&gt;     &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;\n#&gt; 1 Fair      1.05   6.25  6.18  3.98  1610\n#&gt; 2 Good      0.849  5.84  5.85  3.64  4906\n#&gt; 3 Very Good 0.806  5.74  5.77  3.56 12082\n#&gt; 4 Premium   0.892  5.97  5.94  3.65 13791\n#&gt; 5 Ideal     0.703  5.51  5.52  3.40 21551\n\n\n\n27.2.7 Compare with pivot_longer()\nBefore we go on, it’s worth pointing out an interesting connection between across() and pivot_longer() (Section 6.3). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:\n\ndf |&gt; \n  summarize(across(a:d, list(median = median, mean = mean)))\n#&gt; # A tibble: 1 × 8\n#&gt;   a_median a_mean b_median b_mean c_median c_mean d_median d_mean\n#&gt;      &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508\n\nWe could compute the same values by pivoting longer and then summarizing:\n\nlong &lt;- df |&gt; \n  pivot_longer(a:d) |&gt; \n  group_by(name) |&gt; \n  summarize(\n    median = median(value),\n    mean = mean(value)\n  )\nlong\n#&gt; # A tibble: 4 × 3\n#&gt;   name   median   mean\n#&gt;   &lt;chr&gt;   &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1 a      0.0380 0.205 \n#&gt; 2 b     -0.0163 0.0910\n#&gt; 3 c      0.260  0.0716\n#&gt; 4 d      0.540  0.508\n\nAnd if you wanted the same structure as across() you could pivot again:\n\nlong |&gt; \n  pivot_wider(\n    names_from = name,\n    values_from = c(median, mean),\n    names_vary = \"slowest\",\n    names_glue = \"{name}_{.value}\"\n  )\n#&gt; # A tibble: 1 × 8\n#&gt;   a_median a_mean b_median b_mean c_median c_mean d_median d_mean\n#&gt;      &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;  &lt;dbl&gt;\n#&gt; 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508\n\nThis is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with across(): when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:\n\ndf_paired &lt;- tibble(\n  a_val = rnorm(10),\n  a_wts = runif(10),\n  b_val = rnorm(10),\n  b_wts = runif(10),\n  c_val = rnorm(10),\n  c_wts = runif(10),\n  d_val = rnorm(10),\n  d_wts = runif(10)\n)\n\nThere’s currently no way to do this with across()4, but it’s relatively straightforward with pivot_longer():\n\ndf_long &lt;- df_paired |&gt; \n  pivot_longer(\n    everything(), \n    names_to = c(\"group\", \".value\"), \n    names_sep = \"_\"\n  )\ndf_long\n#&gt; # A tibble: 40 × 3\n#&gt;   group    val   wts\n#&gt;   &lt;chr&gt;  &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 a      0.715 0.518\n#&gt; 2 b     -0.709 0.691\n#&gt; 3 c      0.718 0.216\n#&gt; 4 d     -0.217 0.733\n#&gt; 5 a     -1.09  0.979\n#&gt; 6 b     -0.209 0.675\n#&gt; # ℹ 34 more rows\n\ndf_long |&gt; \n  group_by(group) |&gt; \n  summarize(mean = weighted.mean(val, wts))\n#&gt; # A tibble: 4 × 2\n#&gt;   group    mean\n#&gt;   &lt;chr&gt;   &lt;dbl&gt;\n#&gt; 1 a      0.126 \n#&gt; 2 b     -0.0704\n#&gt; 3 c     -0.360 \n#&gt; 4 d     -0.248\n\nIf needed, you could pivot_wider() this back to the original form.\n\n\n27.2.8 Exercises\n\nPractice your across() skills by:\n\nComputing the number of unique values in each column of palmerpenguins::penguins.\nComputing the mean of every column in mtcars.\nGrouping diamonds by cut, clarity, and color then counting the number of observations and computing the mean of each numeric column.\n\nWhat happens if you use a list of functions in across(), but don’t name them? How is the output named?\nAdjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?\nExplain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?\n\nshow_missing &lt;- function(df, group_vars, summary_vars = everything()) {\n  df |&gt; \n    group_by(pick({{ group_vars }})) |&gt; \n    summarize(\n      across({{ summary_vars }}, \\(x) sum(is.na(x))),\n      .groups = \"drop\"\n    ) |&gt;\n    select(where(\\(x) any(x &gt; 0)))\n}\nnycflights13::flights |&gt; show_missing(c(year, month, day))",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#reading-multiple-files",
    "href": "iteration.html#reading-multiple-files",
    "title": "27  Iteration",
    "section": "27.3 Reading multiple files",
    "text": "27.3 Reading multiple files\nIn the previous section, you learned how to use dplyr::across() to repeat a transformation on multiple columns. In this section, you’ll learn how to use purrr::map() to do something to every file in a directory. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheets5 you want to read. You could do it with copy and paste:\n\ndata2019 &lt;- readxl::read_excel(\"data/y2019.xlsx\")\ndata2020 &lt;- readxl::read_excel(\"data/y2020.xlsx\")\ndata2021 &lt;- readxl::read_excel(\"data/y2021.xlsx\")\ndata2022 &lt;- readxl::read_excel(\"data/y2022.xlsx\")\n\nAnd then use dplyr::bind_rows() to combine them all together:\n\ndata &lt;- bind_rows(data2019, data2020, data2021, data2022)\n\nYou can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use list.files() to list all the files in a directory, then use purrr::map() to read each of them into a list, then use purrr::list_rbind() to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.\n\n27.3.1 Listing files in a directory\nAs the name suggests, list.files() lists the files in a directory. You’ll almost always use three arguments:\n\nThe first argument, path, is the directory to look in.\npattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.\nfull.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.\n\nTo make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one year’s worth of data for 142 countries. We can list them all with the appropriate call to list.files():\n\npaths &lt;- list.files(\"data/gapminder\", pattern = \"[.]xlsx$\", full.names = TRUE)\npaths\n#&gt;  [1] \"data/gapminder/1952.xlsx\" \"data/gapminder/1957.xlsx\"\n#&gt;  [3] \"data/gapminder/1962.xlsx\" \"data/gapminder/1967.xlsx\"\n#&gt;  [5] \"data/gapminder/1972.xlsx\" \"data/gapminder/1977.xlsx\"\n#&gt;  [7] \"data/gapminder/1982.xlsx\" \"data/gapminder/1987.xlsx\"\n#&gt;  [9] \"data/gapminder/1992.xlsx\" \"data/gapminder/1997.xlsx\"\n#&gt; [11] \"data/gapminder/2002.xlsx\" \"data/gapminder/2007.xlsx\"\n\n\n\n27.3.2 Lists\nNow that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames:\n\ngapminder_1952 &lt;- readxl::read_excel(\"data/gapminder/1952.xlsx\")\ngapminder_1957 &lt;- readxl::read_excel(\"data/gapminder/1957.xlsx\")\ngapminder_1962 &lt;- readxl::read_excel(\"data/gapminder/1962.xlsx\")\n ...,\ngapminder_2007 &lt;- readxl::read_excel(\"data/gapminder/2007.xlsx\")\n\nBut putting each sheet into its own variable is going to make it hard to work with them a few steps down the road. Instead, they’ll be easier to work with if we put them into a single object. A list is the perfect tool for this job:\n\nfiles &lt;- list(\n  readxl::read_excel(\"data/gapminder/1952.xlsx\"),\n  readxl::read_excel(\"data/gapminder/1957.xlsx\"),\n  readxl::read_excel(\"data/gapminder/1962.xlsx\"),\n  ...,\n  readxl::read_excel(\"data/gapminder/2007.xlsx\")\n)\n\nNow that you have these data frames in a list, how do you get one out? You can use files[[i]] to extract the ith element:\n\nfiles[[3]]\n#&gt; # A tibble: 142 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         32.0 10267083      853.\n#&gt; 2 Albania     Europe       64.8  1728137     2313.\n#&gt; 3 Algeria     Africa       48.3 11000948     2551.\n#&gt; 4 Angola      Africa       34    4826015     4269.\n#&gt; 5 Argentina   Americas     65.1 21283783     7133.\n#&gt; 6 Australia   Oceania      70.9 10794968    12217.\n#&gt; # ℹ 136 more rows\n\nWe’ll come back to [[ in more detail in Section 28.3.\n\n\n27.3.3 purrr::map() and list_rbind()\nThe code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use purrr::map() to make even better use of our paths vector. map() is similar toacross(), but instead of doing something to each column in a data frame, it does something to each element of a vector.map(x, f) is shorthand for:\n\nlist(\n  f(x[[1]]),\n  f(x[[2]]),\n  ...,\n  f(x[[n]])\n)\n\nSo we can use map() to get a list of 12 data frames:\n\nfiles &lt;- map(paths, readxl::read_excel)\nlength(files)\n#&gt; [1] 12\n\nfiles[[1]]\n#&gt; # A tibble: 142 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         28.8  8425333      779.\n#&gt; 2 Albania     Europe       55.2  1282697     1601.\n#&gt; 3 Algeria     Africa       43.1  9279525     2449.\n#&gt; 4 Angola      Africa       30.0  4232095     3521.\n#&gt; 5 Argentina   Americas     62.5 17876956     5911.\n#&gt; 6 Australia   Oceania      69.1  8691212    10040.\n#&gt; # ℹ 136 more rows\n\n(This is another data structure that doesn’t display particularly compactly with str() so you might want to load it into RStudio and inspect it with View()).\nNow we can use purrr::list_rbind() to combine that list of data frames into a single data frame:\n\nlist_rbind(files)\n#&gt; # A tibble: 1,704 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         28.8  8425333      779.\n#&gt; 2 Albania     Europe       55.2  1282697     1601.\n#&gt; 3 Algeria     Africa       43.1  9279525     2449.\n#&gt; 4 Angola      Africa       30.0  4232095     3521.\n#&gt; 5 Argentina   Americas     62.5 17876956     5911.\n#&gt; 6 Australia   Oceania      69.1  8691212    10040.\n#&gt; # ℹ 1,698 more rows\n\nOr we could do both steps at once in a pipeline:\n\npaths |&gt; \n  map(readxl::read_excel) |&gt; \n  list_rbind()\n\nWhat if we want to pass in extra arguments to read_excel()? We use the same technique that we used with across(). For example, it’s often useful to peek at the first few rows of the data with n_max = 1:\n\npaths |&gt; \n  map(\\(path) readxl::read_excel(path, n_max = 1)) |&gt; \n  list_rbind()\n#&gt; # A tibble: 12 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         28.8  8425333      779.\n#&gt; 2 Afghanistan Asia         30.3  9240934      821.\n#&gt; 3 Afghanistan Asia         32.0 10267083      853.\n#&gt; 4 Afghanistan Asia         34.0 11537966      836.\n#&gt; 5 Afghanistan Asia         36.1 13079460      740.\n#&gt; 6 Afghanistan Asia         38.4 14880372      786.\n#&gt; # ℹ 6 more rows\n\nThis makes it clear that something is missing: there’s no year column because that value is recorded in the path, not in the individual files. We’ll tackle that problem next.\n\n\n27.3.4 Data in the path\nSometimes the name of the file is data itself. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things:\nFirst, we name the vector of paths. The easiest way to do this is with the set_names() function, which can take a function. Here we use basename() to extract just the file name from the full path:\n\npaths |&gt; set_names(basename) \n#&gt;                  1952.xlsx                  1957.xlsx \n#&gt; \"data/gapminder/1952.xlsx\" \"data/gapminder/1957.xlsx\" \n#&gt;                  1962.xlsx                  1967.xlsx \n#&gt; \"data/gapminder/1962.xlsx\" \"data/gapminder/1967.xlsx\" \n#&gt;                  1972.xlsx                  1977.xlsx \n#&gt; \"data/gapminder/1972.xlsx\" \"data/gapminder/1977.xlsx\" \n#&gt;                  1982.xlsx                  1987.xlsx \n#&gt; \"data/gapminder/1982.xlsx\" \"data/gapminder/1987.xlsx\" \n#&gt;                  1992.xlsx                  1997.xlsx \n#&gt; \"data/gapminder/1992.xlsx\" \"data/gapminder/1997.xlsx\" \n#&gt;                  2002.xlsx                  2007.xlsx \n#&gt; \"data/gapminder/2002.xlsx\" \"data/gapminder/2007.xlsx\"\n\nThose names are automatically carried along by all the map functions, so the list of data frames will have those same names:\n\nfiles &lt;- paths |&gt; \n  set_names(basename) |&gt; \n  map(readxl::read_excel)\n\nThat makes this call to map() shorthand for:\n\nfiles &lt;- list(\n  \"1952.xlsx\" = readxl::read_excel(\"data/gapminder/1952.xlsx\"),\n  \"1957.xlsx\" = readxl::read_excel(\"data/gapminder/1957.xlsx\"),\n  \"1962.xlsx\" = readxl::read_excel(\"data/gapminder/1962.xlsx\"),\n  ...,\n  \"2007.xlsx\" = readxl::read_excel(\"data/gapminder/2007.xlsx\")\n)\n\nYou can also use [[ to extract elements by name:\n\nfiles[[\"1962.xlsx\"]]\n#&gt; # A tibble: 142 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         32.0 10267083      853.\n#&gt; 2 Albania     Europe       64.8  1728137     2313.\n#&gt; 3 Algeria     Africa       48.3 11000948     2551.\n#&gt; 4 Angola      Africa       34    4826015     4269.\n#&gt; 5 Argentina   Americas     65.1 21283783     7133.\n#&gt; 6 Australia   Oceania      70.9 10794968    12217.\n#&gt; # ℹ 136 more rows\n\nThen we use the names_to argument to list_rbind() to tell it to save the names into a new column called year then use readr::parse_number() to extract the number from the string.\n\npaths |&gt; \n  set_names(basename) |&gt; \n  map(readxl::read_excel) |&gt; \n  list_rbind(names_to = \"year\") |&gt; \n  mutate(year = parse_number(year))\n#&gt; # A tibble: 1,704 × 6\n#&gt;    year country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;dbl&gt; &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1  1952 Afghanistan Asia         28.8  8425333      779.\n#&gt; 2  1952 Albania     Europe       55.2  1282697     1601.\n#&gt; 3  1952 Algeria     Africa       43.1  9279525     2449.\n#&gt; 4  1952 Angola      Africa       30.0  4232095     3521.\n#&gt; 5  1952 Argentina   Americas     62.5 17876956     5911.\n#&gt; 6  1952 Australia   Oceania      69.1  8691212    10040.\n#&gt; # ℹ 1,698 more rows\n\nIn more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use set_names() (without any arguments) to record the full path, and then use tidyr::separate_wider_delim() and friends to turn them into useful columns.\n\npaths |&gt; \n  set_names() |&gt; \n  map(readxl::read_excel) |&gt; \n  list_rbind(names_to = \"year\") |&gt; \n  separate_wider_delim(year, delim = \"/\", names = c(NA, \"dir\", \"file\")) |&gt; \n  separate_wider_delim(file, delim = \".\", names = c(\"file\", \"ext\"))\n#&gt; # A tibble: 1,704 × 8\n#&gt;   dir       file  ext   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;     &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 gapminder 1952  xlsx  Afghanistan Asia         28.8  8425333      779.\n#&gt; 2 gapminder 1952  xlsx  Albania     Europe       55.2  1282697     1601.\n#&gt; 3 gapminder 1952  xlsx  Algeria     Africa       43.1  9279525     2449.\n#&gt; 4 gapminder 1952  xlsx  Angola      Africa       30.0  4232095     3521.\n#&gt; 5 gapminder 1952  xlsx  Argentina   Americas     62.5 17876956     5911.\n#&gt; 6 gapminder 1952  xlsx  Australia   Oceania      69.1  8691212    10040.\n#&gt; # ℹ 1,698 more rows\n\n\n\n27.3.5 Save your work\nNow that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:\n\ngapminder &lt;- paths |&gt; \n  set_names(basename) |&gt; \n  map(readxl::read_excel) |&gt; \n  list_rbind(names_to = \"year\") |&gt; \n  mutate(year = parse_number(year))\n\nwrite_csv(gapminder, \"gapminder.csv\")\n\nNow when you come back to this problem in the future, you can read in a single csv file. For large and richer datasets, using parquet might be a better choice than .csv, as discussed in Section 23.4.\nIf you’re working in a project, we suggest calling the file that does this sort of data prep work something like 0-cleanup.R. The 0 in the file name suggests that this should be run before anything else.\nIf your input data files change over time, you might consider learning a tool like targets to set up your data cleaning code to automatically re-run whenever one of the input files is modified.\n\n\n27.3.6 Many simple iterations\nHere we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.\nFor example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is to write a function that takes a file and does all those steps then call map() once:\n\nprocess_file &lt;- function(path) {\n  df &lt;- read_csv(path)\n  \n  df |&gt; \n    filter(!is.na(id)) |&gt; \n    mutate(id = tolower(id)) |&gt; \n    pivot_longer(jan:dec, names_to = \"month\")\n}\n\npaths |&gt; \n  map(process_file) |&gt; \n  list_rbind()\n\nAlternatively, you could perform each step of process_file() to every file:\n\npaths |&gt; \n  map(read_csv) |&gt; \n  map(\\(df) df |&gt; filter(!is.na(id))) |&gt; \n  map(\\(df) df |&gt; mutate(id = tolower(id))) |&gt; \n  map(\\(df) df |&gt; pivot_longer(jan:dec, names_to = \"month\")) |&gt; \n  list_rbind()\n\nWe recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.\nIn this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:\n\npaths |&gt; \n  map(read_csv) |&gt; \n  list_rbind() |&gt; \n  filter(!is.na(id)) |&gt; \n  mutate(id = tolower(id)) |&gt; \n  pivot_longer(jan:dec, names_to = \"month\")\n\n\n\n27.3.7 Heterogeneous data\nUnfortunately, sometimes it’s not possible to go from map() straight to list_rbind() because the data frames are so heterogeneous that list_rbind() either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:\n\nfiles &lt;- paths |&gt; \n  map(readxl::read_excel) \n\nThen a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills. One way to do so is with this handy df_types function6 that returns a tibble with one row for each column:\n\ndf_types &lt;- function(df) {\n  tibble(\n    col_name = names(df), \n    col_type = map_chr(df, vctrs::vec_ptype_full),\n    n_miss = map_int(df, \\(x) sum(is.na(x)))\n  )\n}\n\ndf_types(gapminder)\n#&gt; # A tibble: 6 × 3\n#&gt;   col_name  col_type  n_miss\n#&gt;   &lt;chr&gt;     &lt;chr&gt;      &lt;int&gt;\n#&gt; 1 year      double         0\n#&gt; 2 country   character      0\n#&gt; 3 continent character      0\n#&gt; 4 lifeExp   double         0\n#&gt; 5 pop       double         0\n#&gt; 6 gdpPercap double         0\n\nYou can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:\n\nfiles |&gt; \n  map(df_types) |&gt; \n  list_rbind(names_to = \"file_name\") |&gt; \n  select(-n_miss) |&gt; \n  pivot_wider(names_from = col_name, values_from = col_type)\n#&gt; # A tibble: 12 × 6\n#&gt;   file_name country   continent lifeExp pop    gdpPercap\n#&gt;   &lt;chr&gt;     &lt;chr&gt;     &lt;chr&gt;     &lt;chr&gt;   &lt;chr&gt;  &lt;chr&gt;    \n#&gt; 1 1952.xlsx character character double  double double   \n#&gt; 2 1957.xlsx character character double  double double   \n#&gt; 3 1962.xlsx character character double  double double   \n#&gt; 4 1967.xlsx character character double  double double   \n#&gt; 5 1972.xlsx character character double  double double   \n#&gt; 6 1977.xlsx character character double  double double   \n#&gt; # ℹ 6 more rows\n\nIf the files have heterogeneous formats, you might need to do more processing before you can successfully merge them. Unfortunately, we’re now going to leave you to figure that out on your own, but you might want to read about map_if() and map_at(). map_if() allows you to selectively modify elements of a list based on their values; map_at() allows you to selectively modify elements based on their names.\n\n\n27.3.8 Handling failures\nSometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map(): it succeeds or fails as a whole. map() will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?\nLuckily, purrr comes with a helper to tackle this problem: possibly(). possibly() is what’s known as a function operator: it takes a function and returns a function with modified behavior. In particular, possibly() changes a function from erroring to returning a value that you specify:\n\nfiles &lt;- paths |&gt; \n  map(possibly(\\(path) readxl::read_excel(path), NULL))\n\ndata &lt;- files |&gt; list_rbind()\n\nThis works particularly well here because list_rbind(), like many tidyverse functions, automatically ignores NULLs.\nNow you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed to load and what to do about it. Start by getting the paths that failed:\n\nfailed &lt;- map_vec(files, is.null)\npaths[failed]\n#&gt; character(0)\n\nThen call the import function again for each failure and figure out what went wrong.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#saving-multiple-outputs",
    "href": "iteration.html#saving-multiple-outputs",
    "title": "27  Iteration",
    "section": "27.4 Saving multiple outputs",
    "text": "27.4 Saving multiple outputs\nIn the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:\n\nSaving multiple data frames into one database.\nSaving multiple data frames into multiple .csv files.\nSaving multiple plots to multiple .png files.\n\n\n27.4.1 Writing to a database\nSometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.\nIf you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():\n\ncon &lt;- DBI::dbConnect(duckdb::duckdb())\nduckdb::duckdb_read_csv(con, \"gapminder\", paths)\n\nThis would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.\nWe need to start by creating a table that we will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:\n\ntemplate &lt;- readxl::read_excel(paths[[1]])\ntemplate$year &lt;- 1952\ntemplate\n#&gt; # A tibble: 142 × 6\n#&gt;   country     continent lifeExp      pop gdpPercap  year\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         28.8  8425333      779.  1952\n#&gt; 2 Albania     Europe       55.2  1282697     1601.  1952\n#&gt; 3 Algeria     Africa       43.1  9279525     2449.  1952\n#&gt; 4 Angola      Africa       30.0  4232095     3521.  1952\n#&gt; 5 Argentina   Americas     62.5 17876956     5911.  1952\n#&gt; 6 Australia   Oceania      69.1  8691212    10040.  1952\n#&gt; # ℹ 136 more rows\n\nNow we can connect to the database, and use DBI::dbCreateTable() to turn our template into a database table:\n\ncon &lt;- DBI::dbConnect(duckdb::duckdb())\nDBI::dbCreateTable(con, \"gapminder\", template)\n\ndbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:\n\ncon |&gt; tbl(\"gapminder\")\n#&gt; # Source:   table&lt;gapminder&gt; [0 x 6]\n#&gt; # Database: DuckDB v1.1.2 [root@Darwin 23.4.0:R 4.4.1/:memory:]\n#&gt; # ℹ 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;, pop &lt;dbl&gt;,\n#&gt; #   gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;\n\nNext, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():\n\nappend_file &lt;- function(path) {\n  df &lt;- readxl::read_excel(path)\n  df$year &lt;- parse_number(basename(path))\n  \n  DBI::dbAppendTable(con, \"gapminder\", df)\n}\n\nNow we need to call append_file() once for each element of paths. That’s certainly possible with map():\n\npaths |&gt; map(append_file)\n\nBut we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:\n\npaths |&gt; walk(append_file)\n\nNow we can see if we have all the data in our table:\n\ncon |&gt; \n  tbl(\"gapminder\") |&gt; \n  count(year)\n#&gt; # Source:   SQL [?? x 2]\n#&gt; # Database: DuckDB v1.1.2 [root@Darwin 23.4.0:R 4.4.1/:memory:]\n#&gt;    year     n\n#&gt;   &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  1952   142\n#&gt; 2  1957   142\n#&gt; 3  1972   142\n#&gt; 4  1977   142\n#&gt; 5  1987   142\n#&gt; 6  2007   142\n#&gt; # ℹ more rows\n\n\n\n27.4.2 Writing csv files\nThe same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: group_nest().\n\nby_clarity &lt;- diamonds |&gt; \n  group_nest(clarity)\n\nby_clarity\n#&gt; # A tibble: 8 × 2\n#&gt;   clarity               data\n#&gt;   &lt;ord&gt;   &lt;list&lt;tibble[,9]&gt;&gt;\n#&gt; 1 I1               [741 × 9]\n#&gt; 2 SI2            [9,194 × 9]\n#&gt; 3 SI1           [13,065 × 9]\n#&gt; 4 VS2           [12,258 × 9]\n#&gt; 5 VS1            [8,171 × 9]\n#&gt; 6 VVS2           [5,066 × 9]\n#&gt; # ℹ 2 more rows\n\nThis gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:\n\nby_clarity$data[[1]]\n#&gt; # A tibble: 741 × 9\n#&gt;   carat cut       color depth table price     x     y     z\n#&gt;   &lt;dbl&gt; &lt;ord&gt;     &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1  0.32 Premium   E      60.9    58   345  4.38  4.42  2.68\n#&gt; 2  1.17 Very Good J      60.2    61  2774  6.83  6.9   4.13\n#&gt; 3  1.01 Premium   F      61.8    60  2781  6.39  6.36  3.94\n#&gt; 4  1.01 Fair      E      64.5    58  2788  6.29  6.21  4.03\n#&gt; 5  0.96 Ideal     F      60.7    55  2801  6.37  6.41  3.88\n#&gt; 6  1.04 Premium   G      62.2    58  2801  6.46  6.41  4   \n#&gt; # ℹ 735 more rows\n\nWhile we’re here, let’s create a column that gives the name of output file, using mutate() and str_glue():\n\nby_clarity &lt;- by_clarity |&gt; \n  mutate(path = str_glue(\"diamonds-{clarity}.csv\"))\n\nby_clarity\n#&gt; # A tibble: 8 × 3\n#&gt;   clarity               data path             \n#&gt;   &lt;ord&gt;   &lt;list&lt;tibble[,9]&gt;&gt; &lt;glue&gt;           \n#&gt; 1 I1               [741 × 9] diamonds-I1.csv  \n#&gt; 2 SI2            [9,194 × 9] diamonds-SI2.csv \n#&gt; 3 SI1           [13,065 × 9] diamonds-SI1.csv \n#&gt; 4 VS2           [12,258 × 9] diamonds-VS2.csv \n#&gt; 5 VS1            [8,171 × 9] diamonds-VS1.csv \n#&gt; 6 VVS2           [5,066 × 9] diamonds-VVS2.csv\n#&gt; # ℹ 2 more rows\n\nSo if we were going to save these data frames by hand, we might write something like:\n\nwrite_csv(by_clarity$data[[1]], by_clarity$path[[1]])\nwrite_csv(by_clarity$data[[2]], by_clarity$path[[2]])\nwrite_csv(by_clarity$data[[3]], by_clarity$path[[3]])\n...\nwrite_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])\n\nThis is a little different to our previous uses of map() because there are two arguments that are changing, not just one. That means we need a new function: map2(), which varies both the first and second arguments. And because we again don’t care about the output, we want walk2() rather than map2(). That gives us:\n\nwalk2(by_clarity$data, by_clarity$path, write_csv)\n\n\n\n27.4.3 Saving plots\nWe can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:\n\ncarat_histogram &lt;- function(df) {\n  ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)  \n}\n\ncarat_histogram(by_clarity$data[[1]])\n\n\n\n\n\n\n\n\nNow we can use map() to create a list of many plots7 and their eventual file paths:\n\nby_clarity &lt;- by_clarity |&gt; \n  mutate(\n    plot = map(data, carat_histogram),\n    path = str_glue(\"clarity-{clarity}.png\")\n  )\n\nThen use walk2() with ggsave() to save each plot:\n\nwalk2(\n  by_clarity$path,\n  by_clarity$plot,\n  \\(path, plot) ggsave(path, plot, width = 6, height = 6)\n)\n\nThis is shorthand for:\n\nggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)\nggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)\nggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)\n...\nggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#summary",
    "href": "iteration.html#summary",
    "title": "27  Iteration",
    "section": "27.5 Summary",
    "text": "27.5 Summary\nIn this chapter, you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the Functionals chapter of Advanced R and consulting the purrr website.\nIf you know much about iteration in other languages, you might be surprised that we didn’t discuss the for loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like map() that does something to each element of a list. However, you will see for loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "iteration.html#footnotes",
    "href": "iteration.html#footnotes",
    "title": "27  Iteration",
    "section": "",
    "text": "Anonymous, because we never explicitly gave it a name with &lt;-. Another term programmers use for this is “lambda function”.↩︎\nIn older code you might see syntax that looks like ~ .x + 1. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name .x. We now recommend the base syntax, \\(x) x + 1.↩︎\nYou can’t currently change the order of the columns, but you could reorder them after the fact using relocate() or similar.↩︎\nMaybe there will be one day, but currently we don’t see how.↩︎\nIf you instead had a directory of csv files with the same format, you can use the technique from Section 8.4.↩︎\nWe’re not going to explain how it works, but if you look at the docs for the functions used, you should be able to puzzle it out.↩︎\nYou can print by_clarity$plot to get a crude animation — you’ll get one plot for each element of plots. NOTE: this didn’t happen for me.↩︎",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>27</span>  <span class='chapter-title'>Iteration</span>"
    ]
  },
  {
    "objectID": "base-R.html",
    "href": "base-R.html",
    "title": "28  A field guide to base R",
    "section": "",
    "text": "28.1 Introduction\nTo finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code you’ll encounter in the wild.\nThis is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.\nAfter you read this book, you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll undoubtedly encounter these other approaches when you start reading R code written by others, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!\nIn this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two essential plotting functions.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#introduction",
    "href": "base-R.html#introduction",
    "title": "28  A field guide to base R",
    "section": "",
    "text": "28.1.1 Prerequisites\nThis package focuses on base R so doesn’t have any real prerequisites, but we’ll load the tidyverse in order to explain some of the differences.\n\nlibrary(tidyverse)",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#sec-subset-many",
    "href": "base-R.html#sec-subset-many",
    "title": "28  A field guide to base R",
    "section": "28.2 Selecting multiple elements with [",
    "text": "28.2 Selecting multiple elements with [\n[ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.\n\n28.2.1 Subsetting vectors\nThere are five main types of things that you can subset a vector with, i.e., that can be the i in x[i]:\n\nA vector of positive integers. Subsetting with positive integers keeps the elements at those positions:\n\nx &lt;- c(\"one\", \"two\", \"three\", \"four\", \"five\")\nx[c(3, 2, 5)]\n#&gt; [1] \"three\" \"two\"   \"five\"\n\nBy repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.\n\nx[c(1, 1, 5, 5, 5, 2)]\n#&gt; [1] \"one\"  \"one\"  \"five\" \"five\" \"five\" \"two\"\n\nA vector of negative integers. Negative values drop the elements at the specified positions:\n\nx[c(-1, -3, -5)]\n#&gt; [1] \"two\"  \"four\"\n\nA logical vector. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.\n\nx &lt;- c(10, 3, NA, 5, 8, 1, NA)\n\n# All non-missing values of x\nx[!is.na(x)]\n#&gt; [1] 10  3  5  8  1\n\n# All even (or missing!) values of x\nx[x %% 2 == 0]\n#&gt; [1] 10 NA  8 NA\n\nUnlike filter(), NA indices will be included in the output as NAs.\nA character vector. If you have a named vector, you can subset it with a character vector:\n\nx &lt;- c(abc = 1, def = 2, xyz = 5)\nx[c(\"xyz\", \"def\")]\n#&gt; xyz def \n#&gt;   5   2\n\nAs with subsetting with positive integers, you can use a character vector to duplicate individual entries.\nNothing. The final type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as we’ll see shortly, it is useful when subsetting 2d structures like tibbles.\n\n\n\n28.2.2 Subsetting data frames\nThere are quite a few different ways1 that you can use [ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols]. Here rows and cols are vectors as described above. For example, df[rows, ] and df[, cols] select just rows or just columns, using the empty subset to preserve the other dimension.\nHere are a couple of examples:\n\ndf &lt;- tibble(\n  x = 1:3, \n  y = c(\"a\", \"e\", \"f\"), \n  z = runif(3)\n)\n\n# Select first row and second column\ndf[1, 2]\n#&gt; # A tibble: 1 × 1\n#&gt;   y    \n#&gt;   &lt;chr&gt;\n#&gt; 1 a\n\n# Select all rows and columns x and y\ndf[, c(\"x\" , \"y\")]\n#&gt; # A tibble: 3 × 2\n#&gt;       x y    \n#&gt;   &lt;int&gt; &lt;chr&gt;\n#&gt; 1     1 a    \n#&gt; 2     2 e    \n#&gt; 3     3 f\n\n# Select rows where `x` is greater than 1 and all columns\ndf[df$x &gt; 1, ]\n#&gt; # A tibble: 2 × 3\n#&gt;       x y         z\n#&gt;   &lt;int&gt; &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1     2 e     0.834\n#&gt; 2     3 f     0.601\n\nWe’ll come back to $ shortly, but you should be able to guess what df$x does from the context: it extracts the x variable from df. We need to use it here because [ doesn’t use tidy evaluation, so you need to be explicit about the source of the x variable.\nThere’s an important difference between tibbles and data frames when it comes to [. In this book, we’ve mainly used tibbles, which are data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write data.frame. If df is a data.frame, then df[, cols] will return a vector if col selects a single column and a data frame if it selects more than one column. If df is a tibble, then [ will always return a tibble.\n\ndf1 &lt;- data.frame(x = 1:3)\ndf1[, \"x\"]\n#&gt; [1] 1 2 3\n\ndf2 &lt;- tibble(x = 1:3)\ndf2[, \"x\"]\n#&gt; # A tibble: 3 × 1\n#&gt;       x\n#&gt;   &lt;int&gt;\n#&gt; 1     1\n#&gt; 2     2\n#&gt; 3     3\n\nOne way to avoid this ambiguity with data.frames is to explicitly specify drop = FALSE:\n\ndf1[, \"x\" , drop = FALSE]\n#&gt;   x\n#&gt; 1 1\n#&gt; 2 2\n#&gt; 3 3\n\n\n\n28.2.3 dplyr equivalents\nSeveral dplyr verbs are special cases of [:\n\nfilter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:\n\ndf &lt;- tibble(\n  x = c(2, 3, 1, 1, NA), \n  y = letters[1:5], \n  z = runif(5)\n)\ndf |&gt; filter(x &gt; 1)\n\n# same as\ndf[!is.na(df$x) & df$x &gt; 1, ]\n\nAnother common technique in the wild is to use which() for its side-effect of dropping missing values: df[which(df$x &gt; 1), ].\narrange() is equivalent to subsetting the rows with an integer vector, usually created with order():\n\ndf |&gt; arrange(x, y)\n\n# same as\ndf[order(df$x, df$y), ]\n\nYou can use order(decreasing = TRUE) to sort all columns in descending order or -rank(col) to sort columns in decreasing order individually.\nBoth select() and relocate() are similar to subsetting the columns with a character vector:\n\ndf |&gt; select(x, z)\n\n# same as\ndf[, c(\"x\", \"z\")]\n\n\nBase R also provides a function that combines the features of filter() and select()2 called subset():\n\ndf |&gt; \n  filter(x &gt; 1) |&gt; \n  select(y, z)\n#&gt; # A tibble: 2 × 2\n#&gt;   y           z\n#&gt;   &lt;chr&gt;   &lt;dbl&gt;\n#&gt; 1 a     0.157  \n#&gt; 2 b     0.00740\n\n\n# same as\ndf |&gt; subset(x &gt; 1, c(y, z))\n\nThis function was the inspiration for much of dplyr’s syntax.\n\n\n28.2.4 Exercises\n\nCreate functions that take a vector as input and return:\n\nThe elements at even-numbered positions.\nEvery element except the last value.\nOnly even values (and no missing values).\n\nWhy is x[-which(x &gt; 0)] not the same as x[x &lt;= 0]? Read the documentation for which() and do some experiments to figure it out.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#sec-subset-one",
    "href": "base-R.html#sec-subset-one",
    "title": "28  A field guide to base R",
    "section": "28.3 Selecting a single element with $ and [[",
    "text": "28.3 Selecting a single element with $ and [[\n[, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.\n\n28.3.1 Data frames\n[[ and $ can be used to extract columns out of a data frame. [[ can access by position or by name, and $ is specialized for access by name:\n\ntb &lt;- tibble(\n  x = 1:4,\n  y = c(10, 4, 1, 21)\n)\n\n# by position\ntb[[1]]\n#&gt; [1] 1 2 3 4\n\n# by name\ntb[[\"x\"]]\n#&gt; [1] 1 2 3 4\ntb$x\n#&gt; [1] 1 2 3 4\n\nThey can also be used to create new columns, the base R equivalent of mutate():\n\ntb$z &lt;- tb$x + tb$y\ntb\n#&gt; # A tibble: 4 × 3\n#&gt;       x     y     z\n#&gt;   &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;\n#&gt; 1     1    10    11\n#&gt; 2     2     4     6\n#&gt; 3     3     1     4\n#&gt; 4     4    21    25\n\nThere are several other base R approaches to creating new columns including with transform(), with(), and within(). Hadley collected a few examples at https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf.\nUsing $ directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of cut, there’s no need to use summarize():\n\nmax(diamonds$carat)\n#&gt; [1] 5.01\n\nlevels(diamonds$cut)\n#&gt; [1] \"Fair\"      \"Good\"      \"Very Good\" \"Premium\"   \"Ideal\"\n\ndplyr also provides an equivalent to [[/$ that we didn’t mention in Chapter 4: pull(). pull() takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:\n\ndiamonds |&gt; pull(carat) |&gt; max()\n#&gt; [1] 5.01\n\ndiamonds |&gt; pull(cut) |&gt; levels()\n#&gt; [1] \"Fair\"      \"Good\"      \"Very Good\" \"Premium\"   \"Ideal\"\n\n\n\n28.3.2 Tibbles\nThere are a couple of important differences between tibbles and base data.frames when it comes to $. Data frames match the prefix of any variable names (so-called partial matching) and don’t complain if a column doesn’t exist:\n\ndf &lt;- data.frame(x1 = 1)\ndf$x\n#&gt; [1] 1\ndf$z\n#&gt; NULL\n\nTibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:\n\ntb &lt;- tibble(x1 = 1)\n\ntb$x\n#&gt; Warning: Unknown or uninitialised column: `x`.\n#&gt; NULL\ntb$z\n#&gt; Warning: Unknown or uninitialised column: `z`.\n#&gt; NULL\n\nFor this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.\n\n\n28.3.3 Lists\n[[ and $ are also really important for working with lists, and it’s important to understand how they differ from [. Let’s illustrate the differences with a list named l:\n\nl &lt;- list(\n  a = 1:3, \n  b = \"a string\", \n  c = pi, \n  d = list(-1, -5)\n)\n\n\n[ extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.\n\nstr(l[1:2])\n#&gt; List of 2\n#&gt;  $ a: int [1:3] 1 2 3\n#&gt;  $ b: chr \"a string\"\n\nstr(l[1])\n#&gt; List of 1\n#&gt;  $ a: int [1:3] 1 2 3\n\nstr(l[4])\n#&gt; List of 1\n#&gt;  $ d:List of 2\n#&gt;   ..$ : num -1\n#&gt;   ..$ : num -5\n\nLike with vectors, you can subset with a logical, integer, or character vector.\n[[ and $ extract a single component from a list. They remove a level of hierarchy from the list.\n\nstr(l[[1]])\n#&gt;  int [1:3] 1 2 3\n\nstr(l[[4]])\n#&gt; List of 2\n#&gt;  $ : num -1\n#&gt;  $ : num -5\n\nstr(l$a)\n#&gt;  int [1:3] 1 2 3\n\n\nThe difference between [ and [[ is particularly important for lists because [[ drills down into the list while [ returns a new, smaller list. To help you remember the difference, take a look at the unusual pepper shaker shown in Figure 28.1. If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself.\n\n\n\n\n\n\n\n\nFigure 28.1: (Left) A pepper shaker that Hadley once found in his hotel room. (Middle) pepper[1]. (Right) pepper[[1]]\n\n\n\n\n\nThis same principle applies when you use 1d [ with a data frame: df[\"x\"] returns a one-column data frame and df[[\"x\"]] returns a vector.\n\n\n28.3.4 Exercises\n\nWhat happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?\nWhat would pepper[[1]][1] be? What about pepper[[1]][[1]]?",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#apply-family",
    "href": "base-R.html#apply-family",
    "title": "28  A field guide to base R",
    "section": "28.4 Apply family",
    "text": "28.4 Apply family\nIn Chapter 27, you learned tidyverse techniques for iteration like dplyr::across() and the map family of functions. In this section, you’ll learn about their base equivalents, the apply family. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.\nThe most important member of this family is lapply(), which is very similar to purrr::map()3. In fact, because we haven’t used any of map()’s more advanced features, you can replace every map() call in Chapter 27 with lapply().\nThere’s no exact base R equivalent to across() but you can get close by using [ with lapply(). This works because under the hood, data frames are lists of columns, so calling lapply() on a data frame applies the function to each column.\n\ndf &lt;- tibble(a = 1, b = 2, c = \"a\", d = \"b\", e = 4)\n\n# First find numeric columns\nnum_cols &lt;- sapply(df, is.numeric)\nnum_cols\n#&gt;     a     b     c     d     e \n#&gt;  TRUE  TRUE FALSE FALSE  TRUE\n\n# Then transform each column with lapply() then replace the original values\ndf[, num_cols] &lt;- lapply(df[, num_cols, drop = FALSE], \\(x) x * 2)\ndf\n#&gt; # A tibble: 1 × 5\n#&gt;       a     b c     d         e\n#&gt;   &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;\n#&gt; 1     2     4 a     b         8\n\nThe code above uses a new function, sapply(). It’s similar to lapply() but it always tries to simplify the result, hence the s in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called map_vec() that we didn’t mention in Chapter 27.\nBase R provides a stricter version of sapply() called vapply(), short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the sapply() call above with this vapply() where we specify that we expect is.numeric() to return a logical vector of length 1:\n\nvapply(df, is.numeric, logical(1))\n#&gt;     a     b     c     d     e \n#&gt;  TRUE  TRUE FALSE FALSE  TRUE\n\nThe distinction between sapply() and vapply() is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.\nAnother important member of the apply family is tapply() which computes a single grouped summary:\n\ndiamonds |&gt; \n  group_by(cut) |&gt; \n  summarize(price = mean(price))\n#&gt; # A tibble: 5 × 2\n#&gt;   cut       price\n#&gt;   &lt;ord&gt;     &lt;dbl&gt;\n#&gt; 1 Fair      4359.\n#&gt; 2 Good      3929.\n#&gt; 3 Very Good 3982.\n#&gt; 4 Premium   4584.\n#&gt; 5 Ideal     3458.\n\ntapply(diamonds$price, diamonds$cut, mean)\n#&gt;      Fair      Good Very Good   Premium     Ideal \n#&gt;  4358.758  3928.864  3981.760  4584.258  3457.542\n\nUnfortunately tapply() returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use tapply() or other base techniques to perform other grouped summaries, Hadley has collected a few techniques in a gist.\nThe final member of the apply family is the titular apply(), which works with matrices and arrays. In particular, watch out for apply(df, 2, something), which is a slow and potentially dangerous way of doing lapply(df, something). This rarely comes up in data science because we usually work with data frames and not matrices.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#for-loops",
    "href": "base-R.html#for-loops",
    "title": "28  A field guide to base R",
    "section": "28.5 for loops",
    "text": "28.5 for loops\nfor loops are the fundamental building block of iteration that both the apply and map families use under the hood. for loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:\n\nfor (element in vector) {\n  # do something with element\n}\n\nThe most straightforward use of for loops is to achieve the same effect as walk(): call some function with a side-effect on each element of a list. For example, in Section 27.4.1 instead of using walk():\n\npaths |&gt; walk(append_file)\n\nWe could have used a for loop:\n\nfor (path in paths) {\n  append_file(path)\n}\n\nThings get a little trickier if you want to save the output of the for loop, for example reading all of the excel files in a directory like we did in Chapter 27:\n\npaths &lt;- dir(\"data/gapminder\", pattern = \"\\\\.xlsx$\", full.names = TRUE)\nfiles &lt;- map(paths, readxl::read_excel)\n\nThere are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as paths, which we can create with vector():\n\nfiles &lt;- vector(\"list\", length(paths))\n\nThen instead of iterating over the elements of paths, we’ll iterate over their indices, using seq_along() to generate one index for each element of paths:\n\nseq_along(paths)\n#&gt;  [1]  1  2  3  4  5  6  7  8  9 10 11 12\n\nUsing the indices is important because it allows us to link to each position in the input with the corresponding position in the output:\n\nfor (i in seq_along(paths)) {\n  files[[i]] &lt;- readxl::read_excel(paths[[i]])\n}\n\nTo combine the list of tibbles into a single tibble you can use do.call() + rbind():\n\ndo.call(rbind, files)\n#&gt; # A tibble: 1,704 × 5\n#&gt;   country     continent lifeExp      pop gdpPercap\n#&gt;   &lt;chr&gt;       &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;\n#&gt; 1 Afghanistan Asia         28.8  8425333      779.\n#&gt; 2 Albania     Europe       55.2  1282697     1601.\n#&gt; 3 Algeria     Africa       43.1  9279525     2449.\n#&gt; 4 Angola      Africa       30.0  4232095     3521.\n#&gt; 5 Argentina   Americas     62.5 17876956     5911.\n#&gt; 6 Australia   Oceania      69.1  8691212    10040.\n#&gt; # ℹ 1,698 more rows\n\nRather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:\n\nout &lt;- NULL\nfor (path in paths) {\n  out &lt;- rbind(out, readxl::read_excel(path))\n}\n\nWe recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#plots",
    "href": "base-R.html#plots",
    "title": "28  A field guide to base R",
    "section": "28.6 Plots",
    "text": "28.6 Plots\nMany R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they’re so concise — it takes very little typing to do a basic exploratory plot.\nThere are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with plot() and hist() respectively. Here’s a quick example from the diamonds dataset:\n# Left\nhist(diamonds$carat)\n\n# Right\nplot(diamonds$carat, diamonds$price)\n\n\n\n\n\n\n\n\n\n\nNote that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#summary",
    "href": "base-R.html#summary",
    "title": "28  A field guide to base R",
    "section": "28.7 Summary",
    "text": "28.7 Summary\nIn this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.\nThis chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can program in R. We hope these chapters have sparked your interest in programming and that you’re looking forward to learning more outside of this book.",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "base-R.html#footnotes",
    "href": "base-R.html#footnotes",
    "title": "28  A field guide to base R",
    "section": "",
    "text": "Read https://adv-r.hadley.nz/subsetting.html#subset-multiple to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.↩︎\nBut it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like starts_with().↩︎\nIt just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.↩︎",
    "crumbs": [
      "Program",
      "<span class='chapter-number'>28</span>  <span class='chapter-title'>A field guide to base R</span>"
    ]
  },
  {
    "objectID": "communicate.html",
    "href": "communicate.html",
    "title": "Communicate",
    "section": "",
    "text": "So far, you’ve learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualization. However, it doesn’t matter how great your analysis is unless you can explain it to others: you need to communicate your results.\n\n\n\n\n\n\n\n\nFigure 1: Communication is the final part of the data science process; if you can’t communicate your results to other humans, it doesn’t matter how great your analysis is.\n\n\n\n\n\nCommunication is the theme of the following two chapters:\n\nIn 29  Quarto, you will learn about Quarto, a tool for integrating prose, code, and results. You can use Quarto for analyst-to-analyst communication as well as analyst-to-decision-maker communication. Thanks to the power of Quarto formats, you can even use the same document for both purposes.\nIn 30  Quarto formats, you’ll learn a little about the many other varieties of outputs you can produce using Quarto, including dashboards, websites, and books.\n\nThese chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which we’ll point you to at the end of each chapter.",
    "crumbs": [
      "Communicate"
    ]
  },
  {
    "objectID": "quarto.html",
    "href": "quarto.html",
    "title": "29  Quarto",
    "section": "",
    "text": "29.1 Introduction\nQuarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.\nQuarto files are designed to be used in three ways:\nQuarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through ?. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation.\nIf you’re an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. You’re not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#introduction",
    "href": "quarto.html#introduction",
    "title": "29  Quarto",
    "section": "",
    "text": "For communicating to decision-makers, who want to focus on the conclusions, not the code behind the analysis.\nFor collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).\nAs an environment in which to do data science, as a modern-day lab notebook where you can capture not only what you did, but also what you were thinking.\n\n\n\n\n29.1.1 Prerequisites\nYou need the Quarto command line interface (Quarto CLI), but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#quarto-basics",
    "href": "quarto.html#quarto-basics",
    "title": "29  Quarto",
    "section": "29.2 Quarto basics",
    "text": "29.2 Quarto basics\nThis is a Quarto file – a plain text file that has the extension .qmd:\n\n---\ntitle: \"Diamond sizes\"\ndate: 2022-09-12\nformat: html\n---\n\n```{r}\n#| label: setup\n#| include: false\n\nlibrary(tidyverse)\n\nsmaller &lt;- diamonds |&gt; \n  filter(carat &lt;= 2.5)\n```\n\nWe have data about `r nrow(diamonds)` diamonds.\nOnly `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.\nThe distribution of the remainder is shown below:\n\n```{r}\n#| label: plot-smaller-diamonds\n#| echo: false\n\nsmaller |&gt; \n  ggplot(aes(x = carat)) + \n  geom_freqpoly(binwidth = 0.01)\n```\n\nIt contains three important types of content:\n\nAn (optional) YAML header surrounded by ---s.\nChunks of R code surrounded by ```.\nText mixed with simple text formatting like # heading and _italics_.\n\nFigure 29.1 shows a .qmd document in RStudio with notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code.\n\n\n\n\n\n\n\n\nFigure 29.1: A Quarto document in RStudio. Code and output interleaved in the document, with the plot output appearing right underneath the code.\n\n\n\n\n\nIf you don’t like seeing your plots and output in your document and would rather make use of RStudio’s Console and Plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”, as shown in Figure 29.2.\n\n\n\n\n\n\n\n\nFigure 29.2: A Quarto document in RStudio with the plot output in the Plots pane.\n\n\n\n\n\nTo produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with quarto::quarto_render(\"diamond-sizes.qmd\"). This will display the report in the viewer pane as shown in Figure 29.3 and create an HTML file.\n\n\n\n\n\n\n\n\nFigure 29.3: A Quarto document in RStudio with the rendered document in the Viewer pane.\n\n\n\n\n\nWhen you render the document, Quarto sends the .qmd file to knitr, https://yihui.org/knitr/, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, https://pandoc.org, which is responsible for creating the finished file. This process is shown in Figure 29.4. The advantage of this two step workflow is that you can create a very wide range of output formats, as you’ll learn about in Chapter 30.\n\n\n\n\n\n\n\n\nFigure 29.4: Diagram of Quarto workflow from qmd, to knitr, to md, to pandoc, to output in PDF, MS Word, or HTML formats.\n\n\n\n\n\nTo get started with your own .qmd file, select File &gt; New File &gt; Quarto Document… in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.\nThe following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.\n\n29.2.1 Exercises\n\nCreate a new Quarto document using File &gt; New File &gt; Quarto Document. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.\nCreate one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#visual-editor",
    "href": "quarto.html#visual-editor",
    "title": "29  Quarto",
    "section": "29.3 Visual editor",
    "text": "29.3 Visual editor\nThe Visual editor in RStudio provides a WYSIWYM interface for authoring Quarto documents. Under the hood, prose in Quarto documents (.qmd files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in Section 29.4, it still requires learning new syntax. Therefore, if you’re new to computational documents like .qmd files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.\nIn the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all ⌘ + / or Ctrl + / shortcut to insert just about anything. If you are at the beginning of a line (as shown in Figure 29.5), you can also enter just / to invoke the shortcut.\n\n\n\n\n\n\n\n\nFigure 29.5: Quarto visual editor.\n\n\n\n\n\nInserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editor’s Insert &gt; Figure / Image menu to browse to the image you want to insert or paste it’s URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.\nThe visual editor has many more features that we haven’t enumerated here that you might find useful as you gain experience authoring with it.\nMost importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.\n\n29.3.1 Exercises\n\nRe-create the document in Figure 29.5 using the visual editor.\nUsing the visual editor, insert a code chunk using the Insert menu and then the insert anything tool.\nUsing the visual editor, figure out how to:\n\nAdd a footnote.\nAdd a horizontal rule.\nAdd a block quote.\n\nIn the visual editor, go to Insert &gt; Citation and insert a citation to the paper titled Welcome to the Tidyverse using its DOI (digital object identifier), which is 10.21105/joss.01686. Render the document and observe how the reference shows up in the document. What change do you observe in the YAML of your document?",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#sec-source-editor",
    "href": "quarto.html#sec-source-editor",
    "title": "29  Quarto",
    "section": "29.4 Source editor",
    "text": "29.4 Source editor\nYou can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since it’s often easier to catch these in plain text.\nThe guide below shows how to use Pandoc’s Markdown for authoring Quarto documents in the source editor.\n\n## Text formatting\n\n*italic* **bold** ~~strikeout~~ `code`\n\nsuperscript^2^ subscript~2~\n\n[underline]{.underline} [small caps]{.smallcaps}\n\n## Headings\n\n# 1st Level Header\n\n## 2nd Level Header\n\n### 3rd Level Header\n\n## Lists\n\n-   Bulleted list item 1\n\n-   Item 2\n\n    -   Item 2a\n\n    -   Item 2b\n\n1.  Numbered list item 1\n\n2.  Item 2.\n    The numbers are incremented automatically in the output.\n\n## Links and images\n\n&lt;http://example.com&gt;\n\n[linked phrase](http://example.com)\n\n![optional caption text](quarto.png){fig-alt=\"Quarto logo and the word quarto spelled in small case letters\"}\n\n## Tables\n\n| First Header | Second Header |\n|--------------|---------------|\n| Content Cell | Content Cell  |\n| Content Cell | Content Cell  |\n\nThe best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won’t need to think about them. If you forget, you can get to a handy reference sheet with Help &gt; Markdown Quick Reference.\n\n29.4.1 Exercises\n\nPractice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.\nUsing the source editor and the Markdown quick reference, figure out how to:\n\nAdd a footnote.\nAdd a horizontal rule.\nAdd a block quote.\n\nCopy and paste the contents of diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.\nCreate a document in a Google doc or MS Word (or locate a document you have created previously) with some content in it such as headings, hyperlinks, formatted text, etc. Copy the contents of this document and paste it into a Quarto document in the visual editor. Then, switch over to the source editor and inspect the source code.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#code-chunks",
    "href": "quarto.html#code-chunks",
    "title": "29  Quarto",
    "section": "29.5 Code chunks",
    "text": "29.5 Code chunks\nTo run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:\n\nThe keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.\nThe “Insert” button icon in the editor toolbar.\nBy manually typing the chunk delimiters ```{r} and ```.\n\nWe’d recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!\nYou can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.\nThe following sections describe the chunk header which consists of ```{r}, followed by an optional chunk label and various other chunk options, each on their own line, marked by #|.\n\n29.5.1 Chunk label\nChunks can be given an optional label, e.g.\n\n```{r}\n#| label: simple-addition\n\n1 + 1\n```\n#&gt; [1] 2\n\nThis has three advantages:\n\nYou can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:\n\n\n\n\n\n\n\n\n\nGraphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in Section 29.6.\nYou can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in Section 29.8.\n\nYour chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (-) to separate words (instead of underscores, _) and avoiding other special characters in chunk labels.\nYou are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: setup. When you’re in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.\nAdditionally, chunk labels cannot be duplicated. Each chunk label must be unique.\n\n\n29.5.2 Chunk options\nChunk output can be customized with options, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at https://yihui.org/knitr/options.\nThe most important set of options controls if your code block is executed and what results are inserted in the finished report:\n\neval: false prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.\ninclude: false runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.\necho: false prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.\nmessage: false or warning: false prevents messages or warnings from appearing in the finished file.\nresults: hide hides printed output; fig-show: hide hides plots.\nerror: true causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .qmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error: false causes rendering to fail if there is a single error in the document.\n\nEach of these chunk options get added to the header of the chunk, following #|, e.g., in the following chunk the result is not printed since eval is set to false.\n\n```{r}\n#| label: simple-multiplication\n#| eval: false\n\n2 * 2\n```\n\nThe following table summarizes which types of output each option suppresses:\n\n\n\n\n\n\n\n\n\n\n\n\nOption\nRun code\nShow code\nOutput\nPlots\nMessages\nWarnings\n\n\n\n\neval: false\nX\n\nX\nX\nX\nX\n\n\ninclude: false\n\nX\nX\nX\nX\nX\n\n\necho: false\n\nX\n\n\n\n\n\n\nresults: hide\n\n\nX\n\n\n\n\n\nfig-show: hide\n\n\n\nX\n\n\n\n\nmessage: false\n\n\n\n\nX\n\n\n\nwarning: false\n\n\n\n\n\nX\n\n\n\n\n\n29.5.3 Global options\nAs you work more with knitr, you will discover that some of the default chunk options don’t fit your needs and you want to change them.\nYou can do this by adding the preferred options in the document YAML, under execute. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set echo: false at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo: true). You might consider setting message: false and warning: false, but that would make it harder to debug problems because you wouldn’t see any messages in the final document.\ntitle: \"My report\"\nexecute:\n  echo: false\nSince Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the knitr field, under opts_chunk. For example, when writing books and tutorials we set:\ntitle: \"Tutorial\"\nknitr:\n  opts_chunk:\n    comment: \"#&gt;\"\n    collapse: true\nThis uses our preferred comment formatting and ensures that the code and output are kept closely entwined.\n\n\n29.5.4 Inline code\nThere is one other way to embed R code into a Quarto document: directly into the text, with: `r `. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:\n\nWe have data about `r nrow(diamonds)` diamonds. Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats. The distribution of the remainder is shown below:\n\nWhen the report is rendered, the results of these computations are inserted into the text:\n\nWe have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:\n\nWhen inserting numbers into text, format() is your friend. It allows you to set the number of digits so you don’t print to a ridiculous degree of precision, and a big.mark to make numbers easier to read. You might combine these into a helper function:\n\ncomma &lt;- function(x) format(x, digits = 2, big.mark = \",\")\ncomma(3452345)\n#&gt; [1] \"3,452,345\"\ncomma(.12358124331)\n#&gt; [1] \"0.12\"\n\n\n\n29.5.5 Exercises\n\nAdd a section that explores how diamond sizes vary by cut, color, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo: false on each chunk, set a global option.\nDownload diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.\nModify diamonds-sizes.qmd to use label_comma() to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#sec-figures",
    "href": "quarto.html#sec-figures",
    "title": "29  Quarto",
    "section": "29.6 Figures",
    "text": "29.6 Figures\nThe figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.\nTo embed an image from an external file, you can use the Insert menu in the Visual Editor in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.\nIf you include a code chunk that generates a figure (e.g., includes a ggplot() call), the resulting figure will be automatically included in your Quarto document.\n\n29.6.1 Figure sizing\nThe biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: fig-width, fig-height, fig-asp, out-width and out-height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e. height, width, and aspect ratio: pick two of three).\nWe recommend three of the five options:\n\nPlots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set fig-width: 6 (6”) and fig-asp: 0.618 (the golden ratio) in the defaults. Then in individual chunks, only adjust fig-asp.\nControl the output size with out-width and set it to a percentage of the body width of the output document. We suggest to out-width: \"70%\" and fig-align: center.\nThat gives plots room to breathe, without taking up too much space.\nTo put multiple plots in a single row, set the layout-ncol to 2 for two plots, 3 for three plots, etc. This effectively sets out-width to “50%” for each of your plots if layout-ncol is 2, “33%” if layout-ncol is 3, etc. Depending on what you’re trying to illustrate (e.g., show data or show plot variations), you might also tweak fig-width, as discussed below.\n\nIf you find that you’re having to squint to read the text in your plot, you need to tweak fig-width. If fig-width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig-width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig-width and the eventual width in your document. To illustrate the principle, the following three plots have fig-width of 4, 6, and 8 respectively:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIf you want to make sure the font size is consistent across all your figures, whenever you set out-width, you’ll also need to adjust fig-width to maintain the same ratio with your default out-width. For example, if your default fig-width is 6 and out-width is “70%”, when you set out-width: \"50%\" you’ll need to set fig-width to 4.3 (6 * 0.5 / 0.7).\nFigure sizing and scaling is an art and science and getting things right can require an iterative trial-and-error approach. You can learn more about figure sizing in the taking control of plot scaling blog post.\n\n\n29.6.2 Other important options\nWhen mingling code and text, like in this book, you can set fig-show: hold so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.\nTo add a caption to the plot, use fig-cap. In Quarto this will change the figure from inline to “floating”.\nIf you’re producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set fig-format: \"png\" to force the use of PNGs. They are slightly lower quality, but will be much more compact.\nIt’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (e.g., if you want to quickly drop a single plot into an email).\n\n\n29.6.3 Exercises\n\nOpen diamond-sizes.qmd in the visual editor, find an image of a diamond, copy it, and paste it into the document. Double click on the image and add a caption. Resize the image and render your document. Observe how the image is saved in your current working directory.\nEdit the label of the code chunk in diamond-sizes.qmd that generates a plot to start with the prefix fig- and add a caption to the figure with the chunk option fig-cap. Then, edit the text above the code chunk to add a cross-reference to the figure with Insert &gt; Cross Reference.\nChange the size of the figure with the following chunk options, one at a time, render your document, and describe how the figure changes.\n\nfig-width: 10\nfig-height: 3\nout-width: \"100%\"\nout-width: \"20%\"",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#tables",
    "href": "quarto.html#tables",
    "title": "29  Quarto",
    "section": "29.7 Tables",
    "text": "29.7 Tables\nSimilar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.\nBy default, Quarto prints data frames and matrices as you’d see them in the console:\n\nmtcars[1:5, ]\n#&gt;                    mpg cyl disp  hp drat    wt  qsec vs am gear carb\n#&gt; Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4\n#&gt; Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4\n#&gt; Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1\n#&gt; Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1\n#&gt; Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2\n\nIf you prefer that data be displayed with additional formatting you can use the knitr::kable() function. The code below generates Table 29.1.\n\nknitr::kable(mtcars[1:5, ], )\n\n\n\nTable 29.1: A knitr kable.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmpg\ncyl\ndisp\nhp\ndrat\nwt\nqsec\nvs\nam\ngear\ncarb\n\n\n\n\nMazda RX4\n21.0\n6\n160\n110\n3.90\n2.620\n16.46\n0\n1\n4\n4\n\n\nMazda RX4 Wag\n21.0\n6\n160\n110\n3.90\n2.875\n17.02\n0\n1\n4\n4\n\n\nDatsun 710\n22.8\n4\n108\n93\n3.85\n2.320\n18.61\n1\n1\n4\n1\n\n\nHornet 4 Drive\n21.4\n6\n258\n110\n3.08\n3.215\n19.44\n1\n0\n3\n1\n\n\nHornet Sportabout\n18.7\n8\n360\n175\n3.15\n3.440\n17.02\n0\n0\n3\n2\n\n\n\n\n\n\n\n\nRead the documentation for ?knitr::kable to see the other ways in which you can customize the table. For even deeper customization, consider the gt, huxtable, reactable, kableExtra, xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.\n\n29.7.1 Exercises\n\nOpen diamond-sizes.qmd in the visual editor, insert a code chunk, and add a table with knitr::kable() that shows the first 5 rows of the diamonds data frame.\nDisplay the same table with gt::gt() instead.\nAdd a chunk label that starts with the prefix tbl- and add a caption to the table with the chunk option tbl-cap. Then, edit the text above the code chunk to add a cross-reference to the table with Insert &gt; Cross Reference.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#sec-caching",
    "href": "quarto.html#sec-caching",
    "title": "29  Quarto",
    "section": "29.8 Caching",
    "text": "29.8 Caching\nNormally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache: true.\nYou can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:\n---\ntitle: \"My Document\"\nexecute: \n  cache: true\n---\nYou can also enable caching at the chunk level for caching the results of computation in a specific chunk:\n\n```{r}\n#| cache: true\n\n# code for lengthy computation...\n```\n\nWhen set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.\nThe caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw-data chunk:\n```{r}\n#| label: raw-data\n#| cache: true\n\nrawdata &lt;- readr::read_csv(\"a_very_large_file.csv\")\n```\n```{r}\n#| label: processed_data\n#| cache: true\n\nprocessed_data &lt;- rawdata |&gt; \n  filter(!is.na(import_var)) |&gt; \n  mutate(new_variable = complicated_transformation(x, y, z))\n```\nCaching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:\n```{r}\n#| label: processed-data\n#| cache: true\n#| dependson: \"raw-data\"\n\nprocessed_data &lt;- rawdata |&gt; \n  filter(!is.na(import_var)) |&gt; \n  mutate(new_variable = complicated_transformation(x, y, z))\n```\ndependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.\nNote that the chunks won’t update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .qmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is file.mtime(): it returns when it was last modified. Then you can write:\n```{r}\n#| label: raw-data\n#| cache: true\n#| cache.extra: !expr file.mtime(\"a_very_large_file.csv\")\n\nrawdata &lt;- readr::read_csv(\"a_very_large_file.csv\")\n```\nWe’ve followed the advice of David Robinson to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.\nAs your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with knitr::clean_cache().\n\n29.8.1 Exercises\n\nSet up a network of chunks where d depends on c and b, and both b and c depend on a. Have each chunk print lubridate::now(), set cache: true, then verify your understanding of caching.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#troubleshooting",
    "href": "quarto.html#troubleshooting",
    "title": "29  Quarto",
    "section": "29.9 Troubleshooting",
    "text": "29.9 Troubleshooting\nTroubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.\nOne common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.\nIf the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks”, either from the Code menu, under Run region or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.\nIf that doesn’t help, there must be something different between your interactive environment and the Quarto environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including getwd() in a chunk.\nNext, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your Quarto session. The easiest way to do that is to set error: true on the chunk causing the problem, then use print() and str() to check that settings are as you expect.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#yaml-header",
    "href": "quarto.html#yaml-header",
    "title": "29  Quarto",
    "section": "29.10 YAML header",
    "text": "29.10 YAML header\nYou can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it’s “YAML Ain’t Markup Language”, which is designed for representing hierarchical data in a way that’s easy for humans to read and write. Quarto uses it to control many details of the output. Here we’ll discuss three: self-contained documents, document parameters, and bibliographies.\n\n29.10.1 Self-contained\nHTML documents typically have a number of external dependencies (e.g., images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a _files folder in the same directory as your .qmd file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, https://quartopub.com/), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the embed-resources option:\nformat:\n  html:\n    embed-resources: true\nThe resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.\n\n\n29.10.2 Parameters\nQuarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the params field.\nThis example uses a my_class parameter to determine which class of cars to display:\n\n---\nformat: html\nparams:\n  my_class: \"suv\"\n---\n\n```{r}\n#| label: setup\n#| include: false\n\nlibrary(tidyverse)\n\nclass &lt;- mpg |&gt; filter(class == params$my_class)\n```\n\n# Fuel economy for `r params$my_class`s\n\n```{r}\n#| message: false\n\nggplot(class, aes(x = displ, y = hwy)) + \n  geom_point() + \n  geom_smooth(se = FALSE)\n```\n\nAs you can see, parameters are available within the code chunks as a read-only list named params.\nYou can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with !expr. This is a good way to specify date/time parameters.\nparams:\n  start: !expr lubridate::ymd(\"2015-01-01\")\n  snapshot: !expr lubridate::ymd_hms(\"2015-01-01 12:30:00\")\n\n\n29.10.3 Bibliographies and Citations\nQuarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.\nTo add a citation using the visual editor, go to Insert &gt; Citation. Citations can be inserted from a variety of sources:\n\nDOI (Document Object Identifier) references.\nZotero personal or group libraries.\nSearches of Crossref, DataCite, or PubMed.\nYour document bibliography (a .bib file in the directory of your document)\n\nUnder the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g., [@citation]).\nIf you add a citation using one of the first three methods, the visual editor will automatically create a bibliography.bib file for you and add the reference to it. It will also add a bibliography field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.\nTo create a citation within your .qmd file in the source editor, use a key composed of ‘@’ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:\nSeparate multiple citations with a `;`: Blah blah [@smith04; @doe99].\n\nYou can add arbitrary comments inside the square brackets: \nBlah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].\n\nRemove the square brackets to create an in-text citation: @smith04 \nsays blah, or @smith04 [p. 33] says blah.\n\nAdd a `-` before the citation to suppress the author's name: \nSmith says blah [-@smith04].\nWhen Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as # References or # Bibliography.\nYou can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the csl field:\nbibliography: rmarkdown.bib\ncsl: apa.csl\nAs with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is https://github.com/citation-style-language/styles.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#workflow",
    "href": "quarto.html#workflow",
    "title": "29  Quarto",
    "section": "29.11 Workflow",
    "text": "29.11 Workflow\nEarlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.\nQuarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:\n\nRecords what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!\nSupports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.\nHelps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.\n\nMuch of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (https://colinpurrington.com/tips/lab-notebooks) to come up with the following tips:\n\nEnsure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.\nUse the YAML header date field to record the date you started working on the notebook:\ndate: 2016-08-23\nUse ISO8601 YYYY-MM-DD format so that’s there no ambiguity. Use it even if you don’t normally write dates that way!\nIf you spend a lot of time on an analysis idea and it turns out to be a dead end, don’t delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.\nGenerally, you’re better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using tibble::tribble().\nIf you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.\nBefore you finish for the day, make sure you can render the notebook. If you’re using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.\nIf you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you’ll need to track the versions of the packages that your code uses. A rigorous approach is to use renv, https://rstudio.github.io/renv/index.html, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs sessionInfo() — that won’t let you easily recreate your packages as they are today, but at least you’ll know what they were.\nYou are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto.html#summary",
    "href": "quarto.html#summary",
    "title": "29  Quarto",
    "section": "29.12 Summary",
    "text": "29.12 Summary\nIn this chapter we introduced you to Quarto for authoring and publishing reproducible computational documents that include your code and your prose in one place. You’ve learned about writing Quarto documents in RStudio with the visual or the source editor, how code chunks work and how to customize options for them, how to include figures and tables in your Quarto documents, and options for caching for computations. Additionally, you’ve learned about adjusting YAML header options for creating self-contained or parametrized documents as well as including citations and bibliography. We have also given you some troubleshooting and workflow tips.\nWhile this introduction should be sufficient to get you started with Quarto, there is still a lot more to learn. Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: https://quarto.org.\nThere are two important topics that we haven’t covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: https://happygitwithr.com.\nWe have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either Style: Lessons in Clarity and Grace by Joseph M. Williams & Joseph Bizup, or The Sense of Structure: Writing from the Reader’s Perspective by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they’re used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at https://www.georgegopen.com/litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>29</span>  <span class='chapter-title'>Quarto</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html",
    "href": "quarto-formats.html",
    "title": "30  Quarto formats",
    "section": "",
    "text": "30.1 Introduction\nSo far, you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.\nThere are two ways to set the output of a document:",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#introduction",
    "href": "quarto-formats.html#introduction",
    "title": "30  Quarto formats",
    "section": "",
    "text": "Permanently, by modifying the YAML header:\ntitle: \"Diamond sizes\"\nformat: html\nTransiently, by calling quarto::quarto_render() by hand:\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = \"docx\")\n\nThis is useful if you want to programmatically produce multiple types of output since the output_format argument can also take a list of values.\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = c(\"docx\", \"pdf\"))",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#output-options",
    "href": "quarto-formats.html#output-options",
    "title": "30  Quarto formats",
    "section": "30.2 Output options",
    "text": "30.2 Output options\nQuarto offers a wide range of output formats. You can find the complete list at https://quarto.org/docs/output-formats/all-formats.html. Many formats share some output options (e.g., toc: true for including a table of contents), but others have options that are format specific (e.g., code-fold: true collapses code chunks into a &lt;details&gt; tag for HTML output so the user can display it on demand, it’s not applicable in a PDF or Word document).\nTo override the default options, you need to use an expanded format field. For example, if you wanted to render an html with a floating table of contents, you’d use:\nformat:\n  html:\n    toc: true\n    toc_float: true\nYou can even render to multiple outputs by supplying a list of formats:\nformat:\n  html:\n    toc: true\n    toc_float: true\n  pdf: default\n  docx: default\nNote the special syntax (pdf: default) if you don’t want to override any default options.\nTo render to all formats specified in the YAML of a document, you can use output_format = \"all\".\n\nquarto::quarto_render(\"diamond-sizes.qmd\", output_format = \"all\")",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#documents",
    "href": "quarto-formats.html#documents",
    "title": "30  Quarto formats",
    "section": "30.3 Documents",
    "text": "30.3 Documents\nThe previous chapter focused on the default html output. There are several basic variations on that theme, generating different types of documents. For example:\n\npdf makes a PDF with LaTeX (an open-source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.\ndocx for Microsoft Word (.docx) documents.\nodt for OpenDocument Text (.odt) documents.\nrtf for Rich Text Format (.rtf) documents.\ngfm for a GitHub Flavored Markdown (.md) document.\nipynb for Jupyter Notebooks (.ipynb).\n\nRemember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in the document YAML:\nexecute:\n  echo: false\nFor html documents another option is to make the code chunks hidden by default, but visible with a click:\nformat:\n  html:\n    code: true",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#presentations",
    "href": "quarto-formats.html#presentations",
    "title": "30  Quarto formats",
    "section": "30.4 Presentations",
    "text": "30.4 Presentations\nYou can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (##) level header. Additionally, first (#) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.\nQuarto supports a variety of presentation formats, including:\n\nrevealjs - HTML presentation with revealjs\npptx - PowerPoint presentation\nbeamer - PDF presentation with LaTeX Beamer.\n\nYou can read more about creating presentations with Quarto at https://quarto.org/docs/presentations.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#interactivity",
    "href": "quarto-formats.html#interactivity",
    "title": "30  Quarto formats",
    "section": "30.5 Interactivity",
    "text": "30.5 Interactivity\nJust like any HTML document, HTML documents created with Quarto can contain interactive components as well. Here we introduce two options for including interactivity in your Quarto documents: htmlwidgets and Shiny.\n\n30.5.1 htmlwidgets\nHTML is an interactive format, and you can take advantage of that interactivity with htmlwidgets, R functions that produce interactive HTML visualizations. For example, take the leaflet map below. If you’re viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can’t do that in a book, so Quarto automatically inserts a static screenshot for you.\n\nlibrary(leaflet)\nleaflet() |&gt;\n  setView(174.764, -36.877, zoom = 16) |&gt; \n  addTiles() |&gt;\n  addMarkers(174.764, -36.877, popup = \"Maungawhau\") \n\n\n\n\n\nThe great thing about htmlwidgets is that you don’t need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don’t need to worry about it.\nThere are many packages that provide htmlwidgets, including:\n\ndygraphs for interactive time series visualizations.\nDT for interactive tables.\nthreejs for interactive 3d plots.\nDiagrammeR for diagrams (like flow charts and simple node-link diagrams).\n\nTo learn more about htmlwidgets and see a complete list of packages that provide them visit https://www.htmlwidgets.org.\n\n\n30.5.2 Shiny\nhtmlwidgets provide client-side interactivity — all the interactivity happens in the browser, independently of R. On the one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use shiny, a package that allows you to create interactivity using R code, not JavaScript.\nTo call Shiny code from a Quarto document, add server: shiny to the YAML header:\ntitle: \"Shiny Web App\"\nformat: html\nserver: shiny\nThen you can use the “input” functions to add interactive components to the document:\n\nlibrary(shiny)\n\ntextInput(\"name\", \"What is your name?\")\nnumericInput(\"age\", \"How old are you?\", NA, min = 0, max = 150)\n\n\n\n\n\n\n\n\n\n\nAnd you also need a code chunk with chunk option context: server which contains the code that needs to run in a Shiny server.\nYou can then refer to the values with input$name and input$age, and the code that uses them will be automatically re-run whenever they change.\nWe can’t show you a live shiny app here because shiny interactions occur on the server-side. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.\nFor learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, https://mastering-shiny.org.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#websites-and-books",
    "href": "quarto-formats.html#websites-and-books",
    "title": "30  Quarto formats",
    "section": "30.6 Websites and books",
    "text": "30.6 Websites and books\nWith a bit of additional infrastructure, you can use Quarto to generate a complete website or book:\n\nPut your .qmd files in a single directory. index.qmd will become the home page.\nAdd a YAML file named _quarto.yml that provides the navigation for the site. In this file, set the project type to either book or website, e.g.:\nproject:\n  type: book\n\nFor example, the following _quarto.yml file creates a website from three source files: index.qmd (the home page), viridis-colors.qmd, and terrain-colors.qmd.\n\nproject:\n  type: website\n\nwebsite:\n  title: \"A website on color scales\"\n  navbar:\n    left:\n      - href: index.qmd\n        text: Home\n      - href: viridis-colors.qmd\n        text: Viridis colors\n      - href: terrain-colors.qmd\n        text: Terrain colors\n\nThe _quarto.yml file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (html, pdf, and epub). Once again, the source files are .qmd files.\n\nproject:\n  type: book\n\nbook:\n  title: \"A book on color scales\"\n  author: \"Jane Coloriste\"\n  chapters:\n    - index.qmd\n    - intro.qmd\n    - viridis-colors.qmd\n    - terrain-colors.qmd\n\nformat:\n  html:\n    theme: cosmo\n  pdf: default\n  epub: default\n\nWe recommend that you use an RStudio project for your websites and books. Based on the _quarto.yml file, RStudio will recognize the type of project you’re working on, and add a Build tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using quarto::quarto_render().\nRead more at https://quarto.org/docs/websites about Quarto websites and https://quarto.org/docs/books about books.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#other-formats",
    "href": "quarto-formats.html#other-formats",
    "title": "30  Quarto formats",
    "section": "30.7 Other formats",
    "text": "30.7 Other formats\nQuarto offers even more output formats:\n\nYou can write journal articles using Quarto Journal Templates: https://quarto.org/docs/journals/templates.html.\nYou can output Quarto documents to Jupyter Notebooks with format: ipynb: https://quarto.org/docs/reference/formats/ipynb.html.\n\nSee https://quarto.org/docs/output-formats/all-formats.html for a list of even more formats.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  },
  {
    "objectID": "quarto-formats.html#summary",
    "href": "quarto-formats.html#summary",
    "title": "30  Quarto formats",
    "section": "30.8 Summary",
    "text": "30.8 Summary\nIn this chapter we presented you a variety of options for communicating your results with Quarto, from static and interactive documents to presentations to websites and books.\nTo learn more about effective communication in these different formats, we recommend the following resources:\n\nTo improve your presentation skills, try Presentation Patterns by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.\nIf you give academic talks, you might like the Leek group guide to giving talks.\nWe haven’t taken it ourselves, but we’ve heard good things about Matt McGarrity’s online course on public speaking: https://www.coursera.org/learn/public-speaking.\nIf you are creating many dashboards, make sure to read Stephen Few’s Information Dashboard Design: The Effective Visual Communication of Data. It will help you create dashboards that are truly useful, not just pretty to look at.\nEffectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams’ The Non-Designer’s Design Book is a great place to start.",
    "crumbs": [
      "Communicate",
      "<span class='chapter-number'>30</span>  <span class='chapter-title'>Quarto formats</span>"
    ]
  }
]