Extending functionality #1576

felixschweigkofler · 2024-10-12T22:13:20Z

I would like to propose a functionality that I personally found to be useful. Maybe something that allows me to do this already exists and I just missed it, but if that is not the case, I think it would best fit into the tidyr-package and a possible name is extend().

What it would do is it would duplicate and row-bind the dataframe, but with the identifiers in select columns (and all combinations of these columns, each with its own row-bound duplicate of the original dataframe) being overwritten by a placeholder (e.g. NA or 'all'). This altered dataframe can then be fed to mutate or summarise to calculate the mean or whatever. Different from the usual approach, the result would therefore not only contain rows with the the means of the groups, but also a row with the mean of the entire dataframe (if a single column was extended) or multiple rows with different combinations of column-groupings.

E.g. after extend()-ing it, a dataframe with height of individuals grouped by country and gender could not only be summarized by each country-gender combination but also by gender alone or country alone or gender and country alone or also the entire dataset, all directly fed into a single output. In this simple example the workaround isnt too long, just grouping the dataframe differently and summarizing again, but I worked with a case where I had several of these combinations of columns and grouping and summarizing separately would have taken a lot of space and effort.

I have a fully functional package that is not the most elegant it could be, but I think implements all necessary logic and considers special cases. If there interest in such a functionality?

The core of my function is something like the code below, but there are other things that I added, like allowing to exclude particular combinations of colums or not duplicating certain groups (the columns to be extended cannot be columns used for grouping) when they have only a single unique entry because extending by that column would not be so useful.

library(dplyr)
library(rlang)
extend <- function(df, ..., placeholder = NA) {
  extend_cols <- sapply(ensyms(...), as_string)
  extn <- unlist(lapply(1:length(extend_cols), function(x) combn(extend_cols, x, simplify = F)), recursive = F)
  df0 <- list(df)
  for (i in extn) {
    df0[[paste(i,collapse = '')]] <- df %>%
      mutate(across(all_of(i), ~ placeholder))
  }
  return(bind_rows(df0))
}

df <- data.frame(A = c('a','a','b','b'), B = c('k','l','m','n'), value = 1:4)

# The extended dataframe
df %>% 
  extend(A,B)

# The resulting summary
df %>% 
  extend(A,B) %>% 
  group_by(A,B) %>% 
  summarise(M = mean(value))

The text was updated successfully, but these errors were encountered:

DavisVaughan · 2024-10-24T13:42:25Z

This seems nice for an extension package but is probably a little too niche for tidyr, nice idea though!

DavisVaughan closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending functionality #1576

Extending functionality #1576

felixschweigkofler commented Oct 12, 2024 •

edited

Loading

DavisVaughan commented Oct 24, 2024

Extending functionality #1576

Extending functionality #1576

Comments

felixschweigkofler commented Oct 12, 2024 • edited Loading

DavisVaughan commented Oct 24, 2024

felixschweigkofler commented Oct 12, 2024 •

edited

Loading