Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending functionality #1576

Closed
felixschweigkofler opened this issue Oct 12, 2024 · 1 comment
Closed

Extending functionality #1576

felixschweigkofler opened this issue Oct 12, 2024 · 1 comment

Comments

@felixschweigkofler
Copy link

felixschweigkofler commented Oct 12, 2024

I would like to propose a functionality that I personally found to be useful. Maybe something that allows me to do this already exists and I just missed it, but if that is not the case, I think it would best fit into the tidyr-package and a possible name is extend().

What it would do is it would duplicate and row-bind the dataframe, but with the identifiers in select columns (and all combinations of these columns, each with its own row-bound duplicate of the original dataframe) being overwritten by a placeholder (e.g. NA or 'all'). This altered dataframe can then be fed to mutate or summarise to calculate the mean or whatever. Different from the usual approach, the result would therefore not only contain rows with the the means of the groups, but also a row with the mean of the entire dataframe (if a single column was extended) or multiple rows with different combinations of column-groupings.

E.g. after extend()-ing it, a dataframe with height of individuals grouped by country and gender could not only be summarized by each country-gender combination but also by gender alone or country alone or gender and country alone or also the entire dataset, all directly fed into a single output. In this simple example the workaround isnt too long, just grouping the dataframe differently and summarizing again, but I worked with a case where I had several of these combinations of columns and grouping and summarizing separately would have taken a lot of space and effort.

I have a fully functional package that is not the most elegant it could be, but I think implements all necessary logic and considers special cases. If there interest in such a functionality?

The core of my function is something like the code below, but there are other things that I added, like allowing to exclude particular combinations of colums or not duplicating certain groups (the columns to be extended cannot be columns used for grouping) when they have only a single unique entry because extending by that column would not be so useful.

library(dplyr)
library(rlang)
extend <- function(df, ..., placeholder = NA) {
  extend_cols <- sapply(ensyms(...), as_string)
  extn <- unlist(lapply(1:length(extend_cols), function(x) combn(extend_cols, x, simplify = F)), recursive = F)
  df0 <- list(df)
  for (i in extn) {
    df0[[paste(i,collapse = '')]] <- df %>%
      mutate(across(all_of(i), ~ placeholder))
  }
  return(bind_rows(df0))
}

df <- data.frame(A = c('a','a','b','b'), B = c('k','l','m','n'), value = 1:4)

# The extended dataframe
df %>% 
  extend(A,B)

# The resulting summary
df %>% 
  extend(A,B) %>% 
  group_by(A,B) %>% 
  summarise(M = mean(value))
@DavisVaughan
Copy link
Member

This seems nice for an extension package but is probably a little too niche for tidyr, nice idea though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants