You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For example, in bind_rows, if the first input is a data.table, the output table can have corrupt indexing due to how the underlying dplyr_reconstruct function deals with the attributes of the two inputs
Reprex
The example below shows that the index attribute can be incorrect for the output.
>a<-data.table::data.table(cola= c(5, 2:4), colb= runif(4), colc= runif(4), cold="c") # Create a data.table> attributes(a)$.internal.selfref<- new("externalptr") # Set pointer to nil. This is necessary for the subset error below to happen in data.table. But it is not necessary to re-produce the corrupted index. >a[cola==4] # Give data.table a secondary index ("cola" column) by auto-indexingcolacolbcolccold<num><num><num><char>1:40.14956790.6097216c>> attributes(a)$index# The secondary index is set correctly
integer(0)
attr(,"__cola")
[1] 2341>>b<-data.table::data.table(cola=-1, colb=2, colc=3, cold="d")
>>combined<-dplyr::bind_rows(list(a,b))
>>combined# combined is a data.table, with 5 rowsIndex:<cola>colacolbcolccold<num><num><num><char>1:50.55358550.6024416c2:20.34070510.9291365c3:30.50072080.6823528c4:40.14956790.6097216c5:-12.00000003.0000000d>> attributes(combined)$index# Wrong! length of secondary index is only 4
integer(0)
attr(,"__cola")
[1] 2341>combined[cola==-1]
Empty data.table (0rowsand4cols):cola,colb,colc,cold# Wrong! The last row of combined should be returned>combinedIndex:<cola>colacolbcolccold<num><num><num><char>1:10.831054270.4214379c2:20.057025990.1354883c3:30.638662510.1644736c4:40.214415440.2198251c5:-12.000000003.0000000d
Cause
In the bind_rows function, dplyr_reconstruct is used to set attributes for the output dataframe.
In the case above, all attributes of first (which has four rows), including index are given to out, which has five rows. This causes the problem.
Impact
Because the data.table produced by bind_rows has corrupted secondary index, the filter functionality of data.table is skipping some rows when filtering by the index column.
Also, I found that this problem is not limited to bind_rows. Other dplyr functions that calls dplyr_reconstruct can result in data.tables with corrupted secondary index. For example, the full_join function can also produce unexpected results due to corrupted secondary index.
> a <- data.table::data.table(cola = c(1:4), colb = runif(4), colc= runif(4), cold = "d")
>
> attributes(a)$.internal.selfref <- new("externalptr") # Set pointer to nil
> a[cola == 3]
cola colb colc cold
<int> <num> <num> <char>
1: 3 0.9968646 0.8137836 d
>
> b <- data.table::data.table(cola = -1, cole = "e")
>
> combined <- dplyr::full_join(a, b, by = "cola")
>
> combined[cola==-1]
Empty data.table (0 rows and 5 cols): cola,colb,colc,cold,cole
The text was updated successfully, but these errors were encountered:
Just putting the same examples with clearer (IMO) formatting:
Example 1
# Create a data.tablea<-data.table::data.table(cola= c(5, 2:4), colb= runif(4), colc= runif(4), cold="c")
# Set pointer to nil. This is necessary for the subset error below to happen # in data.table. But it is not necessary to re-produce the corrupted index.
attributes(a)$.internal.selfref<- new("externalptr")
# Give data.table a secondary index ("cola" column) by auto-indexinga[cola==4]
#> cola colb colc cold#> <num> <num> <num> <char>#> 1: 4 0.8401062 0.09284545 c
# The secondary index is set correctly
attributes(a)$index#> integer(0)#> attr(,"__cola")#> [1] 2 3 4 1
b<-data.table::data.table(cola=-1, colb=2, colc=3, cold="d")
combined<-dplyr::bind_rows(list(a,b))
# combined is a data.table, with 5 rowscombined#> Index: <cola>#> cola colb colc cold#> <num> <num> <num> <char>#> 1: 5 0.4526811 0.38061661 c#> 2: 2 0.6131192 0.28859921 c#> 3: 3 0.7053851 0.85011065 c#> 4: 4 0.8401062 0.09284545 c#> 5: -1 2.0000000 3.00000000 d
# Wrong! length of secondary index is only 4
attributes(combined)$index#> integer(0)#> attr(,"__cola")#> [1] 2 3 4 1
combined[cola==-1]
#> Error: Internal error: index 'cola' exists but is invalid
Problem
Thanks @AMDraghici for your suggestions!
For example, in
bind_rows
, if the first input is adata.table
, the output table can have corrupt indexing due to how the underlyingdplyr_reconstruct
function deals with the attributes of the two inputsReprex
The example below shows that the index attribute can be incorrect for the output.
Cause
In the
bind_rows
function,dplyr_reconstruct
is used to set attributes for the output dataframe.dplyr/R/bind-rows.R
Line 79 in be36acf
Looking at the
dplyr_reconstruct
function, it is essentially giving all attributes other thannames
androw.names
intemplate_
todata
.dplyr/src/reconstruct.cpp
Line 36 in be36acf
In the case above, all attributes of
first
(which has four rows), including index are given toout
, which has five rows. This causes the problem.Impact
Because the
data.table
produced bybind_rows
has corrupted secondary index, the filter functionality ofdata.table
is skipping some rows when filtering by the index column.Also, I found that this problem is not limited to
bind_rows
. Otherdplyr
functions that callsdplyr_reconstruct
can result in data.tables with corrupted secondary index. For example, thefull_join
function can also produce unexpected results due to corrupted secondary index.The text was updated successfully, but these errors were encountered: