-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df1Row[df2Column] == df1Row[df2Column.name]
#445
base: master
Are you sure you want to change the base?
Conversation
This behavior is defined by |
The only place I found it breaking in the library was here: core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/toList.kt. There, the column was taken from the row while the other way around was intended. I also looked into forbidding such behavior, as taking a column from another dataframe from a row seems odd, but that's still used a lot across the implementation. But purely from a user-standpoint, what would you expect |
Ah, ok, i think i finally understood. Somehow this part was hard for me to understand, that's why i was suspicious of this fix
Now i think we talk about the same thing, but differently. Yeah, indeed df1Row[df2Column] gives you values from df2Column if df2Column is pure DataColumn, unlike columns created by column accessors API:
So yeah, to me the reason why it works like this is that we need this behavior for data schema api:
But it backfires in |
@koperagen Thanks! you just gave me another test case that this change indeed broke. I'll try to find another solution which is a) more efficient and b) doesn't break |
That's interesting... val aColumn = columnOf(3, 4, 5) named "a"
dataFrameOf("a")(1, 2, 3)
.select { aColumn } This results in "1, 2, 3". So it takes the name from |
Right. Columns created by
|
@koperagen hmm, you're right val aColumn = dataFrameOf("a")(4, 5, 6)["a"]
dataFrameOf("a")(1, 2, 3)
.select { aColumn } results in "4, 5, 6" indeed. I don't know entirely how to rectify this anymore... but I do think they should all behave the same |
Okay, new implementation doesn't check all names anymore; it attempts to resolve both manners (from column by row index and from dataframe by name) and gives by name priority if it was able to resolve it and it was different from the other manner: val fromColumnByRow = column.getValue(this)
val fromDfByName = df
.let {
try {
it.getColumnOrNull(column.name())
} catch (e: IllegalStateException) {
return fromColumnByRow
}
}
.let { it ?: return fromColumnByRow }
.let {
try {
it[index]
} catch (e: IndexOutOfBoundsException) {
return fromColumnByRow
}
}
.let {
try {
it as R
} catch (e: ClassCastException) {
return fromColumnByRow
}
}
return when {
// Issue #442: df1Row[df2Column] should be df1Row[df2Column.name], not df2Column[df1Row(.index)]
// so, give fromDfByName priority if it's not the same as fromColumnByRow
fromDfByName != fromColumnByRow -> fromDfByName
else -> fromColumnByRow
} |
Implemented fix for #442.
It used to be that
df1Row[df2Column] == df2Column[df1Row.index]
, but IMO that's confusing. I can see it was done to makedf1Row[df2Column]
anddf2Column[df1Row]
return the same thing but across the library, this is only actually used once (in toList and was fixed by swapping row/cols).It's also very common for the library to call
df1Row[newColumn]
for reasons I don't fully understand. So my fix checks if the name ofnewColumn
already exists in the row, if so, return the value at that name. If not, do the magic you did before and let the column handle the resolving of the value.This seems to work with all tests and fixes the specific example of the issue.