Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Following the comment in discourse #162

Open
sl-solution opened this issue Mar 22, 2022 · 8 comments
Open

Following the comment in discourse #162

sl-solution opened this issue Mar 22, 2022 · 8 comments

Comments

@sl-solution
Copy link

Thank you for supporting the development of InMemoryDatasets.jl.

I added a comment to your post in discourse (here), however, I thought it might be better to discuss it here.

Currently printing a large view of a data set is very slow, I wondering if this can be improved?

@ronisbr
Copy link
Owner

ronisbr commented Mar 22, 2022

Hi @sl-solution !

Can you please test against master?

@sl-solution
Copy link
Author

For scenarios like view(ds, :, :) it is great, however, my problem is about the situation when a view has a non consecutive permutation of rows, this happens for groupby and gatherby,

using Random, InMemoryDatasets
ds = Dataset(rand(1:1000, 10^7,3), :auto);
sds = view(ds, shuffle(1:nrow(ds)), :);
@time show(ds); # this is fast
@time show(sds); # 2.419944 seconds (5.29 k allocations: 283.562 KiB)

@ronisbr
Copy link
Owner

ronisbr commented Mar 23, 2022

Hi @sl-solution !

Sorry, I just have time to take a look at this issue now. After profiling, I see that almost the entire execution is inside this function:

https://github.com/sl-solution/InMemoryDatasets.jl/blob/009b9084b9f5263d3f2f8080e423fb6c9c17aac4/src/other/tables.jl#L34

According to Tables.jl API, I need to access an element using this function:

Tables.getcolumn(ctable.table, column_name)[I]

In your case, it calls view(_columns(parent(sds))[i], rows(sds)) which seems very slow:

julia> @btime Tables.getcolumn(sds, :x1)[1]
  4.688 ms (1 allocation: 48 bytes)
314

julia> @btime Tables.getcolumn(ds, :x1)[1]
  147.847 ns (2 allocations: 272 bytes)
403

@sl-solution
Copy link
Author

If that's the case, does the following help?

julia> _tmp = Tables.getcolumn(sds, :x1) # this is expensive but we can do it once
julia> _tmp[1]

@ronisbr
Copy link
Owner

ronisbr commented Mar 24, 2022

Actually no, because this function is called to get the element at each cell. Thus, I cannot store it temporally. Since PrettyTables.jl must work with many other types of data, including those that do no support Tables.jl, it will not be good to create functions to store columns at each different case.

@sl-solution
Copy link
Author

but here _tmp isn't the whole column it is like a pointer to the actual values.

@ronisbr
Copy link
Owner

ronisbr commented Mar 24, 2022

Yes, but PrettyTables will call:

_tmp = Tables.getcolumn(sds, :x1)
_tmp[1]

for each table element, leading to the same result. The only way to improve inside PrettyTables is to store a column and then get all the elements, instead of asking for a column at every cell. However, this is not feasible since I do not render only structures that supports Tables.jl API.

@sl-solution
Copy link
Author

I see, let me think more about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants