You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.
Two ideas come to mind for how this might work:
A bucketing action. Something like this, perhaps: tsv-summarize --group-by 1 --bucket 2
For the above input, I'd expect this to output:
Nice example. I need to read through your proposals in more detail. However, my initial thought is that what you mention in point 2 is the key. Using tsv-summarize -H --group-by 1,2 --count produces "long, narrow" data. What you are asking for is a "wide" data format. Both forms are commonly used in tools for statistics, machine learning, etc. I need to put my own data in those forms quite commonly, so I definitely want to support them.
tsv-append was created for creating long, narrow data. tsv-join is good for creating wide data sets. However, it can be cumbersome to use in certain circumstances and I expect to create a version more tailored to the task. I haven't spent much time thinking about how tsv-summarize fits into this, but figuring this out might help keep the tools self-consistent.
A couple references on long/narrow and wide data formats:
Real quick before bed: after posting this, I discovered datamash (exists.
Also that it) has cross-tabulation as a separate tool, so there's at least
one toolset in the same domain treating it as a separate tool.
Idly, I wonder if there's not room for both concepts to coexist?
Ahaha, it keeps happening! ;) This time, in a much longer form.
I'm not sure what to call this precisely, so I'll describe the problem and see what you think:
Say we have as input a list of actions and their outcome:
The goal is something like this:
The above was done with the following incantation:
This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.
Two ideas come to mind for how this might work:
tsv-summarize --group-by 1 --bucket 2
For the above input, I'd expect this to output:
i.e. each unique value in the group is counted and given a bucket.
Advantages:
Disadvantages:
In this case, there needs to be something to bridge the gap. Maybe something like this?
tsv-pivot --column 2 --fact sum:3
Output... probably the same as before.
Advantages:
Disadvantages:
The text was updated successfully, but these errors were encountered: