Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary bucketing to columns #29

Open
Llammissar opened this issue Feb 9, 2017 · 2 comments
Open

Summary bucketing to columns #29

Llammissar opened this issue Feb 9, 2017 · 2 comments

Comments

@Llammissar
Copy link

Ahaha, it keeps happening! ;) This time, in a much longer form.

I'm not sure what to call this precisely, so I'll describe the problem and see what you think:

Say we have as input a list of actions and their outcome:

action                success
frob                  true
tweak                 true
frob                  false
frob                  true
twiddle               true
tweak                 false
frob                  true

The goal is something like this:

action  succ    fail
frob    3       1
tweak   1       1
twiddle 1       0

The above was done with the following incantation:

tsv-join -w 0 --filter-file \
    <(tsv-filter -H --str-eq 2:"false" input.tsv | tsv-summarize -H --group-by 1 --count-header fail) \
    -k 1 \
    -a 2 \
    <(tsv-filter -H --str-eq 2:"true" input.tsv | tsv-summarize -H --group-by 1 --count-header succ)

This is pretty nasty: two subshells (with bonus bashisms), two scans of the input file, annoying and error-prone to edit... you can probably see why I'd like to improve this one.

Two ideas come to mind for how this might work:

  1. A bucketing action. Something like this, perhaps:
    tsv-summarize --group-by 1 --bucket 2
    For the above input, I'd expect this to output:
action  success_true    success_false
frob    3               1
tweak   1               1
twiddle 1               0

i.e. each unique value in the group is counted and given a bucket.

Advantages:

  • It's relatively simple. Low overhead on the user and the column names can be cleaned up with little effort.
  • It actually works for any number of unique values in the bucketed column (e.g. if I had a third "baz" status, it'd give me a column for that too).

Disadvantages:

  • You do have to clean up the names. Given what we discussed before, I can't really think of any decent syntaxen for naming the columns in situ.
  • It may be too simple? This is just a hunch, though.
  • More feature-creep in tsv-summarize doesn't necessarily sit well with me.
  1. Some mechanism for pivoting (possibly as a new tool?). Using existing tools, we can get the following:
$ tsv-summarize -H --group-by 1,2 --count input.tsv | sort
action  success count
frob    false   1
frob    true    3
tweak   false   1
tweak   true    1
twiddle true    1

In this case, there needs to be something to bridge the gap. Maybe something like this?
tsv-pivot --column 2 --fact sum:3

Output... probably the same as before.

Advantages:

  • General tool, functional for this scenario and for any other pivot operation
  • Can probably reuse a lot code, including the aggregates from tsv-summarize
  • More flexibility WRT syntax because it's a whole new tool.

Disadvantages:

  • It's a new tool in the suite, so hooray for effort and bugs. Do I really need more disadvantages at this point?
  • ...Hahah, whoops, probably still have to clean up column names!
@jondegenhardt
Copy link
Contributor

Nice example. I need to read through your proposals in more detail. However, my initial thought is that what you mention in point 2 is the key. Using tsv-summarize -H --group-by 1,2 --count produces "long, narrow" data. What you are asking for is a "wide" data format. Both forms are commonly used in tools for statistics, machine learning, etc. I need to put my own data in those forms quite commonly, so I definitely want to support them.

tsv-append was created for creating long, narrow data. tsv-join is good for creating wide data sets. However, it can be cumbersome to use in certain circumstances and I expect to create a version more tailored to the task. I haven't spent much time thinking about how tsv-summarize fits into this, but figuring this out might help keep the tools self-consistent.

A couple references on long/narrow and wide data formats:

@Llammissar
Copy link
Author

Llammissar commented Feb 9, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants