-
Notifications
You must be signed in to change notification settings - Fork 7
/
params.json
1 lines (1 loc) · 15.3 KB
/
params.json
1
{"name":"Cuts","tagline":"Unix 'cut' (and 'paste') on steroids: more flexible select columns from files","body":"cuts\r\n====\r\n\r\n***cuts***: Unix/POSIX `cut` (and `paste`) on (s)teroids.\r\n\r\n`cut` is a very useful Unix (and POSIX standard) utility designed to\r\nextract columns from files. Unfortunately, despite its usefulness\r\nand great popularity, it is pretty limited in power.\r\n\r\nMany <a href=\"http://stackoverflow.com/questions/tagged/cut\">questions on stackoverflow</a>\r\nsuggest that the same pain-points of the standard `cut` are felt by many users.\r\n\r\nThe following list demonstrates what is missing in `cut` and why\r\nI felt the need to write `cuts`:\r\n\r\n#### `cuts` automatically detects the file input column delimiter:\r\n```\r\n#\r\n# -- cut doesn't:\r\n#\r\n$ cut -f1 test.dat\r\n0,1,2\r\n0,1,2\r\n0,1,2\r\n\r\n#\r\n# -- cuts does:\r\n#\r\n$ cuts 0 test.dat\r\n0\r\n0\r\n0\r\n```\r\nAs you can see, I prefer zero-based indexing. `cuts` uses 0 for 1st column.\r\n\r\n#### `cuts` supports mixed input delimiters (e.g. both CSV and TSV)\r\n```\r\n#\r\n# -- cut doesn't \"cut it\":\r\n#\r\ncut -f2 t.mixed\r\n0,1,2\r\n0 1 2\r\n1\r\n#\r\n# -- cuts does:\r\n#\r\n$ cuts 1 t.mixed\r\n1\r\n1\r\n1\r\n```\r\n\r\n#### `cuts` does automatic side-by-side pasting\r\n\r\n```\r\n#\r\n# -- cut doesn't output columns side-by-side when reading from\r\n# multiple input files, even though this is the most useful\r\n# and expected thing to do.\r\n# (It requires a separate utility like \"paste\")\r\n#\r\n\r\n#\r\n# -- a simple example input\r\n#\r\n$ cat t.tsv\r\n0\t1\t2\r\na\tb\tc\r\nX\tY\tZ\r\n\r\n#\r\n# -- cut does one file at a time:\r\n#\r\n$ cut -f2 t.tsv t.tsv\r\n1\r\nb\r\nY\r\n1\r\nb\r\nY\r\n\r\n#\r\n# -- cuts does automatic side-by-side printing:\r\n#\r\n$ cuts 1 t.tsv t.tsv\r\n1\t1\r\nb\tb\r\nY\tY\r\n```\r\n\r\n#### `cuts` supports multi-char column delimiters\r\n\r\nIn particular, standard `cut` can't deal with the very\r\ncommon case of any white-space sequence:\r\n\r\n```\r\n#\r\n# -- a file with variable length space-delimiters\r\n#\r\n$ cat 012.txt\r\n0 1 2\r\n0 1 2\r\n0 1 2\r\n\r\n#\r\n# -- standard cut doesn't \"cut it\":\r\n#\r\n$ cut -d' ' -f2 012.txt\r\n\r\n\r\n\r\n#\r\n# -- cuts does what makes sense:\r\n#\r\n$ cuts 1 012.txt\r\n1\r\n1\r\n1\r\n```\r\n\r\n#### `cuts` supports powerful (perl style) regex delimiters\r\n\r\nWhen your delimiter is a bit more complex (say, any sequence of non-digits)\r\nand you have `cut`, you're out-of-luck. `cuts` fixes this by allowing you\r\nto specify any perl regular-expression (regexp) as the delimiter:\r\n\r\n```\r\n#\r\n# -- a file with numbers separated by mixed non-numeric chars\r\n#\r\n$ cat 012.regex\r\n0-----1-------2\r\n0 ## 1 #### 2\r\n0 aa 1 bbbbbbb 2\r\n\r\n#\r\n# -- cuts accepts perl regexes for delimiters\r\n# in this case, we set delimiter regex to any sequence of non-digits\r\n#\r\n$ cuts -d '[^0-9]+' 1 012.regex\r\n1\r\n1\r\n1\r\n```\r\n\r\n#### `cuts` supports negative (from end) column numbers\r\n\r\nThis is very useful when you have, say, 257 fields (but you haven't counted\r\nthem, so you don't really know), and you're interested in the last field,\r\nor the one before the last etc. `cuts` supports negative offsets\r\nfrom the end:\r\n\r\n```\r\n#\r\n# -- Ask cuts to print last field only, by using a negative offset\r\n#\r\n$ cuts -1 012.txt\r\n2\r\n2\r\n2\r\n\r\n```\r\n\r\n#### `cuts` supports changing order of columns\r\n\r\nUnlike `cut` which ignores the order requested by the user,\r\nand always force-prints the fields in order from low to high:\r\n\r\n```\r\n#\r\n# -- cut can't change the order of columns:\r\n#\r\n$ cut -f3,2,1 file.tsv\r\n0\t1\t2\r\n0\t1\t2\r\n0\t1\t2\r\n\r\n#\r\n# -- cuts does exactly what you ask it to:\r\n#\r\n$ cuts 2 1 0 file.tsv \r\n2\t1\t0\r\n2\t1\t0\r\n2\t1\t0\r\n```\r\n\r\n#### `cuts` is more powerful dealing with variable number of columns:\r\n\r\nThe ability to offset from the end of line, in combination with the\r\nability to specify perl regular expressions as delimiters makes some\r\njobs that would require writing specialized scripts,\r\nstraight-forward with `cuts`:\r\n\r\n```\r\n#\r\n# -- Example file, not that Mary doesn't have a midinitial\r\n#\r\n$ cat t.complex\r\nfirstname midinitial lastname phone-number Age\r\nJohn T. Public 555-5555 35\r\nMary Joe 444-5555 27\r\n\r\n#\r\n# -- Want the phone-number? It's easy with cuts\r\n#\r\n$ cuts t.complex -2\r\nphone-number\r\n555-5555\r\n444-5555\r\n```\r\n\r\n#### `cuts` is forgiving if you accidentally use `-t` (like `sort` does)\r\n\r\nIt is unfortunate that the Unix toolset is so inconsistent in the\r\nchoice of option-letters. `cuts` solves this by allowing 'any of\r\nthe above'. So if you accidentally use `-s` instead of `-d` because\r\nyou think \"separator\" instead of \"delimiter\" - it still works\r\n(and `-t`, which is used by `sort`, works just as well).\r\n\r\n#### `cuts` requires minimal typing for simple column extraction tasks\r\n\r\n`cut` is hader to use and less friendly because it doesn't support\r\nreasonable defaults. For example:\r\n\r\n```\r\n#\r\n# -- `cut` errors when arguments are missing:\r\n#\r\n$ cut -d, example.csv\r\ncut: you must specify a list of bytes, characters, or fields\r\n\r\n#\r\n# -- compare to cuts, where default is 1st field &\r\n# field-delimiters are auto-detected for most common cases:\r\n#\r\n$ cuts example.csv\r\n0\r\n0\r\n0\r\n```\r\n\r\n#### `cuts` supports multi-file & multi-column mixes\r\n\r\nFor example 2nd column from file1 and 3rd column from file2.\r\n\r\nObviously with the power of the `bash` shell you can do stuff like:\r\n```\r\n $ paste <(cut -d, -f1 file.csv) <(cut -d\"<TAB>\" -f2 file.tsv)\r\n```\r\n\r\nbut that requires too much typing (3 commands & shell-magic),\r\nwhile still not supporting regex-style delimiters and offsets from end.\r\n\r\nCompare the above to the much simpler, and more intuitive, `cuts` version,\r\nwhich works right out of the box, in any shell:\r\n\r\n```\r\n$ cat file.tsv\r\n0\t1\t2\r\na\tb\tc\r\n\r\n$ cat file.csv\r\n0,1,2\r\na,b,c\r\n\r\n$ cuts file.csv 0 file.tsv 1\r\n0\t1\r\na\tb\r\n```\r\n\r\n\r\nOther utilities, like `awk` or `perl` give you more power at the expense\r\nof having to learn a much more complex language to do what you want.\r\n\r\n`cuts` is designed to give you the power you need in almost all cases,\r\nwhile always being able to stay on the command line and keeping\r\nthe human interface _as simple and minimalist as possible_\r\n\r\n`cuts` arguments can be:\r\n\r\n - file-names\r\n - column-numbers (negative offsets from the end are supported too) or\r\n - any combo of the two using: `file:colno`\r\n\r\n`cuts` also supports `-` as a handy alias for `stdin`.\r\n\r\n\r\n## `cuts` design principles\r\n\r\nThe following are the principles which guide the design decisions of\r\ncuts.\r\n\r\n### Reasonable defaults for everything\r\n\r\nA file-name without a column-number will cause the *last* specified\r\ncolumn-number to be reused.\r\n\r\nA column-number without a file-name will cause the *last* specified\r\nfile-name to be reused.\r\n\r\nAn unspecified column-number will default to the 1st column (0)\r\n\r\nAn unspecified file-name will default to `/dev/stdin`so you can easily pipe\r\nany other command output into `cuts`.\r\n\r\nBy default, the input column delimiter is the most common case of\r\nany-sequence of white-space *or* a comma, optionally surrounded by\r\nwhite-space. As a result, in the vast majority of use cases, there's\r\nno need to specify an input column delimiter at all. If you have\r\na more complex case you may overide `cuts` default\r\ninput-field-delimiter:\r\n\r\n```\r\n $ cuts -d '<some-perl-regex>' ...\r\n # see `man perlre` for documentation on perl regular expressions\r\n```\r\n\r\nSimilarly, the output column delimiter which is tab by default, can be\r\noverriden using `-T <sep>` (or -S, or -D). This is chosen\r\nas a mnemonic: lowercase options are for input delimiters, while\r\nthe respective upper-case options are for output delimiters.\r\n\r\n### Require minimal typing from the user\r\n\r\nIn addition to having reasonable defaults, `cuts` doesn't force you\r\nto type more than needed, or enforce an order of arguments on you.\r\nIt tries to be as minimalist as possible in its requirements from the user.\r\nCompare one of the simplest and most straightforward examples of\r\nextracting 3 columns from a single file:\r\n\r\n```\r\n# -- the traditional, cut way:\r\n$ cut -d, -f 1,2,3 file.csv\r\n\r\n# -- the cuts way: shorter & sweeter:\r\n$ cuts file.csv 0 1 2\r\n```\r\n\r\nMinimal typing is also what guided the decision to include the\r\nfunctionality of `paste` in `cuts`.\r\n\r\n\r\n### Input flexibility & tolerance to missing data\r\n\r\nOne thing that `cuts` does is try and be completely tolerant\r\nand supportive to cases of missing data. If you try to paste two columns,\r\nside-by-side, from two files but one of the files is shorter,\r\n`cuts` will oblige and won't output a field where it is missing\r\nfrom the shorter file, until it reaches EOF on the longer file.\r\n\r\nSimilarly, requesting column 2 (3rd column) when there are only\r\n2 columns (0,1) in a line will result in an empty output for that\r\nfield rather than resulting in a fatal error. This is done by\r\ndesign and it conforms to the perl philosophy of silently converting\r\nundefined values to empty ones.\r\n\r\n## Examples\r\n\r\n```\r\n cuts 0 file1 file2 Extract 1st (0) column from both files\r\n\r\n cuts file1 file2 0 Same as above (flexible argument order)\r\n\r\n cuts file1 file2 Same as above (0 is default colno)\r\n\r\n cuts -1 f1 f2 f3 Last column from each of f1, f2, & f3\r\n\r\n cuts file1:0 file2:-1 1st (0) column from file1 & last column from file2\r\n\r\n cuts 0 2 3 Columns (0,2,3) from /dev/stdin\r\n\r\n cuts f1 0 -1 f2 1st & last columns from f1\r\n + last column (last colno seen) from f2\r\n\r\n```\r\n\r\n\r\n## Usage\r\n\r\nSimply call `cuts` without any argument to get a full usage message:\r\n\r\n```\r\n$ cuts\r\nUsage: cuts [Options] [Column_Specs]...\r\n Options:\r\n -v verbose (mostly for debugging)\r\n -0 Don't use the default 0-based indexing, use 1-based\r\n\r\n Input column delimiter options (lowercase):\r\n -d <sep> Use <sep> (perl regexp) as column delimiter\r\n -t <sep> Alias for -d\r\n -s <sep> Another alias for -d\r\n \r\n Output column delimiter options (uppercase of same):\r\n -D <sep>\r\n -T <sep>\r\n -S <sep>\r\n\r\n Column_Specs:\r\n filename:colno Extract colno from filename\r\n filename Use filename to extract columns from\r\n colno Use column colno to extract columns\r\n\r\n If there's an excess of colno args, will duplicate the last\r\n file arg. If there's an excess of file args, will duplicate\r\n the last colno.\r\n\r\n If omitted:\r\n Default file is /dev/stdin\r\n Default colno is 0 (or 1 if 1-based indexing is in effect)\r\n\r\n Examples:\r\n cuts 0 file1 file2 1st (0) column from both files\r\n\r\n cuts file1 file2 0 Same as above (flexible argument order)\r\n\r\n cuts file1 file2 Same as above (0 is default colno)\r\n\r\n cuts -1 f1 f2 f3 Last column from each of f1, f2, & f3\r\n\r\n cuts file1:0 file2:-1 1st column from file1 & last column from file2\r\n\r\n cuts 0 2 3 Columns (0,2,3) from /dev/stdin\r\n\r\n cuts f1 0 -1 f2 1st & last columns from f1\r\n + last column (last colno seen) from f2\r\n```\r\n\r\n## Further configuration & customization\r\n\r\nIf you don't like `cuts` defaults, you can override them in\r\nan optional personal configuration ~/.cuts.pl\r\n\r\nIf this file exists, cuts will read it during startup allowing you\r\noverride cuts default parameters, in particular the value of\r\nthe `$ICS` input-column separator regex. The syntax of this\r\nfile is perl:\r\n\r\n```\r\n # -- If you prefer 1-based indexing, by default, set this to 1\r\n # You may also set it from the command-line with the\r\n # -0 option.\r\n our $opt_0 = 0;\r\n\r\n # -- Alternative file:colno char separators\r\n our $FCsep = ':;,#';\r\n\r\n # -- Default input column separator (smart)\r\n our $ICS = '(?:\\s*,\\s*|\\s+)';\r\n\r\n # -- Default output column separator\r\n our $OCS = \"\\t\";\r\n\r\n # -- if you use a config file, you must end it with 1;\r\n # -- so executing it by cuts using perl 'do' succeeds.\r\n 1;\r\n```\r\n\r\n## TODO items (contributions welcome)\r\n\r\nI made no effort to make `cuts` fast. Although compared to the\r\nI/O overhead, there may be not much need for it. If you have ideas\r\non how to make the column extractions and joining more efficient,\r\nthat would be welcome.\r\n\r\nPer file column input delimiters. I haven't had the need so far so\r\nthat took a back-seat in priority. The most common case of\r\nintermixing TSV and CSV files as inputs is working thanks to\r\nthe current default multi-match pattern `$ICS` which simply\r\nmatches all of: multi-white-space, tabs, or (optionally space surrounded)\r\ncommas. Even an extreme case of a schizophrenic input like:\r\n\r\n```\r\n$ cat schizo.csv\r\n0,1 , 2\r\n0,1 ,2\r\n0,1 ,2\r\na b c\r\n```\r\n\r\nWorks correctly, and as designed/expected, with the present smart\r\ncolumn-delimiter trick:\r\n\r\n```\r\n$ cuts -1 schizo.csv\r\n2\r\n2\r\n2\r\nc\r\n```\r\n\r\nI consider it a blissful feature.\r\n\r\nImplement `cut` rarely used options? I haven't had the need for\r\nthem, and if I ever do, I can simply use `cut` itself, so I haven't\r\neven tried to implement stuff like fixed-width field support,\r\nbyte-offsets, `--complement`, `--characters`. The basic features\r\nthat `cut` is missing were much more critical for me when writing `cuts`.\r\nStill on top of this is implementing column-ranges like: 3-5 and mixed\r\nranges with lists like: 1,3-5,7\r\n\r\n## Other thoughts\r\n\r\nWhy do I support the `filename:colno` syntax? you may ask.\r\nIt seems redundant (since `filename colno` works just as well.)\r\nThe reason is that sometimes you may have files named `1`, `2` etc.\r\nThis introduces an ambiguity: are these arguments files or column numbers?\r\n`cuts` solves this ambiguity by:\r\n\r\n - Giving priority to files (it first checks arguments for file existence)\r\n - In case you want to force `1` to a column number, even in the\r\n presence of a file by the same name, you can use the `file:colno` syntax.\r\n - You may even use `#`, `,` or `;` (needs shell quoting), as the\r\n `file:colno` separator instead of `:` for somewhat greater control.\r\n\r\n\r\nResolving option ambiguity: negative column offsets and `-` for\r\n`stdin` don't play well with `getopts()`. `cuts` solves this by auto\r\ninjecting `--` (end of options marker) into `@ARGV` before calling\r\ngetopts when needed. This is so the user never has to worry about\r\nthe ambiguity. For example, (`-v` is `cuts` own debugging/verbose\r\noption, while `-3` is a column index specifier) this works as expected:\r\n\r\n```\r\n $ cuts -v -3 file.txt\r\n```\r\n\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."}