-
Notifications
You must be signed in to change notification settings - Fork 7
/
index.html
588 lines (456 loc) · 20.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
<!DOCTYPE html>
<html>
<head>
<meta charset='utf-8' />
<meta http-equiv="X-UA-Compatible" content="chrome=1" />
<meta name="description" content="Cuts : Unix 'cut' (and 'paste') on steroids: more flexible select columns from files" />
<link rel="stylesheet" type="text/css" media="screen" href="stylesheets/stylesheet.css">
<title>Cuts</title>
</head>
<body>
<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
<a id="forkme_banner" href="https://github.com/arielf/cuts">View on GitHub</a>
<h1 id="project_title">Cuts</h1>
<h2 id="project_tagline">Unix 'cut' (and 'paste') on steroids: more flexible select columns from files</h2>
<section id="downloads">
<a class="zip_download_link" href="https://github.com/arielf/cuts/zipball/master">Download this project as a .zip file</a>
<a class="tar_download_link" href="https://github.com/arielf/cuts/tarball/master">Download this project as a tar.gz file</a>
</section>
</header>
</div>
<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
<h1>
<a name="cuts" class="anchor" href="#cuts"><span class="octicon octicon-link"></span></a>cuts</h1>
<p><strong><em>cuts</em></strong>: Unix/POSIX <code>cut</code> (and <code>paste</code>) on (s)teroids.</p>
<p><code>cut</code> is a very useful Unix (and POSIX standard) utility designed to
extract columns from files. Unfortunately, despite its usefulness
and great popularity, it is pretty limited in power.</p>
<p>Many <a href="http://stackoverflow.com/questions/tagged/cut">questions on stackoverflow</a>
suggest that the same pain-points of the standard <code>cut</code> are felt by many users.</p>
<p>The following list demonstrates what is missing in <code>cut</code> and why
I felt the need to write <code>cuts</code>:</p>
<h4>
<a name="cuts-automatically-detects-the-file-input-column-delimiter" class="anchor" href="#cuts-automatically-detects-the-file-input-column-delimiter"><span class="octicon octicon-link"></span></a>
<code>cuts</code> automatically detects the file input column delimiter:</h4>
<pre><code>#
# -- cut doesn't:
#
$ cut -f1 test.dat
0,1,2
0,1,2
0,1,2
#
# -- cuts does:
#
$ cuts 0 test.dat
0
0
0
</code></pre>
<p>As you can see, I prefer zero-based indexing. <code>cuts</code> uses 0 for 1st column.</p>
<h4>
<a name="cuts-supports-mixed-input-delimiters-eg-both-csv-and-tsv" class="anchor" href="#cuts-supports-mixed-input-delimiters-eg-both-csv-and-tsv"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports mixed input delimiters (e.g. both CSV and TSV)</h4>
<pre><code>#
# -- cut doesn't "cut it":
#
cut -f2 t.mixed
0,1,2
0 1 2
1
#
# -- cuts does:
#
$ cuts 1 t.mixed
1
1
1
</code></pre>
<h4>
<a name="cuts-does-automatic-side-by-side-pasting" class="anchor" href="#cuts-does-automatic-side-by-side-pasting"><span class="octicon octicon-link"></span></a>
<code>cuts</code> does automatic side-by-side pasting</h4>
<pre><code>#
# -- cut doesn't output columns side-by-side when reading from
# multiple input files, even though this is the most useful
# and expected thing to do.
# (It requires a separate utility like "paste")
#
#
# -- a simple example input
#
$ cat t.tsv
0 1 2
a b c
X Y Z
#
# -- cut does one file at a time:
#
$ cut -f2 t.tsv t.tsv
1
b
Y
1
b
Y
#
# -- cuts does automatic side-by-side printing:
#
$ cuts 1 t.tsv t.tsv
1 1
b b
Y Y
</code></pre>
<h4>
<a name="cuts-supports-multi-char-column-delimiters" class="anchor" href="#cuts-supports-multi-char-column-delimiters"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports multi-char column delimiters</h4>
<p>In particular, standard <code>cut</code> can't deal with the very
common case of any white-space sequence:</p>
<pre><code>#
# -- a file with variable length space-delimiters
#
$ cat 012.txt
0 1 2
0 1 2
0 1 2
#
# -- standard cut doesn't "cut it":
#
$ cut -d' ' -f2 012.txt
#
# -- cuts does what makes sense:
#
$ cuts 1 012.txt
1
1
1
</code></pre>
<h4>
<a name="cuts-supports-powerful-perl-style-regex-delimiters" class="anchor" href="#cuts-supports-powerful-perl-style-regex-delimiters"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports powerful (perl style) regex delimiters</h4>
<p>When your delimiter is a bit more complex (say, any sequence of non-digits)
and you have <code>cut</code>, you're out-of-luck. <code>cuts</code> fixes this by allowing you
to specify any perl regular-expression (regexp) as the delimiter:</p>
<pre><code>#
# -- a file with numbers separated by mixed non-numeric chars
#
$ cat 012.regex
0-----1-------2
0 ## 1 #### 2
0 aa 1 bbbbbbb 2
#
# -- cuts accepts perl regexes for delimiters
# in this case, we set delimiter regex to any sequence of non-digits
#
$ cuts -d '[^0-9]+' 1 012.regex
1
1
1
</code></pre>
<h4>
<a name="cuts-supports-negative-from-end-column-numbers" class="anchor" href="#cuts-supports-negative-from-end-column-numbers"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports negative (from end) column numbers</h4>
<p>This is very useful when you have, say, 257 fields (but you haven't counted
them, so you don't really know), and you're interested in the last field,
or the one before the last etc. <code>cuts</code> supports negative offsets
from the end:</p>
<pre><code>#
# -- Ask cuts to print last field only, by using a negative offset
#
$ cuts -1 012.txt
2
2
2
</code></pre>
<h4>
<a name="cuts-supports-changing-order-of-columns" class="anchor" href="#cuts-supports-changing-order-of-columns"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports changing order of columns</h4>
<p>Unlike <code>cut</code> which ignores the order requested by the user,
and always force-prints the fields in order from low to high:</p>
<pre><code>#
# -- cut can't change the order of columns:
#
$ cut -f3,2,1 file.tsv
0 1 2
0 1 2
0 1 2
#
# -- cuts does exactly what you ask it to:
#
$ cuts 2 1 0 file.tsv
2 1 0
2 1 0
2 1 0
</code></pre>
<h4>
<a name="cuts-is-more-powerful-dealing-with-variable-number-of-columns" class="anchor" href="#cuts-is-more-powerful-dealing-with-variable-number-of-columns"><span class="octicon octicon-link"></span></a>
<code>cuts</code> is more powerful dealing with variable number of columns:</h4>
<p>The ability to offset from the end of line, in combination with the
ability to specify perl regular expressions as delimiters makes some
jobs that would require writing specialized scripts,
straight-forward with <code>cuts</code>:</p>
<pre><code>#
# -- Example file, not that Mary doesn't have a midinitial
#
$ cat t.complex
firstname midinitial lastname phone-number Age
John T. Public 555-5555 35
Mary Joe 444-5555 27
#
# -- Want the phone-number? It's easy with cuts
#
$ cuts t.complex -2
phone-number
555-5555
444-5555
</code></pre>
<h4>
<a name="cuts-is-forgiving-if-you-accidentally-use--t-like-sort-does" class="anchor" href="#cuts-is-forgiving-if-you-accidentally-use--t-like-sort-does"><span class="octicon octicon-link"></span></a>
<code>cuts</code> is forgiving if you accidentally use <code>-t</code> (like <code>sort</code> does)</h4>
<p>It is unfortunate that the Unix toolset is so inconsistent in the
choice of option-letters. <code>cuts</code> solves this by allowing 'any of
the above'. So if you accidentally use <code>-s</code> instead of <code>-d</code> because
you think "separator" instead of "delimiter" - it still works
(and <code>-t</code>, which is used by <code>sort</code>, works just as well).</p>
<h4>
<a name="cuts-requires-minimal-typing-for-simple-column-extraction-tasks" class="anchor" href="#cuts-requires-minimal-typing-for-simple-column-extraction-tasks"><span class="octicon octicon-link"></span></a>
<code>cuts</code> requires minimal typing for simple column extraction tasks</h4>
<p><code>cut</code> is hader to use and less friendly because it doesn't support
reasonable defaults. For example:</p>
<pre><code>#
# -- `cut` errors when arguments are missing:
#
$ cut -d, example.csv
cut: you must specify a list of bytes, characters, or fields
#
# -- compare to cuts, where default is 1st field &
# field-delimiters are auto-detected for most common cases:
#
$ cuts example.csv
0
0
0
</code></pre>
<h4>
<a name="cuts-supports-multi-file--multi-column-mixes" class="anchor" href="#cuts-supports-multi-file--multi-column-mixes"><span class="octicon octicon-link"></span></a>
<code>cuts</code> supports multi-file & multi-column mixes</h4>
<p>For example 2nd column from file1 and 3rd column from file2.</p>
<p>Obviously with the power of the <code>bash</code> shell you can do stuff like:</p>
<pre><code> $ paste <(cut -d, -f1 file.csv) <(cut -d"<TAB>" -f2 file.tsv)
</code></pre>
<p>but that requires too much typing (3 commands & shell-magic),
while still not supporting regex-style delimiters and offsets from end.</p>
<p>Compare the above to the much simpler, and more intuitive, <code>cuts</code> version,
which works right out of the box, in any shell:</p>
<pre><code>$ cat file.tsv
0 1 2
a b c
$ cat file.csv
0,1,2
a,b,c
$ cuts file.csv 0 file.tsv 1
0 1
a b
</code></pre>
<p>Other utilities, like <code>awk</code> or <code>perl</code> give you more power at the expense
of having to learn a much more complex language to do what you want.</p>
<p><code>cuts</code> is designed to give you the power you need in almost all cases,
while always being able to stay on the command line and keeping
the human interface <em>as simple and minimalist as possible</em></p>
<p><code>cuts</code> arguments can be:</p>
<pre><code>- file-names
- column-numbers (negative offsets from the end are supported too) or
- any combo of the two using: `file:colno`
</code></pre>
<p><code>cuts</code> also supports <code>-</code> as a handy alias for <code>stdin</code>.</p>
<h2>
<a name="cuts-design-principles" class="anchor" href="#cuts-design-principles"><span class="octicon octicon-link"></span></a>
<code>cuts</code> design principles</h2>
<p>The following are the principles which guide the design decisions of
cuts.</p>
<h3>
<a name="reasonable-defaults-for-everything" class="anchor" href="#reasonable-defaults-for-everything"><span class="octicon octicon-link"></span></a>Reasonable defaults for everything</h3>
<p>A file-name without a column-number will cause the <em>last</em> specified
column-number to be reused.</p>
<p>A column-number without a file-name will cause the <em>last</em> specified
file-name to be reused.</p>
<p>An unspecified column-number will default to the 1st column (0)</p>
<p>An unspecified file-name will default to <code>/dev/stdin</code>so you can easily pipe
any other command output into <code>cuts</code>.</p>
<p>By default, the input column delimiter is the most common case of
any-sequence of white-space <em>or</em> a comma, optionally surrounded by
white-space. As a result, in the vast majority of use cases, there's
no need to specify an input column delimiter at all. If you have
a more complex case you may overide <code>cuts</code> default
input-field-delimiter:</p>
<pre><code> $ cuts -d '<some-perl-regex>' ...
# see `man perlre` for documentation on perl regular expressions
</code></pre>
<p>Similarly, the output column delimiter which is tab by default, can be
overriden using <code>-T <sep></code> (or -S, or -D). This is chosen
as a mnemonic: lowercase options are for input delimiters, while
the respective upper-case options are for output delimiters.</p>
<h3>
<a name="require-minimal-typing-from-the-user" class="anchor" href="#require-minimal-typing-from-the-user"><span class="octicon octicon-link"></span></a>Require minimal typing from the user</h3>
<p>In addition to having reasonable defaults, <code>cuts</code> doesn't force you
to type more than needed, or enforce an order of arguments on you.
It tries to be as minimalist as possible in its requirements from the user.
Compare one of the simplest and most straightforward examples of
extracting 3 columns from a single file:</p>
<pre><code># -- the traditional, cut way:
$ cut -d, -f 1,2,3 file.csv
# -- the cuts way: shorter & sweeter:
$ cuts file.csv 0 1 2
</code></pre>
<p>Minimal typing is also what guided the decision to include the
functionality of <code>paste</code> in <code>cuts</code>.</p>
<h3>
<a name="input-flexibility--tolerance-to-missing-data" class="anchor" href="#input-flexibility--tolerance-to-missing-data"><span class="octicon octicon-link"></span></a>Input flexibility & tolerance to missing data</h3>
<p>One thing that <code>cuts</code> does is try and be completely tolerant
and supportive to cases of missing data. If you try to paste two columns,
side-by-side, from two files but one of the files is shorter,
<code>cuts</code> will oblige and won't output a field where it is missing
from the shorter file, until it reaches EOF on the longer file.</p>
<p>Similarly, requesting column 2 (3rd column) when there are only
2 columns (0,1) in a line will result in an empty output for that
field rather than resulting in a fatal error. This is done by
design and it conforms to the perl philosophy of silently converting
undefined values to empty ones.</p>
<h2>
<a name="examples" class="anchor" href="#examples"><span class="octicon octicon-link"></span></a>Examples</h2>
<pre><code> cuts 0 file1 file2 Extract 1st (0) column from both files
cuts file1 file2 0 Same as above (flexible argument order)
cuts file1 file2 Same as above (0 is default colno)
cuts -1 f1 f2 f3 Last column from each of f1, f2, & f3
cuts file1:0 file2:-1 1st (0) column from file1 & last column from file2
cuts 0 2 3 Columns (0,2,3) from /dev/stdin
cuts f1 0 -1 f2 1st & last columns from f1
+ last column (last colno seen) from f2
</code></pre>
<h2>
<a name="usage" class="anchor" href="#usage"><span class="octicon octicon-link"></span></a>Usage</h2>
<p>Simply call <code>cuts</code> without any argument to get a full usage message:</p>
<pre><code>$ cuts
Usage: cuts [Options] [Column_Specs]...
Options:
-v verbose (mostly for debugging)
-0 Don't use the default 0-based indexing, use 1-based
Input column delimiter options (lowercase):
-d <sep> Use <sep> (perl regexp) as column delimiter
-t <sep> Alias for -d
-s <sep> Another alias for -d
Output column delimiter options (uppercase of same):
-D <sep>
-T <sep>
-S <sep>
Column_Specs:
filename:colno Extract colno from filename
filename Use filename to extract columns from
colno Use column colno to extract columns
If there's an excess of colno args, will duplicate the last
file arg. If there's an excess of file args, will duplicate
the last colno.
If omitted:
Default file is /dev/stdin
Default colno is 0 (or 1 if 1-based indexing is in effect)
Examples:
cuts 0 file1 file2 1st (0) column from both files
cuts file1 file2 0 Same as above (flexible argument order)
cuts file1 file2 Same as above (0 is default colno)
cuts -1 f1 f2 f3 Last column from each of f1, f2, & f3
cuts file1:0 file2:-1 1st column from file1 & last column from file2
cuts 0 2 3 Columns (0,2,3) from /dev/stdin
cuts f1 0 -1 f2 1st & last columns from f1
+ last column (last colno seen) from f2
</code></pre>
<h2>
<a name="further-configuration--customization" class="anchor" href="#further-configuration--customization"><span class="octicon octicon-link"></span></a>Further configuration & customization</h2>
<p>If you don't like <code>cuts</code> defaults, you can override them in
an optional personal configuration ~/.cuts.pl</p>
<p>If this file exists, cuts will read it during startup allowing you
override cuts default parameters, in particular the value of
the <code>$ICS</code> input-column separator regex. The syntax of this
file is perl:</p>
<pre><code> # -- If you prefer 1-based indexing, by default, set this to 1
# You may also set it from the command-line with the
# -0 option.
our $opt_0 = 0;
# -- Alternative file:colno char separators
our $FCsep = ':;,#';
# -- Default input column separator (smart)
our $ICS = '(?:\s*,\s*|\s+)';
# -- Default output column separator
our $OCS = "\t";
# -- if you use a config file, you must end it with 1;
# -- so executing it by cuts using perl 'do' succeeds.
1;
</code></pre>
<h2>
<a name="todo-items-contributions-welcome" class="anchor" href="#todo-items-contributions-welcome"><span class="octicon octicon-link"></span></a>TODO items (contributions welcome)</h2>
<p>I made no effort to make <code>cuts</code> fast. Although compared to the
I/O overhead, there may be not much need for it. If you have ideas
on how to make the column extractions and joining more efficient,
that would be welcome.</p>
<p>Per file column input delimiters. I haven't had the need so far so
that took a back-seat in priority. The most common case of
intermixing TSV and CSV files as inputs is working thanks to
the current default multi-match pattern <code>$ICS</code> which simply
matches all of: multi-white-space, tabs, or (optionally space surrounded)
commas. Even an extreme case of a schizophrenic input like:</p>
<pre><code>$ cat schizo.csv
0,1 , 2
0,1 ,2
0,1 ,2
a b c
</code></pre>
<p>Works correctly, and as designed/expected, with the present smart
column-delimiter trick:</p>
<pre><code>$ cuts -1 schizo.csv
2
2
2
c
</code></pre>
<p>I consider it a blissful feature.</p>
<p>Implement <code>cut</code> rarely used options? I haven't had the need for
them, and if I ever do, I can simply use <code>cut</code> itself, so I haven't
even tried to implement stuff like fixed-width field support,
byte-offsets, <code>--complement</code>, <code>--characters</code>. The basic features
that <code>cut</code> is missing were much more critical for me when writing <code>cuts</code>.
Still on top of this is implementing column-ranges like: 3-5 and mixed
ranges with lists like: 1,3-5,7</p>
<h2>
<a name="other-thoughts" class="anchor" href="#other-thoughts"><span class="octicon octicon-link"></span></a>Other thoughts</h2>
<p>Why do I support the <code>filename:colno</code> syntax? you may ask.
It seems redundant (since <code>filename colno</code> works just as well.)
The reason is that sometimes you may have files named <code>1</code>, <code>2</code> etc.
This introduces an ambiguity: are these arguments files or column numbers?
<code>cuts</code> solves this ambiguity by:</p>
<ul>
<li>Giving priority to files (it first checks arguments for file existence)</li>
<li>In case you want to force <code>1</code> to a column number, even in the
presence of a file by the same name, you can use the <code>file:colno</code> syntax.</li>
<li>You may even use <code>#</code>, <code>,</code> or <code>;</code> (needs shell quoting), as the
<code>file:colno</code> separator instead of <code>:</code> for somewhat greater control.</li>
</ul><p>Resolving option ambiguity: negative column offsets and <code>-</code> for
<code>stdin</code> don't play well with <code>getopts()</code>. <code>cuts</code> solves this by auto
injecting <code>--</code> (end of options marker) into <code>@ARGV</code> before calling
getopts when needed. This is so the user never has to worry about
the ambiguity. For example, (<code>-v</code> is <code>cuts</code> own debugging/verbose
option, while <code>-3</code> is a column index specifier) this works as expected:</p>
<pre><code> $ cuts -v -3 file.txt
</code></pre>
</section>
</div>
<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
<p class="copyright">Cuts maintained by <a href="https://github.com/arielf">arielf</a></p>
<p>Published with <a href="http://pages.github.com">GitHub Pages</a></p>
</footer>
</div>
</body>
</html>