I am trying to use csplit
in BASH to separate a file by years in the 1500-1600's as delimiters.
When I do the command
csplit Shakespeare.txt '/1[56]../' '{36}'
it almost works, except for at least two issues:
- This outputs 38 files, not 36, numbered
xx00
throughxx37
. (Alsoxx00
is completely blank.) I don't understand how this is possible. - One of the files (why, it seems, that
csplit
returns 37 non-empty files instead of the 36 non-empty files I expected) doesn't begin with 15XX or 16XX -- it begins with "ACT 4 SCENE 15\n" (where \n is supposed to denote a newline or line break). I don't understand howcsplit
can match a new line/line break with a number.
When I do the command (which is what I want)
csplit Shakespeare.txt '/1[56][0-9][0-9]/' '{36}'
the terminal returns the error: csplit: 1[56][0-9][0-9]: no match
plus listing all of the numbers it lists when the above is executed.
This especially doesn't make sense to me, since grep
says otherwise:
grep -c "1[56][0-9][0-9]" Shakespeare.txt
36
grep -c "1[56].." Shakespeare.txt
36
Note: man csplit
indicates that I have the BSD version from January 26, 2005. man grep
indicates that I have the BSD version from July 28, 2010.
Based on the answer given here by user 'DRL' on 06-20-2008, I decided to try adding the
-k
option tocsplit
.csplit -k Shakespeare.txt '/^1[56][0-9][0-9]/' '{36}'
This returned an error:
csplit: ^1[56][0-9][0-9]: no match
However, it still gave (more or less) the desired output: files
xx00.txt
throughxx36.txt
(notxx37.txt
), and each of the non-empty files,xx01.txt
-xx36.txt
had the expected/desired content. (In particular, no file began with "ACT 4 SCENE 15".The man page for
csplit
says the following about the-k
flag:Honestly I don't quite understand what this means, but I still have the following conjecture about why this solution worked/works:
Conjecture:
csplit
expects the beginning of the file to match the regex. Thus, since the beginning line of the file did not match^1[56][0-9][0-9]
, it threw a tantrum and quit without the-k
flag.Nevertheless, I still don't understand why
1[56][0-9][0-9]
did not work, maybe the same reason. And I definitely don't understand why1[56]..
did not work (i.e. whycsplit
produced a 37th file not beginning with the pattern).