I'm trying to read in a BZ2 file from the Reddit Politosphere dataset (specifically the "comments_2008-01.bz2" file). The dataset contains, among other things, the body of a Reddit comment.
If I read the file in using read.csv, it works well for the most part, except for a few lines where it incorrectly splits what should be one entry into multiple columns.
df <- read.csv(bzfile("comments_2008-01.bz2"), fill = T)
df[9, ]
What happens:
| body..deleted | body..cleaned |
|---|---|
| We ended it in 2004, but they stole it back. Google \\Ohio voting results, | 2004.\\ |
What I would like to happen:
| body..deleted |
|---|
| We ended it in 2004, but they stole it back. Google \\Ohio voting results, 2004.\\ |
When I use read_lines to explore:
"{\"author\":\"nOD1S\",\"body\\":\"We ended it in 2004, but they stole it back. Google \\\"Ohio voting results, 2004.\\\"\", ..... }"
What I think is happening is that in \"Ohio voting results, 2004, the \" is unintentionally telling the parser that the entry is completed, which is why the next , forces the rest of the text into a new column.
I can think of hacky ways to delete these rows altogether, but I don't really want to do that. Any ideas about how to get around this issue?
As your data is not a csv file after "unzipping", but a json file (sadly not formatted 100% correctly). So we can use readLines and then interpret each line as JSON and convert it using the jsonlite package.