link to original dataset
I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure:
.
└── trec 06p/
├── data
├── data-delay
├── full
├── full-delay
├── ham25
├── ham25-delay
├── ham50
├── ham50-delay
├── spam25
├── spam25-delay
├── spam50
└── spam50-delay
Some questions:
- What is the delay for? (e.g.
data-delay,full-delay) - What does
fullmean in this case? (is it just the labels?) - What is the difference between HAM and ham in the
full-delaysubfolder? - Why is the
data-delayfolder empty? - Is there any special way to parse the contents in the data folder?
Disclaimer
Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have on the dataset.
Educated Guessed Answers
First, reading the task paper helps https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)
Next, the right download link for future readers would be https://plg.uwaterloo.ca/~gvcormac/treccorpus06/
And now, some summary:
Q: How are the above forms of feedback represented by the files in the dataset?
A: All the actual textual data are actually found in the
trec06/data/**/*filesAnd for the rest of the directories, they are just a indices pointing to the subsets to emulate the different forms of evaluations.
Q: What does full mean in this case? (is it just the labels?)
trec06p/full/index: The index of email lists that points to all the data points intrec06p/data/**/*Q: What is the delay for? (e.g. data-delay, full-delay)
trec06p/full-delay/index: The indices that points to the delayed feedback evaluationtrec06p/ham*-delay/index: The indices that points to only the non-spam labelled emails in the delayed feedback evaluationtrec06p/spam*-delay/index: The indices that points to only the spam labelled emails in the delayed feedback evaluationSo essentially, all the unique list of
trec06p/ham*-delay/index+trec06p/spam*-delay/index=trec06p/full-delay/indexQ: Why is the data-delay folder empty?
For this, I don't have an answer... Got to ask the data provider/creator.
Q: Is there any special way to parse the contents in the data folder?
Now that's the fun coding part =)
Lets step back a little and think what we have essentially:
trec06/data/**/*spam/hamlabels of each email intrec06/full/indexSpam/SPAM/Ham/HAMlabels of a subset of emails intrec06/full-delay/indexSo...
Q: What is the difference between HAM, Ham, SPAM and Spam labels in the
trec06p/*-delay/indexIf we look carefully at the
if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower()line, we see that all the caps and the non-caps labels are the same.Q: So why is there a difference?
A: Not sure, best to ask data provider/creator
Q: Is there a difference between the labels from
trec06p/full-delay/indexandtrec06p/full/index?Don't seem like there's any.
Q: How do I just read it into a pandas dataframe?
Given what we know above:
Q: But the input columns are still binaries, can I somehow guess the encoding?
Not really, it's pretty hard / messy to guess the encoding of a binary file but you can try this (though not all file specify
charset=...in the content)