Interpreting robots.txt vs. terms of use

800 Views Asked by At

I'm interested in scraping craigslist, solely for the purpose of data analysis for a blog post (i.e., no commercial or financial gain, no posting/emailing, no personal data collection, no sharing of data scraped). Their robots.txt file is the following:

User-agent: *
Disallow: /reply
Disallow: /fb/
Disallow: /suggest
Disallow: /flag
Disallow: /mf
Disallow: /eaf

I intend to visit none of these directories, only to view posts and then collect the text from the postbody. This seems to not be disallowed in the robots.txt file. However, Craigslist terms of use has the following entry (relevant bit in bold):

USE. You agree not to use or provide software (except for general purpose web browsers and email clients, or software expressly licensed by us) or services that interact or interoperate with CL, e.g. for downloading, uploading, posting, flagging, emailing, search, or mobile use. Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited, as are misleading, unsolicited, unlawful, and/or spam postings/email. You agree not to collect users' personal and/or contact information ("PI").

So should I assume that my bot is forbidden across the entire site, or just forbidden in the Disallowed directories in robots.txt? If it's the former, then what am I misunderstanding about the robots.txt file? If it's the latter, then may I assume that they will not ban my IP given that I abide by robots.txt?

1

There are 1 best solutions below

2
On

They provide data in rss format. At the bottom right there is an rss link that will take you to ?format=rss

For example: https://losangeles.craigslist.org/search/sss?format=rss

My guess would be that sort of thing is really not allowed if you're redistributing the post content, collecting emails to spam, etc. It probably depends on how you use the data. If you're only gathering statistical information maybe it's acceptable but I really don't know. Probably a better question for a lawyer.