The stormcrawler maven archetype seems to not play nice with the warc module in my project. Currently it only creates empty 0 byte files with names like "crawl-20180802121925-00000.warc.gz". Am I missing something here?
I try to enable warc writing by creating a default project like so:
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.10
And then adding the dependency to the warc module in the pom.xml like so
<dependency>
<groupId>com.digitalpebble.stormcrawler</groupId>
<artifactId>storm-crawler-warc</artifactId>
<version>1.10</version>
</dependency>
And then I add the WARCHdfsBolt to the fetch grouping, while trying to write to a local filesystem directory.
public class CrawlTopology extends ConfigurableTopology {
public static void main(String[] args) throws Exception {
ConfigurableTopology.start(new CrawlTopology(), args);
}
@Override
protected int run(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
String[] testURLs = new String[] { "http://www.lequipe.fr/",
"http://www.lemonde.fr/", "http://www.bbc.co.uk/",
"http://storm.apache.org/", "http://digitalpebble.com/" };
builder.setSpout("spout", new MemorySpout(testURLs));
builder.setBolt("partitioner", new URLPartitionerBolt())
.shuffleGrouping("spout");
builder.setBolt("fetch", new FetcherBolt())
.fieldsGrouping("partitioner", new Fields("key"));
builder.setBolt("warc", getWarcBolt())
.localOrShuffleGrouping("fetch");
builder.setBolt("sitemap", new SiteMapParserBolt())
.localOrShuffleGrouping("fetch");
builder.setBolt("feeds", new FeedParserBolt())
.localOrShuffleGrouping("sitemap");
builder.setBolt("parse", new JSoupParserBolt())
.localOrShuffleGrouping("feeds");
builder.setBolt("index", new StdOutIndexer())
.localOrShuffleGrouping("parse");
Fields furl = new Fields("url");
// can also use MemoryStatusUpdater for simple recursive crawls
builder.setBolt("status", new StdOutStatusUpdater())
.fieldsGrouping("fetch", Constants.StatusStreamName, furl)
.fieldsGrouping("sitemap", Constants.StatusStreamName, furl)
.fieldsGrouping("feeds", Constants.StatusStreamName, furl)
.fieldsGrouping("parse", Constants.StatusStreamName, furl)
.fieldsGrouping("index", Constants.StatusStreamName, furl);
return submit("crawl", conf, builder);
}
private WARCHdfsBolt getWarcBolt() {
String warcFilePath = "/Users/user/Documents/workspace/test/warc";
FileNameFormat fileNameFormat = new WARCFileNameFormat()
.withPath(warcFilePath);
Map<String,String> fields = new HashMap<>();
fields.put("software:", "StormCrawler 1.0 http://stormcrawler.net/");
fields.put("conformsTo:", "http://www.archive.org/documents/WarcFileFormat-1.0.html");
WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt()
.withFileNameFormat(fileNameFormat);
warcbolt.withHeader(fields);
// can specify the filesystem - will use the local FS by default
// String fsURL = "hdfs://localhost:9000";
// warcbolt.withFsUrl(fsURL);
// a custom max length can be specified - 1 GB will be used as a default
FileSizeRotationPolicy rotpol = new FileSizeRotationPolicy(50.0f,
FileSizeRotationPolicy.Units.MB);
warcbolt.withRotationPolicy(rotpol);
return warcbolt;
}
}
Whether I run it locally with or without flux, doesn't seem to make a difference. You can have a look at the demo repo here: https://github.com/keyboardsamurai/storm-test-warc
Thanks for asking this. In theory content gets written to the WARC files when
Since the topology you are using as a starting point is not recursive and does not process more than 5 URLs, the conditions 1 and 3 are never met.
You can change that by using
instead. This way new URLs will be processed continuously. Alternatively, you can add
to your code so that the synchronization is triggered after every tuple. In practice, you wouldn't need to do that on a real crawl where URLs are coming constantly.
Now the weird thing is that regardless of whether the sync is triggered by condition 1 or 2, I can't see any change to the file at all and it remains at 0 bytes. This is not the case with version 1.8
so it could be due to a change in the code after that.
I know that some users have been relying on FileTimeSizeRotationPolicy, which can trigger condition 3 above based on time.
Feel free to open an issue on Github, I'll have a closer look at it (when I am back next month).
EDIT: there was a bug with the compression of the entries which has now been fixed and will be part of the next SC release.
See comments on the issue kindly posted by the OP.