I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request.
POST - http://localhost:8081/job/create
Payload
{
"crawl-id":"crawl-01",
"type":"INJECT",
"config-id":"default",
"args":{ "path/to/seedlist/directory"}
}
My problem is in the args, I think more is needed but I'm not sure. In the NutchRESTAPI page this is the sample it gives for creating a job.
POST /job/create
{
"crawlId":"crawl-01",
"type":"FETCH",
"confId":"default",
"args":{"someParam":"someValue"}
}
POST /job/create
{
"crawlId":"crawl-01",
"jobClassName":"org.apache.nutch.fetcher.FetcherJob"
"confId":"default",
"args":{"someParam":"someValue"}
}
I'm not sure what param or value to give each of the commands to complete a job. (eg. Inject, Generate, Fetch, Parse, and UpdateDb) Can someone clear this up? How do I tell the api where to look for the seedlist at?
UPDATE
When trying to complete the Generate command I came into a classException error where the value for the topN key is to be of type long but the api reads it as either a string or an int. I found a fix that is supposed to included in the 2.3.1 release (release date: TBA) and applied it and recompiled my code. It can now work.
At the time of this posting, the REST API is not yet complete. A much more detailed document exists, though it's still not comprehensive. It is linked to in the following email from the user mailing list (which you might want to consider joining):
http://www.mail-archive.com/user%40nutch.apache.org/msg13652.html
But to answer your question about the seedlist, you can create the seedlist through REST, or you can use the argument "seedDir"