I have setup Nutch 2.x to crawl few domains that are multilingual. I can restrict Nutch to inlinks only but not to subfolders. For example, for following seed,
I just want to crawl URLs in /urdu as this website contains webpage in other languages also. Now, how I can configure or customize Nutch to handle these cases ?
Nutch Does not have any default configuration to achieve your task.
There are many flows which you can tune like changing plugins code which does the parsing of HTML and extracting links(like parse-html,parse-tika.. etc) (OR) changing in the Parse phase Mapper code.
(OR)
you can add the following regex in regex-urlfilter.txt (please note to disable Urlfilter in the injection phase because the input seed might not have language information in URL path).
But I would prefer the following way.
In Nutch 1.16 .. you can customize the code of ParseOutputFormat which is used in ParseSegment Parse Reducer Phase as a RecordWriter.
What happens in ParseOutputFormat?
If you check this particular line of code inside getRecordWriter
you can write a custom filter method. and return all those pages which do not have the corresponding langValue in its path.
langValue --> you can directly hard code this value (OR) you can have a property (like allowed.lang.per.page) in nutch-site.xml and read it in the getConf method and use it inside the filter method.
if you want to have multiple langValues to allow. Then pass , separated values, and while reading them split it and customize your filter method accordingly...