i am Crawling govt Web Site with Recaptcha is it legal or illegal and i found some links in back-end code which is commented other than the below i mention links and these links are not used on web sites, with that link i am crawling the data is that link is good to crawl the data or if i used to crawl the data with that link the web site owners may block my ip address. this is my code what i am crawling the data
var requester = new HttpRequester();
requester.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36";
var configuration = Configuration.Default.WithDefaultLoader(requesters: new[] { requester }).WithCookies();
string url = "http://www.mca.gov.in/mcafoportal/viewSignatoryDetails.do";
var context = BrowsingContext.New(configuration);
await context.OpenAsync(url);
try
{
await context.Active.QuerySelector<IHtmlFormElement>("form[name='signatoryForm']").SubmitAsync(new
{
companyID= "U30009KA2001PTC029692",
displayCaptcha ="false"
});
Console.WriteLine();
}
catch(Exception ex)
{
Console.WriteLine(ex.InnerException.Message);
}
if (context.Active != null)
{
var sdTable = context.Active.QuerySelector<IHtmlTableElement>("table[id='signatoryDetails']");
if (sdTable != null)
{
if (sdTable.Children.Count() > 0)
{
for (int i = 0; i < sdTable.Children[1].ChildElementCount; i++)
{
Console.WriteLine(sdTable.Children[1].Children[i].Children[0].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[1].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[2].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[3].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[4].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[5].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[6].TextContent);
Console.WriteLine(sdTable.Children[1].Children[i].Children[7].TextContent);
Console.WriteLine("------------------------------");
}
}
}
else
{
Console.WriteLine("No result found");
}
}
}
catch ( Exception ex)
{
Console.WriteLine(ex.Message);
}
i am crawling the data with this url Index Charges but when i change the this url Signatory i am crawl the data some error or not working as first url, please help me what i am missing in that.
I am not 100% sure I understand your question. Nevertheless, hopefully the following answer will help you a bit...
Recaptcha is usually requiring JavaScript (as far as I know there is a fallback variant, but I am not sure if its used on your sites). Therefore, even though your form may be valid in general you will never get a valid captcha token.
There is AngleSharp.Scripting.JavaScript for enabling JavaScript, but keep in mind that is only experimental and does only work for simple scripts. The scripts in question may be too much for it.