I have a long CSS selector that works perfectly fine when actually used in CSS, jQuery etc. But this very same selector will not work on a Mechanize::Page
object - it simply returns an empty array.
The selector targets a paragraph and in my other case a header1. I also converted my page result to string with page.body
, and that element is definitely there, but the search
(or at
) method will not return me anything.
What could be the cause of this?
My code looks like this:
agent = Mechanize.new
page = agent.get 'http://example.com'
page.search(source.read_more_selector).each do |read_more|
inner_page = agent.get(read_more['href'])
# displaying inner_page.body gives me a few valid HTML pages, but...
inner_page.search(source.inner_title_selector).each do |inner_content|
# but here, there's nothing here, inner_content is nil even though the selector should get us something back definitely
end
end
Normally working CSS selector (source.inner_content_selector
)
div#main-container-body > div#body-container > table > tbody > tr > td > span#ajaxprochoice > table > tbody > tr > td > table > tbody > tr > td > table > tbody > tr > td > div > h1.h1productHead
Output of inner_page.body
(one of the many loop results. Can't be added here due to too many characters):
So the above selector is supposed to definitely match the paragraph inside that HTML code (of course, while it's a Mechanize::Page
object, not a string) with inner_page.search
, but it's not.
I went to the actual page online and opened up my console and ran this simple jQuery command to try that out:
$('div#main-container-body > div#body-container > table > tbody > tr > td > span#ajaxprochoice > table > tbody > tr > td > table > tbody > tr > td > table > tbody > tr > td > div > h1.h1productHead').hide();
And it worked! Which pretty much means the selector is valid here.
Edit
When I added this piece of code:
inner_page.at('.h1productHead').to_s
This returned me a result. But when I use the full selector, it doesn't return anything. Why is Mechanize being inflexible with selectors in this case?
The page you are searching doesn’t contain any
tbody
tags. When your browser parses the page it adds the missingtbody
elements into the DOM that it creates. This means that when you examine the page through the browser’s inspector and console it acts like thetbody
tags exist.Nokogiri doesn’t add this tag when parsing. When you use Nokogiri to search for your query (which contains
tbody
) it looks for an explicittbody
tag, and so returns no matches when it fails to find one.The simplest fix is to remove all the
tbody
s from your query (along with any extra>
s).You could also look into Nokogumbo, which extends Nokogiri with Google’s Gumbo HTML5 parser, and which does add the
tbody
elements into the parsed document.