I'm trying to collect data with x-ray from a page that structured like:
<h1>Page title</h1>
<article>
<h2 id="first">Title 1</h2>
<h3>Subtitle 1</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
<h2 id="second">Title 2</h2>
<h3>Subtitle 2</h3>
<h2 id="third">Title 3</h2>
<h3>Subtitle 3</h3>
<ul>
<li>Element 1
<li>Element 2
<li>Element 3
</ul>
</article>
The article is split in sections with <h2>. The section contains subtitle and may contain a list of items. My goal is to get an object with the structure:
type Result = {
pageTitle: string,
sections: [{ subtitle?: string, elements?: string[] }],
}
From that example structure I expect output:
{
pageTitle: "Page title",
sections: [
{
subtitle: "Subtitle 1",
elements: ["Element1", "Element2", "Element3"]
},
{
subtitle: "Subtitle 2",
elements: [] //or any falsy value
},
{
subtitle: "Subtitle 3",
elements: ["Element1", "Element2", "Element3"]
}
]
}
I've tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("article", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
But I've figured out that it doesn't work as expected because there is only one article tag on the page and [] indicates that xray will iterate over whatever selector (article in my case) returns
I've also tried:
xray(url, {
pageTitle: "h1 | trim", //where trim is defined filter
sections: xray("h2", [{
subtitle: "h3",
elements: ['h3 ~ ul li']
}])
})
This returns 0 results, probably because xray("h2", /* other code */) "scopes" selection to only h2 and nothing else. And my h2's doesn't contain nested elements.
So is there a way to get array of objects from a non-nested html structure?
X-ray library doesn't provide an easy way to capture sibling elements within its syntax. It primarily works with a parent-child relationship and the structure you're trying to scrape doesn't conform to that pattern.
Ideally library like Puppeteer would be more suitable, but x-ray with jsdom can handle it too.
The solution is to pre-process the HTML to encapsulate each section within a separate container, then scrape that new structure with x-ray.
Steps:
Code:
It makes extensive use of DOM manipulation which may not be ideal