I'd like to gather some knowledge about webscraping. I am currently building a hobby project where I'd create something like "pricerunner" which is a price comparison website. My approach would be to scrape similar products from different sources.
For simplicity lets say that I want to compare iPhone prices
Webshop X has the following product: Title: iPhone 14 Pro 128GB Red Price: $1299,99
Webshop Y has the following product: Title: Green 128GB iPhone 14 Pro Price: $1249,99
Webshop Z has the following product: Title: Blue iPhone 14 Pro 128GB Price: $1199,99
Upon scraping this data I'd need some way to standardize that the above three products into one product
{
"title": "iPhone 14 Pro"
"storage": 128
"vendors":
[
"webshop-x": {"link": "webshop-x.com/iphone", "price": 1299},
"webshop-y": {"link": "webshop-y.com/iphone", "price": 1249},
"webshop-z": {"link": "webshop-z.com/iphone", "price": 1199}
]
}
Or something along the lines of the above object. I hope it makes sense.
In short the objective would be to gather data from different sources for products that is similar. Standardize this product, so when a user searches for iPhone 14 Pro on my site, all three would be returned.
I reckon that with this exact product, it would be quite easy to just return every product that contains the word "iPhone 14 Pro 128" in no particular order; but more complex products would exist.
What is your take on this? Am I missing something?
Have a nice day!
Best regards