** updated to reflect suggestion by @copser. Still no luck getting the output they got. Can't find what I'm doing wrong. I have tried just plugging the list of urls into process_pages()
function like so in the shell
KgbScrape.process_pages(["https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page1/?filter=#link", ...(9)> "https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page2/?filter=#link", ...(9)> "https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page3/?filter=#link", ...(9)> "https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page4/?filter=#link", ...(9)> "https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page5/?filter=#link"])
but get this error in return:
** (UndefinedFunctionError) function Floki.parse_document/2 is undefined or private. Did you mean one of:
* parse/1
I have verified that build_urls()
and fetch_pages()
functions are working correctly:
defmodule KgbScrape do
use HTTPoison.Base
@endpoint "https://www.dealerrater.com/dealer/McKaig-Chevrolet-Buick-A-Dealer-For-The-People-dealer-reviews-23685/page"
def build_urls() do
page_num = ["1","2","3","4","5"]
tail_url = ["/?filter=#link"]
for page <- page_num, tail <- tail_url do
urls_list = @endpoint <> page <> tail
end
end
def fetch_pages(url) do
url
|> HTTPoison.get()
|> response()
end
def process_pages(urls) when is_list(urls) do
resp =
urls
|> Task.async_stream(fn url -> fetch_pages(url) end)
|> Enum.map(fn {_, resp} -> resp end)
Enum.map(resp, fn r ->
r
|> Floki.parse_document!()
|> Floki.find(".review-entry")
|> Map.new(fn entry ->
[{"div", _, [date]}] = Floki.find(entry, "div.italic")
[{"p", _, [content]}] = Floki.find(entry, "p.review-content")
{date, content}
end)
end)
end
def response({:ok, %{body: {:ok, %{"error" => error}}}}) do
{:error, error}
end
def response({:ok, %{body: body}}), do: body
def response({:error, error}), do: {:error, error}
end
I'll do my best to explain what is happening with the error you have. You are passing a list to the
get_urls({_, urls})
function which is pattern matching against the list and failing. Now even if you properly pass a list and Enumerate over URLs, you will still get an error when response hit|> Map.get(:body)
because you want to fetchbody
but you will get a list of body so you still need to enumerate over that, etc...I will do something like this
With
fetch_pages(URL)
you will be able to test one URL and see what will be the response, also this can be later reused for other pages, and I'm using it inprocess_pages(URLs)
.process_pages(urls)
will process a list of URLs that you are trying to parse with Floki. I'm using Task module here to fetch those pages concurrently. The result will beresponse
are helper functions which are handling HTTPoison response. Happy coding.