Note: If you're downvoting at least share why. I put in a lot of effort to write this question, shared my code and did my own research first, so not sure what else I could add.
I already use Scrapy to crawl websites successfully. I extract specific data from a webpage using CSS selectors. However, it's time consuming to setup and error prone. I want to be able to pass the raw HTML to chatGPT and ask a question like
"Give me in a JSON object format the price, array of photos, description, key features, street address, and zipcode of the object"
Desired output below. I truncated description, key features and photos for legibility.
{
"price":"$945,000",
"photos":"https://media-cloud.corcoranlabs.com/filters:format(webp)/fit-in/1500x1500/ListingFullAPI/NewTaxi/7625191/mediarouting.vestahub.com/Media/134542874?w=3840&q=75;https://media-cloud.corcoranlabs.com/filters:format(webp)/fit-in/1500x1500/ListingFullAPI/NewTaxi/7625191/mediarouting.vestahub.com/Media/134542875?w=3840&q=75;https://media-cloud.corcoranlabs.com/filters:format(webp)/fit-in/1500x1500/ListingFullAPI/NewTaxi/7625191/mediarouting.vestahub.com/Media/134542876?w=3840&q=75",
"description":"<div>This spacious 2 bedroom 1 bath home easily converts to 3 bedrooms. Featuring a BRIGHT and quiet southern exposure, the expansive great room (with 9ft ceilings) is what sets (...)",
"key features":"Center island;Central air;Dining in living room;Dishwasher",
"street address":"170 West 89th Street, 2D",
"zipcode":"NY 10024",
}
Right now I run into the max chat length of 4096 characters. So I decided to send the page in chunks. However even with a simple question like "What is the price of this object?" I'd expect the answer to be "$945,000" but I'm just getting a whole bunch of text. I'm wondering what I'm doing wrong. I heard that AutoGPT offers a new layer of flexibility so was also wondering if that could be a solution here.
My code:
import requests
from bs4 import BeautifulSoup, Comment
import openai
import json
# Set up your OpenAI API key
openai.api_key = "MYKEY"
# Fetch the HTML from the page
url = "https://www.corcoran.com/listing/for-sale/170-west-89th-street-2d-manhattan-ny-10024/22053660/regionId/1"
response = requests.get(url)
# Parse and clean the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Remove unnecessary tags, comments, and scripts
for script in soup(["script", "style"]):
script.extract()
# for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
# comment.extract()
text = soup.get_text(strip=True)
# Divide the cleaned text into chunks of 4096 characters
def chunk_text(text, chunk_size=4096):
chunks = []
for i in range(0, len(text), chunk_size):
chunks.append(text[i:i+chunk_size])
return chunks
print(text)
text_chunks = chunk_text(text)
# Send text chunks to ChatGPT API and ask for the price
def get_price_from_gpt(text_chunks, question):
for chunk in text_chunks:
prompt = f"{question}\n\n{chunk}"
response = openai.Completion.create(
engine="text-davinci-002",
prompt=prompt,
max_tokens=50,
n=1,
stop=None,
temperature=0.5,
)
answer = response.choices[0].text.strip()
if answer.lower() != "unknown" and len(answer) > 0:
return answer
return "Price not found"
question = "What is the price of this object?"
price = get_price_from_gpt(text_chunks, question)
print(price)
UPDATED ANSWER 06.28.2023
Your question was very interesting, so I wanted to try to improve my previous answer that you have already accepted.
I noted that my previous answer cost around .05 cents to query the
OpenAI api. These costs was directly related to the text chunking function and asking the questions in afor loop. I have removed the text chunking function and thefor loopbecause I was able to reduced the tokens to a condensed size.One of the core items that was required to reduce the cost is text cleaning, which is a standard NLP and data science problem. I added some more code to remove additional unneeded text from the SOUP object. There is a performance hit when doing this, but not enough to lose sleep over.
Refining the query prompt was also needed to submit everything in a single request. Doing this reduces the query costs.
The code below can be refined more. Currently it cost .02 cents per query using
text-davinci-003. The prompt will need to reworked to usetext-davinci-002, which is a little cheaper thantext-davinci-003.The API query time for the code below can exceed 15 seconds. There are numerous discussions on the community forums at OpenAI about query performance. From my research there is no solid technique on how to improve query performance.
This is the output:
UPDATED ANSWER 06.26.2023
I'm trying to refine this answer. I decided to clean the data slightly before sending it the API. Doing this allowed me to get some cleaner answers. I removed the code from my previous answer, but I left my notes, which I consider important to anyone trying to do something similar.
I found that
text-davinci-003gives more precise answers to the questions thantext-davinci-002, but it costs more to usetext-davinci-003.Updated code:
This was the output from the code above:
ORIGINAL ANSWER 06.24.2023 (code removed for readability)
I noted that one of the core issues in your code was in this line:
This line was removing some of the needed spaces that
OpenAI chatGPTneeded for processing.Doing this allows for the spaces:
Also the api for
OpenAIdeals with tokens and not characters, so your chunking code needs to be replace with one that handles tokenization.I'm unsure of the scalability of this answer, because you will definitely need to think through all the applicable questions related to your data source.
For instance:
What is the address of this property? The address of this property is 170 West 89th Street #2D, New York, NY 10024.
What year was the property built? The property was built in 1910.
What are the maintenance fees for this property? The maintenance fees for this property are $1,680.
** Are there any property amenities?** There are a few property amenities, including a storage unit, stroller parking, and bicycle storage.
One of the core issues with using
OpenAI apiis extracting a clear description from the text provided.This line
text = ' '.join(soup.find_all(string=True))produces this text:When processing you might get this:
or this:
Getting a clear and concise description will require lots of testing. It might require you to do something like this
details = ''.join([element.text for element in soup.select('div[class*="DetailsAndAgents"]')])Doing this also creates an issue with obtaining a clear description.
or this: