
AI-Assisted Web Scraping with GPT-4: A Practical Guide
October-08-2024
Boost your web scraping efficiency! Try BrightData now and scrape smarter, not harder.
tl;dr; show me the demo and source code!

The recent release of OpenAI's/ Mistral structured outputs feature has opened up exciting possibilities for AI-assisted tasks. In this post, I'll share my experience developing an AI-assisted web scraper using GPT-4/ Mistral and discuss the insights gained from this experiment.
Guide to Fine-Tuning the Mistral 7B LLM with Your Own Data
The Initial Approach: Direct Data Extraction
My first attempt involved asking GPT-4/ Mistral to extract data directly from HTML strings. I used Pydantic models to structure the output:
class DayTemperature(BaseModel):
""" Define your desired data structure.
"""
date: str = Field("date of measurements")
day: float = Field("day temperature ")
night: float = Field("night temperature")
weather_conditions: str = Field("weather conditions for day and night")
chance_of_rain: str = Field("chance of rain (day and night)")
wind: float = Field("wind speed and direction (day and night)")
humidity: float = Field("humidity (day and night)")
UV: str = Field(" UV index in format example 0 of 11 or 1 of 11")
sunrise_sunset: str = Field("sunrise and sunset times")
With a system prompt designating the AI as an "expert web scraper," I set out to parse various tables.
Key Findings
Handling Complex Tables

GPT-4/ Mistral impressed me with its ability to parse complex tables, such as a 10-day weather forecast from Weather.com. It correctly interpreted the layout, including day/night forecasts and hidden condition data.
Limitations with Merged Rows
I discovered a limitation when parsing Wikipedia tables with merged rows. The model struggled to maintain consistent column sizes, making representing the data as a table challenging.
Generating JSON
To reduce API calls and costs, I experimented with asking GPT-4/ Mistral to return JSON instead of parsed data. While this approach had potential, it often resulted in invalid or incorrect JSON.
Cost Considerations
Using GPT-4 for web scraping can be expensive. I implemented HTML cleanup to mitigate costs and reduce unnecessary data before passing it to the model.
Conclusions and Future Directions
Despite the cost, GPT-4's extraction quality shows promise for AI-assisted web scraping tools. I've created a demo using Streamlit, with the source code on GitHub.
Future improvements could include:
- Capturing browser events for better user experience
- Exploring more complex extraction methods for intricate table structures
- Further optimizing HTML cleanup to reduce API costs
This experiment highlights both the potential and challenges of AI-assisted web scraping. As AI models continue to evolve, we can expect even more powerful and efficient tools in this domain