Web Scraping for Product Analysis and Price Comparison
December 22, 2024

Web Scraping for Product Analysis and Price Comparison

This is submitted to Bright Data Web Scraping Challenge: The most creative use of network data in artificial intelligence models

Product research plays an important role in market research, SEO and, for me personally, finding the best prices for the products I want to buy. For some time, I have been researching E-Katalog LKPP, a government-controlled online marketplace. The marketplace is said to provide government agencies, schools, and institutions with access to a variety of products, including stationery, laptops, and more.

One of my family members owns a laptop purchased from this market and man, it sucks. It’s a laptop from an obscure brand, and it’s ridiculously expensive compared to other brands in its price range.

Therefore, I turned my interest to comparing product prices between LKPP and other online marketplaces to see if there were any significant differences.

In this article, I will tell you how to use the Bright Data platform to crawl the LKPP website through Scraping Browser and Web Scraping API to collect product data from online markets for comparison.

Let’s dig deeper!


what i built

I built a dashboard where you can explore and compare product statistics across multiple marketplaces (LKPP, Tokopedia, Lazada). Additionally, with the power of open source LLM, we can cluster products to discover interesting relationships.

Overall, we can break this process into several steps as shown below.

First, I used Scraping Browser to collect data from E-Katalog LKPP, and then used the data to extract popular product keywords for searching in two other marketplaces, namely Tokopedia and Lazada. For this example, I’m using the Web Scraping API as a convenient way to collect product data.

After obtaining data from three different sources, I used the Ollama + Llama 3.1 model and DSPy to extract structured data (processor, memory, and storage) from the scraped product descriptions. We will also use embedding models to build text embeddings and then cluster the profiles to explore similar products in the market.

Finally, I used Streamlit to deploy the application.


demonstration

You can visit The web app is here.

The Streamlit app is divided into four parts,

Dashboardthis section shows the product price distribution, the most popular brands, GPU and storage.

keyword browserthis section contains a basic keyword research tool based on N-gram frequency.

Product Cloud,This section shows the 3D product name clustering ,based on K-Means clustering. The points are pre-computed using T-SNE dimensionality reduction, and the embedding model used to produce text embeddings is Nomic Text Embed.

compare prices: In this section you can enter the product name and it will show the comparison between the products in three different markets, as well as the statistical test (t-test).


How I use Bright Data

As mentioned in the previous sections, I primarily used Bright Data’s Scraping Browser and Web Scraping API services.

bright data Crawl browser With its powerful unblocking and proxy features, it excels at unblocking access to any website. Although the LKPP network is protected by CloudFlare, the scraping process still runs smoothly through Scraping Browser. I use Playwright for scraping and the integration process is as easy as changing a line,

# from this
browser = await p.chromium.launch(headless=False, slow_mo=50)

# to this
browser = await p.chromium.connect_over_cdp("wss://AUTH_HERE@brd.superproxy.io:9222", slow_mo=50)
Enter full screen mode

Exit full screen mode

Now for public market data, namely Tokopedia and Lazada, Bright Data through their Web scraping API Provides an intuitive and convenient API to capture data without requiring us to write custom scripts to capture it. This saves me a lot of time, allowing me to focus on analyzing data and building Streamlit applications.


Award categories

Although I filled in the hackathon category in the third prompt, I believe this project could fall into any category.


final thoughts

It’s been an interesting journey, especially how we use web scraping and GenAI to extract structured information from the web. Bright Data’s powerful scraping browser and convenient web scraping API allow me to quickly create and collect large amounts of data in a short period of time. This allowed me to shift my focus to providing insights from the scraped material and making the web scraping process a breeze. No more captchas and creating custom scripts for popular websites.

2024-12-22 13:07:07

Leave a Reply

Your email address will not be published. Required fields are marked *