Scraping German Rental Price Data – Part I: Whole Lotta Captchas

Our Goal

Finding affordable apartments is hard. We may not be able to influence rental prices but at least we can bring some transparency into the market. Our goal: We want to collect apartment listings with corresponding meta information like rental prices, square meters and location. To achieve this we will:

Crawl wg-gesucht.de
Write the results to a sqlite database
Write a small Django to serve our scraped data as a dashboard

You can find the source code on Github. The final result will looks like this and can be accessed under this link:

This part of my series deals with the actual scraping.

WG Gesucht?

The German abbreviation “WG” (Wohngemeinschaft) translates to “shared flat” and if you’re looking for one you can try your luck on wg-gesucht.de. Despite its name wg-gesucht not only offers shared flats but also offers apartments for rent:

Minio Bucket

WG-Zimmer = Shared flat
1-Zimmer Wohnung = One-room apartment
Wohnung = Apartment
Haus = House

Let’s Inspect the HTML…

The network tab in Chrome DevTools doesn’t reveal any API we could directly query, but looking at the HTML of a search result for apartments in Berlin we see that listing data is embedded as JSON-LD:

Interestingly the square meter information is not part of this JSON, but only part of a div:

This div has the id liste-details-ad-<LISTING-ID> and the listing id is also part of the URL field in the above JSON, so we can easily link these two parts.

Last but not least we wil need to loop over the result pages. At the bottom of the page we find a div with id assets-list-pagination which gives us information of how many pages of results we can scrape:

That’s enough for us now, we can work with that.

…And the URL Too

We can parameterize the search result URL with these parameters:

city
city_id
page

In my code I simply use a templated string:

URL_TEMPLATE = (
    "https://www.wg-gesucht.de/wg-zimmer-und-1-zimmer-wohnungen-und-wohnungen-und-haeuser"
    "-in-{city}.{city_id}.0+1+2+3.1.{page}.html?offer_filter=1"
    "&city_id={city_id}&sort_order=0&noDeact=1&categories%5B%5D=0"
    "&categories%5B%5D=1&categories%5B%5D=2&categories%5B%5D=3&pagination=4&pu="
)

We are going to query all listings without further filters.

Let’s Scrape

The scraper in itself is quite simple. I am using Zendriver for scraping via Chrome and parse the scraped HTML with Beautiful Soup.

Running the scraper will:

open multiple Chrome instances and scrape the site for you
load & parse the HTML
write the extracted data into a local SQLite database.

This is how it looks:

After a while you will run into Captchas:

My script currently detectes when a Captcha page pops up and stops scraping while waiting for the user to manually solve the captcha. After that scraping for the corresponding page will continue. It might be worth checking out SeleniumBase which might be able to bypass Captchas.

Result: A Database of Apartment Listings

After our scraper has finished its job we will see that our table in SQLite has filled up with listing data:

Part II: Building a Dashboard

That’s it for this part! Now that we’ve got our data collected we’d like get some insights into rental prices. Therefore in the upcoming part II I will show you how I’ve built a small UI using Django, daisyUI & Alpine.js to visualize our scraped data.

Our Goal#

WG Gesucht?#

Let’s Inspect the HTML…#

…And the URL Too#

Let’s Scrape#

Result: A Database of Apartment Listings#

Part II: Building a Dashboard#