Posts

Scraping German Rental Price Data – Part I: Whole Lotta Captchas

Our Goal Finding affordable apartments is hard. We may not be able to influence rental prices but at least we can bring some transparency into the market. Our goal: We want to collect apartment listings with corresponding meta information like rental prices, square meters and location. To achieve this we will: Crawl wg-gesucht.de Write the results to a sqlite database Write a small Django to serve our scraped data as a dashboard You can find the source code on Github....

Real-time Data Streaming with Kafka Connect

Why Kafka Connect? While you can always write your own Kafka connector to write data from Kafka to S3 or a database using for example confluent-kafka-python, this might be hard to maintain and error prone. Kafka Connect can help you to simplify this task. In this post we will … … set up a local Kafka cluster, S3 storage & Kafka Connect with Docker Compose … create a Kafka topic and publish messages to it … use Kafka Connect to create an S3 sink connector and write the messages to S3 Run the Example You can run the full example which is described in this post by executing the following script: It will download the necessary files, start the containers, create a Kafka topic, publish messages, and create an S3 sink connector to write the data to S3:...

ARIMA Models in Python: All Just Statsmodels Under The Hood?

What’s ARIMA and Why Should You Care? If you’re working with time series and you need to produce forecasts, autoregressive moving-average models (AR(I)MA) are still a good place to start. But which Python implementation should you use (if you don’t want to use R)? Recently I’ve been again looking into what the Python ecosystem has to offer in regards to time series analysis in general and ARIMA models in particular. There are quite a few options, however you should have a rough understanding what’s happening under the hood: Are we dealing with a framework that wraps existing libraries or native implementations?...

Ibis: Build your SQL Queries via Python ‒ One API for nearly 20 Backends

What is Ibis and Why Would You Use It? Ibis is a backend agnostic query builder / dataframe API for Python. Think of it as an interface for dynamically generating & executing SQL queries via Python. Ibis provides a unified API for a variety of backends like Snowflake, BigQuery, DuckDB, polars or PySpark. But why would this be useful? I mainly use it for: Generating complex dynamic queries where (Jinja-)templating would becomes too messy Writing SQL generation as self-contained, easily testable Python functions Switching out a production OLAP database (like Snowflake) at test time with a local & version controlled DuckDB instance containing test data If you know SQL you already know Ibis....

Distributed locking using Google Cloud Storage (or S3)

The Need for Distributed Locks When you run into situations where you want to prevent two pieces of code ‒ possibly running on differen machines ‒ from running concurrently, you need a distributed lock. An easy solution to implement such a lock is to leverage a cloud storage service like Google Cloud Storage (GCS) or Amazon S3. Here you can find the full code example implementing this kind of lock....