ARIMA Models in Python: All Just Statsmodels Under The Hood?

What’s ARIMA and Why Should You Care? If you’re working with time series and you need to produce forecasts, autoregressive moving-average models (AR(I)MA) are still a good place to start. But which Python implementation should you use (if you don’t want to use R)? Recently I’ve been again looking into what the Python ecosystem has to offer in regards to time series analysis in general and ARIMA models in particular. There are quite a few options, however you should have a rough understanding what’s happening under the hood: Are we dealing with a framework that wraps existing libraries or native implementations?...

May 11, 2025 · 3 min · 615 words · Andreas Lay

Ibis: Build your SQL Queries via Python ‒ One API for nearly 20 Backends

What is Ibis and Why Would You Use It? Ibis is a backend agnostic query builder / dataframe API for Python. Think of it as an interface for dynamically generating & executing SQL queries via Python. Ibis provides a unified API for a variety of backends like Snowflake, BigQuery, DuckDB, polars or PySpark. But why would this be useful? I mainly use it for: Generating complex dynamic queries where (Jinja-)templating would becomes too messy Writing SQL generation as self-contained, easily testable Python functions Switching out a production OLAP database (like Snowflake) at test time with a local & version controlled DuckDB instance containing test data If you know SQL you already know Ibis....

March 30, 2025 · 7 min · 1394 words · Andreas Lay

Distributed locking using Google Cloud Storage (or S3)

The Need for Distributed Locks When you run into situations where you want to prevent two pieces of code ‒ possibly running on differen machines ‒ from running concurrently, you need a distributed lock. An easy solution to implement such a lock is to leverage a cloud storage service like Google Cloud Storage (GCS) or Amazon S3. Here you can find the full code example implementing this kind of lock....

March 27, 2025 · 4 min · 785 words · Andreas Lay

dbt: Programmatic Invocation via dbtRunner

Introduction dbt is a great tool for building & organising your ELT data pipelines. When deploying dbt yourself you can invoke dbt either through dbt core cli or through Python via dbtRunner. I will give you an example template on how to use the latter. You can find the full example in the Full Example section. Note: This example was build on dbt-core==1.8.3. dbtRunner may be subject to breaking changes so there’s no guarantee the provided code works as is with other dbt versions....

August 6, 2024 · 9 min · 1823 words · Andreas Lay

A Primer on SARIMAX

A while ago I created a notebook with an introduction to time series analysis. Here is this notebook as a Gist: Generate a synthetic time series with cycles, trend (random walk) and noise components Look at some descriptive statistics (e.g. autocorrelations) Model the synthetic data with a SARIMA model Working with synthetic data first forces you to be explicit about your assumptions and is great for debugging: Unlike real data, as you know the true process the synthetic data follows you can validate your estimates easily against the “true” values....

November 21, 2023 · 1 min · 109 words · Andreas Lay