2026-03-11 14:00:35 +00:00
2026-03-08 17:01:37 +00:00
2026-03-05 10:38:09 +00:00
2026-03-04 16:59:56 +00:00
2026-03-11 13:56:42 +00:00
2026-03-04 10:03:48 +00:00
2026-03-11 13:01:58 +00:00
2026-03-04 10:17:20 +00:00
2026-03-07 20:45:25 +00:00
2026-03-09 10:07:51 +00:00
2026-03-11 13:08:49 +00:00
2026-03-11 13:31:49 +00:00
2026-03-11 12:54:13 +00:00
2026-03-11 13:31:49 +00:00

data-platform

CI Coverage Python 3.12 dbt Dagster PostgreSQL 16 MLflow License: MIT

A personal data platform for ingesting, transforming and analysing external data sources. Built with Dagster, dbt and PostgreSQL, managed with uv and deployed via Docker Compose.

Stack

Layer Tool
Orchestration Dagster (webserver + daemon)
Transformation dbt-core + dbt-postgres
ML LightGBM, MLflow, scikit-learn
Storage PostgreSQL 16
Notifications Discord webhooks
Observability Elementary (report served via nginx)
CI GitHub Actions (Ruff, SQLFluff, Prettier, pytest)
Package / venv uv
Infrastructure Docker Compose

Data pipeline

Ingestion

Python-based Dagster assets fetch data from external APIs and load it into a raw PostgreSQL schema using upsert semantics. Assets are prefixed with raw_ to clearly separate unmodelled data from transformed outputs. They are chained via explicit dependencies and run on a recurring cron schedule.

Transformation

dbt models follow the staging → intermediate → marts layering pattern:

Layer Materialisation Purpose
Staging view 1:1 cleaning of each raw table with enforced contracts
Intermediate view Business-logic joins and enrichment across staging models
Marts incremental Analysis-ready tables with derived metrics, loaded incrementally

Mart models use dbt's incremental materialisation — on each run only rows with a newer ingestion timestamp are merged into the target table. On a full-refresh the entire table is rebuilt.

All staging and intermediate models have dbt contracts enforced, locking column names and data types to prevent silent schema drift.

Data quality

  • dbt tests: uniqueness, not-null, accepted values, referential integrity and expression-based tests run as part of every dbt build.
  • Source freshness: a scheduled job verifies raw tables haven't gone stale.
  • Elementary: collects test results and generates an HTML observability report served via nginx.

Machine learning

An ELO rating system lets you rank listings via pairwise comparisons. An ML pipeline then learns to predict ELO scores for unseen listings. The ranking UI and model logic originate from house-elo-ranking.

Asset Description
elo_prediction_model Trains a LightGBM regressor on listing features → ELO rating. Logs to MLflow.
elo_inference Loads the best model from MLflow, scores all unscored listings, writes to Postgres.
listing_alert Sends a Discord notification for listings with a predicted ELO above a threshold.

All three are tagged "manual" — they run only when triggered explicitly.

Notifications

The listing_alert asset posts rich embeds to a Discord channel via webhook when newly scored listings exceed a configurable ELO threshold. Notifications are deduplicated using the elo.notified table.

Scheduling & automation

Ingestion assets run on cron schedules managed by the Dagster daemon. Downstream dbt models use eager auto-materialisation: whenever an upstream raw asset completes, Dagster automatically triggers the dbt build for all dependent staging, intermediate and mart models.

Assets tagged "manual" are excluded from auto-materialisation and only run when triggered explicitly. A guard condition (~any_deps_in_progress) prevents duplicate runs while upstream assets are still materialising.

Getting started

# 1. Install uv (if not already)
curl -Lsf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter the project
cd ~/git/data-platform

# 3. Create your credentials file
cp .env.example .env        # edit .env with your passwords

# 4. Install dependencies into a local venv
uv sync

# 5. Generate the dbt manifest (needed before first run)
uv run dbt deps  --project-dir dbt --profiles-dir dbt
uv run dbt parse --project-dir dbt --profiles-dir dbt

# 6. Start all services
docker compose up -d --build

# 7. Open the Dagster UI
#    http://localhost:3000

Local development

uv sync
source .venv/bin/activate

# Run the Dagster UI locally
DAGSTER_HOME=$PWD/dagster_home dagster dev

# Useful Make targets
make validate          # Check Dagster definitions load
make lint              # Ruff + SQLFluff + Prettier
make lint-fix          # Auto-fix all linters
make test              # pytest
make reload-code       # Rebuild + restart user-code container

Services

Service URL
Dagster UI http://localhost:3000
MLflow UI http://localhost:5000
pgAdmin http://localhost:5050
Elementary report http://localhost:8080
Description
Complete data platform solution using Dagster, dbt, MLFlow and Postgres
Readme MIT 707 KiB
Languages
Python 94.4%
Makefile 3.5%
Shell 1.2%
Dockerfile 0.9%