data-platform

A personal data platform for ingesting, transforming and analysing external data sources. Built with Dagster, dbt and PostgreSQL, managed with uv and deployed via Docker Compose.

Stack

Layer	Tool
Orchestration	Dagster (webserver + daemon)
Transformation	dbt-core + dbt-postgres
ML	LightGBM, MLflow, scikit-learn
Storage	PostgreSQL 16
Notifications	Discord webhooks
Observability	Elementary (report served via nginx)
CI	GitHub Actions (Ruff, SQLFluff, Prettier, pytest)
Package / venv	uv
Infrastructure	Docker Compose

Data pipeline

Ingestion

Python-based Dagster assets fetch data from external APIs and load it into a raw PostgreSQL schema using upsert semantics. Assets are prefixed with raw_ to clearly separate unmodelled data from transformed outputs. They are chained via explicit dependencies and run on a recurring cron schedule.

Transformation

dbt models follow the staging → intermediate → marts layering pattern:

Layer	Materialisation	Purpose
Staging	view	1:1 cleaning of each raw table with enforced contracts
Intermediate	view	Business-logic joins and enrichment across staging models
Marts	incremental	Analysis-ready tables with derived metrics, loaded incrementally

Mart models use dbt's incremental materialisation — on each run only rows with a newer ingestion timestamp are merged into the target table. On a full-refresh the entire table is rebuilt.

All staging and intermediate models have dbt contracts enforced, locking column names and data types to prevent silent schema drift.

Data quality

dbt tests: uniqueness, not-null, accepted values, referential integrity and expression-based tests run as part of every dbt build.
Source freshness: a scheduled job verifies raw tables haven't gone stale.
Elementary: collects test results and generates an HTML observability report served via nginx.

Machine learning

An ELO rating system lets you rank listings via pairwise comparisons. An ML pipeline then learns to predict ELO scores for unseen listings:

Asset	Description
`elo_prediction_model`	Trains a LightGBM regressor on listing features → ELO rating. Logs to MLflow.
`elo_inference`	Loads the best model from MLflow, scores all unscored listings, writes to Postgres.
`listing_alert`	Sends a Discord notification for listings with a predicted ELO above a threshold.

All three are tagged "manual" — they run only when triggered explicitly.

Notifications

The listing_alert asset posts rich embeds to a Discord channel via webhook when newly scored listings exceed a configurable ELO threshold. Notifications are deduplicated using the elo.notified table.

Scheduling & automation

Ingestion assets run on cron schedules managed by the Dagster daemon. Downstream dbt models use eager auto-materialisation: whenever an upstream raw asset completes, Dagster automatically triggers the dbt build for all dependent staging, intermediate and mart models.

Assets tagged "manual" are excluded from auto-materialisation and only run when triggered explicitly. A guard condition (~any_deps_in_progress) prevents duplicate runs while upstream assets are still materialising.

Project layout

data_platform/              # Dagster Python package
  assets/
    dbt.py                  # @dbt_assets definition
    elo/                    # ELO schema/table management assets
    ingestion/              # Raw ingestion assets + SQL templates
    ml/                     # ML assets (training, inference, alerts)
  helpers/                  # Shared utilities (SQL rendering, formatting, automation)
  jobs/                     # Job definitions
  schedules/                # Schedule definitions
  resources/                # Dagster resources (Postgres, MLflow, Discord, Funda)
  definitions.py            # Main Definitions entry point
dbt/                        # dbt project
  models/
    staging/                # 1:1 views on raw tables + source definitions
    intermediate/           # Enrichment joins
    marts/                  # Incremental analysis-ready tables
  macros/                   # Custom schema generation, Elementary compat
  profiles.yml              # Reads credentials from env vars
dagster_home/               # dagster.yaml + workspace.yaml
tests/                      # pytest test suite
nginx/                      # Elementary report nginx config
docker-compose.yaml         # All services
Dockerfile                  # Multi-stage: usercode + dagster-infra
Makefile                    # Developer shortcuts

Getting started

# 1. Install uv (if not already)
curl -Lsf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter the project
cd ~/git/data-platform

# 3. Create your credentials file
cp .env.example .env        # edit .env with your passwords

# 4. Install dependencies into a local venv
uv sync

# 5. Generate the dbt manifest (needed before first run)
uv run dbt deps  --project-dir dbt --profiles-dir dbt
uv run dbt parse --project-dir dbt --profiles-dir dbt

# 6. Start all services
docker compose up -d --build

# 7. Open the Dagster UI
#    http://localhost:3000

Local development

uv sync
source .venv/bin/activate

# Run the Dagster UI locally
DAGSTER_HOME=$PWD/dagster_home dagster dev

# Useful Make targets
make validate          # Check Dagster definitions load
make lint              # Ruff + SQLFluff + Prettier
make lint-fix          # Auto-fix all linters
make test              # pytest
make reload-code       # Rebuild + restart user-code container

Services

Service	URL
Dagster UI	http://localhost:3000
MLflow UI	http://localhost:5000
pgAdmin	http://localhost:5050
Elementary report	http://localhost:8080

6.8 KiB Raw Blame History