Files
data-platform/README.md
Stijnvandenbroek ea166e4097 fix: coverage
2026-03-11 12:54:13 +00:00

144 lines
6.2 KiB
Markdown

# data-platform
[![CI](https://github.com/Stijnvandenbroek/data-platform/actions/workflows/ci.yml/badge.svg)](https://github.com/Stijnvandenbroek/data-platform/actions/workflows/ci.yml)
![Coverage](https://raw.githubusercontent.com/Stijnvandenbroek/data-platform/coverage-badge/coverage-badge.svg)
![Python 3.12](https://img.shields.io/badge/python-3.12-blue)
![dbt](https://img.shields.io/badge/dbt-core-orange)
![Dagster](https://img.shields.io/badge/dagster-orchestration-4F43DD)
![PostgreSQL 16](https://img.shields.io/badge/postgres-16-336791)
![MLflow](https://img.shields.io/badge/mlflow-tracking-0194E2)
![License: MIT](https://img.shields.io/badge/license-MIT-green)
A personal data platform for ingesting, transforming and analysing external data sources. Built with
[Dagster](https://dagster.io/), [dbt](https://www.getdbt.com/) and
[PostgreSQL](https://www.postgresql.org/), managed with [uv](https://github.com/astral-sh/uv) and
deployed via Docker Compose.
## Stack
| Layer | Tool |
| -------------- | ------------------------------------------------- |
| Orchestration | Dagster (webserver + daemon) |
| Transformation | dbt-core + dbt-postgres |
| ML | LightGBM, MLflow, scikit-learn |
| Storage | PostgreSQL 16 |
| Notifications | Discord webhooks |
| Observability | Elementary (report served via nginx) |
| CI | GitHub Actions (Ruff, SQLFluff, Prettier, pytest) |
| Package / venv | uv |
| Infrastructure | Docker Compose |
## Data pipeline
### Ingestion
Python-based Dagster assets fetch data from external APIs and load it into a raw PostgreSQL schema
using upsert semantics. Assets are prefixed with `raw_` to clearly separate unmodelled data from
transformed outputs. They are chained via explicit dependencies and run on a recurring cron
schedule.
### Transformation
dbt models follow the **staging → intermediate → marts** layering pattern:
| Layer | Materialisation | Purpose |
| ---------------- | --------------- | ---------------------------------------------------------------- |
| **Staging** | view | 1:1 cleaning of each raw table with enforced contracts |
| **Intermediate** | view | Business-logic joins and enrichment across staging models |
| **Marts** | incremental | Analysis-ready tables with derived metrics, loaded incrementally |
Mart models use dbt's **incremental materialisation** — on each run only rows with a newer ingestion
timestamp are merged into the target table. On a full-refresh the entire table is rebuilt.
All staging and intermediate models have **dbt contracts enforced**, locking column names and data
types to prevent silent schema drift.
### Data quality
- **dbt tests**: uniqueness, not-null, accepted values, referential integrity and expression-based
tests run as part of every `dbt build`.
- **Source freshness**: a scheduled job verifies raw tables haven't gone stale.
- **Elementary**: collects test results and generates an HTML observability report served via nginx.
### Machine learning
An **ELO rating system** lets you rank listings via pairwise comparisons. An ML pipeline then learns
to predict ELO scores for unseen listings. The ranking UI and model logic originate from
[house-elo-ranking](https://github.com/Stijnvandenbroek/house-elo-ranking).
| Asset | Description |
| ---------------------- | ----------------------------------------------------------------------------------- |
| `elo_prediction_model` | Trains a LightGBM regressor on listing features → ELO rating. Logs to MLflow. |
| `elo_inference` | Loads the best model from MLflow, scores all unscored listings, writes to Postgres. |
| `listing_alert` | Sends a Discord notification for listings with a predicted ELO above a threshold. |
All three are tagged `"manual"` — they run only when triggered explicitly.
### Notifications
The `listing_alert` asset posts rich embeds to a Discord channel via webhook when newly scored
listings exceed a configurable ELO threshold. Notifications are deduplicated using the
`elo.notified` table.
## Scheduling & automation
Ingestion assets run on cron schedules managed by the Dagster daemon. Downstream dbt models use
**eager auto-materialisation**: whenever an upstream raw asset completes, Dagster automatically
triggers the dbt build for all dependent staging, intermediate and mart models.
Assets tagged `"manual"` are excluded from auto-materialisation and only run when triggered
explicitly. A guard condition (`~any_deps_in_progress`) prevents duplicate runs while upstream
assets are still materialising.
## Getting started
```bash
# 1. Install uv (if not already)
curl -Lsf https://astral.sh/uv/install.sh | sh
# 2. Clone and enter the project
cd ~/git/data-platform
# 3. Create your credentials file
cp .env.example .env # edit .env with your passwords
# 4. Install dependencies into a local venv
uv sync
# 5. Generate the dbt manifest (needed before first run)
uv run dbt deps --project-dir dbt --profiles-dir dbt
uv run dbt parse --project-dir dbt --profiles-dir dbt
# 6. Start all services
docker compose up -d --build
# 7. Open the Dagster UI
# http://localhost:3000
```
## Local development
```bash
uv sync
source .venv/bin/activate
# Run the Dagster UI locally
DAGSTER_HOME=$PWD/dagster_home dagster dev
# Useful Make targets
make validate # Check Dagster definitions load
make lint # Ruff + SQLFluff + Prettier
make lint-fix # Auto-fix all linters
make test # pytest
make reload-code # Rebuild + restart user-code container
```
## Services
| Service | URL |
| ----------------- | --------------------- |
| Dagster UI | http://localhost:3000 |
| MLflow UI | http://localhost:5000 |
| pgAdmin | http://localhost:5050 |
| Elementary report | http://localhost:8080 |