mirror/data-platform

Fork 0

Go to file

Stijnvandenbroek a68217a709 feat: update actions versions

2026-03-11 14:00:35 +00:00

.github/workflows

feat: update actions versions

2026-03-11 14:00:35 +00:00

dagster_home

fix: postgres timeout issues during heavy load

2026-03-10 14:47:36 +00:00

data_platform

test: expand unit tests for ml and dagster resources

2026-03-11 13:52:53 +00:00

dbt

feat: set elementary retention to 7 days

2026-03-11 12:46:11 +00:00

mlflow

fix: mlflow artifact path

2026-03-08 17:01:37 +00:00

nginx

fix: working elementary setup

2026-03-05 10:38:09 +00:00

pgadmin

feat: add pgadmin to deployment

2026-03-04 16:59:56 +00:00

tests

fix: skip manifest-dependant tests

2026-03-11 13:56:42 +00:00

.dockerignore

fix: docker setup for usercode

2026-03-04 10:03:48 +00:00

.env.example

fix: postgres timeout issues during heavy load

2026-03-10 14:47:36 +00:00

.gitignore

fix: coverage badge publishing

2026-03-11 13:01:58 +00:00

.pre-commit-config.yaml

feat: global sql linting

2026-03-06 10:49:12 +00:00

.prettierignore

fix: linting and ci issues

2026-03-04 10:17:20 +00:00

.prettierrc

feat: implement linting and testing

2026-03-03 22:02:25 +00:00

.sqlfluff

feat: optimize price history ingestion

2026-03-07 21:25:06 +00:00

docker-compose.yaml

chore: remove hardcoded lan ip

2026-03-09 10:07:40 +00:00

Dockerfile

feat: add mlflow and lightbm model

2026-03-07 20:45:25 +00:00

entrypoint.sh

feat: finetune elementary implementation

2026-03-04 22:26:35 +00:00

LICENSE

feat: add MIT license

2026-03-09 10:07:51 +00:00

Makefile

feat: add coverage to makefile

2026-03-11 13:08:49 +00:00

pyproject.toml

fix: pytest warnings

2026-03-11 13:31:49 +00:00

README.md

fix: coverage

2026-03-11 12:54:13 +00:00

uv.lock

fix: pytest warnings

2026-03-11 13:31:49 +00:00

README.md

data-platform

A personal data platform for ingesting, transforming and analysing external data sources. Built with Dagster, dbt and PostgreSQL, managed with uv and deployed via Docker Compose.

Stack

Layer	Tool
Orchestration	Dagster (webserver + daemon)
Transformation	dbt-core + dbt-postgres
ML	LightGBM, MLflow, scikit-learn
Storage	PostgreSQL 16
Notifications	Discord webhooks
Observability	Elementary (report served via nginx)
CI	GitHub Actions (Ruff, SQLFluff, Prettier, pytest)
Package / venv	uv
Infrastructure	Docker Compose

Data pipeline

Ingestion

Python-based Dagster assets fetch data from external APIs and load it into a raw PostgreSQL schema using upsert semantics. Assets are prefixed with raw_ to clearly separate unmodelled data from transformed outputs. They are chained via explicit dependencies and run on a recurring cron schedule.

Transformation

dbt models follow the staging → intermediate → marts layering pattern:

Layer	Materialisation	Purpose
Staging	view	1:1 cleaning of each raw table with enforced contracts
Intermediate	view	Business-logic joins and enrichment across staging models
Marts	incremental	Analysis-ready tables with derived metrics, loaded incrementally

Mart models use dbt's incremental materialisation — on each run only rows with a newer ingestion timestamp are merged into the target table. On a full-refresh the entire table is rebuilt.

All staging and intermediate models have dbt contracts enforced, locking column names and data types to prevent silent schema drift.

Data quality

dbt tests: uniqueness, not-null, accepted values, referential integrity and expression-based tests run as part of every dbt build.
Source freshness: a scheduled job verifies raw tables haven't gone stale.
Elementary: collects test results and generates an HTML observability report served via nginx.

Machine learning

An ELO rating system lets you rank listings via pairwise comparisons. An ML pipeline then learns to predict ELO scores for unseen listings. The ranking UI and model logic originate from house-elo-ranking.

Asset	Description
`elo_prediction_model`	Trains a LightGBM regressor on listing features → ELO rating. Logs to MLflow.
`elo_inference`	Loads the best model from MLflow, scores all unscored listings, writes to Postgres.
`listing_alert`	Sends a Discord notification for listings with a predicted ELO above a threshold.

All three are tagged "manual" — they run only when triggered explicitly.

Notifications

The listing_alert asset posts rich embeds to a Discord channel via webhook when newly scored listings exceed a configurable ELO threshold. Notifications are deduplicated using the elo.notified table.

Scheduling & automation

Ingestion assets run on cron schedules managed by the Dagster daemon. Downstream dbt models use eager auto-materialisation: whenever an upstream raw asset completes, Dagster automatically triggers the dbt build for all dependent staging, intermediate and mart models.

Assets tagged "manual" are excluded from auto-materialisation and only run when triggered explicitly. A guard condition (~any_deps_in_progress) prevents duplicate runs while upstream assets are still materialising.

Getting started

# 1. Install uv (if not already)
curl -Lsf https://astral.sh/uv/install.sh | sh

# 2. Clone and enter the project
cd ~/git/data-platform

# 3. Create your credentials file
cp .env.example .env        # edit .env with your passwords

# 4. Install dependencies into a local venv
uv sync

# 5. Generate the dbt manifest (needed before first run)
uv run dbt deps  --project-dir dbt --profiles-dir dbt
uv run dbt parse --project-dir dbt --profiles-dir dbt

# 6. Start all services
docker compose up -d --build

# 7. Open the Dagster UI
#    http://localhost:3000

Local development

uv sync
source .venv/bin/activate

# Run the Dagster UI locally
DAGSTER_HOME=$PWD/dagster_home dagster dev

# Useful Make targets
make validate          # Check Dagster definitions load
make lint              # Ruff + SQLFluff + Prettier
make lint-fix          # Auto-fix all linters
make test              # pytest
make reload-code       # Rebuild + restart user-code container

Services

Service	URL
Dagster UI	http://localhost:3000
MLflow UI	http://localhost:5000
pgAdmin	http://localhost:5050
Elementary report	http://localhost:8080