Local CAVE Stack — Setup & Demo
A self-hosted, single-user, $0 deployment of the real CAVEconnectome backend running on one Windows box — queryable by the official caveclient, indistinguishable from production. This page is the guide to standing it up and walking a team through it.
1. What this is
CAVE — the Connectome Annotation Versioning Engine — is the backend that large EM
connectomics projects (FlyWire, MICrONS) use to store annotations tied to a segmentation, resolve
supervoxels to root IDs through a dynamic proofreading graph, and take immutable, versioned snapshots
you can time-travel query. In production it is a fleet of microservices behind
global.daf-apis.com.
This project runs that entire fleet locally, for $0, with no cloud account, on a single
Windows machine. These are the genuine production container images — InfoService, SchemaService,
AnnotationEngine, PyChunkedGraph, and MaterializationEngine (plus its Celery worker) — behind an
nginx gateway, backed by Postgres/PostGIS, Redis, and an in-memory BigTable emulator. The official Python
client connects to http://127.0.0.1:8080 and cannot tell the difference from real CAVE.
The dummy fixtures it serves:
| Datastack | fake_datastack → aligned volume fake_volume → PCG table test |
|---|---|
| Annotation table | my_cave_table, schema bound_tag (a point + a string tag), voxel resolution [8, 8, 40] |
| Dummy graph | two supervoxels 72127962782105600 / ...601, both resolving to root 288230376151711745 |
| Demo annotations | points [600,100,10] and [900,100,10], landing on the two dummy supervoxels |
Why it matters
It lets us demo and train against a fully faithful CAVE — discovery, annotation, live supervoxel→root resolution, materialization, and time-travel queries — with no cloud costs and no credentials. Anyone learning the client, or testing tooling, gets the real API surface on a laptop. It also proves out the hardest-to-host piece, MaterializationEngine, entirely offline.
Where it lives. The working stack is on this box at Desktop\_scratch\cave-local\realcave. Everything below is run from that realcave directory.
2. Windows setup, step by step
Prerequisites: Docker Desktop with the WSL2 backend enabled. Confirm
docker version and docker compose version both work.
Run from Git Bash driving docker.exe — not from inside WSL.
Docker Desktop's WSL integration registers Ubuntu but never injects /usr/bin/docker or the socket, and Windows-path bind mounts do not resolve over a raw TCP DOCKER_HOST from WSL. Native Windows-path bind mounts do work through docker.exe. Git Bash + docker.exe is the combination that works.
-
Open Git Bash and go to the stack directory:
copycd realcave -
Bring the whole stack up.
MSYS_NO_PATHCONV=1stops Git Bash from mangling in-container path arguments like/cfg/...:copyMSYS_NO_PATHCONV=1 bash ./bootstrap.shbootstrap.shruns 8 idempotent-ish steps, in order:- Backing services + gateway + schema —
postgres redis bigtable gateway schema. - Wait for Postgres to report healthy.
- Create databases (
datasets,annotation,materialize, + PostGIS). The per-volume DBfake_volumeis auto-created later by AnnotationEngine. - InfoService + DB migrations (
flask db upgrade). - Register the datastack — inserts the
aligned_volumeanddatastackrows the client discovers (there is no POST API for this). - AnnotationEngine + PyChunkedGraph.
- Build the dummy graph in the emulator — two supervoxels joined by one edge, both resolving to one root.
- MaterializationEngine web + worker.
If step 3 fails once on a fresh PostGIS init race (
database system is shutting down), just re-runbootstrap.sh. - Backing services + gateway + schema —
-
Create a host virtualenv (Windows Python 3.11) with the official client and its deps:
copypython -m venv .venv-host ./.venv-host/Scripts/python.exe -m pip install caveclient pandas numpycloudvolumefromrequirements-host.txthas no Windows wheel and the demo scripts do not need it —caveclient+pandas+numpyis enough. -
Create the annotation table and its two annotations:
copy./.venv-host/Scripts/python.exe scripts/setup_demo_data.py -
Run the materialization workflow to create version 1 (annotations joined to root IDs in an immutable, queryable snapshot):
copycurl -s -X POST "http://localhost:8080/materialize/api/v2/materialize/run/complete_workflow/datastack/fake_datastack?days_to_expire=5&merge_tables=true"Watch it with
docker logs -f realcave-mat-worker-1— look for supervoxel + root lookups andset_version_status ... AVAILABLE. It creates DBfake_datastack__mat1and anAnalysisVersionrow. -
Open the live dashboard in a browser:
copyhttp://localhost:8080/
Daily use
Once it is set up, you do not repeat the whole bootstrap. Two scripts cover day-to-day demoing, both run from realcave/ in Git Bash:
Get demo-ready and run the proof — ensures the stack is up, rebuilds the in-memory dummy graph (needed after any reboot), and runs the 5-part narrated proof:
Reset before a fresh demo — wipes the live table back to 2 clean annotations with empty edit history, then materializes a fresh version:
Cold boot (Docker Desktop not running): launch Docker Desktop, wait for the engine, then ./run_demo.sh. It handles the rebooted-box case where the in-memory graph was lost.
3. Gotchas
These are the hard-won, non-obvious things — the reasons a naive docker compose up of the real images does not just work. Don't re-discover them.
Read before you demo
- Use
127.0.0.1, neverlocalhost, in any host-side client config. On Windows, Pythonrequestsresolveslocalhostto IPv6::1first and eats a ~21s SYN timeout per new connection before falling back to IPv4. This turned a ~3s demo into ~150s. - Rebuild the dummy graph after any BigTable restart or reboot. The emulator is in-memory, so the graph is wiped and live queries break.
run_demo.sh/reset_demo.shdo this for you; manually:docker exec realcave-pcg-1 python /cfg/create_dummy_graph.py. It must run inside the pcg container so the written graph version matches the server. - The three MaterializationEngine compose fixes (in
docker-compose.cave.yml) that got the workflow toAVAILABLE: (a) worker poolprefork --concurrency=4, notsolo(solo deadlocks on the workflow's Celery chords); (b)LOCAL_SERVER_URL/GLOBAL_SERVER_URL=http://gatewayas real env vars (the worker readsLOCAL_SERVER_URLfromos.environ, not flask config, else defaults to a bogus host); (c) mountcfg/cave-secret.jsonat/root/.cloudvolume/secrets/(CloudVolume's graphene driver demands a token file even with auth disabled). - AVX is required. Polars (a MaterializationEngine dependency) needs AVX instructions. Fine on this Intel i7 (native AVX2); it breaks on Apple-Silicon Macs because the x86 images run under Rosetta, which has no AVX. Fallback on a no-AVX host: build the worker with the
polars-lts-cpuwheel and setPOLARS_SKIP_CPU_CHECK=1. - Annotation POST returns 500 but the row still persists. On insert, AnnotationEngine fires a "supervoxel notify" to
local_server, which is unreachable from inside the container, so it 500s — but the write already happened, and materialization does its own supervoxel lookup anyway. Optional clean fix: add127.0.0.1 host.docker.internalto the hosts file and pointlocal_serverthere. AUTH_DISABLED=truebut the client still needs a token. Auth is off on info/annotation/pcg/materialize, butcaveclientstill insists on some token — passauth_token="dummy".- The materialized version counter only resets on a full
down -v. DeletingAnalysisVersionrows or dropping mat DBs does not reset it; old versions accumulate but stay valid and queryable. (This is whyteam_demo.py'sversions == [1]assertion may report a cosmetic "CHECK FAILED" once more than one version exists — the data checks still pass.)
One more intentional-looking oddity: segmentation_source host is gateway (container-reachable over the docker network) while local_server is http://127.0.0.1:8080 (host caveclient reaches it). Different fields, split on purpose.
4. Demo walkthrough — the caveclient command list
This mirrors the click-to-copy cheat-sheet on the dashboard. Start a Python shell on the host
(./.venv-host/Scripts/python.exe) and paste these in order. Every command is the
same caveclient API the team uses against production — only the
server_address differs. Click any command to copy it.
Connect (run these first)
Discover what's on the server
Read the annotation table
Materialized data — versions + time travel
Segmentation — live supervoxel → root
288230376151711745.
Edit — every change is logged with full history
Note the schema-native body {"id":N,"pt":{"position":[...]},"tag":...} — not a flattened pt_position.
5. The versioning / edit-history story
This is the "V" in CAVE — Versioning — and the most compelling thing to show a team.
Editing an annotation never overwrites it. The edit_history_demo.py script tells the story end to end:
- CAVE logs every change. When you update an annotation, CAVE keeps the old row, stamps it
deleted, marks itvalid=false, and links it to its successor viasuperceded_id. A brand-new row holds the new value. - An audit log of old values. Every prior value is still on disk, with the exact timestamp it changed — you can read the whole lineage of a single annotation, superseded rows and all.
- Materialize a new version. Snapshotting the current state produces a fresh immutable version
N. - Time-travel across versions.
query_table("my_cave_table", materialization_version=N)returns the data exactly as it was at versionN, andlive_query(timestamp=...)reconstructs state as of any moment.
Run the whole story (re-runnable — each run adds one more edit, so the history visibly grows):
The demo prints the active "cell A" annotation, renames it, then shows the full audit log where every prior value is still present and timestamped, marked superseded and linked to the row that replaced it. It then materializes a new version and prints what "cell A" was at each version — the essence of a versioned annotation store.
6. The dashboard
The live dashboard (web/index.html) is served by the gateway from the ./web
bind mount at http://localhost:8080/ — no rebuild to edit it, but it needs the running
stack (it polls the services same-origin every 5 seconds). It is the visual companion to the command list, with these panels:
- Services health — live status of the gateway and each CAVE service, polled every 5s.
- Command cheat-sheet — the same click-to-copy
caveclientsnippets grouped in section 4 above, ready to paste into a Python shell. - Datastack discovery — the raw datastack JSON exactly as
caveclientreads it. - Annotation table — the current (active) rows of
my_cave_tablefrom the real AnnotationEngine. - Edit history / audit log — every row ever written, active and superseded, with a ✎ Make a live edit button that renames "cell A" straight from the browser (a PUT to the annotation API) so the audit trail grows live in front of the audience.
- Materialized versions — the immutable snapshots with their timestamps and status.
- Live query — a ▶ Run live root query button that resolves the two supervoxels to their root through the real PyChunkedGraph on demand.
- Service Swagger / API docs — links to InfoService, SchemaService, AnnotationEngine, and PyChunkedGraph's own UIs, as proof these are genuine CAVE binaries.