Skip to content

Automation cadence

OmicIDX uses Dagster’s declarative automation to drive the daily refresh. Most assets cascade automatically when their upstream lands. A few are deliberately gated to prevent high-frequency upstreams from triggering full-graph rebuilds.

This page documents the pattern. If you’re adding a new asset, this is the rubric for choosing its automation condition.

Most consolidation and Postgres-load assets use:

automation_condition=dg.AutomationCondition.eager()

Translation: “materialize me whenever any upstream updates.” For an asset whose upstream is daily, that means once per day. The cascade is automatic; no schedule needed.

PubMed lands hourly via a file sensor. If pubmed_parquet, pubmed_postgres, and omicidx_duckdb were eager, every new PubMed file would trigger:

  1. A full PubMed Parquet rebuild (gigabytes of data, every hour).
  2. A full PubMed Postgres A/B reload (every hour).
  3. A full DuckDB rebuild + R2 upload (every hour).

That’s wasteful and expensive. The fix is to gate those three assets to once-daily, only firing when an upstream actually has new content:

automation_condition=(
dg.AutomationCondition.on_cron("0 3 * * *")
& dg.AutomationCondition.any_deps_updated()
)

Translation: “fire at 3:00 UTC, but only if any upstream has been updated since the last cron tick.”

Use eager()Use on_cron(...) & any_deps_updated()
Upstream is daily or slowerUpstream is hourly or faster
Asset is cheap to rebuildAsset is expensive to rebuild
You want immediate propagationYou want a fixed daily wall-clock cadence
The asset has few upstreamsThe asset has many upstreams that update at varying times

Three assets currently use the cron-paced pattern, with staggered times so each downstream finds its dep settled:

AssetCronWhy this slot
pubmed_parquet0 3 * * *After PubMed sensor’s overnight runs land.
pubmed_postgres0 4 * * *One hour after pubmed_parquet.
omicidx_duckdb0 5 * * *After the entire consolidation cascade has had time to settle.

If you add an asset that depends on one of these, give it a slot after its upstream’s slot.

The cascade is driven by an AutomationConditionSensorDefinition registered with default_status=DefaultSensorStatus.RUNNING so it’s active immediately on deployment without manual enable steps. See packages/omicidx-dagster/src/omicidx/dagster/definitions.py.

The cron+deps_updated pattern was adopted in PR #70 after a Copilot review caught the eager-cascade-from-hourly-upstream issue. The follow-up issue #73 tracks concurrent-run safety for omicidx_duckdb’s fixed output path.