Skip to content

Data sources

OmicIDX pulls from six upstream sources. Each is independently refreshed and consolidated into a Parquet snapshot before being loaded into the API’s PostgreSQL backing store.

  • What: Sequence Read Archive metadata — studies, samples, experiments, runs.
  • Source: NCBI FTP mirror of Metadata/ XML files, plus the SRA_Run_Members and SRA_Accessions accession tables.
  • Cadence: Daily.
  • Format upstream: XML (samples, studies, experiments, runs); TSV (accessions/run members).
  • Format here: Parquet, one file per entity (sra_studies.parquet, sra_samples.parquet, etc.).
  • What: Gene Expression Omnibus — series (GSE), samples (GSM), platforms (GPL), and the RNA-seq counts overlay.
  • Source: NCBI GEO FTP, monthly-partitioned mirrors.
  • Cadence: Monthly fetch; daily for the RNA-seq counts overlay.
  • Format upstream: SOFT (text) and per-record XML.
  • Format here: Parquet per entity, deduplicated to the latest version per accession.
  • What: NCBI’s descriptive sample metadata.
  • Source: https://ftp.ncbi.nlm.nih.gov/biosample/biosample_set.xml.gz.
  • Cadence: Daily.
  • Format upstream: Single gzipped XML dump.
  • Format here: JSONL.gz raw, then Parquet.
  • What: NCBI’s project-level metadata (the umbrella above samples and experiments).
  • Source: https://ftp.ncbi.nlm.nih.gov/bioproject/bioproject.xml.
  • Cadence: Daily.
  • Format upstream: Single XML dump.
  • Format here: JSONL.gz raw, then Parquet.
  • What: Citation records (titles, authors, journals, MeSH terms, abstracts).
  • Source: PubMed baseline + daily update files via NCBI FTP.
  • Cadence: Hourly file sensor → daily consolidation. New baseline ingest every January when NCBI re-publishes the corpus.
  • Format upstream: Per-day XML files.
  • Format here: Parquet, partitioned by file then consolidated daily.
  • What: The European complement to NCBI BioSample. Many samples appear here that aren’t in NCBI BioSample.
  • Source: https://www.ebi.ac.uk/biosamples/samples (REST API with dt:update filter).
  • Cadence: Daily-partitioned fetch starting 2021-01-01.
  • Format upstream: JSON via paginated REST.
  • Format here: NDJSON.gz per partition, consolidated to a single Parquet.
  • Raw + Parquet: Cloudflare R2, S3-compatible.
  • Analytical snapshot: omicidx.duckdb — a single DuckDB file containing all the views; rebuilt daily, served from R2.
  • API backing store: PostgreSQL, loaded via DuckDB’s postgres_scanner extension with a zero-downtime A/B view swap.

See Architecture for how this all fits together.