Data sources
OmicIDX pulls from six upstream sources. Each is independently refreshed and consolidated into a Parquet snapshot before being loaded into the API’s PostgreSQL backing store.
Sources
Section titled “Sources”NCBI SRA
Section titled “NCBI SRA”- What: Sequence Read Archive metadata — studies, samples, experiments, runs.
- Source: NCBI FTP mirror of
Metadata/XML files, plus theSRA_Run_MembersandSRA_Accessionsaccession tables. - Cadence: Daily.
- Format upstream: XML (samples, studies, experiments, runs); TSV (accessions/run members).
- Format here: Parquet, one file per entity (
sra_studies.parquet,sra_samples.parquet, etc.).
NCBI GEO
Section titled “NCBI GEO”- What: Gene Expression Omnibus — series (GSE), samples (GSM), platforms (GPL), and the RNA-seq counts overlay.
- Source: NCBI GEO FTP, monthly-partitioned mirrors.
- Cadence: Monthly fetch; daily for the RNA-seq counts overlay.
- Format upstream: SOFT (text) and per-record XML.
- Format here: Parquet per entity, deduplicated to the latest version per accession.
NCBI BioSample
Section titled “NCBI BioSample”- What: NCBI’s descriptive sample metadata.
- Source:
https://ftp.ncbi.nlm.nih.gov/biosample/biosample_set.xml.gz. - Cadence: Daily.
- Format upstream: Single gzipped XML dump.
- Format here: JSONL.gz raw, then Parquet.
NCBI BioProject
Section titled “NCBI BioProject”- What: NCBI’s project-level metadata (the umbrella above samples and experiments).
- Source:
https://ftp.ncbi.nlm.nih.gov/bioproject/bioproject.xml. - Cadence: Daily.
- Format upstream: Single XML dump.
- Format here: JSONL.gz raw, then Parquet.
NCBI PubMed
Section titled “NCBI PubMed”- What: Citation records (titles, authors, journals, MeSH terms, abstracts).
- Source: PubMed baseline + daily update files via NCBI FTP.
- Cadence: Hourly file sensor → daily consolidation. New baseline ingest every January when NCBI re-publishes the corpus.
- Format upstream: Per-day XML files.
- Format here: Parquet, partitioned by file then consolidated daily.
EBI BioSamples
Section titled “EBI BioSamples”- What: The European complement to NCBI BioSample. Many samples appear here that aren’t in NCBI BioSample.
- Source:
https://www.ebi.ac.uk/biosamples/samples(REST API withdt:updatefilter). - Cadence: Daily-partitioned fetch starting 2021-01-01.
- Format upstream: JSON via paginated REST.
- Format here: NDJSON.gz per partition, consolidated to a single Parquet.
Where the data lives
Section titled “Where the data lives”- Raw + Parquet: Cloudflare R2, S3-compatible.
- Analytical snapshot:
omicidx.duckdb— a single DuckDB file containing all the views; rebuilt daily, served from R2. - API backing store: PostgreSQL, loaded via DuckDB’s
postgres_scannerextension with a zero-downtime A/B view swap.
See Architecture for how this all fits together.