BIDS-SQL

Database infrastructure for neuroimaging data

Solo Developer · 2026–Present · Open Source

Key Achievement: Queryable neuroimaging data · Powers MR.Flow dataset management · Full processing provenance tracking

BIDS-SQL is a database schema and Python library for storing, querying, and managing BIDS-formatted neuroimaging datasets. It transforms the file-based BIDS structure into queryable relational tables—enabling complex queries across subjects, sessions, modalities, and processing pipelines.

The schema captures everything from raw acquisition metadata to processing provenance: which files were processed, by which pipeline stages, with what QC metrics. It's the data persistence layer that enables MR.Flow's dataset management and will power future neuroimaging tools.

View on GitLab →

Infrastructure, not end-user facing. BIDS-SQL is the foundational data layer that other tools build on—the architecture that determines what queries are possible and how efficiently workflows can run.

BIDS directory structure to SQL database transformation — BIDS file structure → queryable SQL tables with modular extensions

Schema Architecture

Core BIDS Tables

datasets subjects sessions files metadata scans
Dataset descriptions, participant info, file entities (task, run, acquisition), JSON sidecar storage

EEG Extensions

eeg_metadata eeg_channels eeg_events eeg_electrodes eeg_coordsystem
50+ recording parameter fields, channel info, event markers, electrode positions

MRI Extensions

mri_metadata mri_intended_for fmri_events physio_metadata dwi_metadata
Acquisition parameters, fieldmap relationships, physiological recordings

Processing Provenance

processing_runs processing_stages qc_metrics ica_components bad_channels rejected_epochs
Full pipeline tracking with timing, status, errors, QC thresholds, component classifications

Modular Design: Mix schemas for your modality (EEG + provenance, MRI + provenance, or all)

What Problem It Solves

BIDS (Brain Imaging Data Structure) organizes neuroimaging data as files in directories. This works for storage, but falls apart when you need to:

Query across 200 subjects: "Which subjects completed Task A with good QC?"
Track processing: "Which files have been preprocessed? Which failed?"
Join metadata: "Get all EEG files where sampling rate > 500 Hz and impedance < 10 kΩ"
Maintain provenance: "What pipeline version processed this file? What QC metrics resulted?"

BIDS-SQL makes BIDS queryable. Instead of walking directories and parsing JSON, you write SQL.

How It Works

Dual Backend: SQLite for Local, PostgreSQL for Shared

SQLite — Single-file database for local projects, no server needed
PostgreSQL — Multi-user access for shared lab databases
Identical Python API — Same code works with either backend
Schema versioning — Modular SQL files applied in order

Import/Export: BIDS Directory ↔ Database Roundtrip

BIDSImporter — Parse BIDS directories with PyBIDS, populate database
BIDSExporter — Export database contents back to BIDS-formatted directories
Entity normalization — Auto-strip 'sub-', 'ses-' prefixes for clean queries
Metadata extraction — JSON sidecars stored as queryable key-value pairs

Query Helpers for Common Operations

File queries — Filter by datatype, task, run, session, subject, acquisition
QC queries — Get metrics with threshold evaluation, summary statistics
Processing queries — Track pipeline stages, find failed jobs, check completion
Metadata search — Search across sidecar fields with SQL patterns

Processing Provenance: Full Pipeline Tracking

Pipeline runs — Track inputs, outputs, execution status, timing, errors
Multi-stage support — Record each processing stage independently
QC metrics — Store metrics with categories (signal_quality, artifacts, motion) and thresholds
ICA tracking — ICLabel classifications with probabilities and removal decisions
Bad channel history — Detection method, interpolation records
Epoch rejection — Reasons, peak amplitudes, timestamps

Example Queries

-- Get all EEG files for subjects who completed TaskA
SELECT f.filepath, s.subject_id
FROM files f
JOIN subjects s ON f.subject_id = s.id
WHERE f.datatype = 'eeg' AND f.task = 'TaskA';

-- Find files that failed preprocessing
SELECT f.filepath, pr.error_message
FROM processing_runs pr
JOIN files f ON pr.input_file_id = f.id
WHERE pr.status = 'failed';

-- QC summary: average metrics by subject
SELECT s.subject_id, AVG(qc.value) as avg_snr
FROM qc_metrics qc
JOIN files f ON qc.file_id = f.id
JOIN subjects s ON f.subject_id = s.id
WHERE qc.metric_name = 'snr'
GROUP BY s.subject_id;

Architecture Decisions

Modular schemas — Core + extensions (EEG, MRI, provenance) can be mixed
Foreign key constraints — Data integrity with ON DELETE CASCADE
Strategic indexing — Indexes on commonly filtered fields (dataset_id, datatype, task)
BIDS compliance — Entity naming matches the spec exactly
File lineage — Derivatives track source files via source_file_id

Impact

Powers MR.Flow's dataset management and job tracking
Enables complex queries that would require custom scripts with file-based BIDS
Full reproducibility through processing provenance
Foundation for future neuroimaging tools in the Neurotrophy ecosystem

Related Tools

How does this connect? This is the data layer that enables tools like MR.Flow to query and coordinate neuroimaging workflows. For context on how infrastructure architecture shapes what's possible, see Architecture → Processing → Behavior.