ehrQL output formats
Supported output formats🔗
The following output formats are supported:
Recommended🔗
.arrow
— Apache Arrow format.csv.gz
— compressed CSV format
Not recommended🔗
.csv
— uncompressed CSV format
The uncompressed CSV format is not recommended, because this produces much larger files than the alternative formats.
Unsupported output formats🔗
These formats were supported in cohort-extractor, but are not by ehrQL
.dta
and.dta.gz
— Stata formats
arrowload
for Stata users🔗
Stata itself does not directly support .arrow
.
However, OpenSAFELY's Stata Docker image contains the arrowload
library
that can load .arrow
files in Stata.
Use arrowload
as:
. arrowload /path/to/arrow/file
See the full documentation via running command-line Stata via OpenSAFELY:
opensafely exec stata-mp stata
and then running
. help arrowload
Selecting an output format🔗
You select an output format
when you use the --output
option to specify an output filename for ehrQL.
The filename extension — for example, .arrow
— that you provide determines the output format file.
If you specify a filename extension that is not supported, you will get an error telling you so.
Examples with opensafely exec
🔗
.arrow
🔗
opensafely exec ehrql:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.arrow"
.csv.gz
🔗
opensafely exec ehrql:v0 generate-dataset "./dataset-definition.py" --dummy-tables "example-data/" --output "./outputs/data_extract.csv.gz"
Example project.yaml
🔗
version: "3.0"
expectations:
population_size: 1000
actions:
extract_data:
run: ehrql:v0 generate-dataset "./dataset_definition.py" --output "outputs/data_extract.arrow"
outputs:
highly_sensitive:
population: outputs/data_extract.arrow
The population
filename must be identical to the output filename specified by --output
.
Otherwise you will see the following error when you use opensafely run
to run the project actions:
$ opensafely run run_all
=> ProjectValidationError
Invalid project:
1 validation error for Pipeline
__root__
--output in run command and outputs must match (type=value_error)