hiopy

analysis-ready hierarchical model output

Tobias Kölling

Observation

Analysis scripts are forced to load way too much data.

Idea

optimize output for analysis

(not write throughput)

single dataset

(from xarray docs)

n-dimensional variables
shared dimensions

coordinates
attributes for metadata

Why a (single) Dataset

provides an easy-to-understand overview
forces consistency across output
cutting things is easier than glueing things
no complex search queries required

Datasets are not

a single file
a storage format
shaped by storage & handling

chunking & hierarchy

Grid	Cells
1° by 1°	0.06M
10 km	5.1M
5 km	20M
1 km	510M
200 m	12750M

Screen	Pixels
VGA	0.3M
Full HD	2.1M
MacBook 13’	4.1M
4K	8.8M
8K	35.4M

It’s impossible to look at the entire globe in full resolution.

scalable implementation

should work well below km-scale

Dataset Implementations

Format choice

(for now)

Zarr because single PiB-scale datasets are managable
Zarr because concurrent writes are trivial
plain-filesystem backend to keep support for netCDF
blosc-zstd compression (much faster than zlib)

Zarr

hiopy implementation

sketch of hiopy worker

yac = YAC(xml, xsd)
sgd = HealPixSubgridDefinition(2**10, nchunks, ichunk)
dataset = zarr.open_consolidated(output_folder, mode="r+")
comp_id = yac.def_comp("healpix_io")
point_id, grid = make_yac_grid(sgd)
fields = {varname: Field.create(varname, comp_id, point_id)
          for varname in varnames}
put_fields = ...
yac.search()
steps = compute_nsteps(yac, fields)
for i in range(steps):
    for varname, field in fields.items():
        buffer = field.get()
        put_fields[varname].put(coarsen(buffer))
        dataset[varname][i,sgd.cell_chunk_slice] = buffer

(real code at GWDG gitlab)

process setup

does it work

ngc3028 ICON run

coupled (atmosphere and ocean)
5km resolution
output intervals: 30min, 3h and daily
5 simulated years

It’s about 1.6 PiB.

(if it wouldn’t be compressed)

nice benefit: monitoring

all resolutions available immediately
Zarr can be read while written

import xarray as xr
ds = xr.open_dataset("coarse_resolution_output.zarr", engine="zarr")
ds.tas.mean("cell").plot()

dropsonde data comparison

This code selects ICON model output at all dropsonde locations during EUREC4A.

sonde_pix = healpy.ang2pix(
    icon.crs.healpix_nside, joanne.flight_lon, joanne.flight_lat, lonlat=True, nest=True
)

icon_sondes = (
    icon[["ua", "va", "ta", "hus"]]
    .sel(time=joanne.launch_time, method="nearest")
    .isel(cell=sonde_pix)
    .compute()
)

(55 sec, 1GB, single thread, full code at easy.gems)

interactive crosssections

(code on GWDG gitlab)

nextGEMS Hackathon Madrid

remarkably little issues raised
very positive feedback from users

things to explore

transparent tape offload

hackathon showed: only 3% of data was accessed
tar-ing chunks (cleverly) + reference fs + slkspec might work
(HTTP-) proxy might be another option

sharded store

ZEP0002 proposes a standard way to pack multiple Zarr-Chunks in one Object.

formalized hierarchies

ZEP0005 proposes a standard way to represent aggregated hierarchies.