hiopy

analysis-ready hierarchical model output

Tobias Kölling

Observation

Analysis scripts are forced to load way too much data.

Idea

optimize output for analysis

(not write throughput)

single dataset

(from xarray docs)

  • n-dimensional variables
  • shared dimensions
  • coordinates
  • attributes for metadata

Why a (single) Dataset

  • provides an easy-to-understand overview
  • forces consistency across output
  • cutting things is easier than glueing things
  • no complex search queries required

Datasets are not

  • a single file
  • a storage format
  • shaped by storage & handling

chunking & hierarchy

Grid Cells
1° by 1° 0.06M
10 km 5.1M
5 km 20M
1 km 510M
200 m 12750M
Screen Pixels
VGA 0.3M
Full HD 2.1M
MacBook 13’ 4.1M
4K 8.8M
8K 35.4M

It’s impossible to look at the entire globe in full resolution.

scalable implementation

should work well below km-scale

Dataset Implementations

Dataset Implementations

Format choice

(for now)

  • Zarr because single PiB-scale datasets are managable
  • Zarr because concurrent writes are trivial
  • plain-filesystem backend to keep support for netCDF
  • blosc-zstd compression (much faster than zlib)

Zarr

hiopy implementation

sketch of hiopy worker

yac = YAC(xml, xsd)
sgd = HealPixSubgridDefinition(2**10, nchunks, ichunk)
dataset = zarr.open_consolidated(output_folder, mode="r+")
comp_id = yac.def_comp("healpix_io")
point_id, grid = make_yac_grid(sgd)
fields = {varname: Field.create(varname, comp_id, point_id)
          for varname in varnames}
put_fields = ...
yac.search()
steps = compute_nsteps(yac, fields)
for i in range(steps):
    for varname, field in fields.items():
        buffer = field.get()
        put_fields[varname].put(coarsen(buffer))
        dataset[varname][i,sgd.cell_chunk_slice] = buffer

(real code at GWDG gitlab)

process setup

does it work

ngc3028 ICON run

  • coupled (atmosphere and ocean)
  • 5km resolution
  • output intervals: 30min, 3h and daily
  • 5 simulated years

It’s about 1.6 PiB.

(if it wouldn’t be compressed)

nice benefit: monitoring

  • all resolutions available immediately
  • Zarr can be read while written
import xarray as xr
ds = xr.open_dataset("coarse_resolution_output.zarr", engine="zarr")
ds.tas.mean("cell").plot()

dropsonde data comparison

This code selects ICON model output at all dropsonde locations during EUREC4A.

sonde_pix = healpy.ang2pix(
    icon.crs.healpix_nside, joanne.flight_lon, joanne.flight_lat, lonlat=True, nest=True
)

icon_sondes = (
    icon[["ua", "va", "ta", "hus"]]
    .sel(time=joanne.launch_time, method="nearest")
    .isel(cell=sonde_pix)
    .compute()
)

(55 sec, 1GB, single thread, full code at easy.gems)

interactive crosssections

(code on GWDG gitlab)

nextGEMS Hackathon Madrid

  • remarkably little issues raised
  • very positive feedback from users

things to explore

transparent tape offload

  • hackathon showed: only 3% of data was accessed
  • tar-ing chunks (cleverly) + reference fs + slkspec might work
  • (HTTP-) proxy might be another option

sharded store

ZEP0002 proposes a standard way to pack multiple Zarr-Chunks in one Object.

formalized hierarchies

ZEP0005 proposes a standard way to represent aggregated hierarchies.