IPFS @ ORCESTRA

Björn Brötz

Deutsches Zentrum für Luft- und Raumfahrt

2025-03-21

EUREC4A (2020)

a reason

EUREC4A platforms

producers / consumers / payers

want to share, use, publish, distribute, host and find data together


are part of orthogonal groups, e.g.: aircraft, institution, country, campaign

We share public data

ORCESTRA (2024)

an opportunity

Hardware @ ORCESTRA

cargo with servers was 3 weeks delayed

we’ve got this instead, so let’s try

let’s try IPFS

  • can store files
  • content addressed
  • peer to peer
  • public

  • we knew about the data sharing issues from EUREC4A
  • and had some small-scale post-EUREC4A experience before
  • now: a good opportunity to try IPFS at larger scale

IPFS

content addressed & distributed

IPLD data model

  • like JSON
  • plus bytes and link types
  • serialized into CBOR, protocol buffers, JSON and more
  • packed in a data block , identified by a hash (CID)
  • forms a Merkle-DAG

files on IPFS

naming things

IPFS CIDs refer to immutable content.
A naming system is required for updates, e.g.:

  • public key fingerprints
  • DNS records

flowchart LR
  IPNS -- name --> IPFS -- CID --> content

requires trust

IPFS @ ORCESTRA

setup

  • latest.orcestra-campaign.org
  • raspberry pi as on-site node
  • individual laptops run nodes
  • nightly sync to node at DKRZ
  • additional nodes at CIMH
  • data accessible from everywhere
  • name & pinning updates via GitHub

actual server

name resolution

!ipfs resolve /ipns/latest.orcestra-campaign.org
/ipfs/QmSgY99MScFdqhwroLg7QLSGhkLhtuaMijjrX9yxqGLxcG

ls

!ipfs ls /ipfs/QmfDxNnrmxX1pjSy2LK9frQkB8NXbiytdrsEKREjmWYNsD
QmPmvSb767cgHXKtTC2X9im7ftfZDT4NJigxveqt6yf9PW - meta/
QmT1cpAaBAppRo1EAnpXRonQbRyPeAQx4yM7yjib6cRFa7 - products/
QmUsJPneXEdDKHP75JrnqoABWQxcbPUemsSBZWLdc3x6UH - raw/
!ipfs ls /ipns/latest.orcestra-campaign.org
QmPmvSb767cgHXKtTC2X9im7ftfZDT4NJigxveqt6yf9PW - meta/
QmezVxJ68tn2H87ydux5U5ecMMNKWgapaTYk9JmASTJfbi - products/
QmPS5PwZBpphauFjTqL32xWp4jkptogbDHPcGiyjuGKsWp - raw/

use data with e.g. Python

import xarray as xr
import matplotlib.pylab as plt

# root = "ipns://latest.orcestra-campaign.org"
root = "ipfs://QmenSJd5QnrikC92MFDaFFjTvkBSTjvD1dBggvzvKLh1DT"
ds = xr.open_dataset(f"{root}/products/HALO/dropsondes/Level_3/PERCUSION_Level_3.zarr", engine="zarr")
plt.scatter(ds.aircraft_longitude, ds.aircraft_latitude, c=ds.iwv, cmap="viridis_r", vmin=45, vmax=70, s=2); plt.colorbar()

Browser

browser.orcestra-campaign.org

backup

data flow

flowchart LR
  data -- ipfs add --> sc[subtree CID] -- pull request --> tree[tree.yaml]
  subgraph "data flow"
  tree -- MFS --> root[root CID]
  root -- pin -->  pins[pinning service]
  root -- publish --> dns["DNS (latest...)"]
  end
  subgraph "index flow"
  root -- scan --> dcid["dataset CID(s)"] -- extract metadata --> stac_item[STAC item]
  stac_item -- collect --> stac_index[STAC index]
  end
  stac_item & stac_index --> browser

nice datasets

We still need to gather more datasets.

IPFS ❤️ nice datasets

  • large datasets are easier to understand
  • well chunked datasets can be accessed remotely
  • it’s a task for dataset creators

see also: EGU2024 slides and poster

inside-out database

  • index structures on IPFS
    • STAC items
    • version information
  • updated on DNS name update

HaloDB

After a long planning period, a project for the actual implementation of a distibuted HaloDB is forming.

  • will be based on IPFS
  • will add ease-of-use features
  • will add more storage capacity
  • based on EUREC4A and ORCESTRA experience