An Introduction to Pangeo

LP / ORNL DAAC UWG 2019

Andrew Pawloski
andrew@element84.com

Talk Agenda

  • What is Pangeo?
  • Why Pangeo?
  • How do I use it?
  • How can I contribute to it?
"How do I as a scientist use these technologies to advance my data workflows?"

- Aaron Friesz

What is Pangeo?

  • Not a software distro, package, or library
  • Community of data users and software developers
  • Frequently used packages (Pangeo "stack")
  • Curated set of tools, architectures, and methodologies

Image Source: Jake VanderPlas, "The State of the Stack," SciPy Keynote (SciPy 2017).

Image Source: Theo McCaie, UK Met Office, ESA Φ-week Event.

Goal:

Foster collaboration around open source scientific python ecosystem for ocean / atmosphere / climate science

Goal:

Build out stack with domain specific packages (e.g. thermodynamics, regridding, vector calculus, etc)

Goal:

Scale to handle many-PB data missions (e.g. SWOT, NISAR)

EOSDIS Data Ingest Rates

Image Source: https://earthdata.nasa.gov/about/cloud-evolution

Image Source: Pangeo Technical Architecture (https://pangeo.io/architecture.html)

  • No source data downloaded to workstation
  • Well suited for cloud-native data stores/formats (e.g. S3, GCS and COGs, Zarr)
  • Notebooks can be persisted, shared
  • Compute is ephemeral

Xarray

http://xarray.pydata.org/en/stable/api.html

Xarray

Dask

Image Source: Multidimensional Arrays, Geohackweek 2016

Dask

Image Source: Multidimensional Arrays, Geohackweek 2016

Cloud-Native Data Formats

Linear Reads

Image Source: James Norton (Element 84)

Cloud-Native Data Formats

Tiled Reads

Image Source: James Norton (Element 84)

Image Source: Scott Henderson (University of Washington)

Demo?

https://nasa.pangeo.io

Deploying Pangeo

  1. Deploy Kubernetes Cluster
  2. Configure with Pangeo-provided settings file
  3. Deploy Pangeo-provided Helm chart

http://pangeo.io/setup_guides/index.html https://github.com/pangeo-data/pangeo-cloud-federation

Further Resources

Thank you.

(Come see Pangeo talks at AWS PSS, ESIP!)

andrew@element84.com

apawl.com/talks/pangeo-lp-ornl.html

Backup Slides

Using Your Deployment: Dask


from dask_jobqueue import PBSCluster
from dask.distributed import Client

cluster = PBSCluster(cores=36,
										 memory="108GB")
cluster.scale(10)
client = Client(cluster)
						

Using Your Deployment: Dask


from dask_kubernetes import KubeCluster
from dask.distributed import Client

cluster = KubeCluster(n_workers=10)
cluster.scale(10)

client = Client(cluster)
						
Cloud Optimized GeoTIFFs (COGs)
  • Regular GeoTIFFs
  • Tiled
  • Support HTTP GET Range Requests
  • End users download subset range of the GeoTIFF
Typical Raster
Image credit: James Norton (Element 84)
Tiled GeoTIFF
Image credit: James Norton (Element 84)
Zarr
  • Multi-dimensional arrays saved in discrete chunks
  • Each chunk is a file
  • Clients can pull only the chunks they need