Climate Data Science Lab¶
Motivation and Goals¶
Oceanographers and climate scientists have access to a wealth of data–from instruments, satellite observations, and numerical simulations–to help confront the intellectual and societal challenges posted by climate change. But our ability to observe and simulate the climate system has outstripped our ability to actually understand the resulting data. In terms of software, as a field we have focused nearly all our efforts on the computational problem of simulation itself (ocean and climate models), and comparatively little effort on the data analysis side. Currently, we lack software and infrastructure commensurate to the scale of the data and the complexity of the scientific questions. Consequently, many students and postdocs are toiling over low-level data processing challenging, rather than thinking about big-picture scientific questions.
The goal of the Climate Data Science Lab is to leverage our group’s expertise at the intersection of climate science and open source software to develop a new generation of powerful, scalable, and sustainable computational tools for climate data science (a new term we are embracing henceforth). Hopefully this effort will drive dramatic leaps in productivity across the entire field, leading to exciting new discoveries in oceanography and climate science.
The Climate Data Science Lab builds on the work of the Pangeo Project. Pangeo was founded to improve coordination and community within the open source geoscience community. The Pangeo collaboration has lead to dramatic improvements in the integration and scalability of netCDF, Xarray, Dask, Zarr and and related libraries, on both traditional High Performance Computing systems and cloud platforms. The focus of the Climate Data Science Lab is the development of flexible yet performant layer of high-level software tools for oceanography and climate-specific operations that interoperate with the rest of the Pangeo tools. We are also working on improving the integration between these tools and machine learning libraries such as TensorFlow and PyTorch.
Some of the scientific areas we plan to explore within the Climate Data Science Lab are:
Cross-scale energy transfers in high-resolution models and observations
Air-sea exchange in high-resolution models and observation
Machine learning for ocean remote sensing and sub-grid parameterization
Education and outreach are also an important aspect of the lab’s activities. The tools developed within the lab, and within Pangeo more generally, will only have a broad impact if the next generation (and the current generation!) of researchers can easily learn to use them. Lab members will contribute to the development of tutorials and workshops, hold open office hours, and participate in online forums to help diffuse emerging best practices into the broader climate science community.
The Lab follows a structure that has proved successful so far for the Pangeo project: bringing together software-literate scientists with science-literate software developers. The scientists will define scientific problems which push the boundaries of the currently available software tools. This engineers will identify and resolve the technological roadblocks, making contributions wherever deemed necessary.
With the Climate Data Science Lab, we are hoping to pioneer a new sort of scientific working environment focused on collaboration, rather than traditional model of individual researchers working mostly in isolation. We plan to cultivate the following principles:
Tough scientific challenges are best tackled by a team with diverse backgrounds, skills, and experiences.
Developing reusable tools (e.g. software) is a valuable contribution to scientific research.
Open data, open source, and computational reproducibility are integral to every scientific project
When it comes to publications, quality is more important than quantity.