5.1. Xarray tutorial#

Xarray is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

5.1.1. Install xarray#

If you run code on your own computer, you need to install xarray. Open the console and enter conda install xarray or execute !pip install xarray in this jupyter notebook.

try:
    import xarray as xr  # check whether xarray installed
except ModuleNotFoundError:
    !pip install xarray==0.19.0  # install xarray
    import xarray as xr

print(xr.__version__)
0.19.0
import pandas as pd
import numpy as np

5.1.2. Introduction to xarray Data Structures#

Like pandas, xarray has two core data structures:

  • DataArray

  • Dataset

The DataArray is designed to hold a single multi-dimensional variable and its coordinates, and the Dataset is designed to hold multiple variables that potentially share the same coordinates.

5.1.3. DataArray#

DataArray has four important attributes:

  • values is a Numpy array and stores the data.

  • dims is a tuple that includes the names of all dimensions of values and its length is the number of dimensions.

  • coords is a dict-like built-in data structure of xarray and includes coordinate list or 1D-array of each dimension.

  • attrs is a dict and you can use attrs to define some custom attributes.

5.1.3.1. Defining a DataArray#

You can use the DataArray() function to create a DataArray. There are 4 arguments corresponding to the four attributes above. The data argument is corresponding to values attribute.

da = xr.DataArray(data=[[1, 2, 3], [2, 3, 4]],
                  dims=['x', 'y'],
                  coords={'x': [10, 20],
                          'y': [10, 20, 30]},
                  attrs={'summary': 'This is a custom DataArray for tutorial',
                         'license': 'CC BY-NC-ND 4.0'})
da
<xarray.DataArray (x: 2, y: 3)>
array([[1, 2, 3],
       [2, 3, 4]])
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Attributes:
    summary:  This is a custom DataArray for tutorial
    license:  CC BY-NC-ND 4.0
da.values
array([[1, 2, 3],
       [2, 3, 4]])
da.dims
('x', 'y')
da.coords
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
da.attrs
{'summary': 'This is a custom DataArray for tutorial',
 'license': 'CC BY-NC-ND 4.0'}

You can specify the data argument to create a DataArray. By default, dims of DataArray is ('dim_0', 'dim_1', ...) and attrs of DataArray is a empty dict.

da1 = xr.DataArray(data=[[1, 2, 3], 
                         [2, 3, 4]])
da1
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[1, 2, 3],
       [2, 3, 4]])
Dimensions without coordinates: dim_0, dim_1
da1.dims
('dim_0', 'dim_1')
da1.attrs
{}
da1.coords
Coordinates:
    *empty*

5.1.3.2. Change attribute#

da1 = da.copy(deep=True)
# change values
da1.values = np.random.rand(2, 3)
# change dims (input dict)
da1 = da1.rename({'x': 'm', 'y': 'n'})

# change name (input str)
da1 = da1.rename('Spatial Rainfall')

# add attrs
da1.attrs['gen'] = 'generated by numpy randomly'
# delete attrs
del da1.attrs['summary']
da1
<xarray.DataArray 'Spatial Rainfall' (m: 2, n: 3)>
array([[0.47856475, 0.79661643, 0.38884127],
       [0.44622596, 0.17973351, 0.15285284]])
Coordinates:
  * m        (m) int32 10 20
  * n        (n) int32 10 20 30
Attributes:
    license:  CC BY-NC-ND 4.0
    gen:      generated by numpy randomly

5.1.4. Dataset#

Dataset has three important attributes: data_vars, coords and attrs.

  1. data_vars is a dict with each key being the name of the variable or the dimension and each value being one of:

  • 1D-array or list

  • Series of pandas

  • DataFrame of pandas

  • tuple with two elements

  • DataArray

1D-array or list: When value of the dict is a 1D-array or list, the key of the dict will become the name of the dimension and corresponding 1D-array or list will become coordinates of the dimension.

import pandas as pd
ds = xr.Dataset(data_vars={'v1': [1, 2, 3],  # list
                           'v2': np.array([1, 2, 3])})  #1D-array
ds
<xarray.Dataset>
Dimensions:  (v1: 3, v2: 3)
Coordinates:
  * v1       (v1) int32 1 2 3
  * v2       (v2) int32 1 2 3
Data variables:
    *empty*

Series of pandas: When value of the dict is a Series of pandas, the key of the dict will become the name of the variable and values of the Series will become values of the variable. The index of the Series will become values of coordinates of the variable.

ds = xr.Dataset(data_vars={'v1': pd.Series([1, 2, 3], index=['a', 'b', 'c'])})
ds
<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  * dim_0    (dim_0) object 'a' 'b' 'c'
Data variables:
    v1       (dim_0) int64 1 2 3
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': s})
ds
<xarray.Dataset>
Dimensions:  (str: 3)
Coordinates:
  * str      (str) object 'a' 'b' 'c'
Data variables:
    v1       (str) int64 1 2 3

DataFrame of pandas: When value of the dict is a DataFrame, the key of the dict will become the name of the variable and values of the DataFrame will become values of the variable. The index and columns of the DataFrame will become values of coordinates of the variable. Therefore, the variable constructed by DataFrame has two dimensions.

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'])
df.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': df})
ds
<xarray.Dataset>
Dimensions:  (str: 3, dim_1: 1)
Coordinates:
  * str      (str) object 'a' 'b' 'c'
  * dim_1    (dim_1) int64 0
Data variables:
    v1       (str, dim_1) int64 1 2 3
df = pd.DataFrame([[1, 2],
                   [2, 5], 
                   [3, 6]], index=['a', 'b', 'c'], columns=['c1', 'c2'])
df.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': df})
ds
<xarray.Dataset>
Dimensions:  (str: 3, dim_1: 2)
Coordinates:
  * str      (str) object 'a' 'b' 'c'
  * dim_1    (dim_1) object 'c1' 'c2'
Data variables:
    v1       (str, dim_1) int64 1 2 2 5 3 6

tuple with two elements: When value of the dict is a tuple, the key of the dict will become the name of the variable. The first elements of the tuple will become dimension names of the variable and the second elements of the tuple will become values of the variable.

tp = (('d1', 'd2'), [[1, 2],
                     [2, 5],
                     [3, 6]])
ds = xr.Dataset(data_vars={'v1': tp})
ds
<xarray.Dataset>
Dimensions:  (d1: 3, d2: 2)
Dimensions without coordinates: d1, d2
Data variables:
    v1       (d1, d2) int32 1 2 2 5 3 6

DataArray: When value of the dict is a DataArray, the key of the dict will become the name of the variable. The dimensions and coordinates of the DataArray will be added to the Dataset created.

ds = xr.Dataset(data_vars={'v1': da})
ds
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Data variables:
    v1       (x, y) int32 1 2 3 2 3 4
  1. coords is also a dict. The key of the dict will be the name of the dimension and the values of the dict will be the coordinate of the dimension.

ds = xr.Dataset(data_vars={'v1': tp},
                coords={'d1': ['a', 'b', 'c'], 'd2': ['c1', 'c2']})
ds
<xarray.Dataset>
Dimensions:  (d1: 3, d2: 2)
Coordinates:
  * d1       (d1) <U1 'a' 'b' 'c'
  * d2       (d2) <U2 'c1' 'c2'
Data variables:
    v1       (d1, d2) int32 1 2 2 5 3 6
  1. attrs is also a dict that is the same as the attrs of DataArray.

5.1.5. Operations and Mathematical Functions#

Data variables can be modified through Mathematical Functions which is the same as Numpy arrays.

da = xr.DataArray(data=[[1, 2, 3],
                        [2, 3, 4]],
                  dims=['x', 'y'],
                  coords={'x': [10, 20],
                          'y': [10, 20, 30]})
print(da * 10)
print(np.log(da))
<xarray.DataArray (x: 2, y: 3)>
array([[10, 20, 30],
       [20, 30, 40]])
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
<xarray.DataArray (x: 2, y: 3)>
array([[0.        , 0.69314718, 1.09861229],
       [0.69314718, 1.09861229, 1.38629436]])
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
ds = xr.Dataset(data_vars={'v1': da})
print(ds * 10)
print(np.log(ds))
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Data variables:
    v1       (x, y) int32 10 20 30 20 30 40
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Data variables:
    v1       (x, y) float64 0.0 0.6931 1.099 0.6931 1.099 1.386

5.1.6. Loading Data from netCDF Files#

NetCDF (Network Common Data Format) is a widely used format for distributing geoscience data. For more details about netCDF please access netCDF website.

Call open_dataset function to open a netCDF file. For more details about reading and writing netCDF files please access xarray netCDF docs. Below we load some ERA5 dataset which we have downloaded from online websites.

# data source: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=form
ds = xr.open_dataset('../../assets/data/era5_singapore_2021.9.1.nc')
ds
<xarray.Dataset>
Dimensions:    (longitude: 3, latitude: 3, time: 24)
Coordinates:
  * longitude  (longitude) float32 103.5 103.8 104.0
  * latitude   (latitude) float32 1.5 1.25 1.0
  * time       (time) datetime64[ns] 2021-09-01 ... 2021-09-01T23:00:00
Data variables:
    t2m        (time, latitude, longitude) float32 298.8 298.6 ... 299.7 300.0
    tp         (time, latitude, longitude) float32 0.0004115 ... 0.0002184
Attributes:
    Conventions:  CF-1.6
    history:      2021-10-01 12:00:45 GMT by grib_to_netcdf-2.20.0: /opt/ecmw...

5.1.7. Selecting the Internal Data#

You can directly access Coordinates and Data variables by their name.

ds.t2m
<xarray.DataArray 't2m' (time: 24, latitude: 3, longitude: 3)>
array([[[298.80975, 298.64566, 298.9777 ],
        [299.67108, 300.09094, 300.39368],
        [299.69064, 300.13788, 300.4328 ]],

       [[299.03745, 299.0257 , 299.40656],
        [299.81476, 299.89877, 300.39294],
        [299.74252, 300.01593, 300.48468]],

       ...,

       [[297.7106 , 297.41568, 297.7399 ],
        [299.78677, 299.86887, 300.029  ],
        [299.7594 , 299.7887 , 300.0837 ]],

       [[297.81107, 297.36383, 297.7017 ],
        [299.93607, 299.76227, 299.9537 ],
        [299.7818 , 299.71927, 300.03568]]], dtype=float32)
Coordinates:
  * longitude  (longitude) float32 103.5 103.8 104.0
  * latitude   (latitude) float32 1.5 1.25 1.0
  * time       (time) datetime64[ns] 2021-09-01 ... 2021-09-01T23:00:00
Attributes:
    units:      K
    long_name:  2 metre temperature
ds.time
<xarray.DataArray 'time' (time: 24)>
array(['2021-09-01T00:00:00.000000000', '2021-09-01T01:00:00.000000000',
       '2021-09-01T02:00:00.000000000', '2021-09-01T03:00:00.000000000',
       '2021-09-01T04:00:00.000000000', '2021-09-01T05:00:00.000000000',
       '2021-09-01T06:00:00.000000000', '2021-09-01T07:00:00.000000000',
       '2021-09-01T08:00:00.000000000', '2021-09-01T09:00:00.000000000',
       '2021-09-01T10:00:00.000000000', '2021-09-01T11:00:00.000000000',
       '2021-09-01T12:00:00.000000000', '2021-09-01T13:00:00.000000000',
       '2021-09-01T14:00:00.000000000', '2021-09-01T15:00:00.000000000',
       '2021-09-01T16:00:00.000000000', '2021-09-01T17:00:00.000000000',
       '2021-09-01T18:00:00.000000000', '2021-09-01T19:00:00.000000000',
       '2021-09-01T20:00:00.000000000', '2021-09-01T21:00:00.000000000',
       '2021-09-01T22:00:00.000000000', '2021-09-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2021-09-01 ... 2021-09-01T23:00:00
Attributes:
    long_name:  time
ds.longitude
<xarray.DataArray 'longitude' (longitude: 3)>
array([103.5 , 103.75, 104.  ], dtype=float32)
Coordinates:
  * longitude  (longitude) float32 103.5 103.8 104.0
Attributes:
    units:      degrees_east
    long_name:  longitude

Below you can access the specific value by the index or slices.

# first value of time dimension
ds.time[0]
<xarray.DataArray 'time' ()>
array('2021-09-01T00:00:00.000000000', dtype='datetime64[ns]')
Coordinates:
    time     datetime64[ns] 2021-09-01
Attributes:
    long_name:  time
# get a 2-demension data with first time index 
ds.t2m[0]
<xarray.DataArray 't2m' (latitude: 3, longitude: 3)>
array([[298.80975, 298.64566, 298.9777 ],
       [299.67108, 300.09094, 300.39368],
       [299.69064, 300.13788, 300.4328 ]], dtype=float32)
Coordinates:
  * longitude  (longitude) float32 103.5 103.8 104.0
  * latitude   (latitude) float32 1.5 1.25 1.0
    time       datetime64[ns] 2021-09-01
Attributes:
    units:      K
    long_name:  2 metre temperature

You can also use sel() function to conduct label-based indexing.

ds.t2m.sel(time='2021-09-01 00:00:00')  # select all data of 2021-09-01 00:00:00
<xarray.DataArray 't2m' (latitude: 3, longitude: 3)>
array([[298.80975, 298.64566, 298.9777 ],
       [299.67108, 300.09094, 300.39368],
       [299.69064, 300.13788, 300.4328 ]], dtype=float32)
Coordinates:
  * longitude  (longitude) float32 103.5 103.8 104.0
  * latitude   (latitude) float32 1.5 1.25 1.0
    time       datetime64[ns] 2021-09-01
Attributes:
    units:      K
    long_name:  2 metre temperature
# select data locate in (103.5, 1.5) of 2021-09-01 00:00:00
ds.t2m.sel(longitude=103.5, latitude=1.5, time='2021-09-01 00:00:00')  
<xarray.DataArray 't2m' ()>
array(298.80975, dtype=float32)
Coordinates:
    longitude  float32 103.5
    latitude   float32 1.5
    time       datetime64[ns] 2021-09-01
Attributes:
    units:      K
    long_name:  2 metre temperature
# compute the average value in 2021-09-01 00:00:00
ds.t2m[0].values.mean()
299.65002

5.1.8. Read From URL#

Data used for the CE3201 final project are simulated temperature (unit: K) from the latest climate model outputs, which have been divided into two parts: historical (1850.01.01-2014.12.31) simulations and future (2015.01.01-2100.12.31) projections. You can find these data here.

from urllib.request import urlretrieve
import ssl

# Ignore certificate validation
ssl._create_default_https_context = ssl._create_unverified_context 

# Everyone can access unique data by specifying your student ID and label (future or historical). 
# For example, if your student ID is A0188677A and you will get future data
# you can using following codes to access your data.

student_ID = 'A0188677A'
label = 'future'
filename = '%s_%s.nc'%(student_ID, label)
url = 'https://hydrology.princeton.edu/data/hexg/CE3201_final_project_data/%s'%(filename)
# download file from remote server
urlretrieve(url, filename)
ds = xr.open_dataset(filename)
ds
<xarray.Dataset>
Dimensions:  (model: 10, ssp: 2, time: 31411)
Coordinates:
  * model    (model) object 'MIROC6' 'MPI-ESM1-2-LR' ... 'AWI-CM-1-1-MR'
  * ssp      (ssp) object 'ssp126' 'ssp585'
  * time     (time) datetime64[ns] 2015-01-01T12:00:00 ... 2100-12-31T12:00:00
Data variables:
    tas      (model, ssp, time) float64 ...
Attributes:
    latitude:   3.5896654
    longitude:  98.6738261
    city:       Medan
    StudentID:  A0188677A
# For convenience, the process above is rewritten using function
def read_by_student_ID(student_ID, label):
    from urllib.request import urlretrieve
    import xarray as xr
    import os
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    filename = '%s_%s.nc'%(student_ID, label)
    
    if os.path.exists(filename):
        return xr.open_dataset(filename)
    
    url = 'https://hydrology.princeton.edu/data/hexg/CE3201_final_project_data/%s'%(filename)
    urlretrieve(url, filename)
    return xr.open_dataset(filename)

student_ID = 'A0188677A'
# label = 'future'
label = 'historical'
ds = read_by_student_ID(student_ID, label)
ds
<xarray.Dataset>
Dimensions:  (model: 10, time: 60265)
Coordinates:
  * model    (model) object 'MIROC6' 'MPI-ESM1-2-LR' ... 'AWI-CM-1-1-MR'
  * time     (time) datetime64[ns] 1850-01-01T12:00:00 ... 2014-12-31T12:00:00
Data variables:
    tas      (model, time) float64 ...
Attributes:
    latitude:   3.5896654
    longitude:  98.6738261
    city:       Medan
    StudentID:  A0188677A

5.1.9. References#