# Xarray tutorial

`Xarray` is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

## Install xarray
If you run code on your own computer, you need to install `xarray`. Open the console and enter ```conda install xarray``` or execute ```!pip install xarray``` in this jupyter notebook.

In [1]:
try:
    import xarray as xr  # check whether xarray installed
except ModuleNotFoundError:
    !pip install xarray==0.19.0  # install xarray
    import xarray as xr

print(xr.__version__)

0.19.0


In [2]:
import pandas as pd
import numpy as np

## Introduction to xarray Data Structures
Like `pandas`, `xarray` has two core data structures:
- `DataArray`
- `Dataset`

The `DataArray` is designed to hold a single multi-dimensional variable and its coordinates, and the `Dataset` is designed to hold multiple variables that potentially share the same coordinates.

## DataArray
DataArray has four important attributes:
* `values` is a `Numpy` array and stores the data. 
* `dims` is a `tuple` that includes the names of all dimensions of `values` and its length is the number of dimensions.
* `coords` is a dict-like built-in data structure of xarray and includes coordinate list or 1D-array of each dimension.
* `attrs` is a `dict` and you can use `attrs` to define some custom attributes.

### Defining a DataArray

You can use the `DataArray()` function to create a DataArray. There are 4 arguments corresponding to the four attributes above. The `data` argument is corresponding to `values` attribute. 

In [3]:
da = xr.DataArray(data=[[1, 2, 3], [2, 3, 4]],
                  dims=['x', 'y'],
                  coords={'x': [10, 20],
                          'y': [10, 20, 30]},
                  attrs={'summary': 'This is a custom DataArray for tutorial',
                         'license': 'CC BY-NC-ND 4.0'})
da

In [4]:
da.values

array([[1, 2, 3],
       [2, 3, 4]])

In [5]:
da.dims

('x', 'y')

In [6]:
da.coords

Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30

In [7]:
da.attrs

{'summary': 'This is a custom DataArray for tutorial',
 'license': 'CC BY-NC-ND 4.0'}

You can specify the `data` argument to create a `DataArray`. By default, `dims` of `DataArray` is `('dim_0', 'dim_1', ...)` and `attrs` of `DataArray` is a empty `dict`.

In [8]:
da1 = xr.DataArray(data=[[1, 2, 3], 
                         [2, 3, 4]])
da1

In [9]:
da1.dims

('dim_0', 'dim_1')

In [10]:
da1.attrs

{}

In [11]:
da1.coords

Coordinates:
    *empty*

### Change attribute

In [12]:
da1 = da.copy(deep=True)
# change values
da1.values = np.random.rand(2, 3)
# change dims (input dict)
da1 = da1.rename({'x': 'm', 'y': 'n'})

# change name (input str)
da1 = da1.rename('Spatial Rainfall')

# add attrs
da1.attrs['gen'] = 'generated by numpy randomly'
# delete attrs
del da1.attrs['summary']
da1

## Dataset
Dataset has three important attributes: `data_vars`, `coords` and `attrs`.
1. `data_vars` is a `dict` with each key being the name of the variable or the dimension and each value being one of:
  - `1D-array` or `list`
  - `Series` of `pandas`
  - `DataFrame` of `pandas`
  - `tuple`  with two elements
  - `DataArray`

`1D-array` or `list`: When value of the `dict` is a 1D-array or list, the key of the `dict` will become the name of the dimension and corresponding 1D-array or list will become coordinates of the dimension.

In [13]:
import pandas as pd
ds = xr.Dataset(data_vars={'v1': [1, 2, 3],  # list
                           'v2': np.array([1, 2, 3])})  #1D-array
ds

`Series` of `pandas`: When value of the `dict` is a `Series` of `pandas`, the key of the `dict` will become the name of the variable and values of the `Series` will become values of the variable.  The `index` of the `Series` will become values of coordinates of the variable.

In [14]:
ds = xr.Dataset(data_vars={'v1': pd.Series([1, 2, 3], index=['a', 'b', 'c'])})
ds

In [15]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': s})
ds

`DataFrame` of `pandas`: When value of the `dict` is a `DataFrame`, the key of the `dict` will become the name of the variable and values of the `DataFrame` will become values of the variable. The `index` and `columns` of the `DataFrame` will become values of coordinates of the variable. Therefore, the variable constructed by `DataFrame` has two dimensions.

In [16]:
df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'])
df.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': df})
ds

In [17]:
df = pd.DataFrame([[1, 2],
                   [2, 5], 
                   [3, 6]], index=['a', 'b', 'c'], columns=['c1', 'c2'])
df.index.name = 'str'
ds = xr.Dataset(data_vars={'v1': df})
ds

`tuple` with two elements: When value of the `dict` is a `tuple`, the key of the `dict` will become the name of the variable. The first elements of the `tuple` will become dimension names of the variable and the second elements of the `tuple` will become values of the variable.

In [18]:
tp = (('d1', 'd2'), [[1, 2],
                     [2, 5],
                     [3, 6]])
ds = xr.Dataset(data_vars={'v1': tp})
ds

`DataArray`: When value of the `dict` is a `DataArray`, the key of the `dict` will become the name of the variable. The dimensions and coordinates of the `DataArray` will be added to the `Dataset` created.

In [19]:
ds = xr.Dataset(data_vars={'v1': da})
ds

2. `coords` is also a `dict`. The key of the `dict` will be the name of the dimension and the values of the `dict` will be the coordinate of the dimension.

In [20]:
ds = xr.Dataset(data_vars={'v1': tp},
                coords={'d1': ['a', 'b', 'c'], 'd2': ['c1', 'c2']})
ds

3. `attrs` is also a `dict` that is the same as the `attrs` of `DataArray`.

## Operations and Mathematical Functions

Data variables can be modified through Mathematical Functions which is the same as `Numpy` arrays.

In [21]:
da = xr.DataArray(data=[[1, 2, 3],
                        [2, 3, 4]],
                  dims=['x', 'y'],
                  coords={'x': [10, 20],
                          'y': [10, 20, 30]})
print(da * 10)
print(np.log(da))

<xarray.DataArray (x: 2, y: 3)>
array([[10, 20, 30],
       [20, 30, 40]])
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
<xarray.DataArray (x: 2, y: 3)>
array([[0.        , 0.69314718, 1.09861229],
       [0.69314718, 1.09861229, 1.38629436]])
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30


In [22]:
ds = xr.Dataset(data_vars={'v1': da})
print(ds * 10)
print(np.log(ds))

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Data variables:
    v1       (x, y) int32 10 20 30 20 30 40
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int32 10 20
  * y        (y) int32 10 20 30
Data variables:
    v1       (x, y) float64 0.0 0.6931 1.099 0.6931 1.099 1.386


## Loading Data from netCDF Files

NetCDF (Network Common Data Format) is a widely used format for distributing geoscience data. For more details about netCDF please access [netCDF website](https://www.unidata.ucar.edu/software/netcdf/docs/faq.html#whatisit).

Call `open_dataset` function to open a netCDF file. For more details about reading and writing netCDF files please access [xarray netCDF docs](http://xarray.pydata.org/en/latest/user-guide/io.html#netcdf). Below we load some ERA5 dataset which we have downloaded from online websites.

In [23]:
# data source: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=form
ds = xr.open_dataset('../../assets/data/era5_singapore_2021.9.1.nc')
ds

## Selecting the Internal Data

You can directly access `Coordinates` and `Data variables` by their name.

In [24]:
ds.t2m

In [25]:
ds.time

In [26]:
ds.longitude

Below you can access the specific value by the index or slices.

In [27]:
# first value of time dimension
ds.time[0]

In [28]:
# get a 2-demension data with first time index 
ds.t2m[0]

You can also use `sel()` function to conduct label-based indexing.

In [29]:
ds.t2m.sel(time='2021-09-01 00:00:00')  # select all data of 2021-09-01 00:00:00

In [30]:
# select data locate in (103.5, 1.5) of 2021-09-01 00:00:00
ds.t2m.sel(longitude=103.5, latitude=1.5, time='2021-09-01 00:00:00')  

In [31]:
# compute the average value in 2021-09-01 00:00:00
ds.t2m[0].values.mean()

299.65002

## Read From URL

Data used for the CE3201 final project are simulated temperature (unit: K) from the latest climate model outputs, which have been divided into two parts: historical (1850.01.01-2014.12.31) simulations and future (2015.01.01-2100.12.31) projections. You can find these data [here](https://hydrology.princeton.edu/data/hexg/CE3201_final_project_data/).

In [32]:
from urllib.request import urlretrieve
import ssl

# Ignore certificate validation
ssl._create_default_https_context = ssl._create_unverified_context 

# Everyone can access unique data by specifying your student ID and label (future or historical). 
# For example, if your student ID is A0188677A and you will get future data
# you can using following codes to access your data.

student_ID = 'A0188677A'
label = 'future'
filename = '%s_%s.nc'%(student_ID, label)
url = 'https://hydrology.princeton.edu/data/hexg/CE3201_final_project_data/%s'%(filename)
# download file from remote server
urlretrieve(url, filename)
ds = xr.open_dataset(filename)
ds

In [33]:
# For convenience, the process above is rewritten using function
def read_by_student_ID(student_ID, label):
    from urllib.request import urlretrieve
    import xarray as xr
    import os
    import ssl
    
    ssl._create_default_https_context = ssl._create_unverified_context
    
    filename = '%s_%s.nc'%(student_ID, label)
    
    if os.path.exists(filename):
        return xr.open_dataset(filename)
    
    url = 'https://hydrology.princeton.edu/data/hexg/CE3201_final_project_data/%s'%(filename)
    urlretrieve(url, filename)
    return xr.open_dataset(filename)

student_ID = 'A0188677A'
# label = 'future'
label = 'historical'
ds = read_by_student_ID(student_ID, label)
ds

## References
+ [Xarray Fundamentals of Earth and Enrivonmental Data Science](https://earth-env-data-science.github.io/lectures/xarray/xarray.html).
+ [Xarray documentation](http://xarray.pydata.org/en/latest/index.html)