HW#1: Extreme Rainfall Deficit in Singapore

HW#1: Extreme Rainfall Deficit in Singapore#

Objectives

This homework will help you gain a better understanding in terms of the ways how to:

Fit Generalized Extreme Value (GEV) distribution
Estimate the return level of extreme rainfall deficit

Happy coding!

Submission Guide

Deadline: Sunday 11:59 pm, 2nd November 2025 (Note: Late submissions will not be accepted).

Please upload your solutions to Canvas in a Jupyter Notebook format with the name “Homework1_StudentID.ipynb”. Make sure to write down your student ID and full name in the cell below.

For any questions, feel free to contact Prof. Xiaogang HE (hexg@nus.edu.sg), Kewei ZHANG (kewei_zhang@u.nus.edu) or Yifan LU (yifan_lu@u.nus.edu).

## Fill your student ID and full name below.

# Student ID:
# Full name:

Data: You will need to use the historical (1981-2020) daily total rainfall at Singapore’s Changi station for this homework. You can create a DataFrame using Pandas by reading file “../../assets/data/Changi_daily_rainfall.csv”.

Q1: Calculate daily rainfall statistics (10 marks)#

Calculate the following statistics for daily rainfall during DJF (December-January-February): (i) mean, (ii) variance, (iii) skewness, and (iv) kurtosis.

Hint:

You can filter the daily rainfall time series for DJF using Pandas’ boolean filtering method. Details on filtering values can be found in the Pandas tutorial.
DJF spans across two calendar years. Make sure you only include complete DJF seasons. For the period 1891 to 2020, this results in 39 complete DJF seasons, from DJF 1981-1982 to DJF 2019-2020.

# Your solutions go here.
# using the + icon in the toolbar to add a cell.

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy.optimize import fsolve
from scipy.stats import genextreme as gev

rain = pd.read_csv('../assets/data/Changi_daily_rainfall.csv', index_col = 0, header = 0, parse_dates = True)
display(rain.head(10))

	Daily Rainfall Total (mm)
Date
1981-01-01	0.0
1981-01-02	0.0
1981-01-03	0.0
1981-01-04	0.0
1981-01-05	0.0
1981-01-06	0.3
1981-01-07	51.4
1981-01-08	0.0
1981-01-09	4.4
1981-01-10	0.0

# Filter daily rainfall for December-January-February (DJF)
DJF_rain = rain.loc[(rain.index.month < 3) | (rain.index.month > 11)]
# Select the complete DJF periods
DJF_rain = DJF_rain.loc['19811201':'20200301']
# Calculate the following statistics for daily rainfall during DJF: (i) mean; (ii) variance; (iii) skewness; and (iv) kurtosis.
DJF_rain_stats = pd.Series([
                            DJF_rain.mean().values[0], 
                            DJF_rain.var().values[0], 
                            DJF_rain.skew().values[0], 
                            DJF_rain.kurtosis().values[0]],
                            index=['Mean', 'Variance', 'Skewness', 'Kurtosis']
                            )
display(DJF_rain_stats)

Mean          7.203409
Variance    340.698022
Skewness      4.909512
Kurtosis     33.412606
dtype: float64

# Add 'Year' column to denote the year of the days
DJF_rain['Year'] = np.where(DJF_rain.index.month==12, DJF_rain.index.year.values, DJF_rain.index.year.values-1)
# Group by DJF season year
DJF_rain_grouped = DJF_rain.groupby('Year')['Daily Rainfall Total (mm)']

# Compute each statistic explicitly
DJF_stats = pd.DataFrame({
    'mean': DJF_rain_grouped.mean(),
    'variance': DJF_rain_grouped.var(),
    'skew': DJF_rain_grouped.skew(),
    'fisher_kurtosis': DJF_rain_grouped.apply(pd.Series.kurt),
    'pearson_kurtosis': DJF_rain_grouped.apply(pd.Series.kurt) + 3
})

DJF_stats

	mean	variance	skew	fisher_kurtosis	pearson_kurtosis
Year
1981	5.803333	177.008865	3.031608	9.601951	12.601951
1982	8.285556	323.570688	3.226329	12.567868	15.567868
1983	12.002198	782.667106	3.819583	16.588514	19.588514
1984	5.197778	133.260220	3.113894	10.226615	13.226615
1985	7.500000	298.638427	3.266719	11.947807	14.947807
1986	11.363333	590.704596	3.260409	12.065640	15.065640
1987	5.147253	137.969409	3.144627	11.282675	14.282675
1988	3.388889	97.620325	3.522151	12.658900	15.658900
1989	3.902222	116.119995	4.447478	22.850849	25.850849
1990	4.116667	70.843652	2.353044	4.673089	7.673089
1991	7.017582	262.673687	2.872709	7.943237	10.943237
1992	8.314444	249.255856	3.230577	15.437086	18.437086
1993	5.543333	137.374393	3.096425	11.243004	14.243004
1994	12.320000	756.400944	3.631821	14.380162	17.380162
1995	7.207692	182.731607	2.104027	3.620828	6.620828
1996	3.403333	98.217854	4.753180	27.742527	30.742527
1997	7.257778	237.065613	3.003575	9.452063	12.452063
1998	8.052222	247.986793	2.780940	8.540774	11.540774
1999	9.379121	418.785226	3.387947	13.793709	16.793709
2000	8.315556	347.016834	4.303823	22.304759	25.304759
2001	9.788889	766.960774	5.049193	32.020930	35.020930
2002	9.690000	611.534618	5.520379	36.711347	39.711347
2003	9.953846	871.404957	4.428337	20.977939	23.977939
2004	3.460000	137.696135	6.225783	46.015389	49.015389
2005	8.176667	401.361809	3.874872	18.147534	21.147534
2006	14.683333	1150.105225	3.395738	12.801657	15.801657
2007	9.454945	741.790059	4.322671	19.666771	22.666771
2008	5.381111	176.916605	3.143806	10.462753	13.462753
2009	2.951111	51.758482	3.334967	12.384754	15.384754
2010	7.973333	701.418607	6.137320	44.396494	47.396494
2011	5.212088	151.237297	4.219555	22.839396	25.839396
2012	11.340000	431.479056	2.199296	4.277025	7.277025
2013	4.708889	256.581718	5.420304	33.760213	36.760213
2014	3.822222	93.278602	2.811096	7.500877	10.500877
2015	6.758242	209.424459	2.769551	8.057443	11.057443
2016	7.217778	162.940804	2.828160	9.981833	12.981833
2017	7.477778	212.649613	2.526075	6.237181	9.237181
2018	2.975556	73.031755	3.572999	12.654739	15.654739
2019	6.317582	240.277243	2.933400	8.244971	11.244971

Q2: Preprocess daily rainfall data (20 marks)#

Find the seasonal maximum rainfall deficit for DJF, based on the 15-day centered moving average rainfall deficit.

To do this:

Compute the 15-day centered moving average of daily rainfall. (10 marks)
Calculate the daily rainfall deficit by subtracting the 15-day moving average rainfall from the mean rainfall calculated in Q1. This will be used in Q6. (5 marks)
For each DJF season, take the maximum deficit (one value per season). This yields 39 seasonal maxima that are used for Q3–Q5 (block-maxima approach). (5 marks)

# Calculate the 15-day moving average for each year
DJF_rain_a = DJF_rain.groupby(['Year']).rolling(15, center=True).mean()
DJF_rain_a = DJF_rain_a.dropna(subset='Daily Rainfall Total (mm)')
DJF_rain_a.rename(columns={'Daily Rainfall Total (mm)': '15-day Moving Average Rainfall (mm)'}, inplace=True)

# Method 1 — Single mean across all DJF data
DJF_rain_def1 = DJF_rain['Daily Rainfall Total (mm)'].mean() - DJF_rain_a.iloc[:, 0]
DJF_rain_def1 = DJF_rain_def1.to_frame('15-day Moving Average Rainfall Deficit (mm)')

# Method 2 — Year-specific mean from DJF_stats
DJF_rain_def2 = DJF_stats['mean'].rsub(DJF_rain_a.iloc[:, 0], level='Year')
DJF_rain_def2 = DJF_rain_def2.to_frame('15-day Moving Average Rainfall Deficit (mm)')

# Change method here
DJF_rain_def = DJF_rain_def1
display(DJF_rain_def)

		15-day Moving Average Rainfall Deficit (mm)
Year	Date
1981	1981-12-08	-7.509924
	1981-12-09	-11.056591
	1981-12-10	-12.843258
	1981-12-11	-14.963258
	1981-12-12	-12.329924
...	...	...
2019	2020-02-18	3.843409
	2020-02-19	5.430076
	2020-02-20	5.430076
	2020-02-21	5.430076
	2020-02-22	5.430076

2974 rows × 1 columns

# Create a figure with appropriate size
plt.figure(figsize=(15, 8))

# Plot the original time series
plt.plot(DJF_rain_a.index.get_level_values('Date'), 
         DJF_rain_a['15-day Moving Average Rainfall (mm)'], 
         'b-', alpha=0.5, label='15-day Moving Average')

# Calculate and plot annual mean
annual_mean = DJF_rain_a.groupby(level='Year')['15-day Moving Average Rainfall (mm)'].mean()
for year in annual_mean.index:
    year_data = DJF_rain_a.xs(year, level='Year')
    start_time = year_data.index[0]
    end_time = year_data.index[-1]
    plt.hlines(y=annual_mean[year], 
               xmin=start_time,
               xmax=end_time,
               color='r', linestyle='--', alpha=0.7, linewidth=4)

# Calculate the overall mean
overall_mean = DJF_rain_a['15-day Moving Average Rainfall (mm)'].mean()
plt.axhline(y=overall_mean, color='g', linestyle='-', label='Overall Mean')

plt.title('15-day Moving Average Rainfall with Seasonal Means')
plt.xlabel('Date')
plt.ylabel('Rainfall (mm)')

# Create custom legend with correct line styles
from matplotlib.lines import Line2D
custom_lines = [
    Line2D([0], [0], color='b', linestyle='-', alpha=0.5),
    Line2D([0], [0], color='g', linestyle='-'),
    Line2D([0], [0], color='r', linestyle='--', alpha=0.7)
]
plt.legend(custom_lines, ['15-day Moving Average', 'Overall Mean', 'Seasonal Mean'])

plt.grid(True, alpha=0.3)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

../../_images/2d2be7981eeeabd5816bc9fc5afff62b3a3ee0cad6c6845a95158944a52fbe79.png

# Calculate the maximum deficit for each year
DJF_max_def = pd.DataFrame(DJF_rain_def['15-day Moving Average Rainfall Deficit (mm)'].groupby('Year').max())
DJF_max_def.columns = ['DJF Maximum Rainfall Deficit (mm)']
DJF_max_def

	DJF Maximum Rainfall Deficit (mm)
Year
1981	7.203409
1982	7.170076
1983	5.850076
1984	7.203409
1985	7.203409
1986	7.203409
1987	7.050076
1988	7.203409
1989	7.103409
1990	6.850076
1991	6.870076
1992	7.170076
1993	7.203409
1994	6.983409
1995	6.130076
1996	7.156742
1997	7.130076
1998	6.663409
1999	6.223409
2000	5.783409
2001	7.203409
2002	2.523409
2003	7.203409
2004	7.203409
2005	6.630076
2006	7.196742
2007	7.203409
2008	7.203409
2009	7.070076
2010	7.163409
2011	6.416742
2012	3.923409
2013	7.203409
2014	7.203409
2015	6.163409
2016	6.443409
2017	7.190076
2018	7.190076
2019	7.016742

Q3: Fit the GEV distribution (20 marks)#

Fit a GEV distribution to the time series of seasonal maximum rainfall deficits. To do this, estimate the GEV parameters using (i) Maximum Likelihood and (ii) L-Moments, respectively. (Details on fitting a GEV distribution can be found in the Scipy tutorial)

# Fit the distribution using MLE
cMLE, locMLE, scaleMLE = gev.fit(DJF_max_def, method = "MLE")
MLEGEV = gev(cMLE, loc = locMLE, scale = scaleMLE) 

# Calculate L-moments based on samples
def samlmom3(sample):
    """
    samlmom3 returns the first three L-moments of samples
    sample is the 1-d array
    n is the total number of the samples, j is the j_th sample
    """
    n = len(sample)
    sample = np.sort(sample.reshape(n))[::-1]
    b0 = np.mean(sample)
    b1 = np.array([(n - j - 1) * sample[j] / n / (n - 1)
                   for j in range(n)]).sum()
    b2 = np.array([(n - j - 1) * (n - j - 2) * sample[j] / n / (n - 1) / (n - 2)
                   for j in range(n - 1)]).sum()
    lmom1 = b0
    lmom2 = 2 * b1 - b0
    lmom3 = 6 * (b2 - b1) + b0

    return lmom1, lmom2, lmom3

def pargev_fsolve(lmom):
    """
    pargev_fsolve estimates the parameters of the Generalized Extreme Value 
    distribution given the L-moments of samples
    """
    lmom_ratios = [lmom[0], lmom[1], lmom[2] / lmom[1]]
    f = lambda x, t: 2 * (1 - 3**(-x)) / (1 - 2**(-x)) - 3 - t
    G = fsolve(f, 0.01, lmom_ratios[2])[0]
    para3 = G
    GAM = math.gamma(1 + G)
    para2 = lmom_ratios[1] * G / (GAM * (1 - 2**-G))
    para1 = lmom_ratios[0] - para2 * (1 - GAM) / G
    return para1, para2, para3

# Fit the distribution using LMM
LMM = samlmom3(DJF_max_def.values)
locLMM, scaleLMM, cLMM = pargev_fsolve(LMM)
LMMGEV = gev(cLMM, loc=locLMM, scale=scaleLMM) 

print('Parameters fitted by Maximum Likelihood: shape={:.3f}, loc={:.3f}, scale={:.3f}\n'
      .format(cMLE, locMLE, scaleMLE,))
print('Parameters fitted by L-Moments: shape={:.3f}, loc={:.3f}, scale={:.3f}\n'
      .format(cLMM, locLMM, scaleLMM))

Parameters fitted by Maximum Likelihood: shape=1.700, loc=6.651, scale=0.940

Parameters fitted by L-Moments: shape=2.065, loc=6.993, scale=0.472

Q4: Estimate the return level of the extreme events (20 marks)#

Using the GEV parameters estimated with L-Moments in Q2, estimate the rainfall deficit for events with return periods of 10 years, 50 years, 100 years, and 1000 years.

# Your solutions go here.
# using the + icon in the toolbar to add a cell.

T = np.arange(2, 1001)
rainDef_lmm = LMMGEV.ppf(1 - 1.0 / T)
rainDef_lmm = pd.DataFrame(rainDef_lmm, index = T, 
                      columns = ["DJF Maximum Rainfall Deficit (mm)"])
rainDef_lmm.index.name = "Return period"

display(rainDef_lmm.loc[[10, 50, 100, 1000]])

	DJF Maximum Rainfall Deficit (mm)
Return period
10	7.219571
50	7.221692
100	7.221747
1000	7.221764

Q5: Test the goodness-of-fit (20 marks)#

In this task, you will compare how different distributions fit the same dataset and interpret the results using both statistical analyses.

Repeat the distribution fitting as in Q2, but this time using a normal distribution and the Maximum Likelihood method. (5 marks)
Use the Kolmogorov-Smirnov (KS) test to evaluate the goodness-of-fit for both the normal distribution and the GEV distribution you obtained in Q2. (Details on the KS test can be found in the Scipy tutorial) (10 marks)
Based on the KS test results, discuss how well each distribution (Normal and GEV) fits the data. (5 marks)

Bonus (5 marks):

Plot the CDF (Cumulative Distribution Function) to visually compare the fitted normal distribution, the GEV distribution from Q2, and the empirical distribution derived from the data. Compare the behavior of the two distributions at different return periods. Are the KS statistic results consistent with your observations from the CDF plot?

Hint: You can reuse the empirical distribution estimation and CDF plotting code from the Scipy tutorial.

# Your solutions go here.
# using the + icon in the toolbar to add a cell.

# Fit the normal distribution using MLE
from scipy.stats import norm
locMLE, scaleMLE = norm.fit(DJF_max_def, method = "MLE")
MLEnorm = norm(loc = locMLE, scale = scaleMLE) 

from scipy.stats import kstest
GEVKS = kstest(DJF_max_def["DJF Maximum Rainfall Deficit (mm)"], MLEGEV.cdf)
print('Goodness-of-fit for GEV distribution: statisic is {:.3f}, pvalue is {:.3f}'.format(GEVKS.statistic, GEVKS.pvalue))
NORMKS = kstest(DJF_max_def["DJF Maximum Rainfall Deficit (mm)"], MLEnorm.cdf)
print('Goodness-of-fit for Normal dirstribution: statisic is {:.3f}, pvalue is {:.3f}'.format(NORMKS.statistic, NORMKS.pvalue))

Goodness-of-fit for GEV distribution: statisic is 0.333, pvalue is 0.000
Goodness-of-fit for Normal dirstribution: statisic is 0.307, pvalue is 0.001

# Plot empirical CDF
plt.scatter(DJF_max_def.sort_values(by = ["DJF Maximum Rainfall Deficit (mm)"]),
            np.arange(1, DJF_max_def.size+1, dtype=float)/DJF_max_def.size,
            color = 'k', facecolors='none', label='Empirical CDF')

# Plot estimated PDF based on maximum likelihood method
bins = np.linspace(MLEGEV.ppf(0.01), MLEGEV.ppf(0.99), 100)
plt.plot(bins, MLEGEV.cdf(bins), 
        lw=2, label='Estimated CDF (GEV)')
# Plot estimated PDF based on the method of L-moments
bins = np.linspace(MLEnorm.ppf(0.01), MLEnorm.ppf(0.99), 100)
plt.plot(bins, MLEnorm.cdf(bins), 
         lw=2, label='Estimated CDF (Normal)')

plt.xlabel("DJF Maximum Rainfall Deficit (mm)")
plt.legend(loc='best', frameon=False)
plt.show()

../../_images/4c81406ad9f4df3dc4e5950cae920f42acdfd002930a71afe850e3089638b225.png

# Both the GEV and normal distributions fail to accurately represent the data, as indicated by the extremely low KS test p-values.
# Nevertheless, the CDF comparison shows that the GEV distribution reproduces the observed shape more closely than the normal distribution.
# The statistical discrepancy for the GEV fit may arise because the MLE-based parameter estimation is insufficient for this dataset—potentially due to the limited sample size, temporal non-stationarity, or deviations from the GEV assumptions.
# In contrast, the normal distribution is clearly unsuitable, as it substantially overestimates the probability of large rainfall deficits (>7 mm).
# Overall, the GEV distribution provides a relatively better fit, but alternative estimation methods or distribution families should be explored to improve model performance.

Q6: Simple peaks over threshold & histogram (10 marks)#

Using the same 15-day centered moving-average deficit obtained in Q2:

Compute the 95th percentile of the pooled DJF moving-average deficit and set this as the threshold 𝑢. Report the numeric value of 𝑢. (5 marks)
Plot a histogram of the seasonal exceedance counts \(𝑘_s\) in 39 years, with a 1-2 line caption describing the distribution. Day \(i\) is considered as an exceedance if the deficit is larger than the threshold (\(x_i-u>0\)). (5 marks)