3.1. Pandas tutorial#

3.1.1. Introduction to Pandas Data Structures#

The two most common data structures of pandas are

  • Series

  • DataFrame

The Series is designed to accommodate one-dimensional ndarray data with axis labels (e.g., time series data), and the Dataframe is designed for two-dimensional tabular data with labeled axes (rows and columns).

3.1.2. Series#

As shown below, the structure of the Series object is simple, which consists of two columns - Index and Value - with the same length. The Value column holds the data (of any NumPy type) which is associated with labels in the Index column.

Index

Value

0

-1

1

3

2

8

3.1.2.1. Defining a Series#

We can call the pd.Series() function to create a Series object and specify the argument data to be a Numpy array, a list, a dictionary, or a scalar value.

import numpy as np
import pandas as pd

# Since 'data' is the first location parameter, you can also ignore 'data=' and directly specify the value
s1 = pd.Series(data=np.array([-1, 3, 8]))    # Numpy array
s2 = pd.Series(data=[-1, 3, 8])              # List
s3 = pd.Series(data={'a':-1, 'b':3, 'c':8})  # Dictionary
s4 = pd.Series(data=-1)                      # Scalar
print(s1)
print(s2)
print(s3)
print(s4)
0   -1
1    3
2    8
dtype: int64
0   -1
1    3
2    8
dtype: int64
a   -1
b    3
c    8
dtype: int64
0   -1
dtype: int64

As you can see from the output of the Series object, the left column is the Index, which is a series of labels, and the right is a series of corresponding values.

Note

If you do not specify the value of index when creating the series, by default, Pandas will assign the sequence number starting from 0 as the index. If the data is in the type of dictionary, the keys will be used as the index of the series.

In general, it is better to create a Series object with meaningful labels to distinguish and identify each value. When constructing the Series, the labels are added by specifying the value of the index parameter or specifying the data as a dict object as we mentioned above.

s5 = pd.Series([-1, 3, 8], index=['x', 'y', 'z'])  # specify the index by 'index' option
print(s5)
x   -1
y    3
z    8
dtype: int64

Note

Always keep in mind that when specifying the argument data as a Numpy array, the values contained in the NumPy array are not copied, but are passed by reference. That is, the object is inserted dynamically within the new Series object. If the values of NumPy array change, those changes will also present in the Series object.

arr = np.array([100, 20, -3])

s6 = pd.Series(arr, index=['x', 'y', 'z'] )
print(arr)
print(s6)

arr[2] = -40
print('\n')
print(arr)
print("After reassigning value, the value of series \nalso will be updated automatically:")
print(s6)
[100  20  -3]
x    100
y     20
z     -3
dtype: int64


[100  20 -40]
After reassigning value, the value of series 
also will be updated automatically:
x    100
y     20
z    -40
dtype: int64

As you can see in this example, by changing the third element of the arr array, we also modified the corresponding element in the s6 series.

You can call the attributes - index and values - of the series to see them separately.

print(s6.values)
print(s6.index)
[100  20 -40]
Index(['x', 'y', 'z'], dtype='object')

3.1.2.2. Selecting the Elements#

You can select any individual element in the same way as Numpy arrays by specifying the sequence number with the attribute .iloc of Series.

s = pd.Series(data={'a':-1, 'b':3, 'c':8})
s.iloc[1]
3

Or you can specify the value of the index.

s['a']
-1

Similar to Numpy arrays, you can also select multiple items by specifying number or label slices.

print(s[0:2])
print(s['a':'c'])
a   -1
b    3
dtype: int64
a   -1
b    3
c    8
dtype: int64

Note

Notice that the number-based slice does not contain the last element of the slice (like Numpy arrays), but the label-based slice contains the last element.

3.1.2.3. Assigning Values to the Elements#

s['a'] = 100
s.iloc[2] = 80
s
a    100
b      3
c     80
dtype: int64

3.1.2.4. Filtering Values#

If you need to get the indexes and values of the elements in the series that are greater than 4, you write the following

s = pd.Series([1, 3, 5, 2, 10])
idx_greater = s > 4
# According to Boolean value to filter
print(idx_greater)
print(s[idx_greater])  # greater than 4
0    False
1    False
2     True
3    False
4     True
dtype: bool
2     5
4    10
dtype: int64

Or you can get the elements that satisfy multiple conditions by writing the following

# you cannot use `or` `and` or `not` and only can use corresponding operator symbol(|, & and ~)
filter_condition = (s.isin([2, 5]) | (s<4)) & ~(s>2) 
print(filter_condition)
print(s[filter_condition])
0     True
1    False
2    False
3     True
4    False
dtype: bool
0    1
3    2
dtype: int64

3.1.2.5. Operations and Mathematical Functions#

The Series object can participate in common mathematical operations in a similar way to Numpy arrays.

s * 2.5
0     2.5
1     7.5
2    12.5
3     5.0
4    25.0
dtype: float64
np.exp(s)
0        2.718282
1       20.085537
2      148.413159
3        7.389056
4    22026.465795
dtype: float64

3.1.2.6. Nan Value#

The NaN refers to Not a Number, which is generally caused by the missing value. Before data analysis, the NaN values need to be addressed.

# Declaring a 'Series' including the NaN value
s = pd.Series([1, np.NaN, 10, 9, -2, np.NaN])
s
0     1.0
1     NaN
2    10.0
3     9.0
4    -2.0
5     NaN
dtype: float64

Call isnull() or notnull() functions to generate boolean value and get the indexes corresponding to the NaN value.

print(s.isnull())
print(s.notnull())
0    False
1     True
2    False
3    False
4    False
5     True
dtype: bool
0     True
1    False
2     True
3     True
4     True
5    False
dtype: bool

Based on the generated boolean value, the Series object with NaN value and without NaN can be generated by filtering methods mentioned above.

print(s[s.isnull()])
print(s[s.notnull()])
1   NaN
5   NaN
dtype: float64
0     1.0
2    10.0
3     9.0
4    -2.0
dtype: float64

3.1.2.7. Operations of Multiple Series#

s = pd.Series({"Singapore": 30, "Malaysia": 23, "Vietnam": 36, "Cambodia": 41})
s1 = pd.Series({"China": 51, "Japan": 73, "Vietnam": 36, "Laos": 31})
s * s1
Cambodia        NaN
China           NaN
Japan           NaN
Laos            NaN
Malaysia        NaN
Singapore       NaN
Vietnam      1296.0
dtype: float64

As you can see, we can only operate on the elements with the same indices in both series.

3.1.3. DataFrame#

Compared with the Series, the DataFrame can contain two-dimensional data. Its first column and first row are index and columns, respectively (for DataFrame without multiple indexes).

index

numeric

string

boolean

0

-1

Singapore

True

1

3

China

True

2

8

Japan

False

3.1.3.1. Defining a DataFrame#

Call DataFrame() function to create a DataFrame. The Array, List and dict can all be the input of data argument.

df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 'daodao', False],
                   ["Singapore", "Singapore", 5850342, True],
                   ["Vietnam", "Hanoi", 97338579, True]],
                  columns=["Country", "Capital", "Population", "Isdeveloped"],
                  index=["a", "b", "c"])
print(df)
     Country       Capital Population  Isdeveloped
a   Malaysia  Kuala Lumpur     daodao        False
b  Singapore     Singapore    5850342         True
c    Vietnam         Hanoi   97338579         True
# Array
df = pd.DataFrame(
    np.array([[14, 35, 35, 35], [19, 34, 57, 34], [42, 74, 49, 59]]))
print(df)

# List,  use 'columns' and 'index' parameters to specify the column and index of generated dataframe.
df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 32365999, False],
                   ["Singapore", "Singapore", 5850342, True],
                   ["Vietnam", "Hanoi", 97338579, True]],
                  columns=["Country", "Capital", "Population", "Isdeveloped"],
                  index=["a", "b", "c"])
print(df)

# dict
df = pd.DataFrame({"Country": ["Malaysia", "Singapore", "Vietnam"],
                   "Capital": ["Kuala Lumpur", "Singapore", "Hanoi"],
                   "Population": [32365999, 5850342, 97338579],
                   "Isdeveloped": [False, True, True]},
                  index=["a", "b", "c"])
print(df)
    0   1   2   3
0  14  35  35  35
1  19  34  57  34
2  42  74  49  59
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True

3.1.3.2. Selecting the Internal Elements#

Similar to Series, two ways can be used to select the elements from DataFrame. Call iloc[] and loc[] to select the elements by position and label, respectively .

# use ':' to represent select all
df.iloc[:, 0:2]
Country Capital
a Malaysia Kuala Lumpur
b Singapore Singapore
c Vietnam Hanoi
df.loc[:, "Country":"Population"]
Country Capital Population
a Malaysia Kuala Lumpur 32365999
b Singapore Singapore 5850342
c Vietnam Hanoi 97338579
df.loc["a", ["Country", "Population"]]
Country       Malaysia
Population    32365999
Name: a, dtype: object
df.iloc[[0, 1]] # If you omit number of columns, all columns will be selected 
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True

Use columns,index and values attributes to obtain corresponding object value.

df.index
Index(['a', 'b', 'c'], dtype='object')
df.columns
Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')
df.values
array([['Malaysia', 'Kuala Lumpur', 32365999, False],
       ['Singapore', 'Singapore', 5850342, True],
       ['Vietnam', 'Hanoi', 97338579, True]], dtype=object)

Select the corresponding column(s) according to the labels or indexes of columns.

df["Country"]
a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object
df[["Country", "Population"]] # Use list to select multiple columns  
Country Population
a Malaysia 32365999
b Singapore 5850342
c Vietnam 97338579
df.Country # Also support as atrribute to select
a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object
df["a":"b"] # using label-based slice to select multiple rows
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
df[0:2] # using label-based number to select multiple rows
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True

3.1.3.3. Assigning value#

df1 = df.copy(True)
df1.loc["c", "Country"] = "Japan"
df1.loc["c", "Capital"] = "Tokyo"
df1.loc["c", "Population"] = 126476461
df1.loc["c", "Isdeveloped"] = True
df1
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Japan Tokyo 126476461 True
df1.loc["c"] = ["Japan", "Tokyo", 126476461, True]
df
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True

3.1.3.4. Assigning index, columns, and name of index and columns#

df1.index = ["e", "f", "g"]
df1.index.name = "label"
df1.columns.name = "attributes"
df1.columns = ["Coun", "Cap", "Pop", "ID"]
df1
Coun Cap Pop ID
label
e Malaysia Kuala Lumpur 32365999 False
f Singapore Singapore 5850342 True
g Japan Tokyo 126476461 True

3.1.3.5. Delete columns from dataframe#

del df1["ID"]
# axis = 1 or columns represents delete columns
df1.drop(["Coun"], axis='columns', inplace=True)

# axis = 0 or rows represents delete columns
df1.drop(["e"], axis='rows', inplace=True)

df1
Cap Pop
label
f Singapore 5850342
g Tokyo 126476461

When inplace=True is passed, the modified DataFrame is used in place of the original one, and the operation itself returns None. When inplace=False is passed (this is the default value, so isn’t necessary to specify), the function returns a copy of the modified DataFrame.

# inplace=True
df2 = df.copy(True)
print(df2)
print("----")
df2_return = df2.drop(["Country"], axis='columns', inplace=True)
print(df2)
print("----")
print(df2_return)
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
        Capital  Population  Isdeveloped
a  Kuala Lumpur    32365999        False
b     Singapore     5850342         True
c         Hanoi    97338579         True
----
None
# inplace=False
df3 = df.copy(True)
print(df3)
print("----")
df3_return = df3.drop(["Country"], axis='columns', inplace=False)
print(df3)
print("----")
print(df3_return)
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
        Capital  Population  Isdeveloped
a  Kuala Lumpur    32365999        False
b     Singapore     5850342         True
c         Hanoi    97338579         True

3.1.3.6. Filtering#

Same as Series() mentioned above.

df2 = pd.DataFrame(np.array([[14, 35, 35, 35],
                             [19, 34, 57, 34],
                             [42, 74, 49, 59]]))
# filtering lesser than 30
df2[df2 < 30]
0 1 2 3
0 14.0 NaN NaN NaN
1 19.0 NaN NaN NaN
2 NaN NaN NaN NaN
# Filtering accroding to conditions of one column
df[df["Population"] < 50000000]
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True

You can filter the DataFrame according to conditions of multiple columns like the following:

df[(df["Population"] < 50000000) & (df["Isdeveloped"] == True)]
Country Capital Population Isdeveloped
b Singapore Singapore 5850342 True

3.1.3.7. Transposition of a Dataframe#

Similar to Numpy arrays, Dataframe can be transposed. Columns is interchanged with Index.

df1 = df.T
df1
a b c
Country Malaysia Singapore Vietnam
Capital Kuala Lumpur Singapore Hanoi
Population 32365999 5850342 97338579
Isdeveloped False True True
df1.index
Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')
df1.columns
Index(['a', 'b', 'c'], dtype='object')

3.1.3.8. Merge of Dataframe#

Call concat(), append() functions to merge multiple dataframes.

df1 = pd.DataFrame(np.random.rand(3,4))
df2 = pd.DataFrame(np.random.rand(3,4))
df3 = pd.DataFrame(np.random.rand(6,4))
df4 = pd.DataFrame(np.random.rand(3,6))
# Vertical merging by default.
pd.concat([df1, df2, df3, df4])
0 1 2 3 4 5
0 0.711833 0.210372 0.623283 0.227076 NaN NaN
1 0.187854 0.003110 0.579188 0.457127 NaN NaN
2 0.577679 0.240997 0.406679 0.252861 NaN NaN
0 0.154052 0.299850 0.697615 0.432677 NaN NaN
1 0.808326 0.812322 0.530895 0.383379 NaN NaN
2 0.253567 0.132441 0.423534 0.081832 NaN NaN
0 0.001646 0.384060 0.703795 0.574561 NaN NaN
1 0.395984 0.427264 0.151421 0.405376 NaN NaN
2 0.764523 0.582966 0.418316 0.405842 NaN NaN
3 0.004663 0.723450 0.494632 0.038501 NaN NaN
4 0.281056 0.957538 0.921892 0.813628 NaN NaN
5 0.320597 0.733772 0.371999 0.385703 NaN NaN
0 0.154081 0.741327 0.332735 0.630257 0.277065 0.771409
1 0.034194 0.445515 0.499797 0.853322 0.877335 0.729658
2 0.381277 0.008222 0.563647 0.729580 0.993268 0.242828

Note

As you see, when the shapes of dataframes don’t match, blank positions will be filled by NaN values.

Conduct horizontal merging by specifying axis argument as ‘columns’ or 1.

pd.concat([df1, df2, df3, df4], axis='columns')
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 4 5
0 0.711833 0.210372 0.623283 0.227076 0.154052 0.299850 0.697615 0.432677 0.001646 0.384060 0.703795 0.574561 0.154081 0.741327 0.332735 0.630257 0.277065 0.771409
1 0.187854 0.003110 0.579188 0.457127 0.808326 0.812322 0.530895 0.383379 0.395984 0.427264 0.151421 0.405376 0.034194 0.445515 0.499797 0.853322 0.877335 0.729658
2 0.577679 0.240997 0.406679 0.252861 0.253567 0.132441 0.423534 0.081832 0.764523 0.582966 0.418316 0.405842 0.381277 0.008222 0.563647 0.729580 0.993268 0.242828
3 NaN NaN NaN NaN NaN NaN NaN NaN 0.004663 0.723450 0.494632 0.038501 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN 0.281056 0.957538 0.921892 0.813628 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN 0.320597 0.733772 0.371999 0.385703 NaN NaN NaN NaN NaN NaN

3.1.3.9. View data#

df1 = pd.DataFrame(np.random.rand(100,4))
df1.head(2)
0 1 2 3
0 0.577690 0.626418 0.165206 0.650919
1 0.104072 0.214700 0.576407 0.095773
df1.tail(3)
0 1 2 3
97 0.224167 0.586648 0.824226 0.704023
98 0.668144 0.138919 0.423611 0.079042
99 0.832823 0.959519 0.641955 0.541513

3.1.3.10. Computational tools#

Compute covariance and Pearson correlation coefficients between different columns.

df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
                   columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.cov()
c1 c2 c3 c4 c5
c1 0.025785 -0.005152 0.003585 -0.025674 0.034285
c2 -0.005152 0.028833 0.011865 0.033059 -0.018142
c3 0.003585 0.011865 0.027056 0.026296 0.022413
c4 -0.025674 0.033059 0.026296 0.083990 -0.012309
c5 0.034285 -0.018142 0.022413 -0.012309 0.087803
df1.corr() # method = pearson (default), optional: kendall, spearman
c1 c2 c3 c4 c5
c1 1.000000 -0.188938 0.135723 -0.551697 0.720551
c2 -0.188938 1.000000 0.424791 0.671776 -0.360566
c3 0.135723 0.424791 1.000000 0.551615 0.459855
c4 -0.551697 0.671776 0.551615 1.000000 -0.143334
c5 0.720551 -0.360566 0.459855 -0.143334 1.000000
df1.corr(method='kendall')
c1 c2 c3 c4 c5
c1 1.0 -0.2 -0.2 -0.4 0.6
c2 -0.2 1.0 0.2 0.4 -0.2
c3 -0.2 0.2 1.0 0.4 0.2
c4 -0.4 0.4 0.4 1.0 0.0
c5 0.6 -0.2 0.2 0.0 1.0

Compute the average value, maximum, minimum and sum of different columns or rows.

# compute average value of each column by default.
df1.mean()
c1    0.626051
c2    0.241862
c3    0.363783
c4    0.667357
c5    0.685979
dtype: float64

Compute the sum of each row by specifying the axis argument as ‘columns’ or 1.

df1.sum(axis=1)
i1    2.505459
i2    3.600801
i3    2.429741
i4    2.517477
i5    1.871680
dtype: float64

Display a summary of the characteristics of the dataframe

df1.describe()
c1 c2 c3 c4 c5
count 5.000000 5.000000 5.000000 5.000000 5.000000
mean 0.626051 0.241862 0.363783 0.667357 0.685979
std 0.160576 0.169804 0.164487 0.289810 0.296316
min 0.454519 0.084028 0.215912 0.346318 0.319171
25% 0.479238 0.089982 0.270709 0.372827 0.433839
50% 0.625463 0.195351 0.299169 0.789868 0.763618
75% 0.759993 0.390465 0.400726 0.844656 0.939979
max 0.811040 0.449487 0.632401 0.983113 0.973286

3.1.3.11. Data ranking#

df1 = df.copy(deep=True)
df1.sort_values(by=['Population', 'Country'], ascending=False, na_position='first')
Country Capital Population Isdeveloped
c Vietnam Hanoi 97338579 True
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True

3.1.4. NaN value#

df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
                   columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.iloc[0, 1] = np.nan
df1.iloc[2, 2] = np.nan
df1.iloc[3, 1] = np.nan
df1.iloc[3, 3] = np.nan
df1
c1 c2 c3 c4 c5
i1 0.994022 NaN 0.896517 0.415380 0.943970
i2 0.152851 0.114647 0.288784 0.386462 0.741946
i3 0.296382 0.914272 NaN 0.955293 0.578226
i4 0.313830 NaN 0.993494 NaN 0.135778
i5 0.449799 0.524310 0.967581 0.419476 0.208853
# detecting nan value
print(df1.isnull())
print(df1.notnull())
print(df1.isna())
       c1     c2     c3     c4     c5
i1  False   True  False  False  False
i2  False  False  False  False  False
i3  False  False   True  False  False
i4  False   True  False   True  False
i5  False  False  False  False  False
      c1     c2     c3     c4    c5
i1  True  False   True   True  True
i2  True   True   True   True  True
i3  True   True  False   True  True
i4  True  False   True  False  True
i5  True   True   True   True  True
       c1     c2     c3     c4     c5
i1  False   True  False  False  False
i2  False  False  False  False  False
i3  False  False   True  False  False
i4  False   True  False   True  False
i5  False  False  False  False  False

After detecting NaN with Boolean values, you can easily get the number of NaN values

# False:0, True:1
df1.isnull().sum(axis=1)
i1    1
i2    0
i3    1
i4    2
i5    0
dtype: int64

Fill NaN with a specific value or a value generated by some rules

# fill NaN value using a specific value
df1.fillna(value=0)
c1 c2 c3 c4 c5
i1 0.994022 0.000000 0.896517 0.415380 0.943970
i2 0.152851 0.114647 0.288784 0.386462 0.741946
i3 0.296382 0.914272 0.000000 0.955293 0.578226
i4 0.313830 0.000000 0.993494 0.000000 0.135778
i5 0.449799 0.524310 0.967581 0.419476 0.208853
# delete NaN value
# ‘any’ : If any NA values are present, drop that row or column.
# ‘all’ : If all values are NA, drop that row or column.

# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
df1.dropna(axis="index", how="any")
c1 c2 c3 c4 c5
i2 0.152851 0.114647 0.288784 0.386462 0.741946
i5 0.449799 0.524310 0.967581 0.419476 0.208853

3.1.5. Date index#

Date index is very useful for you to deal with time series. You can create a date index by the date_range function. The date index mentioned here is usually discrete with equal intervals and it usually has three features: starting date, end date, and frequency (or intervals).

dti = pd.date_range("2018-01-01", periods=3, freq="H")
print(dti)
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
print(dti)
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2021-09-28 00:00:00', '2021-09-28 10:00:00',
               '2021-09-28 20:00:00', '2021-09-29 06:00:00',
               '2021-09-29 16:00:00'],
              dtype='datetime64[ns]', freq='10H')

Manipulating and converting date times with timezone information

dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("UTC")
dti
DatetimeIndex(['2021-09-28 00:00:00+00:00', '2021-09-28 10:00:00+00:00',
               '2021-09-28 20:00:00+00:00', '2021-09-29 06:00:00+00:00',
               '2021-09-29 16:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='10H')
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("Asia/Singapore")
dti
DatetimeIndex(['2021-09-28 00:00:00+08:00', '2021-09-28 10:00:00+08:00',
               '2021-09-28 20:00:00+08:00', '2021-09-29 06:00:00+08:00',
               '2021-09-29 16:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq=None)

Using the origin option, one can specify an alternative starting point for the creation of a DatetimeIndex. For example, to use 1900-01-01 00:00:00 as the starting time and hour as the unit period length:

pd.to_datetime([100, 101, 102], unit="h", origin=pd.Timestamp("1900-01-01 00:00:00"))
DatetimeIndex(['1900-01-05 04:00:00', '1900-01-05 05:00:00',
               '1900-01-05 06:00:00'],
              dtype='datetime64[ns]', freq=None)

Supported units are D: day, h: hour, m: minute, and s: second.

Tip

The time labels of climate products is usually given as a start time point and discrete time interval, and then represented by a column of integers. In this case, we can use the above ways to construct time labels.

3.1.6. Upsampling and Downsampling#

  • Upsampling: Increase the frequency of the samples by interpolation, such as from minutes to seconds.

  • Downsampling: Decrease the frequency of the samples by aggregation, such as from months to years.

# prepare data, this section will be introduced in the next tutorial
# Data Source: http://www.weather.gov.sg/climate-historical-daily/
data = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, parse_dates=True)
data.head()
Daily Rainfall Total (mm)
Date
1981-01-01 0.0
1981-01-02 0.0
1981-01-03 0.0
1981-01-04 0.0
1981-01-05 0.0
# Downsampling: Convert monthly data to yearly data by sum or max
df = data.copy(deep=True)
dfsum = df.resample("Y").sum()
dfsum.columns = ["Yearly Rainfall Total (mm)"]
dfsum.head()
Yearly Rainfall Total (mm)
Date
1981-12-31 1336.3
1982-12-31 1581.7
1983-12-31 1866.5
1984-12-31 2686.7
1985-12-31 1483.9
dfmax = df.resample("Y").max()
dfmax.columns = ["Yearly Rainfall Maximum (mm)"]
dfmax.head()
Yearly Rainfall Maximum (mm)
Date
1981-12-31 71.5
1982-12-31 109.0
1983-12-31 181.8
1984-12-31 154.4
1985-12-31 86.8
# Upsampling: Convert monthly data to 10 days' data 
# by directly return (asfreq) or forward filling (pad/ffill)
dfmax.resample('10D').asfreq()[0:5]
Yearly Rainfall Maximum (mm)
Date
1981-12-31 71.5
1982-01-10 NaN
1982-01-20 NaN
1982-01-30 NaN
1982-02-09 NaN
dfmax.resample('10D').ffill()[0:5]
Yearly Rainfall Maximum (mm)
Date
1981-12-31 71.5
1982-01-10 71.5
1982-01-20 71.5
1982-01-30 71.5
1982-02-09 71.5
dfmax.resample('D').ffill(limit=2)[0:5]
Yearly Rainfall Maximum (mm)
Date
1981-12-31 71.5
1982-01-01 71.5
1982-01-02 71.5
1982-01-03 NaN
1982-01-04 NaN

3.1.7. Group DataFrame#

You can group DataFrame using a mapper or by a Series of columns via the groupby function.

# Calculate average and maximum wind speed of each station
df = pd.DataFrame({'Station': ['Changi', 'Changi',
                              'Climenti', 'Climenti'],
                   'Wind Speed': [10., 5., 6., 16.]})
print(df.groupby(['Station']).mean())
print(df.groupby(['Station']).max())
          Wind Speed
Station             
Changi           7.5
Climenti        11.0
          Wind Speed
Station             
Changi          10.0
Climenti        16.0

You can also apply custom function to manipulate each group data via .apply() function

# custom function
df = pd.DataFrame({'Station': ['Changi', 'Changi', 'Changi',
                              'Climenti', 'Climenti', 'Climenti'],
                   'Wind Speed': [10., 5., 12, 6., 16., 20]})
def groupNormalization(w):
    return (w-w.min())/(w.max()-w.min())
df.groupby(['Station']).apply(groupNormalization)
Wind Speed
Station
Changi 0 0.714286
1 0.000000
2 1.000000
Climenti 3 0.000000
4 0.714286
5 1.000000

3.1.8. Input/Output of data#

Typically, you can read data from files ending with “. xlsx” and “. csv” using read_excel() and read_csv() functions, respectively. The index_col and header arguments are used to specify the column and row to be used as index and columns of the dataframe. You can also set parse_dates as True to parse index as the date format. If your date format is uncommon, you can specify the date_parser argument to be a function for converting a sequence of string columns to an array of datetime instances.

df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=True)
df.head()
Daily Rainfall Total (mm)
Date
1981-01-01 0.0
1981-01-02 0.0
1981-01-03 0.0
1981-01-04 0.0
1981-01-05 0.0
df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=True, date_format='%Y-%m-%d')
df.head()
Daily Rainfall Total (mm)
Date
1981-01-01 0.0
1981-01-02 0.0
1981-01-03 0.0
1981-01-04 0.0
1981-01-05 0.0

You can also specify the parse_dates argument as a list to parse a column as a date format.

df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=[0])
df.head()
Daily Rainfall Total (mm)
Date
1981-01-01 0.0
1981-01-02 0.0
1981-01-03 0.0
1981-01-04 0.0
1981-01-05 0.0

The usage of read_excel() function is the same as the read_csv()

Tip

  1. The Series objects can be regarded as the special DataFrames object. Actually, when you extract one column from the DataFrames, it will be converted to the Series.

  2. The DataFrame object has many of the same usages as the Numpy array and you can learn by analogy.

3.1.9. References#