Pandas tutorial

3.1. Pandas tutorial#

3.1.1. Introduction to Pandas Data Structures#

The two most common data structures of pandas are

Series
DataFrame

The Series is designed to accommodate one-dimensional ndarray data with axis labels (e.g., time series data), and the Dataframe is designed for two-dimensional tabular data with labeled axes (rows and columns).

3.1.2. Series#

As shown below, the structure of the Series object is simple, which consists of two columns - Index and Value - with the same length. The Value column holds the data (of any NumPy type) which is associated with labels in the Index column.

Index	Value
0	-1
1	3
2	8

3.1.2.1. Defining a Series#

We can call the pd.Series() function to create a Series object and specify the argument data to be a Numpy array, a list, a dictionary, or a scalar value.

import numpy as np
import pandas as pd

# Since 'data' is the first location parameter, you can also ignore 'data=' and directly specify the value
s1 = pd.Series(data=np.array([-1, 3, 8]))    # Numpy array
s2 = pd.Series(data=[-1, 3, 8])              # List
s3 = pd.Series(data={'a':-1, 'b':3, 'c':8})  # Dictionary
s4 = pd.Series(data=-1)                      # Scalar
print(s1)
print(s2)
print(s3)
print(s4)

0   -1
1    3
2    8
dtype: int64
0   -1
1    3
2    8
dtype: int64
a   -1
b    3
c    8
dtype: int64
0   -1
dtype: int64

As you can see from the output of the Series object, the left column is the Index, which is a series of labels, and the right is a series of corresponding values.

Note

If you do not specify the value of index when creating the series, by default, Pandas will assign the sequence number starting from 0 as the index. If the data is in the type of dictionary, the keys will be used as the index of the series.

In general, it is better to create a Series object with meaningful labels to distinguish and identify each value. When constructing the Series, the labels are added by specifying the value of the index parameter or specifying the data as a dict object as we mentioned above.

s5 = pd.Series([-1, 3, 8], index=['x', 'y', 'z'])  # specify the index by 'index' option
print(s5)

x   -1
y    3
z    8
dtype: int64

Note

Always keep in mind that when specifying the argument data as a Numpy array, the values contained in the NumPy array are not copied, but are passed by reference. That is, the object is inserted dynamically within the new Series object. If the values of NumPy array change, those changes will also present in the Series object.

arr = np.array([100, 20, -3])

s6 = pd.Series(arr, index=['x', 'y', 'z'] )
print(arr)
print(s6)

arr[2] = -40
print('\n')
print(arr)
print("After reassigning value, the value of series \nalso will be updated automatically:")
print(s6)

[100  20  -3]
x    100
y     20
z     -3
dtype: int64


[100  20 -40]
After reassigning value, the value of series 
also will be updated automatically:
x    100
y     20
z    -40
dtype: int64

As you can see in this example, by changing the third element of the arr array, we also modified the corresponding element in the s6 series.

You can call the attributes - index and values - of the series to see them separately.

print(s6.values)
print(s6.index)

[100  20 -40]
Index(['x', 'y', 'z'], dtype='object')

3.1.2.2. Selecting the Elements#

You can select any individual element in the same way as Numpy arrays by specifying the sequence number with the attribute .iloc of Series.

s = pd.Series(data={'a':-1, 'b':3, 'c':8})
s.iloc[1]

Or you can specify the value of the index.

s['a']

-1

Similar to Numpy arrays, you can also select multiple items by specifying number or label slices.

print(s[0:2])
print(s['a':'c'])

a   -1
b    3
dtype: int64
a   -1
b    3
c    8
dtype: int64

Note

Notice that the number-based slice does not contain the last element of the slice (like Numpy arrays), but the label-based slice contains the last element.

3.1.2.3. Assigning Values to the Elements#

s['a'] = 100
s.iloc[2] = 80
s

a    100
b      3
c     80
dtype: int64

3.1.2.4. Filtering Values#

If you need to get the indexes and values of the elements in the series that are greater than 4, you write the following

s = pd.Series([1, 3, 5, 2, 10])
idx_greater = s > 4
# According to Boolean value to filter
print(idx_greater)
print(s[idx_greater])  # greater than 4

  False
  False
   True
  False
   True
dtype: bool
   5
  10
dtype: int64

Or you can get the elements that satisfy multiple conditions by writing the following

# you cannot use `or` `and` or `not` and only can use corresponding operator symbol(|, & and ~)
filter_condition = (s.isin([2, 5]) | (s<4)) & ~(s>2) 
print(filter_condition)
print(s[filter_condition])

   True
  False
  False
   True
  False
dtype: bool
  1
  2
dtype: int64

3.1.2.5. Operations and Mathematical Functions#

The Series object can participate in common mathematical operations in a similar way to Numpy arrays.

s * 2.5

   2.5
   7.5
  12.5
   5.0
  25.0
dtype: float64

np.exp(s)

      2.718282
     20.085537
    148.413159
      7.389056
  22026.465795
dtype: float64

3.1.2.6. Nan Value#

The NaN refers to Not a Number, which is generally caused by the missing value. Before data analysis, the NaN values need to be addressed.

# Declaring a 'Series' including the NaN value
s = pd.Series([1, np.NaN, 10, 9, -2, np.NaN])
s

   1.0
   NaN
  10.0
   9.0
  -2.0
   NaN
dtype: float64

Call isnull() or notnull() functions to generate boolean value and get the indexes corresponding to the NaN value.

print(s.isnull())
print(s.notnull())

  False
   True
  False
  False
  False
   True
dtype: bool
   True
  False
   True
   True
   True
  False
dtype: bool

Based on the generated boolean value, the Series object with NaN value and without NaN can be generated by filtering methods mentioned above.

print(s[s.isnull()])
print(s[s.notnull()])

 NaN
 NaN
dtype: float64
   1.0
  10.0
   9.0
  -2.0
dtype: float64

3.1.2.7. Operations of Multiple Series#

s = pd.Series({"Singapore": 30, "Malaysia": 23, "Vietnam": 36, "Cambodia": 41})
s1 = pd.Series({"China": 51, "Japan": 73, "Vietnam": 36, "Laos": 31})
s * s1

Cambodia        NaN
China           NaN
Japan           NaN
Laos            NaN
Malaysia        NaN
Singapore       NaN
Vietnam      1296.0
dtype: float64

As you can see, we can only operate on the elements with the same indices in both series.

3.1.3. DataFrame#

Compared with the Series, the DataFrame can contain two-dimensional data. Its first column and first row are index and columns, respectively (for DataFrame without multiple indexes).

index	numeric	string	boolean
0	-1	Singapore	True
1	3	China	True
2	8	Japan	False

3.1.3.1. Defining a DataFrame#

Call DataFrame() function to create a DataFrame. The Array, List and dict can all be the input of data argument.

df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 'daodao', False],
                   ["Singapore", "Singapore", 5850342, True],
                   ["Vietnam", "Hanoi", 97338579, True]],
                  columns=["Country", "Capital", "Population", "Isdeveloped"],
                  index=["a", "b", "c"])
print(df)

     Country       Capital Population  Isdeveloped
a   Malaysia  Kuala Lumpur     daodao        False
b  Singapore     Singapore    5850342         True
c    Vietnam         Hanoi   97338579         True

# Array
df = pd.DataFrame(
    np.array([[14, 35, 35, 35], [19, 34, 57, 34], [42, 74, 49, 59]]))
print(df)

# List,  use 'columns' and 'index' parameters to specify the column and index of generated dataframe.
df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 32365999, False],
                   ["Singapore", "Singapore", 5850342, True],
                   ["Vietnam", "Hanoi", 97338579, True]],
                  columns=["Country", "Capital", "Population", "Isdeveloped"],
                  index=["a", "b", "c"])
print(df)

# dict
df = pd.DataFrame({"Country": ["Malaysia", "Singapore", "Vietnam"],
                   "Capital": ["Kuala Lumpur", "Singapore", "Hanoi"],
                   "Population": [32365999, 5850342, 97338579],
                   "Isdeveloped": [False, True, True]},
                  index=["a", "b", "c"])
print(df)

    0   1   2   3
0  14  35  35  35
1  19  34  57  34
2  42  74  49  59
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True

3.1.3.2. Selecting the Internal Elements#

Similar to Series, two ways can be used to select the elements from DataFrame. Call iloc[] and loc[] to select the elements by position and label, respectively .

# use ':' to represent select all
df.iloc[:, 0:2]

	Country	Capital
a	Malaysia	Kuala Lumpur
b	Singapore	Singapore
c	Vietnam	Hanoi

df.loc[:, "Country":"Population"]

	Country	Capital	Population
a	Malaysia	Kuala Lumpur	32365999
b	Singapore	Singapore	5850342
c	Vietnam	Hanoi	97338579

df.loc["a", ["Country", "Population"]]

Country       Malaysia
Population    32365999
Name: a, dtype: object

df.iloc[[0, 1]] # If you omit number of columns, all columns will be selected 

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True

Use columns,index and values attributes to obtain corresponding object value.

df.index

Index(['a', 'b', 'c'], dtype='object')

df.columns

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

df.values

array([['Malaysia', 'Kuala Lumpur', 32365999, False],
       ['Singapore', 'Singapore', 5850342, True],
       ['Vietnam', 'Hanoi', 97338579, True]], dtype=object)

Select the corresponding column(s) according to the labels or indexes of columns.

df["Country"]

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

df[["Country", "Population"]] # Use list to select multiple columns  

	Country	Population
a	Malaysia	32365999
b	Singapore	5850342
c	Vietnam	97338579

df.Country # Also support as atrribute to select

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

df["a":"b"] # using label-based slice to select multiple rows

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True

df[0:2] # using label-based number to select multiple rows

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True

3.1.3.3. Assigning value#

df1 = df.copy(True)
df1.loc["c", "Country"] = "Japan"
df1.loc["c", "Capital"] = "Tokyo"
df1.loc["c", "Population"] = 126476461
df1.loc["c", "Isdeveloped"] = True
df1

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True
c	Japan	Tokyo	126476461	True

df1.loc["c"] = ["Japan", "Tokyo", 126476461, True]
df

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True
c	Vietnam	Hanoi	97338579	True

3.1.3.4. Assigning index, columns, and name of index and columns#

df1.index = ["e", "f", "g"]
df1.index.name = "label"
df1.columns.name = "attributes"
df1.columns = ["Coun", "Cap", "Pop", "ID"]
df1

	Coun	Cap	Pop	ID
label
e	Malaysia	Kuala Lumpur	32365999	False
f	Singapore	Singapore	5850342	True
g	Japan	Tokyo	126476461	True

3.1.3.5. Delete columns from dataframe#

del df1["ID"]
# axis = 1 or columns represents delete columns
df1.drop(["Coun"], axis='columns', inplace=True)

# axis = 0 or rows represents delete columns
df1.drop(["e"], axis='rows', inplace=True)

df1

	Cap	Pop
label
f	Singapore	5850342
g	Tokyo	126476461

When inplace=True is passed, the modified DataFrame is used in place of the original one, and the operation itself returns None. When inplace=False is passed (this is the default value, so isn’t necessary to specify), the function returns a copy of the modified DataFrame.

# inplace=True
df2 = df.copy(True)
print(df2)
print("----")
df2_return = df2.drop(["Country"], axis='columns', inplace=True)
print(df2)
print("----")
print(df2_return)

     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
        Capital  Population  Isdeveloped
a  Kuala Lumpur    32365999        False
b     Singapore     5850342         True
c         Hanoi    97338579         True
----
None

# inplace=False
df3 = df.copy(True)
print(df3)
print("----")
df3_return = df3.drop(["Country"], axis='columns', inplace=False)
print(df3)
print("----")
print(df3_return)

     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True
----
        Capital  Population  Isdeveloped
a  Kuala Lumpur    32365999        False
b     Singapore     5850342         True
c         Hanoi    97338579         True

3.1.3.6. Filtering#

Same as Series() mentioned above.

df2 = pd.DataFrame(np.array([[14, 35, 35, 35],
                             [19, 34, 57, 34],
                             [42, 74, 49, 59]]))
# filtering lesser than 30
df2[df2 < 30]

	0	1	2	3
0	14.0	NaN	NaN	NaN
1	19.0	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN

# Filtering accroding to conditions of one column
df[df["Population"] < 50000000]

	Country	Capital	Population	Isdeveloped
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True

You can filter the DataFrame according to conditions of multiple columns like the following:

df[(df["Population"] < 50000000) & (df["Isdeveloped"] == True)]

	Country	Capital	Population	Isdeveloped
b	Singapore	Singapore	5850342	True

3.1.3.7. Transposition of a Dataframe#

Similar to Numpy arrays, Dataframe can be transposed. Columns is interchanged with Index.

df1 = df.T
df1

	a	b	c
Country	Malaysia	Singapore	Vietnam
Capital	Kuala Lumpur	Singapore	Hanoi
Population	32365999	5850342	97338579
Isdeveloped	False	True	True

df1.index

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

df1.columns

Index(['a', 'b', 'c'], dtype='object')

3.1.3.8. Merge of Dataframe#

Call concat(), append() functions to merge multiple dataframes.

df1 = pd.DataFrame(np.random.rand(3,4))
df2 = pd.DataFrame(np.random.rand(3,4))
df3 = pd.DataFrame(np.random.rand(6,4))
df4 = pd.DataFrame(np.random.rand(3,6))

# Vertical merging by default.
pd.concat([df1, df2, df3, df4])

	0	1	2	3	4	5
0	0.711833	0.210372	0.623283	0.227076	NaN	NaN
1	0.187854	0.003110	0.579188	0.457127	NaN	NaN
2	0.577679	0.240997	0.406679	0.252861	NaN	NaN
0	0.154052	0.299850	0.697615	0.432677	NaN	NaN
1	0.808326	0.812322	0.530895	0.383379	NaN	NaN
2	0.253567	0.132441	0.423534	0.081832	NaN	NaN
0	0.001646	0.384060	0.703795	0.574561	NaN	NaN
1	0.395984	0.427264	0.151421	0.405376	NaN	NaN
2	0.764523	0.582966	0.418316	0.405842	NaN	NaN
3	0.004663	0.723450	0.494632	0.038501	NaN	NaN
4	0.281056	0.957538	0.921892	0.813628	NaN	NaN
5	0.320597	0.733772	0.371999	0.385703	NaN	NaN
0	0.154081	0.741327	0.332735	0.630257	0.277065	0.771409
1	0.034194	0.445515	0.499797	0.853322	0.877335	0.729658
2	0.381277	0.008222	0.563647	0.729580	0.993268	0.242828

Note

As you see, when the shapes of dataframes don’t match, blank positions will be filled by NaN values.

Conduct horizontal merging by specifying axis argument as ‘columns’ or 1.

pd.concat([df1, df2, df3, df4], axis='columns')

	0	1	2	3	0	1	2	3	0	1	2	3	0	1	2	3	4	5
0	0.711833	0.210372	0.623283	0.227076	0.154052	0.299850	0.697615	0.432677	0.001646	0.384060	0.703795	0.574561	0.154081	0.741327	0.332735	0.630257	0.277065	0.771409
1	0.187854	0.003110	0.579188	0.457127	0.808326	0.812322	0.530895	0.383379	0.395984	0.427264	0.151421	0.405376	0.034194	0.445515	0.499797	0.853322	0.877335	0.729658
2	0.577679	0.240997	0.406679	0.252861	0.253567	0.132441	0.423534	0.081832	0.764523	0.582966	0.418316	0.405842	0.381277	0.008222	0.563647	0.729580	0.993268	0.242828
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.004663	0.723450	0.494632	0.038501	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.281056	0.957538	0.921892	0.813628	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.320597	0.733772	0.371999	0.385703	NaN	NaN	NaN	NaN	NaN	NaN

3.1.3.9. View data#

df1 = pd.DataFrame(np.random.rand(100,4))
df1.head(2)

	0	1	2	3
0	0.577690	0.626418	0.165206	0.650919
1	0.104072	0.214700	0.576407	0.095773

df1.tail(3)

	0	1	2	3
97	0.224167	0.586648	0.824226	0.704023
98	0.668144	0.138919	0.423611	0.079042
99	0.832823	0.959519	0.641955	0.541513

3.1.3.10. Computational tools#

Compute covariance and Pearson correlation coefficients between different columns.

df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
                   columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.cov()

	c1	c2	c3	c4	c5
c1	0.025785	-0.005152	0.003585	-0.025674	0.034285
c2	-0.005152	0.028833	0.011865	0.033059	-0.018142
c3	0.003585	0.011865	0.027056	0.026296	0.022413
c4	-0.025674	0.033059	0.026296	0.083990	-0.012309
c5	0.034285	-0.018142	0.022413	-0.012309	0.087803

df1.corr() # method = pearson (default), optional: kendall, spearman

	c1	c2	c3	c4	c5
c1	1.000000	-0.188938	0.135723	-0.551697	0.720551
c2	-0.188938	1.000000	0.424791	0.671776	-0.360566
c3	0.135723	0.424791	1.000000	0.551615	0.459855
c4	-0.551697	0.671776	0.551615	1.000000	-0.143334
c5	0.720551	-0.360566	0.459855	-0.143334	1.000000

df1.corr(method='kendall')

	c1	c2	c3	c4	c5
c1	1.0	-0.2	-0.2	-0.4	0.6
c2	-0.2	1.0	0.2	0.4	-0.2
c3	-0.2	0.2	1.0	0.4	0.2
c4	-0.4	0.4	0.4	1.0	0.0
c5	0.6	-0.2	0.2	0.0	1.0

Compute the average value, maximum, minimum and sum of different columns or rows.

# compute average value of each column by default.
df1.mean()

c1    0.626051
c2    0.241862
c3    0.363783
c4    0.667357
c5    0.685979
dtype: float64

Compute the sum of each row by specifying the axis argument as ‘columns’ or 1.

df1.sum(axis=1)

i1    2.505459
i2    3.600801
i3    2.429741
i4    2.517477
i5    1.871680
dtype: float64

Display a summary of the characteristics of the dataframe

df1.describe()

	c1	c2	c3	c4	c5
count	5.000000	5.000000	5.000000	5.000000	5.000000
mean	0.626051	0.241862	0.363783	0.667357	0.685979
std	0.160576	0.169804	0.164487	0.289810	0.296316
min	0.454519	0.084028	0.215912	0.346318	0.319171
25%	0.479238	0.089982	0.270709	0.372827	0.433839
50%	0.625463	0.195351	0.299169	0.789868	0.763618
75%	0.759993	0.390465	0.400726	0.844656	0.939979
max	0.811040	0.449487	0.632401	0.983113	0.973286

3.1.3.11. Data ranking#

df1 = df.copy(deep=True)
df1.sort_values(by=['Population', 'Country'], ascending=False, na_position='first')

	Country	Capital	Population	Isdeveloped
c	Vietnam	Hanoi	97338579	True
a	Malaysia	Kuala Lumpur	32365999	False
b	Singapore	Singapore	5850342	True

3.1.4. NaN value#

df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
                   columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.iloc[0, 1] = np.nan
df1.iloc[2, 2] = np.nan
df1.iloc[3, 1] = np.nan
df1.iloc[3, 3] = np.nan
df1

	c1	c2	c3	c4	c5
i1	0.994022	NaN	0.896517	0.415380	0.943970
i2	0.152851	0.114647	0.288784	0.386462	0.741946
i3	0.296382	0.914272	NaN	0.955293	0.578226
i4	0.313830	NaN	0.993494	NaN	0.135778
i5	0.449799	0.524310	0.967581	0.419476	0.208853

# detecting nan value
print(df1.isnull())
print(df1.notnull())
print(df1.isna())

       c1     c2     c3     c4     c5
i1  False   True  False  False  False
i2  False  False  False  False  False
i3  False  False   True  False  False
i4  False   True  False   True  False
i5  False  False  False  False  False
      c1     c2     c3     c4    c5
i1  True  False   True   True  True
i2  True   True   True   True  True
i3  True   True  False   True  True
i4  True  False   True  False  True
i5  True   True   True   True  True
       c1     c2     c3     c4     c5
i1  False   True  False  False  False
i2  False  False  False  False  False
i3  False  False   True  False  False
i4  False   True  False   True  False
i5  False  False  False  False  False

After detecting NaN with Boolean values, you can easily get the number of NaN values

# False:0, True:1
df1.isnull().sum(axis=1)

i1    1
i2    0
i3    1
i4    2
i5    0
dtype: int64

Fill NaN with a specific value or a value generated by some rules

# fill NaN value using a specific value
df1.fillna(value=0)

	c1	c2	c3	c4	c5
i1	0.994022	0.000000	0.896517	0.415380	0.943970
i2	0.152851	0.114647	0.288784	0.386462	0.741946
i3	0.296382	0.914272	0.000000	0.955293	0.578226
i4	0.313830	0.000000	0.993494	0.000000	0.135778
i5	0.449799	0.524310	0.967581	0.419476	0.208853

# delete NaN value
# ‘any’ : If any NA values are present, drop that row or column.
# ‘all’ : If all values are NA, drop that row or column.

# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
df1.dropna(axis="index", how="any")

	c1	c2	c3	c4	c5
i2	0.152851	0.114647	0.288784	0.386462	0.741946
i5	0.449799	0.524310	0.967581	0.419476	0.208853

3.1.5. Date index#

Date index is very useful for you to deal with time series. You can create a date index by the date_range function. The date index mentioned here is usually discrete with equal intervals and it usually has three features: starting date, end date, and frequency (or intervals).

dti = pd.date_range("2018-01-01", periods=3, freq="H")
print(dti)
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
print(dti)

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2021-09-28 00:00:00', '2021-09-28 10:00:00',
               '2021-09-28 20:00:00', '2021-09-29 06:00:00',
               '2021-09-29 16:00:00'],
              dtype='datetime64[ns]', freq='10H')

Manipulating and converting date times with timezone information

dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("UTC")
dti

DatetimeIndex(['2021-09-28 00:00:00+00:00', '2021-09-28 10:00:00+00:00',
               '2021-09-28 20:00:00+00:00', '2021-09-29 06:00:00+00:00',
               '2021-09-29 16:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='10H')

dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("Asia/Singapore")
dti

DatetimeIndex(['2021-09-28 00:00:00+08:00', '2021-09-28 10:00:00+08:00',
               '2021-09-28 20:00:00+08:00', '2021-09-29 06:00:00+08:00',
               '2021-09-29 16:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq=None)

Using the origin option, one can specify an alternative starting point for the creation of a DatetimeIndex. For example, to use 1900-01-01 00:00:00 as the starting time and hour as the unit period length:

pd.to_datetime([100, 101, 102], unit="h", origin=pd.Timestamp("1900-01-01 00:00:00"))

DatetimeIndex(['1900-01-05 04:00:00', '1900-01-05 05:00:00',
               '1900-01-05 06:00:00'],
              dtype='datetime64[ns]', freq=None)

Supported units are D: day, h: hour, m: minute, and s: second.

Tip

The time labels of climate products is usually given as a start time point and discrete time interval, and then represented by a column of integers. In this case, we can use the above ways to construct time labels.

3.1.6. Upsampling and Downsampling#

Upsampling: Increase the frequency of the samples by interpolation, such as from minutes to seconds.
Downsampling: Decrease the frequency of the samples by aggregation, such as from months to years.

# prepare data, this section will be introduced in the next tutorial
# Data Source: http://www.weather.gov.sg/climate-historical-daily/
data = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, parse_dates=True)
data.head()

	Daily Rainfall Total (mm)
Date
1981-01-01	0.0
1981-01-02	0.0
1981-01-03	0.0
1981-01-04	0.0
1981-01-05	0.0

# Downsampling: Convert monthly data to yearly data by sum or max
df = data.copy(deep=True)
dfsum = df.resample("Y").sum()
dfsum.columns = ["Yearly Rainfall Total (mm)"]
dfsum.head()

	Yearly Rainfall Total (mm)
Date
1981-12-31	1336.3
1982-12-31	1581.7
1983-12-31	1866.5
1984-12-31	2686.7
1985-12-31	1483.9

dfmax = df.resample("Y").max()
dfmax.columns = ["Yearly Rainfall Maximum (mm)"]
dfmax.head()

	Yearly Rainfall Maximum (mm)
Date
1981-12-31	71.5
1982-12-31	109.0
1983-12-31	181.8
1984-12-31	154.4
1985-12-31	86.8

# Upsampling: Convert monthly data to 10 days' data 
# by directly return (asfreq) or forward filling (pad/ffill)
dfmax.resample('10D').asfreq()[0:5]

	Yearly Rainfall Maximum (mm)
Date
1981-12-31	71.5
1982-01-10	NaN
1982-01-20	NaN
1982-01-30	NaN
1982-02-09	NaN

dfmax.resample('10D').ffill()[0:5]

	Yearly Rainfall Maximum (mm)
Date
1981-12-31	71.5
1982-01-10	71.5
1982-01-20	71.5
1982-01-30	71.5
1982-02-09	71.5

dfmax.resample('D').ffill(limit=2)[0:5]

	Yearly Rainfall Maximum (mm)
Date
1981-12-31	71.5
1982-01-01	71.5
1982-01-02	71.5
1982-01-03	NaN
1982-01-04	NaN

3.1.7. Group DataFrame#

You can group DataFrame using a mapper or by a Series of columns via the groupby function.

# Calculate average and maximum wind speed of each station
df = pd.DataFrame({'Station': ['Changi', 'Changi',
                              'Climenti', 'Climenti'],
                   'Wind Speed': [10., 5., 6., 16.]})
print(df.groupby(['Station']).mean())
print(df.groupby(['Station']).max())

          Wind Speed
Station             
Changi           7.5
Climenti        11.0
          Wind Speed
Station             
Changi          10.0
Climenti        16.0

You can also apply custom function to manipulate each group data via .apply() function

# custom function
df = pd.DataFrame({'Station': ['Changi', 'Changi', 'Changi',
                              'Climenti', 'Climenti', 'Climenti'],
                   'Wind Speed': [10., 5., 12, 6., 16., 20]})
def groupNormalization(w):
    return (w-w.min())/(w.max()-w.min())
df.groupby(['Station']).apply(groupNormalization)

		Wind Speed
Station
Changi	0	0.714286
	1	0.000000
	2	1.000000
Climenti	3	0.000000
	4	0.714286
	5	1.000000

3.1.8. Input/Output of data#

Typically, you can read data from files ending with “. xlsx” and “. csv” using read_excel() and read_csv() functions, respectively. The index_col and header arguments are used to specify the column and row to be used as index and columns of the dataframe. You can also set parse_dates as True to parse index as the date format. If your date format is uncommon, you can specify the date_parser argument to be a function for converting a sequence of string columns to an array of datetime instances.

df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=True)
df.head()

	Daily Rainfall Total (mm)
Date
1981-01-01	0.0
1981-01-02	0.0
1981-01-03	0.0
1981-01-04	0.0
1981-01-05	0.0

df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=True, date_format='%Y-%m-%d')
df.head()

	Daily Rainfall Total (mm)
Date
1981-01-01	0.0
1981-01-02	0.0
1981-01-03	0.0
1981-01-04	0.0
1981-01-05	0.0

You can also specify the parse_dates argument as a list to parse a column as a date format.

df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, 
                 parse_dates=[0])
df.head()

	Daily Rainfall Total (mm)
Date
1981-01-01	0.0
1981-01-02	0.0
1981-01-03	0.0
1981-01-04	0.0
1981-01-05	0.0

The usage of read_excel() function is the same as the read_csv()

Tip

The Series objects can be regarded as the special DataFrames object. Actually, when you extract one column from the DataFrames, it will be converted to the Series.
The DataFrame object has many of the same usages as the Numpy array and you can learn by analogy.

Pandas tutorial

Contents

3.1. Pandas tutorial#

3.1.1. Introduction to Pandas Data Structures#

3.1.2. Series#

3.1.2.1. Defining a Series#

3.1.2.2. Selecting the Elements#

3.1.2.3. Assigning Values to the Elements#

3.1.2.4. Filtering Values#

3.1.2.5. Operations and Mathematical Functions#

3.1.2.6. Nan Value#

3.1.2.7. Operations of Multiple Series#

3.1.3. DataFrame#

3.1.3.1. Defining a DataFrame#

3.1.3.2. Selecting the Internal Elements#

3.1.3.3. Assigning value#

3.1.3.4. Assigning index, columns, and name of index and columns#

3.1.3.5. Delete columns from dataframe#

3.1.3.6. Filtering#

3.1.3.7. Transposition of a Dataframe#

3.1.3.8. Merge of Dataframe#

3.1.3.9. View data#

3.1.3.10. Computational tools#

3.1.3.11. Data ranking#

3.1.4. NaN value#

3.1.5. Date index#

3.1.6. Upsampling and Downsampling#

3.1.7. Group DataFrame#

3.1.8. Input/Output of data#

3.1.9. References#