3.1. Pandas tutorial#
3.1.1. Introduction to Pandas Data Structures#
The two most common data structures of pandas
are
Series
DataFrame
The Series
is designed to accommodate one-dimensional ndarray
data with axis labels (e.g., time series data), and the Dataframe
is designed for two-dimensional tabular data with labeled axes (rows and columns).
3.1.2. Series#
As shown below, the structure of the Series
object is simple, which consists of two columns - Index and Value - with the same length. The Value column holds the data (of any NumPy
type) which is associated with labels in the Index column.
Index |
Value |
---|---|
0 |
-1 |
1 |
3 |
2 |
8 |
3.1.2.1. Defining a Series#
We can call the pd.Series()
function to create a Series
object and specify the argument data
to be a Numpy
array, a list, a dictionary, or a scalar value.
import numpy as np
import pandas as pd
# Since 'data' is the first location parameter, you can also ignore 'data=' and directly specify the value
s1 = pd.Series(data=np.array([-1, 3, 8])) # Numpy array
s2 = pd.Series(data=[-1, 3, 8]) # List
s3 = pd.Series(data={'a':-1, 'b':3, 'c':8}) # Dictionary
s4 = pd.Series(data=-1) # Scalar
print(s1)
print(s2)
print(s3)
print(s4)
0 -1
1 3
2 8
dtype: int64
0 -1
1 3
2 8
dtype: int64
a -1
b 3
c 8
dtype: int64
0 -1
dtype: int64
As you can see from the output of the Series
object, the left column is the Index
, which is a series of labels, and the right is a series of corresponding values.
Note
If you do not specify the value of index when creating the series, by default, Pandas
will assign the sequence number starting from 0 as the index. If the data is in the type of dictionary, the keys will be used as the index of the series.
In general, it is better to create a Series
object with meaningful labels to distinguish and identify each value. When constructing the Series
, the labels are added by specifying the value of the index
parameter or specifying the data
as a dict
object as we mentioned above.
s5 = pd.Series([-1, 3, 8], index=['x', 'y', 'z']) # specify the index by 'index' option
print(s5)
x -1
y 3
z 8
dtype: int64
Note
Always keep in mind that when specifying the argument data
as a Numpy
array, the values contained in the NumPy
array are not copied, but are passed by reference. That is, the object is inserted
dynamically within the new Series
object. If the values of NumPy
array change, those changes will also present in the Series
object.
arr = np.array([100, 20, -3])
s6 = pd.Series(arr, index=['x', 'y', 'z'] )
print(arr)
print(s6)
arr[2] = -40
print('\n')
print(arr)
print("After reassigning value, the value of series \nalso will be updated automatically:")
print(s6)
[100 20 -3]
x 100
y 20
z -3
dtype: int64
[100 20 -40]
After reassigning value, the value of series
also will be updated automatically:
x 100
y 20
z -40
dtype: int64
As you can see in this example, by changing the third element of the arr
array, we
also modified the corresponding element in the s6
series.
You can call the attributes - index and values - of the series to see them separately.
print(s6.values)
print(s6.index)
[100 20 -40]
Index(['x', 'y', 'z'], dtype='object')
3.1.2.2. Selecting the Elements#
You can select any individual element in the same way as Numpy
arrays by specifying the sequence number with the attribute .iloc
of Series
.
s = pd.Series(data={'a':-1, 'b':3, 'c':8})
s.iloc[1]
3
Or you can specify the value of the index.
s['a']
-1
Similar to Numpy
arrays, you can also select multiple items by specifying number or label slices.
print(s[0:2])
print(s['a':'c'])
a -1
b 3
dtype: int64
a -1
b 3
c 8
dtype: int64
Note
Notice that the number-based slice does not contain the last element of the slice (like Numpy
arrays), but the label-based slice contains the last element.
3.1.2.3. Assigning Values to the Elements#
s['a'] = 100
s.iloc[2] = 80
s
a 100
b 3
c 80
dtype: int64
3.1.2.4. Filtering Values#
If you need to get the indexes and values of the elements in the series that are greater than 4, you write the following
s = pd.Series([1, 3, 5, 2, 10])
idx_greater = s > 4
# According to Boolean value to filter
print(idx_greater)
print(s[idx_greater]) # greater than 4
0 False
1 False
2 True
3 False
4 True
dtype: bool
2 5
4 10
dtype: int64
Or you can get the elements that satisfy multiple conditions by writing the following
# you cannot use `or` `and` or `not` and only can use corresponding operator symbol(|, & and ~)
filter_condition = (s.isin([2, 5]) | (s<4)) & ~(s>2)
print(filter_condition)
print(s[filter_condition])
0 True
1 False
2 False
3 True
4 False
dtype: bool
0 1
3 2
dtype: int64
3.1.2.5. Operations and Mathematical Functions#
The Series
object can participate in common mathematical operations in a similar way to Numpy
arrays.
s * 2.5
0 2.5
1 7.5
2 12.5
3 5.0
4 25.0
dtype: float64
np.exp(s)
0 2.718282
1 20.085537
2 148.413159
3 7.389056
4 22026.465795
dtype: float64
3.1.2.6. Nan Value#
The NaN
refers to Not a Number
, which is generally caused by the missing value. Before data analysis, the NaN
values need to be addressed.
# Declaring a 'Series' including the NaN value
s = pd.Series([1, np.NaN, 10, 9, -2, np.NaN])
s
0 1.0
1 NaN
2 10.0
3 9.0
4 -2.0
5 NaN
dtype: float64
Call isnull()
or notnull()
functions to generate boolean value and get the indexes corresponding to the NaN
value.
print(s.isnull())
print(s.notnull())
0 False
1 True
2 False
3 False
4 False
5 True
dtype: bool
0 True
1 False
2 True
3 True
4 True
5 False
dtype: bool
Based on the generated boolean value, the Series
object with NaN
value and without NaN
can be generated by filtering methods mentioned above.
print(s[s.isnull()])
print(s[s.notnull()])
1 NaN
5 NaN
dtype: float64
0 1.0
2 10.0
3 9.0
4 -2.0
dtype: float64
3.1.2.7. Operations of Multiple Series#
s = pd.Series({"Singapore": 30, "Malaysia": 23, "Vietnam": 36, "Cambodia": 41})
s1 = pd.Series({"China": 51, "Japan": 73, "Vietnam": 36, "Laos": 31})
s * s1
Cambodia NaN
China NaN
Japan NaN
Laos NaN
Malaysia NaN
Singapore NaN
Vietnam 1296.0
dtype: float64
As you can see, we can only operate on the elements with the same indices in both series.
3.1.3. DataFrame#
Compared with the Series
, the DataFrame
can contain two-dimensional data. Its first column and first row are index
and columns
, respectively (for DataFrame without multiple indexes).
index |
numeric |
string |
boolean |
---|---|---|---|
0 |
-1 |
Singapore |
True |
1 |
3 |
China |
True |
2 |
8 |
Japan |
False |
3.1.3.1. Defining a DataFrame#
Call DataFrame()
function to create a DataFrame
. The Array
, List
and dict
can all be the input of data
argument.
df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 'daodao', False],
["Singapore", "Singapore", 5850342, True],
["Vietnam", "Hanoi", 97338579, True]],
columns=["Country", "Capital", "Population", "Isdeveloped"],
index=["a", "b", "c"])
print(df)
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur daodao False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
# Array
df = pd.DataFrame(
np.array([[14, 35, 35, 35], [19, 34, 57, 34], [42, 74, 49, 59]]))
print(df)
# List, use 'columns' and 'index' parameters to specify the column and index of generated dataframe.
df = pd.DataFrame([["Malaysia", "Kuala Lumpur", 32365999, False],
["Singapore", "Singapore", 5850342, True],
["Vietnam", "Hanoi", 97338579, True]],
columns=["Country", "Capital", "Population", "Isdeveloped"],
index=["a", "b", "c"])
print(df)
# dict
df = pd.DataFrame({"Country": ["Malaysia", "Singapore", "Vietnam"],
"Capital": ["Kuala Lumpur", "Singapore", "Hanoi"],
"Population": [32365999, 5850342, 97338579],
"Isdeveloped": [False, True, True]},
index=["a", "b", "c"])
print(df)
0 1 2 3
0 14 35 35 35
1 19 34 57 34
2 42 74 49 59
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
3.1.3.2. Selecting the Internal Elements#
Similar to Series
, two ways can be used to select the elements from DataFrame
. Call iloc[]
and loc[]
to select the elements by position and label, respectively .
# use ':' to represent select all
df.iloc[:, 0:2]
Country | Capital | |
---|---|---|
a | Malaysia | Kuala Lumpur |
b | Singapore | Singapore |
c | Vietnam | Hanoi |
df.loc[:, "Country":"Population"]
Country | Capital | Population | |
---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 |
b | Singapore | Singapore | 5850342 |
c | Vietnam | Hanoi | 97338579 |
df.loc["a", ["Country", "Population"]]
Country Malaysia
Population 32365999
Name: a, dtype: object
df.iloc[[0, 1]] # If you omit number of columns, all columns will be selected
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
Use columns
,index
and values
attributes to obtain corresponding object value.
df.index
Index(['a', 'b', 'c'], dtype='object')
df.columns
Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')
df.values
array([['Malaysia', 'Kuala Lumpur', 32365999, False],
['Singapore', 'Singapore', 5850342, True],
['Vietnam', 'Hanoi', 97338579, True]], dtype=object)
Select the corresponding column(s) according to the labels or indexes of columns.
df["Country"]
a Malaysia
b Singapore
c Vietnam
Name: Country, dtype: object
df[["Country", "Population"]] # Use list to select multiple columns
Country | Population | |
---|---|---|
a | Malaysia | 32365999 |
b | Singapore | 5850342 |
c | Vietnam | 97338579 |
df.Country # Also support as atrribute to select
a Malaysia
b Singapore
c Vietnam
Name: Country, dtype: object
df["a":"b"] # using label-based slice to select multiple rows
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
df[0:2] # using label-based number to select multiple rows
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
3.1.3.3. Assigning value#
df1 = df.copy(True)
df1.loc["c", "Country"] = "Japan"
df1.loc["c", "Capital"] = "Tokyo"
df1.loc["c", "Population"] = 126476461
df1.loc["c", "Isdeveloped"] = True
df1
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
c | Japan | Tokyo | 126476461 | True |
df1.loc["c"] = ["Japan", "Tokyo", 126476461, True]
df
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
c | Vietnam | Hanoi | 97338579 | True |
3.1.3.4. Assigning index, columns, and name of index and columns#
df1.index = ["e", "f", "g"]
df1.index.name = "label"
df1.columns.name = "attributes"
df1.columns = ["Coun", "Cap", "Pop", "ID"]
df1
Coun | Cap | Pop | ID | |
---|---|---|---|---|
label | ||||
e | Malaysia | Kuala Lumpur | 32365999 | False |
f | Singapore | Singapore | 5850342 | True |
g | Japan | Tokyo | 126476461 | True |
3.1.3.5. Delete columns from dataframe#
del df1["ID"]
# axis = 1 or columns represents delete columns
df1.drop(["Coun"], axis='columns', inplace=True)
# axis = 0 or rows represents delete columns
df1.drop(["e"], axis='rows', inplace=True)
df1
Cap | Pop | |
---|---|---|
label | ||
f | Singapore | 5850342 |
g | Tokyo | 126476461 |
When inplace=True
is passed, the modified DataFrame
is used in place of the original one, and the operation itself returns None
. When inplace=False
is passed (this is the default value, so isn’t necessary to specify), the function returns a copy of the modified DataFrame
.
# inplace=True
df2 = df.copy(True)
print(df2)
print("----")
df2_return = df2.drop(["Country"], axis='columns', inplace=True)
print(df2)
print("----")
print(df2_return)
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
----
Capital Population Isdeveloped
a Kuala Lumpur 32365999 False
b Singapore 5850342 True
c Hanoi 97338579 True
----
None
# inplace=False
df3 = df.copy(True)
print(df3)
print("----")
df3_return = df3.drop(["Country"], axis='columns', inplace=False)
print(df3)
print("----")
print(df3_return)
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
----
Country Capital Population Isdeveloped
a Malaysia Kuala Lumpur 32365999 False
b Singapore Singapore 5850342 True
c Vietnam Hanoi 97338579 True
----
Capital Population Isdeveloped
a Kuala Lumpur 32365999 False
b Singapore 5850342 True
c Hanoi 97338579 True
3.1.3.6. Filtering#
Same as Series()
mentioned above.
df2 = pd.DataFrame(np.array([[14, 35, 35, 35],
[19, 34, 57, 34],
[42, 74, 49, 59]]))
# filtering lesser than 30
df2[df2 < 30]
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 14.0 | NaN | NaN | NaN |
1 | 19.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
# Filtering accroding to conditions of one column
df[df["Population"] < 50000000]
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
You can filter the DataFrame
according to conditions of multiple columns like the following:
df[(df["Population"] < 50000000) & (df["Isdeveloped"] == True)]
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
b | Singapore | Singapore | 5850342 | True |
3.1.3.7. Transposition of a Dataframe#
Similar to Numpy
arrays, Dataframe
can be transposed. Columns
is interchanged with Index
.
df1 = df.T
df1
a | b | c | |
---|---|---|---|
Country | Malaysia | Singapore | Vietnam |
Capital | Kuala Lumpur | Singapore | Hanoi |
Population | 32365999 | 5850342 | 97338579 |
Isdeveloped | False | True | True |
df1.index
Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')
df1.columns
Index(['a', 'b', 'c'], dtype='object')
3.1.3.8. Merge of Dataframe#
Call concat()
, append()
functions to merge multiple dataframes.
df1 = pd.DataFrame(np.random.rand(3,4))
df2 = pd.DataFrame(np.random.rand(3,4))
df3 = pd.DataFrame(np.random.rand(6,4))
df4 = pd.DataFrame(np.random.rand(3,6))
# Vertical merging by default.
pd.concat([df1, df2, df3, df4])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.711833 | 0.210372 | 0.623283 | 0.227076 | NaN | NaN |
1 | 0.187854 | 0.003110 | 0.579188 | 0.457127 | NaN | NaN |
2 | 0.577679 | 0.240997 | 0.406679 | 0.252861 | NaN | NaN |
0 | 0.154052 | 0.299850 | 0.697615 | 0.432677 | NaN | NaN |
1 | 0.808326 | 0.812322 | 0.530895 | 0.383379 | NaN | NaN |
2 | 0.253567 | 0.132441 | 0.423534 | 0.081832 | NaN | NaN |
0 | 0.001646 | 0.384060 | 0.703795 | 0.574561 | NaN | NaN |
1 | 0.395984 | 0.427264 | 0.151421 | 0.405376 | NaN | NaN |
2 | 0.764523 | 0.582966 | 0.418316 | 0.405842 | NaN | NaN |
3 | 0.004663 | 0.723450 | 0.494632 | 0.038501 | NaN | NaN |
4 | 0.281056 | 0.957538 | 0.921892 | 0.813628 | NaN | NaN |
5 | 0.320597 | 0.733772 | 0.371999 | 0.385703 | NaN | NaN |
0 | 0.154081 | 0.741327 | 0.332735 | 0.630257 | 0.277065 | 0.771409 |
1 | 0.034194 | 0.445515 | 0.499797 | 0.853322 | 0.877335 | 0.729658 |
2 | 0.381277 | 0.008222 | 0.563647 | 0.729580 | 0.993268 | 0.242828 |
Note
As you see, when the shapes of dataframes don’t match, blank positions will be filled by NaN
values.
Conduct horizontal merging by specifying axis
argument as ‘columns’ or 1.
pd.concat([df1, df2, df3, df4], axis='columns')
0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.711833 | 0.210372 | 0.623283 | 0.227076 | 0.154052 | 0.299850 | 0.697615 | 0.432677 | 0.001646 | 0.384060 | 0.703795 | 0.574561 | 0.154081 | 0.741327 | 0.332735 | 0.630257 | 0.277065 | 0.771409 |
1 | 0.187854 | 0.003110 | 0.579188 | 0.457127 | 0.808326 | 0.812322 | 0.530895 | 0.383379 | 0.395984 | 0.427264 | 0.151421 | 0.405376 | 0.034194 | 0.445515 | 0.499797 | 0.853322 | 0.877335 | 0.729658 |
2 | 0.577679 | 0.240997 | 0.406679 | 0.252861 | 0.253567 | 0.132441 | 0.423534 | 0.081832 | 0.764523 | 0.582966 | 0.418316 | 0.405842 | 0.381277 | 0.008222 | 0.563647 | 0.729580 | 0.993268 | 0.242828 |
3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.004663 | 0.723450 | 0.494632 | 0.038501 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.281056 | 0.957538 | 0.921892 | 0.813628 | NaN | NaN | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.320597 | 0.733772 | 0.371999 | 0.385703 | NaN | NaN | NaN | NaN | NaN | NaN |
3.1.3.9. View data#
df1 = pd.DataFrame(np.random.rand(100,4))
df1.head(2)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.577690 | 0.626418 | 0.165206 | 0.650919 |
1 | 0.104072 | 0.214700 | 0.576407 | 0.095773 |
df1.tail(3)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
97 | 0.224167 | 0.586648 | 0.824226 | 0.704023 |
98 | 0.668144 | 0.138919 | 0.423611 | 0.079042 |
99 | 0.832823 | 0.959519 | 0.641955 | 0.541513 |
3.1.3.10. Computational tools#
Compute covariance and Pearson correlation coefficients between different columns.
df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.cov()
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
c1 | 0.025785 | -0.005152 | 0.003585 | -0.025674 | 0.034285 |
c2 | -0.005152 | 0.028833 | 0.011865 | 0.033059 | -0.018142 |
c3 | 0.003585 | 0.011865 | 0.027056 | 0.026296 | 0.022413 |
c4 | -0.025674 | 0.033059 | 0.026296 | 0.083990 | -0.012309 |
c5 | 0.034285 | -0.018142 | 0.022413 | -0.012309 | 0.087803 |
df1.corr() # method = pearson (default), optional: kendall, spearman
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
c1 | 1.000000 | -0.188938 | 0.135723 | -0.551697 | 0.720551 |
c2 | -0.188938 | 1.000000 | 0.424791 | 0.671776 | -0.360566 |
c3 | 0.135723 | 0.424791 | 1.000000 | 0.551615 | 0.459855 |
c4 | -0.551697 | 0.671776 | 0.551615 | 1.000000 | -0.143334 |
c5 | 0.720551 | -0.360566 | 0.459855 | -0.143334 | 1.000000 |
df1.corr(method='kendall')
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
c1 | 1.0 | -0.2 | -0.2 | -0.4 | 0.6 |
c2 | -0.2 | 1.0 | 0.2 | 0.4 | -0.2 |
c3 | -0.2 | 0.2 | 1.0 | 0.4 | 0.2 |
c4 | -0.4 | 0.4 | 0.4 | 1.0 | 0.0 |
c5 | 0.6 | -0.2 | 0.2 | 0.0 | 1.0 |
Compute the average value, maximum, minimum and sum of different columns or rows.
# compute average value of each column by default.
df1.mean()
c1 0.626051
c2 0.241862
c3 0.363783
c4 0.667357
c5 0.685979
dtype: float64
Compute the sum of each row by specifying the axis
argument as ‘columns’ or 1.
df1.sum(axis=1)
i1 2.505459
i2 3.600801
i3 2.429741
i4 2.517477
i5 1.871680
dtype: float64
Display a summary of the characteristics of the dataframe
df1.describe()
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
count | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 5.000000 |
mean | 0.626051 | 0.241862 | 0.363783 | 0.667357 | 0.685979 |
std | 0.160576 | 0.169804 | 0.164487 | 0.289810 | 0.296316 |
min | 0.454519 | 0.084028 | 0.215912 | 0.346318 | 0.319171 |
25% | 0.479238 | 0.089982 | 0.270709 | 0.372827 | 0.433839 |
50% | 0.625463 | 0.195351 | 0.299169 | 0.789868 | 0.763618 |
75% | 0.759993 | 0.390465 | 0.400726 | 0.844656 | 0.939979 |
max | 0.811040 | 0.449487 | 0.632401 | 0.983113 | 0.973286 |
3.1.3.11. Data ranking#
df1 = df.copy(deep=True)
df1.sort_values(by=['Population', 'Country'], ascending=False, na_position='first')
Country | Capital | Population | Isdeveloped | |
---|---|---|---|---|
c | Vietnam | Hanoi | 97338579 | True |
a | Malaysia | Kuala Lumpur | 32365999 | False |
b | Singapore | Singapore | 5850342 | True |
3.1.4. NaN value#
df1 = pd.DataFrame(np.random.rand(5, 5), index=['i1', 'i2', 'i3', 'i4', 'i5'],
columns=['c1', 'c2', 'c3', 'c4', 'c5'])
df1.iloc[0, 1] = np.nan
df1.iloc[2, 2] = np.nan
df1.iloc[3, 1] = np.nan
df1.iloc[3, 3] = np.nan
df1
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
i1 | 0.994022 | NaN | 0.896517 | 0.415380 | 0.943970 |
i2 | 0.152851 | 0.114647 | 0.288784 | 0.386462 | 0.741946 |
i3 | 0.296382 | 0.914272 | NaN | 0.955293 | 0.578226 |
i4 | 0.313830 | NaN | 0.993494 | NaN | 0.135778 |
i5 | 0.449799 | 0.524310 | 0.967581 | 0.419476 | 0.208853 |
# detecting nan value
print(df1.isnull())
print(df1.notnull())
print(df1.isna())
c1 c2 c3 c4 c5
i1 False True False False False
i2 False False False False False
i3 False False True False False
i4 False True False True False
i5 False False False False False
c1 c2 c3 c4 c5
i1 True False True True True
i2 True True True True True
i3 True True False True True
i4 True False True False True
i5 True True True True True
c1 c2 c3 c4 c5
i1 False True False False False
i2 False False False False False
i3 False False True False False
i4 False True False True False
i5 False False False False False
After detecting NaN
with Boolean values, you can easily get the number of NaN
values
# False:0, True:1
df1.isnull().sum(axis=1)
i1 1
i2 0
i3 1
i4 2
i5 0
dtype: int64
Fill NaN
with a specific value or a value generated by some rules
# fill NaN value using a specific value
df1.fillna(value=0)
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
i1 | 0.994022 | 0.000000 | 0.896517 | 0.415380 | 0.943970 |
i2 | 0.152851 | 0.114647 | 0.288784 | 0.386462 | 0.741946 |
i3 | 0.296382 | 0.914272 | 0.000000 | 0.955293 | 0.578226 |
i4 | 0.313830 | 0.000000 | 0.993494 | 0.000000 | 0.135778 |
i5 | 0.449799 | 0.524310 | 0.967581 | 0.419476 | 0.208853 |
# delete NaN value
# ‘any’ : If any NA values are present, drop that row or column.
# ‘all’ : If all values are NA, drop that row or column.
# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
df1.dropna(axis="index", how="any")
c1 | c2 | c3 | c4 | c5 | |
---|---|---|---|---|---|
i2 | 0.152851 | 0.114647 | 0.288784 | 0.386462 | 0.741946 |
i5 | 0.449799 | 0.524310 | 0.967581 | 0.419476 | 0.208853 |
3.1.5. Date index#
Date index is very useful for you to deal with time series. You can create a date index by the date_range
function. The date index mentioned here is usually discrete with equal intervals and it usually has three features: starting date, end date, and frequency (or intervals).
dti = pd.date_range("2018-01-01", periods=3, freq="H")
print(dti)
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
print(dti)
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
'2018-01-01 02:00:00'],
dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2021-09-28 00:00:00', '2021-09-28 10:00:00',
'2021-09-28 20:00:00', '2021-09-29 06:00:00',
'2021-09-29 16:00:00'],
dtype='datetime64[ns]', freq='10H')
Manipulating and converting date times with timezone information
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("UTC")
dti
DatetimeIndex(['2021-09-28 00:00:00+00:00', '2021-09-28 10:00:00+00:00',
'2021-09-28 20:00:00+00:00', '2021-09-29 06:00:00+00:00',
'2021-09-29 16:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='10H')
dti = pd.date_range(start="2021-09-28", end="2021-09-30", freq="10H")
dti = dti.tz_localize("Asia/Singapore")
dti
DatetimeIndex(['2021-09-28 00:00:00+08:00', '2021-09-28 10:00:00+08:00',
'2021-09-28 20:00:00+08:00', '2021-09-29 06:00:00+08:00',
'2021-09-29 16:00:00+08:00'],
dtype='datetime64[ns, Asia/Singapore]', freq=None)
Using the origin
option, one can specify an alternative starting point for the creation of a DatetimeIndex
. For example, to use 1900-01-01 00:00:00 as the starting time and hour as the unit period length:
pd.to_datetime([100, 101, 102], unit="h", origin=pd.Timestamp("1900-01-01 00:00:00"))
DatetimeIndex(['1900-01-05 04:00:00', '1900-01-05 05:00:00',
'1900-01-05 06:00:00'],
dtype='datetime64[ns]', freq=None)
Supported units are D
: day, h
: hour, m
: minute, and s
: second.
Tip
The time labels of climate products is usually given as a start time point and discrete time interval, and then represented by a column of integers. In this case, we can use the above ways to construct time labels.
3.1.6. Upsampling and Downsampling#
Upsampling: Increase the frequency of the samples by interpolation, such as from minutes to seconds.
Downsampling: Decrease the frequency of the samples by aggregation, such as from months to years.
# prepare data, this section will be introduced in the next tutorial
# Data Source: http://www.weather.gov.sg/climate-historical-daily/
data = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0, parse_dates=True)
data.head()
Daily Rainfall Total (mm) | |
---|---|
Date | |
1981-01-01 | 0.0 |
1981-01-02 | 0.0 |
1981-01-03 | 0.0 |
1981-01-04 | 0.0 |
1981-01-05 | 0.0 |
# Downsampling: Convert monthly data to yearly data by sum or max
df = data.copy(deep=True)
dfsum = df.resample("Y").sum()
dfsum.columns = ["Yearly Rainfall Total (mm)"]
dfsum.head()
Yearly Rainfall Total (mm) | |
---|---|
Date | |
1981-12-31 | 1336.3 |
1982-12-31 | 1581.7 |
1983-12-31 | 1866.5 |
1984-12-31 | 2686.7 |
1985-12-31 | 1483.9 |
dfmax = df.resample("Y").max()
dfmax.columns = ["Yearly Rainfall Maximum (mm)"]
dfmax.head()
Yearly Rainfall Maximum (mm) | |
---|---|
Date | |
1981-12-31 | 71.5 |
1982-12-31 | 109.0 |
1983-12-31 | 181.8 |
1984-12-31 | 154.4 |
1985-12-31 | 86.8 |
# Upsampling: Convert monthly data to 10 days' data
# by directly return (asfreq) or forward filling (pad/ffill)
dfmax.resample('10D').asfreq()[0:5]
Yearly Rainfall Maximum (mm) | |
---|---|
Date | |
1981-12-31 | 71.5 |
1982-01-10 | NaN |
1982-01-20 | NaN |
1982-01-30 | NaN |
1982-02-09 | NaN |
dfmax.resample('10D').ffill()[0:5]
Yearly Rainfall Maximum (mm) | |
---|---|
Date | |
1981-12-31 | 71.5 |
1982-01-10 | 71.5 |
1982-01-20 | 71.5 |
1982-01-30 | 71.5 |
1982-02-09 | 71.5 |
dfmax.resample('D').ffill(limit=2)[0:5]
Yearly Rainfall Maximum (mm) | |
---|---|
Date | |
1981-12-31 | 71.5 |
1982-01-01 | 71.5 |
1982-01-02 | 71.5 |
1982-01-03 | NaN |
1982-01-04 | NaN |
3.1.7. Group DataFrame#
You can group DataFrame
using a mapper or by a Series of columns via the groupby
function.
# Calculate average and maximum wind speed of each station
df = pd.DataFrame({'Station': ['Changi', 'Changi',
'Climenti', 'Climenti'],
'Wind Speed': [10., 5., 6., 16.]})
print(df.groupby(['Station']).mean())
print(df.groupby(['Station']).max())
Wind Speed
Station
Changi 7.5
Climenti 11.0
Wind Speed
Station
Changi 10.0
Climenti 16.0
You can also apply custom function to manipulate each group data via .apply()
function
# custom function
df = pd.DataFrame({'Station': ['Changi', 'Changi', 'Changi',
'Climenti', 'Climenti', 'Climenti'],
'Wind Speed': [10., 5., 12, 6., 16., 20]})
def groupNormalization(w):
return (w-w.min())/(w.max()-w.min())
df.groupby(['Station']).apply(groupNormalization)
Wind Speed | ||
---|---|---|
Station | ||
Changi | 0 | 0.714286 |
1 | 0.000000 | |
2 | 1.000000 | |
Climenti | 3 | 0.000000 |
4 | 0.714286 | |
5 | 1.000000 |
3.1.8. Input/Output of data#
Typically, you can read data from files ending with “. xlsx” and “. csv” using read_excel()
and read_csv()
functions, respectively. The index_col
and header
arguments are used to specify the column and row to be used as index
and columns
of the dataframe. You can also set parse_dates
as True
to parse index
as the date format. If your date format is uncommon, you can specify the date_parser
argument to be a function for converting a sequence of string columns to an array of datetime instances.
df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0,
parse_dates=True)
df.head()
Daily Rainfall Total (mm) | |
---|---|
Date | |
1981-01-01 | 0.0 |
1981-01-02 | 0.0 |
1981-01-03 | 0.0 |
1981-01-04 | 0.0 |
1981-01-05 | 0.0 |
df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0,
parse_dates=True, date_format='%Y-%m-%d')
df.head()
Daily Rainfall Total (mm) | |
---|---|
Date | |
1981-01-01 | 0.0 |
1981-01-02 | 0.0 |
1981-01-03 | 0.0 |
1981-01-04 | 0.0 |
1981-01-05 | 0.0 |
You can also specify the parse_dates
argument as a list
to parse a column as a date format.
df = pd.read_csv('../../assets/data/Changi_daily_rainfall.csv', index_col=0, header=0,
parse_dates=[0])
df.head()
Daily Rainfall Total (mm) | |
---|---|
Date | |
1981-01-01 | 0.0 |
1981-01-02 | 0.0 |
1981-01-03 | 0.0 |
1981-01-04 | 0.0 |
1981-01-05 | 0.0 |
The usage of read_excel()
function is the same as the read_csv()
Tip
The
Series
objects can be regarded as the specialDataFrames
object. Actually, when you extract one column from theDataFrames
, it will be converted to theSeries
.The
DataFrame
object has many of the same usages as theNumpy
array and you can learn by analogy.