2. Getting Started (official)

import pandas as pd
import matplotlib.pyplot as plt

2.1. Pandas 處理哪種 data?

  • 處理 tabular data ,如下:

2.1.1. DataFrame

  • 建立 DataFrame 的方式,由 dictionary 來處理:

df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

print(df)
print(type(df))
                       Name  Age     Sex
0   Braund, Mr. Owen Harris   22    male
1  Allen, Mr. William Henry   35    male
2  Bonnell, Miss. Elizabeth   58  female
<class 'pandas.core.frame.DataFrame'>
  • 這其實就和 excel 一樣:

2.1.2. Series

  • 每一個 column,都是一個 Series

  • 要注意,series還是帶有 row index。例如:

df["Age"]
0    22
1    35
2    58
Name: Age, dtype: int64
  • 如果,你想自己建立一個 series,可以這樣做:

ages = pd.Series([22, 35, 58], name="Age")
ages
0    22
1    35
2    58
Name: Age, dtype: int64

2.1.3. 一些基礎 methods

  • 對 Series,我們可以進行 numpy 的那些常見 function,例如:

df["Age"].max()
58
  • 對 DataFrame,我們可以看一下連續型欄位的 basic statistics

df.describe()
Age
count 3.000000
mean 38.333333
std 18.230012
min 22.000000
25% 28.500000
50% 35.000000
75% 46.500000
max 58.000000

2.2. 如何 讀寫 tabular data?

  • 如上圖, read_* 就讀進來,用 to_* 就寫出去

2.2.1. csv

titanic = pd.read_csv("data/titanic.csv")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

2.2.2. excel

  • 記得要先安裝 openpyxl ,才能順利讀寫

  • 我們可以把剛剛的 titanic DataFrame 先寫成 excel

titanic.to_excel("data/titanic.xlsx", sheet_name="passengers", index=False)
  • 然後,我們把他讀進來看看

titanic = pd.read_excel("data/titanic.xlsx", sheet_name="passengers")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • 看一下這張 table 的欄位摘要:

titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
  • 第一列告訴你他的型別是 DataFrame.

  • 第二列告訴你他的 row index 從 0 ~ 890 (共 891 個 row)

  • 第三列告訴你有 12 個 column.

  • 第四列開始,摘要每個欄位的資訊.

    • Non-Null Count 可以讓你看到大部分的 column都沒有 missing (891 non-null),但 Age, Cabin, Embarked 有 missing

    • Dtype 可以讓你看到每個 column 的 type。object 的意思,就是文字型/類別型資料; int64 是整數型資料,float64 是real number型資料

  • 倒數第二列,幫你摘要變數的 type,對統計分析來說,就知道數值型資料有 7 個 (float64 + int64),類別型有 5 個 (object)

  • 最後一列告訴你 memory usage 是 84 kb 左右

2.3. 如何 select/filter

2.3.1. select 特定 column

age_sex = titanic[["Age", "Sex"]]
age_sex.head()
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male

2.3.2. filter 特定 row

above_35 = titanic[titanic["Age"] > 35]
above_35.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2.3.3. select + filter

adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()
1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                             Bonnell, Miss. Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object
titanic.iloc[9:25, 2:5]
Pclass Name Sex
9 2 Nasser, Mrs. Nicholas (Adele Achem) female
10 3 Sandstrom, Miss. Marguerite Rut female
11 1 Bonnell, Miss. Elizabeth female
12 3 Saundercock, Mr. William Henry male
13 3 Andersson, Mr. Anders Johan male
14 3 Vestrom, Miss. Hulda Amanda Adolfina female
15 2 Hewlett, Mrs. (Mary D Kingcome) female
16 3 Rice, Master. Eugene male
17 2 Williams, Mr. Charles Eugene male
18 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female
19 3 Masselmani, Mrs. Fatima female
20 2 Fynney, Mr. Joseph J male
21 2 Beesley, Mr. Lawrence male
22 3 McGowan, Miss. Anna "Annie" female
23 1 Sloper, Mr. William Thompson male
24 3 Palsson, Miss. Torborg Danira female

2.4. 如何畫圖

  • 這一章,我們拿 air quality 的資料集來舉例

air_quality = pd.read_csv("data/air_quality_no2.csv", index_col=0, parse_dates=True)
air_quality.head()
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN

  • 在 pandas 中,只要做 DataFrame.plot.* 就可以畫圖,這個星號包括以下幾種:

    • df.plot(): 對每個 column 畫 line plot.

    • series.plot(): 對這個serieis 畫 line plot

    • df.plot.scatter(x,y): x-y 散布圖

    • df.plot.box()

    • df.plot.hist()

    • df.plot.bar().

    • df.plot.line()

    • df.plot.kde()

    • df.plot.density()

  • 如果想知道到底可以畫哪些圖,可以用 df.plot.<tab> 就可以知道有哪些 method 可以用

  • 這邊條列所有可用的 method 如下:

[method_name for method_name in dir(air_quality.plot) if not method_name.startswith("_")]
['area',
 'bar',
 'barh',
 'box',
 'density',
 'hexbin',
 'hist',
 'kde',
 'line',
 'pie',
 'scatter']

2.4.1. 對整個 df 畫圖 (各欄位的line plot)

  • 我如果直接用 air_quality.plot(),那預設的作法是:對每一個 column 都去畫 line plot

  • 所以以這個資料集為例,就會畫出3條 time-series plot

air_quality.plot();
../../_images/getting_started_58_0.png

2.4.2. 對某個 series 畫圖 (該series line plot)

air_quality["station_paris"].plot();
../../_images/getting_started_60_0.png

2.4.3. Scatter plot

air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5);
../../_images/getting_started_62_0.png

2.4.4. Box plot

air_quality.plot.box(); # 對每個 column 畫圖
../../_images/getting_started_64_0.png

2.4.5. Area plot

air_quality.plot.area();
../../_images/getting_started_66_0.png

2.4.6. subplot

  • 如果我想做成 subplot,可以這樣做:

air_quality.plot.area(subplots = True);
../../_images/getting_started_69_0.png

2.4.7. 更多客製化

  • 如果要做更多客製化,那就要用 matplotlib 的 oop 的寫法

fig, axs = plt.subplots(figsize=(12, 4));
air_quality.plot.area(ax=axs); # 建立連結,pandas畫完的圖,本來就是 matplotlib 物件,現在告訴他我要把這物件更新到外面的 axs
axs.set_ylabel("NO$_2$ concentration");
# fig.savefig("no2_concentrations.png")
../../_images/getting_started_72_0.png

2.4.8. 更多細節

2.5. 如何新增 column

2.5.1. 某個 column 乘上一個常數

  • 如上圖,我想新增 column,我可以這樣做:

air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
air_quality.head()
station_antwerp station_paris station_london london_mg_per_cubic
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286
2019-05-07 03:00:00 50.5 25.0 19.0 35.758
2019-05-07 04:00:00 45.0 27.7 19.0 35.758
2019-05-07 05:00:00 NaN 50.4 16.0 30.112
2019-05-07 06:00:00 NaN 61.9 NaN NaN

2.5.2. 多個 column 間的運算

  • 那如果是像上圖,我要用兩個欄位來計算出新欄位,我可以這樣做:

air_quality["ratio_paris_antwerp"] = (
    air_quality["station_paris"] / air_quality["station_antwerp"]
)

air_quality.head()
station_antwerp station_paris station_london london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN

2.5.3. rename

  • 要對 column name 做 rename 的話,可以這樣做

air_quality_renamed = air_quality.rename(
    columns={
        "station_antwerp": "BETR801",
        "station_paris": "FR04014",
        "station_london": "London Westminster",
    }
)

air_quality_renamed.head()
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN
  • 我也可以用 function,把 column name 都轉小寫:

air_quality_renamed = air_quality_renamed.rename(columns=str.lower)
air_quality_renamed.head()
betr801 fr04014 london westminster london_mg_per_cubic ratio_paris_antwerp
datetime
2019-05-07 02:00:00 NaN NaN 23.0 43.286 NaN
2019-05-07 03:00:00 50.5 25.0 19.0 35.758 0.495050
2019-05-07 04:00:00 45.0 27.7 19.0 35.758 0.615556
2019-05-07 05:00:00 NaN 50.4 16.0 30.112 NaN
2019-05-07 06:00:00 NaN 61.9 NaN NaN NaN

2.6. 如何做 summary statistics

2.6.1. aggregating statistics

titanic["Age"].mean()
29.69911764705882

titanic[["Age", "Fare"]].median()
Age     28.0000
Fare    14.4542
dtype: float64
titanic[["Age", "Fare"]].describe()
Age Fare
count 714.000000 891.000000
mean 29.699118 32.204208
std 14.526497 49.693429
min 0.420000 0.000000
25% 20.125000 7.910400
50% 28.000000 14.454200
75% 38.000000 31.000000
max 80.000000 512.329200
titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208

2.6.2. by group

(
    titanic
    .groupby("Sex")
    ["Age"]
    .mean()
)
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

titanic.groupby("Sex").mean() # 對所有 numeric column 取 mean
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

2.6.3. count number

  • 如果我只是想看某個類別變數 (e.g. Pclass) 的次數分配,那可以這樣做:

titanic["Pclass"].value_counts()
3    491
1    216
2    184
Name: Pclass, dtype: int64
  • 那其實我也可以這樣做:

titanic.groupby("Pclass")["Pclass"].count()
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64
  • 上面的過程,就如下圖:

2.7. 如何 reshape

2.7.1. Long to wide (pivot)(R的pivot wider)

  • 我們來看一下 long data 的範例 (就是 stack data 啦)

air_quality = pd.read_csv(
    "data/air_quality_long.csv", parse_dates=True
)
air_quality = air_quality[air_quality["parameter"]=="no2"]
air_quality.sort_values(["country","city","date.utc"])
city country date.utc location parameter value unit
3663 Antwerpen BE 2019-04-09 01:00:00+00:00 BETR801 no2 22.5 µg/m³
3662 Antwerpen BE 2019-04-09 02:00:00+00:00 BETR801 no2 53.5 µg/m³
3661 Antwerpen BE 2019-04-09 03:00:00+00:00 BETR801 no2 54.5 µg/m³
3660 Antwerpen BE 2019-04-09 04:00:00+00:00 BETR801 no2 34.5 µg/m³
3659 Antwerpen BE 2019-04-09 05:00:00+00:00 BETR801 no2 46.5 µg/m³
... ... ... ... ... ... ... ...
3668 London GB 2019-06-17 07:00:00+00:00 London Westminster no2 13.0 µg/m³
3667 London GB 2019-06-17 08:00:00+00:00 London Westminster no2 13.0 µg/m³
3666 London GB 2019-06-17 09:00:00+00:00 London Westminster no2 11.0 µg/m³
3665 London GB 2019-06-17 10:00:00+00:00 London Westminster no2 11.0 µg/m³
3664 London GB 2019-06-17 11:00:00+00:00 London Westminster no2 11.0 µg/m³

3447 rows × 7 columns

  • 從上表可以看到,每一列的 key 是 country + city + date.utc + location,表示該城市在該時間點的該測站,所測到的數值

  • 那 location 就是被我堆疊起來的變數,我想把 location 來成 column,我可以這樣做

air_quality_wide = air_quality.pivot(
    index = ["city", "country", "date.utc"],
    columns = "location",
    values = "value"
).reset_index() # 如果不 reset_index 的話, city, country, date.utc 會被放在 index
air_quality_wide
location city country date.utc BETR801 FR04014 London Westminster
0 Antwerpen BE 2019-04-09 01:00:00+00:00 22.5 NaN NaN
1 Antwerpen BE 2019-04-09 02:00:00+00:00 53.5 NaN NaN
2 Antwerpen BE 2019-04-09 03:00:00+00:00 54.5 NaN NaN
3 Antwerpen BE 2019-04-09 04:00:00+00:00 34.5 NaN NaN
4 Antwerpen BE 2019-04-09 05:00:00+00:00 46.5 NaN NaN
... ... ... ... ... ... ...
3442 Paris FR 2019-06-20 20:00:00+00:00 NaN 21.4 NaN
3443 Paris FR 2019-06-20 21:00:00+00:00 NaN 24.9 NaN
3444 Paris FR 2019-06-20 22:00:00+00:00 NaN 26.5 NaN
3445 Paris FR 2019-06-20 23:00:00+00:00 NaN 21.8 NaN
3446 Paris FR 2019-06-21 00:00:00+00:00 NaN 20.0 NaN

3447 rows × 6 columns

  • 看說明文件,可以看到更多例子:

air_quality.pivot?
Signature: air_quality.pivot(index=None, columns=None, values=None) -> 'DataFrame'
Docstring:
Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a "pivot" table) based on column values. Uses
unique values from specified `index` / `columns` to form axes of the
resulting DataFrame. This function does not support data
aggregation, multiple values will result in a MultiIndex in the
columns. See the :ref:`User Guide <reshaping>` for more on reshaping.

Parameters
----------
index : str or object or a list of str, optional
    Column to use to make new frame's index. If None, uses
    existing index.

    .. versionchanged:: 1.1.0
       Also accept list of index names.

columns : str or object or a list of str
    Column to use to make new frame's columns.

    .. versionchanged:: 1.1.0
       Also accept list of columns names.

values : str, object or a list of the previous, optional
    Column(s) to use for populating new frame's values. If not
    specified, all remaining columns will be used and the result will
    have hierarchically indexed columns.

Returns
-------
DataFrame
    Returns reshaped DataFrame.

Raises
------
ValueError:
    When there are any `index`, `columns` combinations with multiple
    values. `DataFrame.pivot_table` when you need to aggregate.

See Also
--------
DataFrame.pivot_table : Generalization of pivot that can handle
    duplicate values for one index/column pair.
DataFrame.unstack : Pivot based on the index values instead of a
    column.
wide_to_long : Wide panel to long format. Less flexible but more
    user-friendly than melt.

Notes
-----
For finer-tuned control, see hierarchical indexing documentation along
with the related stack/unstack methods.

Examples
--------
>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t

>>> df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6

>>> df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6

>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

You could also assign a list of column names or a list of index names.

>>> df = pd.DataFrame({
...        "lev1": [1, 1, 1, 2, 2, 2],
...        "lev2": [1, 1, 2, 1, 1, 2],
...        "lev3": [1, 2, 1, 2, 1, 2],
...        "lev4": [1, 2, 3, 4, 5, 6],
...        "values": [0, 1, 2, 3, 4, 5]})
>>> df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5

>>> df.pivot(index="lev1", columns=["lev2", "lev3"],values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0

>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"],values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0

A ValueError is raised if there are any duplicates.

>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]})
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

Notice that the first two rows are the same for our `index`
and `columns` arguments.

>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
   ...
ValueError: Index contains duplicate entries, cannot reshape
File:      /Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/pandas/core/frame.py
Type:      method

2.7.2. wide to long (melt)(R的pivot_longer)

  • 回顧剛剛的 wide data:

air_quality_wide
location city country date.utc BETR801 FR04014 London Westminster
0 Antwerpen BE 2019-04-09 01:00:00+00:00 22.5 NaN NaN
1 Antwerpen BE 2019-04-09 02:00:00+00:00 53.5 NaN NaN
2 Antwerpen BE 2019-04-09 03:00:00+00:00 54.5 NaN NaN
3 Antwerpen BE 2019-04-09 04:00:00+00:00 34.5 NaN NaN
4 Antwerpen BE 2019-04-09 05:00:00+00:00 46.5 NaN NaN
... ... ... ... ... ... ...
3442 Paris FR 2019-06-20 20:00:00+00:00 NaN 21.4 NaN
3443 Paris FR 2019-06-20 21:00:00+00:00 NaN 24.9 NaN
3444 Paris FR 2019-06-20 22:00:00+00:00 NaN 26.5 NaN
3445 Paris FR 2019-06-20 23:00:00+00:00 NaN 21.8 NaN
3446 Paris FR 2019-06-21 00:00:00+00:00 NaN 20.0 NaN

3447 rows × 6 columns

  • 我現在想倒過來,把 BETR801~London 這幾個 column,折下來,那我可以這樣做:

air_quality_long = air_quality_wide.melt(
    id_vars=["city","country", "date.utc"],
    value_vars=["BETR801", "FR04014", "London Westminster"],
    var_name="location", # 轉成新column後的 column name
    value_name="NO_2", # 轉成新 column 後的 value name
)
air_quality_long
city country date.utc location NO_2
0 Antwerpen BE 2019-04-09 01:00:00+00:00 BETR801 22.5
1 Antwerpen BE 2019-04-09 02:00:00+00:00 BETR801 53.5
2 Antwerpen BE 2019-04-09 03:00:00+00:00 BETR801 54.5
3 Antwerpen BE 2019-04-09 04:00:00+00:00 BETR801 34.5
4 Antwerpen BE 2019-04-09 05:00:00+00:00 BETR801 46.5
... ... ... ... ... ...
10336 Paris FR 2019-06-20 20:00:00+00:00 London Westminster NaN
10337 Paris FR 2019-06-20 21:00:00+00:00 London Westminster NaN
10338 Paris FR 2019-06-20 22:00:00+00:00 London Westminster NaN
10339 Paris FR 2019-06-20 23:00:00+00:00 London Westminster NaN
10340 Paris FR 2019-06-21 00:00:00+00:00 London Westminster NaN

10341 rows × 5 columns

2.8. 如何 concat (R 的 bind_rows, bind_cols)

2.8.1. concat (axis = 0) (bind_rows)

df1 = pd.DataFrame({
    "a": [1,2,3],
    "b": [4,5,6]
})
df2 = pd.DataFrame({
    "a": [7,8,9],
    "b": [10,11,12]
})

print(df1)
print(df2)
   a  b
0  1  4
1  2  5
2  3  6
   a   b
0  7  10
1  8  11
2  9  12
pd.concat([df1, df2], axis = 0)
a b
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12

2.9. 如何 merge (R 的 join)

df1 = pd.DataFrame({
    "a": ["A", "A", "B"],
    "b": [4,5,6]
})
df2 = pd.DataFrame({
    "a": ["A","B","C"],
    "c": ["AA","BB","CC"]
})

print(df1)
print(df2)
   a  b
0  A  4
1  A  5
2  B  6
   a   c
0  A  AA
1  B  BB
2  C  CC
df1.merge(df2, how = "left", on = "a")
a b c
0 A 4 AA
1 A 5 AA
2 B 6 BB
  • 如果要merge的名稱不同,例如這樣

df1 = pd.DataFrame({
    "a1": ["A", "A", "B"],
    "b": [4,5,6]
})
df2 = pd.DataFrame({
    "a2": ["A","B","C"],
    "c": ["AA","BB","CC"]
})

print(df1)
print(df2)
  a1  b
0  A  4
1  A  5
2  B  6
  a2   c
0  A  AA
1  B  BB
2  C  CC
df1.merge(df2, how = "left", left_on = "a1", right_on = "a2")
a1 b a2 c
0 A 4 A AA
1 A 5 A AA
2 B 6 B BB

2.10. 如何處理 time-series data

air_quality = pd.read_csv("data/air_quality_no2_long.csv")
air_quality = air_quality.rename(columns={"date.utc": "datetime"})
air_quality.head()
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³
  • 我們首先看一下這筆資料,他的 datetime 欄位,是哪種 type:

air_quality.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   city       2068 non-null   object 
 1   country    2068 non-null   object 
 2   datetime   2068 non-null   object 
 3   location   2068 non-null   object 
 4   parameter  2068 non-null   object 
 5   value      2068 non-null   float64
 6   unit       2068 non-null   object 
dtypes: float64(1), object(6)
memory usage: 113.2+ KB
  • 可以發現, datetime 是 “object”,就是文字/類別的意思,所以我先把他轉為 datetime 格式

air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
air_quality["datetime"]
0      2019-06-21 00:00:00+00:00
1      2019-06-20 23:00:00+00:00
2      2019-06-20 22:00:00+00:00
3      2019-06-20 21:00:00+00:00
4      2019-06-20 20:00:00+00:00
                  ...           
2063   2019-05-07 06:00:00+00:00
2064   2019-05-07 04:00:00+00:00
2065   2019-05-07 03:00:00+00:00
2066   2019-05-07 02:00:00+00:00
2067   2019-05-07 01:00:00+00:00
Name: datetime, Length: 2068, dtype: datetime64[ns, UTC]
  • 可以看到,現在 dtype 是 datetime64 了

  • 那為啥轉格式重要?因為有很多好用的 method 可以用

2.10.1. 最大最小值

  • 就很直觀的,想看時間資料的最大值和最小值:

print(air_quality["datetime"].min())
print(air_quality["datetime"].max())
2019-05-07 01:00:00+00:00
2019-06-21 00:00:00+00:00
  • 可以看到,時間最早是 5/7,最晚是 6/21.

  • 那我還可以看一下時間距離多久?

air_quality["datetime"].max() - air_quality["datetime"].min()
Timedelta('44 days 23:00:00')

2.10.2. 從時間資料中,擷取 年/月/日/星期幾…

  • 要擷取的這些資訊,都是 datetime 這個 series 的 attribute,我們可以這樣取

print("datetime: ", air_quality["datetime"][0])
print("date: ", air_quality["datetime"].dt.date[0])
print("year: ", air_quality["datetime"].dt.year[0])
print("month: ", air_quality["datetime"].dt.month[0])
print("day: ", air_quality["datetime"].dt.day[0])
print("hour: ", air_quality["datetime"].dt.hour[0])
print("minute: ", air_quality["datetime"].dt.minute[0])
print("second: ", air_quality["datetime"].dt.second[0])
print("weekday: ", air_quality["datetime"].dt.weekday[0])
datetime:  2019-06-21 00:00:00+00:00
date:  2019-06-21
year:  2019
month:  6
day:  21
hour:  0
minute:  0
second:  0
weekday:  4
  • 可以看到,我取出 series 後,我還得用 dt 這個 accesor,他才知道我要調用 datetime 的 method,然後後面就直接用 date/year/month…等 attribute

  • 來練習一下吧,我如果想新增一個欄位,是只取出月份,那我可以這樣做:

air_quality["month"] = air_quality["datetime"].dt.month # .dt 是調用datetime的 method/attribute,month看起來是attribute
air_quality.head()
city country datetime location parameter value unit month
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 6
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 6
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 6
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 6
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 6
  • 如果想取出星期幾(weekday),我可以這樣做:

air_quality["weekday"] = air_quality["datetime"].dt.weekday # .dt 是調用datetime的 method/attribute,month看起來是attribute
air_quality.head()
city country datetime location parameter value unit month weekday
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 6 4
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 6 3
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 6 3
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 6 3
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 6 3
  • 我如果想看每個location,每個weekday,平均的 NO2 濃度,我就可以這樣做:

air_quality.groupby(["location", "weekday"])["value"].mean()
location            weekday
BETR801             0          27.875000
                    1          22.214286
                    2          21.125000
                    3          27.500000
                    4          28.400000
                    5          33.500000
                    6          21.896552
FR04014             0          24.856250
                    1          30.999359
                    2          29.165753
                    3          28.600690
                    4          31.617986
                    5          25.266154
                    6          23.274306
London Westminster  0          23.969697
                    1          24.885714
                    2          23.460432
                    3          24.780142
                    4          26.446809
                    5          24.977612
                    6          24.859155
Name: value, dtype: float64
  • 我想畫每個小時的平均 NO2 濃度

fig, axs = plt.subplots(figsize=(12, 4))

(
    air_quality
        .assign(hour = lambda df: df.datetime.dt.hour)
        .groupby("hour")["value"]
        .mean()
        .plot(kind='bar', rot=0, ax=axs)
)
<AxesSubplot:xlabel='hour'>
../../_images/getting_started_156_1.png

2.10.3. slicing datetime

  • 我們這邊先把資料做成 wide 的形式:

no_2 = air_quality.pivot(
    index = "datetime",
    columns = "location",
    values = "value"
).reset_index()
no_2.head()
location datetime BETR801 FR04014 London Westminster
0 2019-05-07 01:00:00+00:00 50.5 25.0 23.0
1 2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2 2019-05-07 03:00:00+00:00 NaN 50.4 19.0
3 2019-05-07 04:00:00+00:00 NaN 61.9 16.0
4 2019-05-07 05:00:00+00:00 NaN 72.4 NaN
  • 可以看到,現在有三條時間序列

  • 我如果想取出 “2019-05-20” ~ “2019-05-21” 的資料,我可以這樣做:

no_2[(no_2.datetime >= "2019-05-20") & (no_2.datetime <= "2019-05-21")].head()
location datetime BETR801 FR04014 London Westminster
311 2019-05-20 00:00:00+00:00 26.0 16.4 28.0
312 2019-05-20 01:00:00+00:00 17.0 12.8 28.0
313 2019-05-20 02:00:00+00:00 10.5 12.1 32.0
314 2019-05-20 03:00:00+00:00 9.0 12.6 32.0
315 2019-05-20 04:00:00+00:00 14.0 14.9 26.0
  • 但我還有另外一招,我可以把 datetime 挪去 index,然後直接篩選:

no_2_with_datetime_index = no_2.set_index("datetime")
no_2_with_datetime_index["2019-05-20":"2019-05-21"].head()
location BETR801 FR04014 London Westminster
datetime
2019-05-20 00:00:00+00:00 26.0 16.4 28.0
2019-05-20 01:00:00+00:00 17.0 12.8 28.0
2019-05-20 02:00:00+00:00 10.5 12.1 32.0
2019-05-20 03:00:00+00:00 9.0 12.6 32.0
2019-05-20 04:00:00+00:00 14.0 14.9 26.0
  • 帥吧!接下來,我就可以畫出這三條時間序列:

no_2_with_datetime_index["2019-05-20":"2019-05-21"].plot()
<AxesSubplot:xlabel='datetime'>
../../_images/getting_started_165_1.png

2.10.4. Resample 成不同的 frequency

  • 其實 resample 的語法,就是 groupby 再 aggregate 的 shortcut.

  • 舉個例子就懂了。

  • 我如果想看每天各個location的平均NO2的值 (所以把 ymd_hms 的 frequency 改成 ymd 而已),那我得這樣做:

(
    no_2
        .assign(Date = lambda df: df.datetime.dt.date)
        .groupby("Date")
        .mean()
        .plot(style="-o", figsize=(10, 5))
);
../../_images/getting_started_168_0.png
  • 那我現在可以用這個 resample 的語法,很快做到這件事

(
    no_2
        .set_index("datetime")
        .resample("D")
        .mean()
        .plot(style="-o", figsize=(10, 5))
);
../../_images/getting_started_170_0.png

2.11. 如何 manipulate textual data

  • 這一節要用的 data 是 Titanic

titanic = pd.read_csv("data/titanic.csv")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

2.11.1. 把某個欄位全轉小寫

  • 很快就可以聯想到 str.lower() 這個 method,所以作法就是:

titanic["Name"].str.lower()
0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

2.11.2. 把某個欄位依照pattern切開

  • 舉例來說,Name 這個欄位,就用 , 來分隔出 first name 和 last name.

  • 所以,我想把它切開來後,分別叫他 first name 和 last name

titanic["split_res"] = titanic["Name"].str.split(",")
titanic[["Name", "split_res"]]
Name split_res
0 Braund, Mr. Owen Harris [Braund, Mr. Owen Harris]
1 Cumings, Mrs. John Bradley (Florence Briggs Th... [Cumings, Mrs. John Bradley (Florence Briggs ...
2 Heikkinen, Miss. Laina [Heikkinen, Miss. Laina]
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 Allen, Mr. William Henry [Allen, Mr. William Henry]
... ... ...
886 Montvila, Rev. Juozas [Montvila, Rev. Juozas]
887 Graham, Miss. Margaret Edith [Graham, Miss. Margaret Edith]
888 Johnston, Miss. Catherine Helen "Carrie" [Johnston, Miss. Catherine Helen "Carrie"]
889 Behr, Mr. Karl Howell [Behr, Mr. Karl Howell]
890 Dooley, Mr. Patrick [Dooley, Mr. Patrick]

891 rows × 2 columns

  • 那如果要再分成 first_name 和 last_name,就得這樣:

titanic["first_name"] = titanic["split_res"].str.get(0) # 取第0個element
titanic["last_name"] = titanic["split_res"].str.get(1) # 取第0個element
titanic[["Name", "split_res", "first_name", "last_name"]]
Name split_res first_name last_name
0 Braund, Mr. Owen Harris [Braund, Mr. Owen Harris] Braund Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th... [Cumings, Mrs. John Bradley (Florence Briggs ... Cumings Mrs. John Bradley (Florence Briggs Thayer)
2 Heikkinen, Miss. Laina [Heikkinen, Miss. Laina] Heikkinen Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) [Futrelle, Mrs. Jacques Heath (Lily May Peel)] Futrelle Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry [Allen, Mr. William Henry] Allen Mr. William Henry
... ... ... ... ...
886 Montvila, Rev. Juozas [Montvila, Rev. Juozas] Montvila Rev. Juozas
887 Graham, Miss. Margaret Edith [Graham, Miss. Margaret Edith] Graham Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie" [Johnston, Miss. Catherine Helen "Carrie"] Johnston Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell [Behr, Mr. Karl Howell] Behr Mr. Karl Howell
890 Dooley, Mr. Patrick [Dooley, Mr. Patrick] Dooley Mr. Patrick

891 rows × 4 columns

2.11.3. 是否包含某字串

  • 如果我想找找看,名字裡面,有出現 Countess 的人,那我可以這樣做:

titanic[titanic["Name"].str.contains("Countess")]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked split_res first_name last_name 中文可以嗎
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye... female 33.0 0 0 110152 86.5 B77 S [Rothes, the Countess. of (Lucy Noel Martha D... Rothes the Countess. of (Lucy Noel Martha Dyer-Edwards) Rothes

2.11.4. 字串長度

  • 我可以這樣,來造出字長:

titanic["Name"].str.len()
0      23
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name, Length: 891, dtype: int64
  • 所以,我如果想找出名字最長的人,我可以這樣做:

(
    titanic
        .assign(name_length = lambda df: df.Name.str.len())
        .sort_values("name_length", ascending=False)
        [["Name", "name_length"]]
        .head(3) # 前三名
)
Name name_length
307 Penasco y Castellana, Mrs. Victor de Satode (M... 82
427 Phillips, Miss. Kate Florence ("Mrs Kate Louis... 67
556 Duff Gordon, Lady. (Lucille Christiana Sutherl... 65

2.11.5. 字串取代

  • 我如果想把 Sex 這個欄位的 “female”,取代為 “F”,我可以這樣做:

titanic["Sex"].str.replace("female", "F")
0      male
1         F
2         F
3         F
4      male
       ... 
886    male
887       F
888       F
889    male
890    male
Name: Sex, Length: 891, dtype: object
  • 所以,我如果想把 Sex 這個欄位的 “female” 改成 “F”, “male” 改成 “M”,那我可以這樣做:

titanic["Sex_short1"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short1"] = titanic["Sex_short1"].str.replace("male", "M")
titanic[["Sex", "Sex_short1"]].head()
Sex Sex_short1
0 male M
1 female F
2 female F
3 female F
4 male M
  • 事實上,如果你是要做這種取代的話,更好的做法是這樣:

titanic["Sex_short2"] = titanic["Sex"].replace({"male": "M", "female": "F"})
titanic[["Sex","Sex_short1","Sex_short2"]].head()
Sex Sex_short1 Sex_short2
0 male M M
1 female F F
2 female F F
3 female F F
4 male M M