11. Linear regression (d2l)¶

修改自 d2l 的 3.1 ~ 3.3 (線性回歸從0開始實現、線性回歸的簡潔實現）(link)
這一章，我們要:
- 先靠自己刻出 linear regression (from scratch)
- 再用 pytorch 內建 NN 來做做看.
- 比較自己寫的和內建的差異

%matplotlib inline
import random
import torch
import numpy as np
import matplotlib.pyplot as plt
# from d2l import torch as d2l

11.1. 生 data¶

linear model 定義為： \(\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.\)
其中， y 是 n 維向量，w 是 p 維向量 (feature維度)，X 是 nxp 矩陣，b 是 scalar
先寫一個 function 來產生模擬資料：

def synthetic_data(w, b, num_examples):  #@save
    p = len(w) # feature 個數
    X = torch.normal(0, 1, (num_examples, p)) # 生出 shape = (n, p) 的 matrix
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

現在來產出模擬資料：
- n = 1000, p = 2
- w = [2, -3.4], b = 4.2

true_w = torch.tensor([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

features

tensor([[-0.8164, -0.9453],
        [-1.2195,  0.1272],
        [ 0.5601,  1.0297],
        ...,
        [ 0.6262, -1.1752],
        [ 0.6752, -1.0719],
        [ 0.6124,  0.5205]])

labels[:5]

tensor([[5.8100],
        [1.3357],
        [1.8175],
        [5.5317],
        [4.9031]])

畫一下 y 對 x1, y 對 x2 的 scatter plot

plt.scatter(features[:, 0].detach().numpy(), labels.detach().numpy());
plt.axis("off");

../../_images/d2l_linear_regression_11_0.png

plt.scatter(features[:, 1].detach().numpy(), labels.detach().numpy());
plt.axis("off");

../../_images/d2l_linear_regression_12_0.png

11.2. from scratch¶

11.2.1. data iter¶

我們要來做出一個 iterator，在讀資料時，每次都可以吐 batch size 的資料出來
這邊複習一下 range(0, 10, 3) 的意思，是 0~10，每 3 個取一個，所以結果會是： [0, 3, 6, 9]，最後一個分不完就不取了

def my_data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples)) # [0,1,...,n]
    random.shuffle(indices) # [4,1,25,0,...]
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor(
            indices[i: min(i + batch_size, num_examples)])
        yield features[batch_indices], labels[batch_indices]

驗證一下，如果 batch size = 10，看一下第一個 batch 的長相

batch_size = 10
custom_data_iter = my_data_iter(10, features, labels)

for X, y in custom_data_iter:
    print(X, '\n', y)
    break

tensor([[ 0.0324, -0.3804],
        [-0.2023,  0.1426],
        [ 0.0364,  1.5655],
        [-1.3299, -0.7282],
        [-0.1377, -0.3729],
        [ 1.6993,  0.4536],
        [-0.4657, -1.1417],
        [-1.0266, -0.1638],
        [-0.3879,  1.4113],
        [ 0.2974,  0.6664]]) 
 tensor([[ 5.5652],
        [ 3.3013],
        [-1.0554],
        [ 4.0234],
        [ 5.1900],
        [ 6.0698],
        [ 7.1442],
        [ 2.7238],
        [-1.3799],
        [ 2.5420]])

11.2.2. 定義模型¶

def model(X, w, b):  #@save
    """ linear regression """
    return torch.matmul(X, w) + b

11.2.3. loss function¶

class MyMSE:
    def __init__(self, reduction = "mean"):
        self.reduction = reduction
    def __call__(self, y_hat, y):
        loss = (y_hat - y.reshape(y_hat.shape)) ** 2
        if self.reduction == "mean":
            cost = loss.mean()
        if self.reduction == "sum":
            cost = loss.sum()
        if self.reduction == "none":
            cost = loss
        return cost

# instance
loss = MyMSE(reduction = "mean")

11.2.4. optimizer¶

def optimizer(params, lr = 0.03):  #@save
    """ sgd """
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad

11.2.5. training¶

11.2.5.1. 走一個 batch 來看看發生什麼事¶

# hyper-parameters
lr = 0.03

# data
custom_data_iter = my_data_iter(10, features, labels)
X, y = next(iter(custom_data_iter))
print(X)
print(y)

tensor([[-1.6985,  0.0291],
        [-0.0222,  0.0066],
        [ 1.1131, -1.0533],
        [ 0.7180,  0.5768],
        [ 0.0799, -0.5924],
        [ 0.1805,  0.9742],
        [ 0.3105,  1.2170],
        [ 0.3885,  0.6850],
        [ 0.8224, -0.2691],
        [-0.2953, -1.0634]])
tensor([[0.6720],
        [4.1316],
        [9.9998],
        [3.6875],
        [6.3652],
        [1.2642],
        [0.6865],
        [2.6544],
        [6.7504],
        [7.2257]])

# 初始化參數
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)

看一下初始化的參數，和真值的差距

print(f"estimate w: {w}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[-0.0079],
        [-0.0013]], requires_grad=True)
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 15.583215713500977

print(f"estimate b: {b}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b.item()-true_b)**2)}")

estimate b: tensor([0.], requires_grad=True)
true b: 4.2
mse of parameter estimation: 17.64

可以看到，差距蠻遠的。
每個參數，都有他目前對應的 gradient，可以這樣看：

print(w.grad)
print(b.grad)

None
None

可以看到，目前 w, b 都沒有 gradient
接著，來做第一次 forward

y_hat = model(X, w, b)
batch_cost = loss(y_hat, y)
print(batch_cost)

tensor(25.1611, grad_fn=<MeanBackward0>)

可以看到，起始的 cost 還蠻高，mse 到 25.
接著，我們做 backward：
- 取得 gradient.
- 利用 gradient，來更新參數

batch_cost.backward()
print(w.grad)
print(b.grad)

tensor([[-2.3259],
        [ 5.6211]])
tensor([-6.3185])

可以看到，對 cost 做 backward 後，他自動找到要算 gradient 的對象 (w, b)，並且，把對應的 gradient 塞進去
然後，可以用我們寫好的 sgd，來更新參數

optimizer([w, b], lr = 0.03)

print(f"estimate w: {w}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[ 0.0619],
        [-0.1699]], requires_grad=True)
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 14.190025329589844

print(f"estimate b: {b}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b.item()-true_b)**2)}")

estimate b: tensor([0.1896], requires_grad=True)
true b: 4.2
mse of parameter estimation: 16.083666673612385

可以看到，經過一個 batch，w 和 b 的參數估計，都更接近真值了

最後，這邊要注意一下：
- 每次做完參數更新後，都會把參數對應的 gradient 歸 0。
- 做法是： w.grad.zero_() or b.grad.zero_().
- 要這麼做的原因是，當我們做 my_cost.backward() 時，他背後做的事情是：
  - 先依據目前的 cost，算出每個 trainiable variable 的 gradient.
  - 把這個 gradient，”累加” 到目前 trainable variable 的 gradient 上.
- 所以，如果你不把 trainable variable 的 gradient 歸 0，那新的 gradient 變成新+舊的 gradient，整個歪掉.
- 那 pytorch 慣例的寫法，不是在更新完 weight 後，清空 gradient。而是在算 backward 前，清空 gradient.
- 雖然看起來這兩種做法，看似等價，但如果你是依據現有的 model weight 延續做 retrain 的話，那你就不能確認一開始的 gradient 是 0。所以總是先歸0，再做 backward，就變成是 pytorch 的 convention.
- 這在等一下的 3 個 epoch 範例就可以看到例子

11.2.5.2. 走 3 個 epoch 看看¶

# hyper-parameters
lr = 0.03
num_epochs = 3

# 初始化參數
w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)
b = torch.zeros(1, requires_grad=True)
params = (w, b)

for epoch in range(num_epochs):
    custom_data_iter = my_data_iter(10, features, labels)
    for X, y in custom_data_iter:
        # forward step
        y_hat = model(X, w, b)
        batch_cost = loss(y_hat, y)
        
        # 清空 gradient
        for param in params:
            if param.grad is not None:
                param.grad.zero_()
        
        # backward step
        batch_cost.backward()
        optimizer([w, b], lr = 0.03)  # 使用参数的梯度更新参數
        
    with torch.no_grad():
        y_hat_all = model(features, w, b)
        epoch_cost = loss(y_hat_all, labels)
        print(f'epoch {epoch +1}, loss {epoch_cost}')
        #print(f'epoch {epoch + 1}, loss {float(epoch_cost):f}')

epoch 1, loss 0.00040435069240629673
epoch 2, loss 0.00010177450167248026
epoch 3, loss 0.00010079665662487969

可以看到，參數估計結果，和真值非常接近

print(f"estimate w: {w}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[ 2.0004],
        [-3.3999]], requires_grad=True)
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 1.8516522004574654e-07

print(f"estimate b: {b}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b.item()-true_b)**2)}")

estimate b: tensor([4.1996], requires_grad=True)
true b: 4.2
mse of parameter estimation: 1.8969775737802604e-07

11.3. 內建 function¶

11.3.1. data iter¶

from torch.utils.data import TensorDataset, DataLoader

dataset = TensorDataset(features, labels)
data_iter = DataLoader(dataset, batch_size = 10, shuffle = True)

驗證一下：

for X, y in data_iter:
    print(X, '\n', y)
    break

tensor([[ 0.8684,  1.8096],
        [-0.4814, -0.4445],
        [-0.3560, -1.1624],
        [-1.6850,  0.9945],
        [ 2.2541, -0.8971],
        [-0.5724, -0.4106],
        [-4.1179, -0.5625],
        [-0.4651, -1.3560],
        [-0.3070,  0.7921],
        [ 0.4441, -0.7416]]) 
 tensor([[-0.2113],
        [ 4.7432],
        [ 7.4426],
        [-2.5434],
        [11.7608],
        [ 4.4673],
        [-2.1368],
        [ 7.8829],
        [ 0.8922],
        [ 7.6340]])

11.3.2. 定義模型¶

from torch import nn
model = nn.Sequential(
    nn.Linear(2, 1) # input 2, output 1
)

我們可以看一下這個模型的架構：

model.state_dict

<bound method Module.state_dict of Sequential(
  (0): Linear(in_features=2, out_features=1, bias=True)
)>

可以看到，他只有一層，這一層是 linear 層。
所以我們可以取出這一層來看看：

model[0]

Linear(in_features=2, out_features=1, bias=True)

取裡面的 weight 和 bias 來看看：

model[0].weight

Parameter containing:
tensor([[ 0.1620, -0.4444]], requires_grad=True)

model[0].bias

Parameter containing:
tensor([-0.5005], requires_grad=True)

單純把數值取出來：

model[0].weight.data

tensor([[ 0.1620, -0.4444]])

model[0].bias.data

tensor([-0.5005])

一次取全部的參數：

model.parameters()

<generator object Module.parameters at 0x132cbb190>

是個 generator，那就用 for 迴圈去看總共有哪些東西：

for param in model.parameters():
    print(param)

Parameter containing:
tensor([[ 0.1620, -0.4444]], requires_grad=True)
Parameter containing:
tensor([-0.5005], requires_grad=True)

發現就是 w 和 b
所以，等等要餵給 optimizer 的參數，就是 model.parameters().
然後，如果想看 w 和 b 的估計狀況，要用 model[0].weight.data 和 model[0].bias.data.
如果想看 w 和 b 的 gradient，要用 model[0].weight.grad 和 model[0].bias.grad

如果要取得 w 和 b 的數值，那要這麼做：

11.3.3. loss function¶

loss = nn.MSELoss()

11.3.4. optimizer¶

optimizer = torch.optim.SGD(model.parameters(), lr = 0.03)

11.3.5. training¶

11.3.5.1. 走一個 batch 看看發生什麼事¶

# data
X, y = next(iter(data_iter))
print(X)
print(y)

tensor([[-1.7370e+00,  2.2797e+00],
        [-1.3517e+00,  6.1767e-01],
        [-8.5405e-02, -4.3337e-01],
        [-4.4443e-01, -2.2241e-01],
        [ 4.1111e-01, -7.4746e-01],
        [ 1.6160e+00,  2.0642e+00],
        [-2.3694e+00, -4.2574e-01],
        [-1.6769e-03, -1.1355e+00],
        [ 7.4750e-01,  3.2721e-01],
        [ 7.4800e-01,  9.3969e-01]])
tensor([[-7.0367],
        [-0.6083],
        [ 5.5124],
        [ 4.0628],
        [ 7.5595],
        [ 0.3955],
        [ 0.9199],
        [ 8.0442],
        [ 4.5495],
        [ 2.5009]])

看一下初始參數和真值的差距

w_est = model[0].weight.data
b_est = model[0].bias.data

print(f"estimate w: {w_est}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w_est.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[ 0.1620, -0.4444]])
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 12.113733291625977

print(f"estimate b: {b_est}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b_est.item()-true_b)**2)}")

estimate b: tensor([-0.5005])
true b: 4.2
mse of parameter estimation: 22.09457495294072

此時，這些參數對應的 gradient 也都還是 None

print(model[0].weight.grad)
print(model[0].bias.grad)

None
None

做第一次 forward

y_hat = model(X)
batch_cost = loss(y_hat, y)
print(batch_cost)

tensor(24.7222, grad_fn=<MseLossBackward>)

backward

batch_cost.backward()

此時，gradient 已經算出來了：

print(model[0].weight.grad)
print(model[0].bias.grad)

tensor([[-2.8292,  4.5661]])
tensor([-6.5510])

更新參數 by optimizer

optimizer.step()

看一下參數更新的狀況，是否和真值更近了一些：

w_est = model[0].weight.data
b_est = model[0].bias.data

print(f"estimate w: {w_est}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w_est.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[ 0.2468, -0.5814]])
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 11.017970085144043

print(f"estimate b: {b_est}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b_est.item()-true_b)**2)}")

estimate b: tensor([-0.3040])
true b: 4.2
mse of parameter estimation: 20.28563309076185

沒錯～估計得更好了

11.3.5.2. 走 3 個 epoch 看看¶

num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter:
        # forward
        y_hat = model(X)
        batch_cost = loss(y_hat ,y)
        
        # 清 gradient
        optimizer.zero_grad()
        
        # backward
        batch_cost.backward()
        optimizer.step()
    
    # training epoch loss
    epoch_cost = loss(model(features), labels)
    print(f'epoch {epoch + 1}, loss {epoch_cost}')

epoch 1, loss 0.000374063674826175
epoch 2, loss 0.00010071097494801506
epoch 3, loss 0.0001010773194138892

可以看一下，參數估計結果和真值非常接近

w_est = model[0].weight.data
b_est = model[0].bias.data

print(f"estimate w: {w_est}")
print(f"true w: {true_w}")
print(f"mse of parameter estimation: {((w_est.reshape(-1)-true_w)**2).sum()}")

estimate w: tensor([[ 2.0005, -3.4002]])
true w: tensor([ 2.0000, -3.4000])
mse of parameter estimation: 3.2639093205943936e-07

print(f"estimate b: {b_est}")
print(f"true b: {true_b}")
print(f"mse of parameter estimation: {((b_est.item()-true_b)**2)}")

estimate b: tensor([4.1997])
true b: 4.2
mse of parameter estimation: 1.216326290888321e-07

11.4. 內建 function vs 自己寫的¶

最大的差別，應該是在 data iterator
自己寫的 data iterator，他跑完後就沒了，所以寫法必須寫成這樣：

for epoch in range(num_epochs):
    # 每個 epoch 下，都要再做出一個新的 iterator
    custom_data_iter = my_data_iter(10, features, labels)
    for X, y in custom_data_iter:
        ...

但如果是用內建 function 的 iterator，那就做一次就好，因為它會自動幫你重新實例化：

dataset = TensorDataset(features, labels)
data_iter = DataLoader(dataset, batch_size = 10, shuffle = True) # 就做這麼一次就好

for epoch in range(num_epochs):
    for X, y in data_iter:
        ...

My sample book

Linear regression (d2l)

Contents

11. Linear regression (d2l)¶

11.1. 生 data¶

11.2. from scratch¶

11.2.1. data iter¶

11.2.2. 定義模型¶

11.2.3. loss function¶

11.2.4. optimizer¶

11.2.5. training¶

11.2.5.1. 走一個 batch 來看看發生什麼事¶

11.2.5.2. 走 3 個 epoch 看看¶

11.3. 內建 function¶

11.3.1. data iter¶

11.3.2. 定義模型¶

11.3.3. loss function¶

11.3.4. optimizer¶

11.3.5. training¶

11.3.5.1. 走一個 batch 看看發生什麼事¶

11.3.5.2. 走 3 個 epoch 看看¶

11.4. 內建 function vs 自己寫的¶