使用pytorch进行数据处理

source code: NJU-ymhui/DataOperations: Use pytorch for data operations (github.com)

use git to clone: https://github.com/NJU-ymhui/DataOperations.git

start.py dataprepare.py

入门

pytorch中的数组被称为张量(Tensor)，与numpy中的ndarray类似，但ndarray仅支持CPU运算，而Tensor同时可以很好地支持GPU加速运算，并且Tensor类支持自动微分。

导入pytorch库

import torch

张量的生成

torch.arange(...)：创建一个行向量

var.shape：查看张量的形状

var.numel()：检查张量中元素总数

var.reshape(...)：改变一个张量的形状而不改变元素数量和元素值

torch.zeros(...)：生成一个张量并以0覆盖，张量形状由参数指定

torch.ones(...)：生成一个张量并以1覆盖，张量形状由参数指定

torch.randn(...)：生成一个张量，元素值随机采样自标准正态分布，张量形状由参数指定

torch.tensor(...)：指定初始化一个张量

torch.zeros_like(...)：创建一个和传入的张量形状相同的张量，并填入0

code

import torch
def torch_tensor():
    x = torch.arange(12)
    print(x)
    print(x.shape)
    print(x.numel())
    print(x.reshape(3, 4))
    print(torch.zeros((2, 3, 4)))  # 1个张量tensor，3 * 4 的零矩阵有2个
    print(torch.ones((2, 3, 4)))  # 同上，不过填入1
    print(torch.randn(3, 4))  # 采样自标准正态分布的随机数，3 * 4的矩阵
    print(torch.tensor([[1, 2, 3, 4], [6, 5, 7, 8], [2, 3, 4, 1]]))  # 手动初始化一个矩阵
    print(torch.zeros_like(x))  # 复制x, 并填0

output

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
torch.Size([12])
12
tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])
tensor([[ 1.6022, -0.8597,  0.0841,  0.7659],
        [-1.3949, -0.0424, -0.3197, -0.6832],
        [ 1.1753, -1.5020,  0.5873, -0.2480]])
tensor([[1, 2, 3, 4],
        [6, 5, 7, 8],
        [2, 3, 4, 1]])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

张量的运算

对具有相同形状的张量可以进行数值运算（类似matlab），语义与原生python运算一致

torch.exp(...)：e指数运算

var.sum()：对张量所有元素求和

code

def tensor_operation():
    x = torch.tensor([1, 2, 3, 3.6])
    y = torch.tensor([0.5, 4, 2, 0.1])
    print(x * y)
    print(x / y)
    print(x + y)
    print(x - y)
    print(x ** y)
    print(torch.exp(x))
    print(x == y)
    print(x.sum())

output

tensor([0.5000, 8.0000, 6.0000, 0.3600])
tensor([ 2.0000,  0.5000,  1.5000, 36.0000])
tensor([1.5000, 6.0000, 5.0000, 3.7000])
tensor([ 0.5000, -2.0000,  1.0000,  3.5000])
tensor([ 1.0000, 16.0000,  9.0000,  1.1367])
tensor([ 2.7183,  7.3891, 20.0855, 36.5982])
tensor([False, False, False, False])
tensor(9.6000)

张量的连接

tensor.cat(...)：连接多个张量，第一个参数指定连接哪些，第二个参数dim指定按第几维（轴）连接

code

def tensor_concat():
    x = (torch.arange(12, dtype=torch.float32)).reshape(3, 4)
    y = (torch.arange(12)).reshape(3, 4)
    # 按行(轴-0)连接
    print(torch.cat((x, y), dim=0))
    # 按列(轴-1)连接
    print(torch.cat((x, y), dim=1))

output

tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.]])
tensor([[ 0.,  1.,  2.,  3.,  0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.,  8.,  9., 10., 11.]])

注意到按行（轴-0）连接的张量列数不变，行合并；按列（轴-1）连接的张量行数不变，列合并

广播机制

如果我们尝试对形状不同的张量进行运算，会发生什么？答案是会将两个张量分别进行适当的复制，变得形状相同，然后运算（和matlab机制一样）

在大多数情况下，我们将沿着数组中长度为1的轴进行广播

demo

a = tensor([[0],
		   [1],
		   [2]])
b = tensor([[0, 1]])
if a + b:
first a => a' = tensor([0, 0],
					   [1, 1],
					   [2, 2])
	  b => b' = tensor([0, 1],
	  				   [0, 1],
	  				   [0, 1])
then a' + b'

code

def tensor_broadcast():
    x = torch.arange(3).reshape(3, 1)
    y = torch.arange(2).reshape(1, 2)
    print(x)
    print(y)
    # add x and y
    print(x + y)

output

tensor([[0],
        [1],
        [2]])
tensor([[0, 1]])
tensor([[0, 1],
        [1, 2],
        [2, 3]])

索引和切片

与原生python完全一致！第一个元素的索引是0，最后一个元素索引是‐1；可以指定范围以包含第一个元素和最后一个之前的元素；访问张量指定位置元素可以用matlab式访问arr[i, j]或C式访问arr[i][j]

i:j: [i, j)左闭右开

[:, i]：取行所有元素（轴-0）与第i列，即第i列的所有元素

[i, :]：取列（轴-1）与第i行，即第i行所有元素

code

def index_slice():
    x = torch.arange(12).reshape(3, 4)
    print("initial:\n", x)
    print("last:\n", x[-1])  # last line
    print("row 1 ~ 2:\n", x[1:3])  # 第 1 ~ 2 行
    print("row 0 ~ 1 and column 1 ~ 2:\n", x[0:2, 1:3])  # 第 0 ~ 1 行，第 1 ~ 2 列
    print("column 2:\n", x[:, 2])  # 第 2 列
    print("row 1:\n", x[1, :])  # 第 1 行
    print("row 1 and col 2:\n", x[1, 2], x[1][2])  # 第 1 行，第 2 列
    x[1, 2] = 114  # 修改
    print("modify:\n", x)

output

initial:
 tensor([[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
last:
 tensor([ 8,  9, 10, 11])
row 1 ~ 2:
 tensor([[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]])
row 0 ~ 1 and column 1 ~ 2:
 tensor([[1, 2],
        [5, 6]])
column 2:
 tensor([ 2,  6, 10])
row 1:
 tensor([4, 5, 6, 7])
row 1 and col 2:
 tensor(6) tensor(6)
modify:
 tensor([[  0,   1,   2,   3],
        [  4,   5, 114,   7],
        [  8,   9,  10,  11]])

Process finished with exit code 0

节省内存

例如，如果我们用Y = X + Y，我们将取消引用Y指向的张量，而是指向新分配的内存处的张量。

code

def tensor_memory():
    x = torch.arange(4)
    y = torch.arange(4)
    old = id(y)
    y = x + y
    print(old == id(y))

output

False

***这是致命的！***在深度学习中我们可能有百兆级的数据，频繁地分配新内存会造成极大的浪费

幸运的是，复用内存还是比较简单的，只需要使用一下切片操作或使用op=简化运算符来进行值覆盖，就可以继续使用旧内存, 即将y = x + y => y[:] = x + y或y += x。

code

def tensor_memory():
    x = torch.arange(4)
    y = torch.arange(4)
    old = id(y)
    y = x + y
    print(old == id(y))
    print("slice operation for reusing memory")
    # slice operation for reusing memory
    old = id(x)
    x[:] = x + y
    print(old == id(x))
    x += y
    print(old == id(x))

output

False
slice operation for reusing memory
True
True

与python对象转换

如题

code

def tensor_transform():
    x = torch.arange(4)
    print(x)
    y = x.numpy()
    print(y)
    print(type(x), type(y))
    # 大小为 1 的张量可以转化为 python标量
    z = torch.tensor([1.14])
    print(z, z.item(), float(z), int(z))  # 转化成标量可以使用 item()，也可以使用python内置函数 float() 和 int()

output

tensor([0, 1, 2, 3])
[0 1 2 3]
<class 'torch.Tensor'> <class 'numpy.ndarray'>
tensor([1.1400]) 1.1399999856948853 1.1399999856948853 1

数据预处理

为了能用深度学习解决现实问题，第一步就是要对原始数据进行预处理而不是始于已经准备好的张量格式数据，此处介绍如何使用pandas库预处理原始数据，并将原始数据转化为张量格式。pandas可以与张量兼容。

导入pandas库

import pandas as pd

读取数据

pandas.read_csv(...)：读取.csv格式的数据

pandas.read_excel(...)：读取.xls或.xlsx格式的数据

code

def create_data_read():
    os.makedirs(os.path.join('.', 'data'), exist_ok=True)
    data_file = os.path.join('.', 'data', 'house_tiny.csv')
    # create data
    with open(data_file, 'w') as f:
        #        房屋数量  巷子类型 房屋价格
        f.write('NumRooms,Alley,Price\n')  # 列名
        f.write('NA,Pave,127500\n')  # 每行表示一个数据样本
        f.write('2,NA,106000\n')  # NA为缺失值
        f.write('4,NA,178100\n')
        f.write('NA,NA,140000\n')
    # use pandas read data
    df = pd.read_csv(data_file)
    print(df)
    return df

output

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000

缺失值处理

两种方法：插值法和删除法，此处介绍插值法

插值法

即用一个替代值弥补缺失值

连续值：对于一列/行的缺失值，可以用该行/列的均值来替代

code

def insert_missing(data):
    # data来自create_data_read
    inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]  # inputs 取前两列，outputs取第三列 Price(暂时用不到)
    print(inputs)

    # 为连续值的缺失值插值
    inputs = inputs.fillna(inputs.mean())  # 以均值替换NaN
    print("after inserting for continuous variable:")
    print(inputs)

output

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000
   NumRooms Alley
0       NaN  Pave
1       2.0   NaN
2       4.0   NaN
3       NaN   NaN
after inserting for continuous variable:
   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

离散值：对于一行/列的缺失值，可以将NaN也视作一个类别（离散值）。在此例中，“巷子类型”只有Pave 与NaN两种类型，因此pandas可以自动将Alley列转化为Alley_Pave和Alley_nan两列，Alley列为Pave的Alley_Pave=1, Alley_nan=0, 为NaN的反之。使用pandas中的get_dummies方法

pandas.get_dummies(...)：将传入数据的分类变量转化为虚拟变量(one-hot编码)，dummy_na参数决定是否为缺失值额外创建一个虚拟列，True为创建。

coed

def insert_missing(data):
    # data来自create_data_read
    inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]  # inputs 取前两列，outputs取第三列 Price(暂时用不到)
    print(inputs)

    # 为连续值的缺失值插值
    inputs = inputs.fillna(inputs.mean())  # 以均值替换NaN
    print("after inserting for continuous variable:")
    print(inputs)

    # 为离散值的缺失值插值
    inputs = pd.get_dummies(inputs, dummy_na=True)  # dummy_na=True, 表示为缺失值创建一个新特征
    print("after inserting for discrete variable:")
    print(inputs)
    return inputs, outputs

output

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000
   NumRooms Alley
0       NaN  Pave
1       2.0   NaN
2       4.0   NaN
3       NaN   NaN
after inserting for continuous variable:
   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN
after inserting for discrete variable:
   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

删除法

直接忽略缺失值

转换为张量格式

经过读取数据和缺失值处理后得到的所有条目都是数值类型的，它们可以被转化为张量格式，以方便调用pytorch中的张量函数便捷地处理数据。

pandas.read_csv()得到的数据是DataFrame格式的，先调用其to_numpy方法转化为numpy数组，再利用torch.tensor()方法将numpy数组转化为张量格式。

DataFrame.to_numpy(dtype=...)：将DataFrame数据转化为numpy数组，dtype指定numpy数组的元素类型

code

def transfer_tensor(data):
    # data来自insert_missing
    inputs, outputs = data
    x = torch.tensor(inputs.to_numpy(dtype=float))
    y = torch.tensor(outputs.to_numpy(dtype=float))
    print(x)
    print(y)

output

tensor([[3., 1., 0.],
        [2., 0., 1.],
        [4., 0., 1.],
        [3., 0., 1.]], dtype=torch.float64)
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64)

(•‿•)