从头开始重新创建 PyTorch

数据科学工厂

简介

多年来，我们一直在使用 PyTorch 来构建和训练深度学习模型。尽管对其语法和规则已经了如指掌，但内心深处总有一些疑问：这些操作背后究竟发生了什么？它们是如何运作的？如果询问你如何在 PyTorch 中创建和训练一个模型，你可能会想到类似以下的代码：

import torch
import torch.nn as nn
import torch.optim as optim

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(1, 10)
        self.sigmoid = nn.Sigmoid()
        self.fc2 = nn.Linear(10, 1)

    def forward(self, x):
        out = self.fc1(x)
        out = self.sigmoid(out)
        out = self.fc2(out)

        return out

...

model = MyModel().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(epochs):
    for x, y in ...

        x = x.to(device)
        y = y.to(device)

        outputs = model(x)
        loss = criterion(outputs, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

如果询问你这个回溯步骤是如何运作的，你会如何回答？再比如，当你改变一个张量的形状时，背后发生了什么？数据是否在内部被重新组织？这个过程是如何进行的？PyTorch为何能够运行得如此迅速？它是如何处理GPU运算的？这些问题一直让我充满好奇，我相信它们同样也引起了你的兴趣。因此，为了更深入地掌握这些概念，有什么方法能比自己动手从零开始构建一个张量库更有效呢？这就是你在本文中将要学习的内容！

Tensor

要创建一个张量库，你首先必须掌握的概念当然是：张量是什么？

你可能已经有一个直观的理解，即张量是一个包含数值的多维数据结构的数学概念。但在这儿，我们得从计算的角度来理解如何构建这种数据结构。我们可以将张量想象成由两部分组成：一部分是数据本身，另一部分是描述张量特征的元数据，比如它的维度形状，或者它存储在哪种设备上（比如CPU内存或GPU内存）。

还有一个你可能不太熟悉的元数据概念，称为步长（stride）。理解这个概念对于深入掌握张量数据重排的内部工作原理至关重要，因此我们有必要对其进行更深入的探讨。

设想一个形状为 [4, 8] 的二维张量，如下图所示。

张量的数据（即浮点数）实际上在内存中存储为一维数组：

所以，为了将这个一维数组表示为多维张量，我们利用了步长（strides）的概念。简单来说，思路是这样的：

我们有一个 4 行 8 列的矩阵。假设所有元素都是按行顺序存储在一维数组中，如果我们想访问位于 [2, 3] 的元素值，我们需要跳过 2 行（每行有 8 个元素）再加上 3 个位置。用数学语言描述，我们需要在一维数组中跳过 3 + 2 * 8 个元素来访问它：

这个“8”代表的是第二维度的步长（stride）。简单来说，它告诉我们在数组中需要跳过多少个元素才能“移动”到第二维度上的其他位置。因此，要访问形状为 [shape_0, shape_1] 的二维张量中的元素 [i, j]，我们实际上需要定位到 j + i * shape_1 的位置。

接下来，让我们设想一个三维张量的情况：

你可以将这个三维张量视作矩阵的序列。比如，你可以把这个形状为 [5, 4, 8] 的张量看作是 5 个形状为 [4, 8] 的矩阵。现在，要获取位于 [1, 2, 7] 的元素，你需要跳过 1 个完整的 [4,8] 形状的矩阵，2 行 [8] 形状的行，以及 7 列 [1] 形状的列。所以，你需要在一维数组中跳过 (1 4 8) + (2 8) + (7 1) 个位置。

因此，要访问 1-D 数据数组上具有 [shape_0, shape_1, shape_2] 的 3-D 张量的元素 [i][j][k]，您可以执行以下操作：

shape_1 * shape_2 是第一维度的步幅，shape_2 是第二维度的步幅，1 是第三维度的步幅。然后，为了概括：

其中每个维度的步幅可以使用下一维度张量形状的乘积来计算：

然后我们设置 stride[n-1] = 1。在形状为 [5, 4, 8] 的张量示例中，步长 = [4*8, 8, 1] = [32, 8, 1] 您可以自行测试：

import torch

torch.rand([5, 4, 8]).stride()
#(32, 8, 1)

好的，但我们为什么要使用形状和步长呢？这个概念不仅可以用于访问存储为一维数组形式的N维张量的元素，还可以非常方便地用来调整张量的布局。

比如，当你想要改变一个张量的形状时，只需指定新的形状，并据此计算出新的步长即可！（这样做是因为新的形状确保了元素的总数保持不变）

import torch

t = torch.rand([5, 4, 8])

print(t.shape)
# [5, 4, 8]

print(t.stride())
# [32, 8, 1]

new_t = t.reshape([4, 5, 2, 2, 2])

print(new_t.shape)
# [4, 5, 2, 2, 2]

print(new_t.stride())
# [40, 8, 4, 2, 1]

在内部，张量仍然以一维数组的形式存储。重塑操作并没有改变数组中元素的排列顺序！这真是令人惊叹，对吧？你可以通过下面的函数来亲自验证这一点，该函数能够访问 PyTorch 内部的一维数组：

import ctypes

def print_internal(t: torch.Tensor):
    print(
        torch.frombuffer(
            ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype
        )
    )

print_internal(t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

print_internal(new_t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

或者例如，您想要转置两个轴。在内部，您只需交换各自的步幅即可！

t = torch.arange(0, 24).reshape(2, 3, 4)
print(t)
# [[[ 0,  1,  2,  3],
#   [ 4,  5,  6,  7],
#   [ 8,  9, 10, 11]],

#  [[12, 13, 14, 15],
#   [16, 17, 18, 19],
#   [20, 21, 22, 23]]]

print(t.shape)
# [2, 3, 4]

print(t.stride())
# [12, 4, 1]

new_t = t.transpose(0, 1)
print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],

#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],

#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]

print(new_t.shape)
# [3, 2, 4]

print(new_t.stride())
# [4, 12, 1]

如果打印内部数组，两者具有相同的值：

print_internal(t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

但是，新张量 new_t 的步长现在与我之前展示的公式不一致。这是因为张量现在不再连续。也就是说，尽管内部数组本身没有变化，它在内存中的存储顺序与张量的实际排列顺序并不一致。

t.is_contiguous()
# True

new_t.is_contiguous()
# False

这意味着，如果我们要按顺序访问那些不连续的元素，效率会比较低（因为这些元素在内存中的存储并不是连续的）。为了提高效率，我们可以通过以下方式来解决：

new_t_contiguous = new_t.contiguous()

print(new_t_contiguous.is_contiguous())
# True

如果我们分析内部数组，它的顺序现在与实际张量顺序匹配，这可以提供更好的内存访问效率：

print(new_t)
# [[[ 0,  1,  2,  3],
#   [12, 13, 14, 15]],

#  [[ 4,  5,  6,  7],
#   [16, 17, 18, 19]],

#  [[ 8,  9, 10, 11],
#   [20, 21, 22, 23]]]

print_internal(new_t)
# [ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t_contiguous)
# [ 0,  1,  2,  3, 12, 13, 14, 15,  4,  5,  6,  7, 16, 17, 18, 19,  8,  9, 10, 11, 20, 21, 22, 23]

构建自己的张量库（Norch）

现在我们已经明白了张量是如何构建的，接下来就让我们着手开发我们的库。我打算给它命名为Norch，这个名字既表示“非PyTorch”，也隐含了我自己的姓氏Nogueira 😁首先需要了解的是，尽管PyTorch是通过Python接口使用的，但其核心实际上是用C/C++编写的。因此，我们首先需要开发我们自己的C/C++底层功能。我们可以首先定义一个结构体来存储张量的数据和元数据，并编写一个函数来创建这个结构体的实例。

//norch/csrc/tensor.cpp

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

typedef struct {
    float* data;
    int* strides;
    int* shape;
    int ndim;
    int size;
    char* device;
} Tensor;

Tensor* create_tensor(float* data, int* shape, int ndim) {

    Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));
    if (tensor == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    tensor->data = data;
    tensor->shape = shape;
    tensor->ndim = ndim;

    tensor->size = 1;
    for (int i = 0; i < ndim; i++) {
        tensor->size *= shape[i];
    }

    tensor->strides = (int*)malloc(ndim * sizeof(int));
    if (tensor->strides == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    int stride = 1;
    for (int i = ndim - 1; i >= 0; i--) {
        tensor->strides[i] = stride;
        stride *= shape[i];
    }

    return tensor;
}

为了访问某些元素，我们可以利用步幅，正如我们之前学到的：

//norch/csrc/tensor.cpp

float get_item(Tensor* tensor, int* indices) {
    int index = 0;
    for (int i = 0; i < tensor->ndim; i++) {
        index += indices[i] * tensor->strides[i];
    }

    float result;
    result = tensor->data[index];

    return result;
}

现在，我们可以创建张量运算。我将展示一些示例，您可以在本文末尾链接的存储库中找到完整版本。

//norch/csrc/cpu.cpp

void add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

    for (int i = 0; i < tensor1->size; i++) {
        result_data[i] = tensor1->data[i] + tensor2->data[i];
    }
}

void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

    for (int i = 0; i < tensor1->size; i++) {
        result_data[i] = tensor1->data[i] - tensor2->data[i];
    }
}

void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

    for (int i = 0; i < tensor1->size; i++) {
        result_data[i] = tensor1->data[i] * tensor2->data[i];
    }
}

void assign_tensor_cpu(Tensor* tensor, float* result_data) {

    for (int i = 0; i < tensor->size; i++) {
        result_data[i] = tensor->data[i];
    }
}

...

之后我们就可以创建其他张量函数来调用这些操作：

//norch/csrc/tensor.cpp

Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
    if (tensor1->ndim != tensor2->ndim) {
        fprintf(stderr, "Tensors must have the same number of dimensions %d and %d for addition\n", tensor1->ndim, tensor2->ndim);
        exit(1);
    }

    int ndim = tensor1->ndim;
    int* shape = (int*)malloc(ndim * sizeof(int));
    if (shape == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }

    for (int i = 0; i < ndim; i++) {
        if (tensor1->shape[i] != tensor2->shape[i]) {
            fprintf(stderr, "Tensors must have the same shape %d and %d at index %d for addition\n", tensor1->shape[i], tensor2->shape[i], i);
            exit(1);
        }
        shape[i] = tensor1->shape[i];
    }      
    float* result_data = (float*)malloc(tensor1->size * sizeof(float));
    if (result_data == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    add_tensor_cpu(tensor1, tensor2, result_data);

    return create_tensor(result_data, shape, ndim, device);
}

如前所述，张量重塑不会修改内部数据数组：

//norch/csrc/tensor.cpp

Tensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {

    int ndim = new_ndim;
    int* shape = (int*)malloc(ndim * sizeof(int));
    if (shape == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }

    for (int i = 0; i < ndim; i++) {
        shape[i] = new_shape[i];
    }

    // Calculate the total number of elements in the new shape
    int size = 1;
    for (int i = 0; i < new_ndim; i++) {
        size *= shape[i];
    }

    // Check if the total number of elements matches the current tensor's size
    if (size != tensor->size) {
        fprintf(stderr, "Cannot reshape tensor. Total number of elements in new shape does not match the current size of the tensor.\n");
        exit(1);
    }

    float* result_data = (float*)malloc(tensor->size * sizeof(float));
    if (result_data == NULL) {
        fprintf(stderr, "Memory allocation failed\n");
        exit(1);
    }
    assign_tensor_cpu(tensor, result_data);
    return create_tensor(result_data, shape, ndim, device);
}

虽然我们已经能够执行一些张量操作，但不应该有人需要直接使用C/C++来运行这些操作，对吗？现在，让我们着手开发Python接口！在Python中执行C/C++代码有很多选择，例如Pybind11和Cython。在我们的示例中，我选择使用ctypes。下面是ctypes的基本结构：

//C code
#include <stdio.h>

float add_floats(float a, float b) {
    return a + b;
}

# Compile
gcc -shared -o add_floats.so -fPIC add_floats.c

# Python code
import ctypes

# Load the shared library
lib = ctypes.CDLL('./add_floats.so')

# Define the argument and return types for the function
lib.add_floats.argtypes = [ctypes.c_float, ctypes.c_float]
lib.add_floats.restype = ctypes.c_float

# Convert python float to c_float type 
a = ctypes.c_float(3.5)
b = ctypes.c_float(2.2)

# Call the C function
result = lib.add_floats(a, b)
print(result)
# 5.7

正如您所看到的，它非常直观。编译 C/C++ 代码后，您可以非常轻松地通过 ctypes 在 Python 上使用它。您只需要定义函数的参数和返回 c_types，将变量转换为其各自的 c_types 并调用该函数。对于更复杂的类型，例如数组（浮点列表），您可以使用指针。

data = [1.0, 2.0, 3.0]
data_ctype = (ctypes.c_float * len(data))(*data)

lib.some_array_func.argstypes = [ctypes.POINTER(ctypes.c_float)]

...

lib.some_array_func(data)

对于结构类型，我们可以创建自己的 c_type：

class CustomType(ctypes.Structure):
    _fields_ = [
        ('field1', ctypes.POINTER(ctypes.c_float)),
        ('field2', ctypes.POINTER(ctypes.c_int)),
        ('field3', ctypes.c_int),
    ]

# Can be used as ctypes.POINTER(CustomType)

经过简短的解释后，让我们为张量 C/C++ 库构建 Python 包装器！

# norch/tensor.py

import ctypes

class CTensor(ctypes.Structure):
    _fields_ = [
        ('data', ctypes.POINTER(ctypes.c_float)),
        ('strides', ctypes.POINTER(ctypes.c_int)),
        ('shape', ctypes.POINTER(ctypes.c_int)),
        ('ndim', ctypes.c_int),
        ('size', ctypes.c_int),
    ]

class Tensor:
    os.path.abspath(os.curdir)
    _C = ctypes.CDLL("COMPILED_LIB.so"))

    def __init__(self):

        data, shape = self.flatten(data)
        self.data_ctype = (ctypes.c_float * len(data))(*data)
        self.shape_ctype = (ctypes.c_int * len(shape))(*shape)
        self.ndim_ctype = ctypes.c_int(len(shape))

        self.shape = shape
        self.ndim = len(shape)

        Tensor._C.create_tensor.argtypes = [ctypes.POINTER(ctypes.c_float), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
        Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)

        self.tensor = Tensor._C.create_tensor(
            self.data_ctype,
            self.shape_ctype,
            self.ndim_ctype,
        )

    def flatten(self, nested_list):
        """
        This method simply convert a list type tensor to a flatten tensor with its shape

        Example:

        Arguments:  
            nested_list: [[1, 2, 3], [-5, 2, 0]]
        Return:
            flat_data: [1, 2, 3, -5, 2, 0]
            shape: [2, 3]
        """
        def flatten_recursively(nested_list):
            flat_data = []
            shape = []
            if isinstance(nested_list, list):
                for sublist in nested_list:
                    inner_data, inner_shape = flatten_recursively(sublist)
                    flat_data.extend(inner_data)
                shape.append(len(nested_list))
                shape.extend(inner_shape)
            else:
                flat_data.append(nested_list)
            return flat_data, shape

        flat_data, shape = flatten_recursively(nested_list)
        return flat_data, shape

现在我们包含 Python 张量操作来调用 C/C++ 操作。

# norch/tensor.py

def __getitem__(self, indices):
    """
    Access tensor by index tensor[i, j, k...]
    """

    if len(indices) != self.ndim:
        raise ValueError("Number of indices must match the number of dimensions")

    Tensor._C.get_item.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int)]
    Tensor._C.get_item.restype = ctypes.c_float

    indices = (ctypes.c_int * len(indices))(*indices)
    value = Tensor._C.get_item(self.tensor, indices)  

    return value

def reshape(self, new_shape):
    """
    Reshape tensor
    result = tensor.reshape([1,2])
    """
    new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)
    new_ndim_ctype = ctypes.c_int(len(new_shape))

    Tensor._C.reshape_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(ctypes.c_int), ctypes.c_int]
    Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)
    result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)   

    result_data = Tensor()
    result_data.tensor = result_tensor_ptr
    result_data.shape = new_shape.copy()
    result_data.ndim = len(new_shape)
    result_data.device = self.device

    return result_data

def __add__(self, other):
    """
    Add tensors
    result = tensor1 + tensor2
    """

    if self.shape != other.shape:
        raise ValueError("Tensors must have the same shape for addition")

    Tensor._C.add_tensor.argtypes = [ctypes.POINTER(CTensor), ctypes.POINTER(CTensor)]
    Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)

    result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)

    result_data = Tensor()
    result_data.tensor = result_tensor_ptr
    result_data.shape = self.shape.copy()
    result_data.ndim = self.ndim
    result_data.device = self.device

    return result_data

# Include the other operations:
# __str__
# __sub__ (-)
# __mul__ (*)
# __matmul__ (@)
# __pow__ (**)
# __truediv__ (/)
# log
# ...

您现在就可以运行代码并开始执行一些张量运算！

import norch

tensor1 = norch.Tensor([[1, 2, 3], [3, 2, 1]])
tensor2 = norch.Tensor([[3, 2, 1], [1, 2, 3]])

result = tensor1 + tensor2
print(result[0, 0])
# 4

微信扫一扫分享文章

从头开始重新创建 PyTorch

简介

Tensor

构建自己的张量库（Norch）

相关帖子

为什么以及如何在多重假设检验中调整 P 值

机器学习算法：随机森林

为什么深度学习模型在GPU上运行更快？

Linux｜创建和使用 Bash 别名

PyTorch | 加速模型训练的妙招

如何搭建深度学习的多 GPU 服务器

解码语言：命名实体识别（NER）技术

数据科学工厂 LV5