从头开始重新创建 PyTorch

数据科学工厂 发表于 2024-9-3 21:46:46

![](https://s2.loli.net/2024/09/03/EYFj3ALZ9N2nfed.png)

## 简介

多年来，我们一直在使用 PyTorch 来构建和训练深度学习模型。尽管对其语法和规则已经了如指掌，但内心深处总有一些疑问：这些操作背后究竟发生了什么？它们是如何运作的？如果询问你如何在 PyTorch 中创建和训练一个模型，你可能会想到类似以下的代码：

```python
import torch
import torch.nn as nn
import torch.optim as optim

class MyModel(nn.Module):
def __init__(self):
   super(MyModel, self).__init__()
   self.fc1 = nn.Linear(1, 10)
   self.sigmoid = nn.Sigmoid()
   self.fc2 = nn.Linear(10, 1)

def forward(self, x):
   out = self.fc1(x)
   out = self.sigmoid(out)
   out = self.fc2(out)

   return out

...

model = MyModel().to(device)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(epochs):
for x, y in ...

   x = x.to(device)
   y = y.to(device)

   outputs = model(x)
   loss = criterion(outputs, y)

   optimizer.zero_grad()
   loss.backward()
   optimizer.step()
```

如果询问你这个回溯步骤是如何运作的，你会如何回答？再比如，当你改变一个张量的形状时，背后发生了什么？数据是否在内部被重新组织？这个过程是如何进行的？PyTorch为何能够运行得如此迅速？它是如何处理GPU运算的？这些问题一直让我充满好奇，我相信它们同样也引起了你的兴趣。因此，为了更深入地掌握这些概念，有什么方法能比自己动手从零开始构建一个张量库更有效呢？这就是你在[本文](https://towardsdatascience.com/recreating-pytorch-from-scratch-with-gpu-support-and-automatic-differentiation-8f565122a3cc "Source")中将要学习的内容！

## Tensor

要创建一个张量库，你首先必须掌握的概念当然是：张量是什么？

你可能已经有一个直观的理解，即张量是一个包含数值的多维数据结构的数学概念。但在这儿，我们得从计算的角度来理解如何构建这种数据结构。我们可以将张量想象成由两部分组成：一部分是数据本身，另一部分是描述张量特征的元数据，比如它的维度形状，或者它存储在哪种设备上（比如CPU内存或GPU内存）。

![](https://s2.loli.net/2024/09/03/UGZ4pcmOwtCbWrE.png)

还有一个你可能不太熟悉的元数据概念，称为步长（stride）。理解这个概念对于深入掌握张量数据重排的内部工作原理至关重要，因此我们有必要对其进行更深入的探讨。

设想一个形状为的二维张量，如下图所示。

![](https://s2.loli.net/2024/09/03/jsElzLI4U8BGJQa.png)

张量的数据（即浮点数）实际上在内存中存储为一维数组：

![](https://miro.medium.com/v2/resize:fit:1156/1*maqNuiifynJlAbBdWhFOgw.png)

所以，为了将这个一维数组表示为多维张量，我们利用了步长（strides）的概念。简单来说，思路是这样的：

我们有一个 4 行 8 列的矩阵。假设所有元素都是按行顺序存储在一维数组中，如果我们想访问位于的元素值，我们需要跳过 2 行（每行有 8 个元素）再加上 3 个位置。用数学语言描述，我们需要在一维数组中跳过 3 + 2 * 8 个元素来访问它：

![](https://s2.loli.net/2024/09/03/PLNR98WGUYrc2uo.png)

这个“8”代表的是第二维度的步长（stride）。简单来说，它告诉我们在数组中需要跳过多少个元素才能“移动”到第二维度上的其他位置。因此，要访问形状为的二维张量中的元素，我们实际上需要定位到 j + i * shape_1 的位置。

接下来，让我们设想一个三维张量的情况：

![](https://s2.loli.net/2024/09/03/lv1GmxTouASUKbM.png)

你可以将这个三维张量视作矩阵的序列。比如，你可以把这个形状为的张量看作是 5 个形状为的矩阵。现在，要获取位于的元素，你需要跳过 1 个完整的形状的矩阵，2 行形状的行，以及 7 列形状的列。所以，你需要在一维数组中跳过 (1 * 4 * 8) + (2 * 8) + (7 * 1) 个位置。

![](https://s2.loli.net/2024/09/03/yZ8UTWakuqfhGXj.png)

因此，要访问 1-D 数据数组上具有的 3-D 张量的元素，您可以执行以下操作：

![](https://s2.loli.net/2024/09/03/3UA4WCYfpi9tMwH.png)

shape_1 * shape_2 是第一维度的步幅，shape_2 是第二维度的步幅，1 是第三维度的步幅。然后，为了概括：

![](https://s2.loli.net/2024/09/03/ZzDlGy72svcTVie.png)

其中每个维度的步幅可以使用下一维度张量形状的乘积来计算：

![](https://s2.loli.net/2024/09/03/qgVWpifyX8ZlNsJ.png)

然后我们设置 stride = 1。在形状为的张量示例中，步长 = = 您可以自行测试：

```python
import torch

torch.rand().stride()
#(32, 8, 1)
```

好的，但我们为什么要使用形状和步长呢？这个概念不仅可以用于访问存储为一维数组形式的N维张量的元素，还可以非常方便地用来调整张量的布局。

比如，当你想要改变一个张量的形状时，只需指定新的形状，并据此计算出新的步长即可！（这样做是因为新的形状确保了元素的总数保持不变）

```python
import torch

t = torch.rand()

print(t.shape)
#

print(t.stride())
#

new_t = t.reshape()

print(new_t.shape)
#

print(new_t.stride())
#
```

在内部，张量仍然以一维数组的形式存储。重塑操作并没有改变数组中元素的排列顺序！这真是令人惊叹，对吧？你可以通过下面的函数来亲自验证这一点，该函数能够访问 PyTorch 内部的一维数组：

```python
import ctypes

def print_internal(t: torch.Tensor):
print(
   torch.frombuffer(
         ctypes.string_at(t.data_ptr(), t.storage().nbytes()), dtype=t.dtype
   )
)

print_internal(t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...

print_internal(new_t)
# [0.0752, 0.5898, 0.3930, 0.9577, 0.2276, 0.9786, 0.1009, 0.138, ...
```

或者例如，您想要转置两个轴。在内部，您只需交换各自的步幅即可！

```python
t = torch.arange(0, 24).reshape(2, 3, 4)
print(t)
# [[[ 0,1,2,3],
# [ 4,5,6,7],
# [ 8,9, 10, 11]],

#[,
# ,
# ]]

print(t.shape)
#

print(t.stride())
#

new_t = t.transpose(0, 1)
print(new_t)
# [[[ 0,1,2,3],
# ],

#[[ 4,5,6,7],
# ],

#[[ 8,9, 10, 11],
# ]]

print(new_t.shape)
#

print(new_t.stride())
#
```

如果打印内部数组，两者具有相同的值：

```python
print_internal(t)
# [ 0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t)
# [ 0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
```

但是，新张量 `new_t` 的步长现在与我之前展示的公式不一致。这是因为张量现在不再连续。也就是说，尽管内部数组本身没有变化，它在内存中的存储顺序与张量的实际排列顺序并不一致。

```python
t.is_contiguous()
# True

new_t.is_contiguous()
# False
```

这意味着，如果我们要按顺序访问那些不连续的元素，效率会比较低（因为这些元素在内存中的存储并不是连续的）。为了提高效率，我们可以通过以下方式来解决：

```python
new_t_contiguous = new_t.contiguous()

print(new_t_contiguous.is_contiguous())
# True
```

如果我们分析内部数组，它的顺序现在与实际张量顺序匹配，这可以提供更好的内存访问效率：

```python
print(new_t)
# [[[ 0,1,2,3],
# ],

#[[ 4,5,6,7],
# ],

#[[ 8,9, 10, 11],
# ]]

print_internal(new_t)
# [ 0,1,2,3,4,5,6,7,8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

print_internal(new_t_contiguous)
# [ 0,1,2,3, 12, 13, 14, 15,4,5,6,7, 16, 17, 18, 19,8,9, 10, 11, 20, 21, 22, 23]
```

## 构建自己的张量库（Norch）

现在我们已经明白了张量是如何构建的，接下来就让我们着手开发我们的库。我打算给它命名为Norch，这个名字既表示“非PyTorch”，也隐含了我自己的姓氏Nogueira 😁首先需要了解的是，尽管PyTorch是通过Python接口使用的，但其核心实际上是用C/C++编写的。因此，我们首先需要开发我们自己的C/C++底层功能。我们可以首先定义一个结构体来存储张量的数据和元数据，并编写一个函数来创建这个结构体的实例。

```c
//norch/csrc/tensor.cpp

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

typedef struct {
float* data;
int* strides;
int* shape;
int ndim;
int size;
char* device;
} Tensor;

Tensor* create_tensor(float* data, int* shape, int ndim) {

Tensor* tensor = (Tensor*)malloc(sizeof(Tensor));
if (tensor == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}
tensor->data = data;
tensor->shape = shape;
tensor->ndim = ndim;

tensor->size = 1;
for (int i = 0; i < ndim; i++) {
   tensor->size *= shape;
}

tensor->strides = (int*)malloc(ndim * sizeof(int));
if (tensor->strides == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}
int stride = 1;
for (int i = ndim - 1; i >= 0; i--) {
   tensor->strides = stride;
   stride *= shape;
}

return tensor;
}
```

为了访问某些元素，我们可以利用步幅，正如我们之前学到的：

```c
//norch/csrc/tensor.cpp

float get_item(Tensor* tensor, int* indices) {
int index = 0;
for (int i = 0; i < tensor->ndim; i++) {
   index += indices * tensor->strides;
}

float result;
result = tensor->data;

return result;
}
```

现在，我们可以创建张量运算。我将展示一些示例，您可以在本文末尾链接的存储库中找到完整版本。

```c
//norch/csrc/cpu.cpp

void add_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

for (int i = 0; i < tensor1->size; i++) {
   result_data = tensor1->data + tensor2->data;
}
}

void sub_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

for (int i = 0; i < tensor1->size; i++) {
   result_data = tensor1->data - tensor2->data;
}
}

void elementwise_mul_tensor_cpu(Tensor* tensor1, Tensor* tensor2, float* result_data) {

for (int i = 0; i < tensor1->size; i++) {
   result_data = tensor1->data * tensor2->data;
}
}

void assign_tensor_cpu(Tensor* tensor, float* result_data) {

for (int i = 0; i < tensor->size; i++) {
   result_data = tensor->data;
}
}

...
```

之后我们就可以创建其他张量函数来调用这些操作：

```c
//norch/csrc/tensor.cpp

Tensor* add_tensor(Tensor* tensor1, Tensor* tensor2) {
if (tensor1->ndim != tensor2->ndim) {
   fprintf(stderr, "Tensors must have the same number of dimensions %d and %d for addition\n", tensor1->ndim, tensor2->ndim);
   exit(1);
}

int ndim = tensor1->ndim;
int* shape = (int*)malloc(ndim * sizeof(int));
if (shape == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}

for (int i = 0; i < ndim; i++) {
   if (tensor1->shape != tensor2->shape) {
         fprintf(stderr, "Tensors must have the same shape %d and %d at index %d for addition\n", tensor1->shape, tensor2->shape, i);
         exit(1);
   }
   shape = tensor1->shape;
}
float* result_data = (float*)malloc(tensor1->size * sizeof(float));
if (result_data == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}
add_tensor_cpu(tensor1, tensor2, result_data);

return create_tensor(result_data, shape, ndim, device);
}
```

如前所述，张量重塑不会修改内部数据数组：

```c
//norch/csrc/tensor.cpp

Tensor* reshape_tensor(Tensor* tensor, int* new_shape, int new_ndim) {

int ndim = new_ndim;
int* shape = (int*)malloc(ndim * sizeof(int));
if (shape == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}

for (int i = 0; i < ndim; i++) {
   shape = new_shape;
}

// Calculate the total number of elements in the new shape
int size = 1;
for (int i = 0; i < new_ndim; i++) {
   size *= shape;
}

// Check if the total number of elements matches the current tensor's size
if (size != tensor->size) {
   fprintf(stderr, "Cannot reshape tensor. Total number of elements in new shape does not match the current size of the tensor.\n");
   exit(1);
}

float* result_data = (float*)malloc(tensor->size * sizeof(float));
if (result_data == NULL) {
   fprintf(stderr, "Memory allocation failed\n");
   exit(1);
}
assign_tensor_cpu(tensor, result_data);
return create_tensor(result_data, shape, ndim, device);
}
```

虽然我们已经能够执行一些张量操作，但不应该有人需要直接使用C/C++来运行这些操作，对吗？现在，让我们着手开发Python接口！在Python中执行C/C++代码有很多选择，例如Pybind11和Cython。在我们的示例中，我选择使用ctypes。下面是ctypes的基本结构：

```c
//C code
#include <stdio.h>

float add_floats(float a, float b) {
return a + b;
}

# Compile
gcc -shared -o add_floats.so -fPIC add_floats.c
```

```python
# Python code
import ctypes

# Load the shared library
lib = ctypes.CDLL('./add_floats.so')

# Define the argument and return types for the function
lib.add_floats.argtypes =
lib.add_floats.restype = ctypes.c_float

# Convert python float to c_float type
a = ctypes.c_float(3.5)
b = ctypes.c_float(2.2)

# Call the C function
result = lib.add_floats(a, b)
print(result)
# 5.7
```

正如您所看到的，它非常直观。编译 C/C++ 代码后，您可以非常轻松地通过 ctypes 在 Python 上使用它。您只需要定义函数的参数和返回 c_types，将变量转换为其各自的 c_types 并调用该函数。对于更复杂的类型，例如数组（浮点列表），您可以使用指针。

```python
data =
data_ctype = (ctypes.c_float * len(data))(*data)

lib.some_array_func.argstypes =

...

lib.some_array_func(data)
```

对于结构类型，我们可以创建自己的 c_type：

```C
class CustomType(ctypes.Structure):
_fields_ = [
   ('field1', ctypes.POINTER(ctypes.c_float)),
   ('field2', ctypes.POINTER(ctypes.c_int)),
   ('field3', ctypes.c_int),
]

# Can be used as ctypes.POINTER(CustomType)
```

经过简短的解释后，让我们为张量 C/C++ 库构建 Python 包装器！

```python
# norch/tensor.py

import ctypes

class CTensor(ctypes.Structure):
_fields_ = [
   ('data', ctypes.POINTER(ctypes.c_float)),
   ('strides', ctypes.POINTER(ctypes.c_int)),
   ('shape', ctypes.POINTER(ctypes.c_int)),
   ('ndim', ctypes.c_int),
   ('size', ctypes.c_int),
]

class Tensor:
os.path.abspath(os.curdir)
_C = ctypes.CDLL("COMPILED_LIB.so"))

def __init__(self):

   data, shape = self.flatten(data)
   self.data_ctype = (ctypes.c_float * len(data))(*data)
   self.shape_ctype = (ctypes.c_int * len(shape))(*shape)
   self.ndim_ctype = ctypes.c_int(len(shape))

   self.shape = shape
   self.ndim = len(shape)

   Tensor._C.create_tensor.argtypes =
   Tensor._C.create_tensor.restype = ctypes.POINTER(CTensor)

   self.tensor = Tensor._C.create_tensor(
         self.data_ctype,
         self.shape_ctype,
         self.ndim_ctype,
   )

def flatten(self, nested_list):
   """
   This method simply convert a list type tensor to a flatten tensor with its shape

   Example:

   Arguments:
         nested_list: [, [-5, 2, 0]]
   Return:
         flat_data:
         shape:
   """
   def flatten_recursively(nested_list):
         flat_data = []
         shape = []
         if isinstance(nested_list, list):
            for sublist in nested_list:
               inner_data, inner_shape = flatten_recursively(sublist)
               flat_data.extend(inner_data)
            shape.append(len(nested_list))
            shape.extend(inner_shape)
         else:
            flat_data.append(nested_list)
         return flat_data, shape

   flat_data, shape = flatten_recursively(nested_list)
   return flat_data, shape
```

现在我们包含 Python 张量操作来调用 C/C++ 操作。

```python
# norch/tensor.py

def __getitem__(self, indices):
"""
Access tensor by index tensor
"""

if len(indices) != self.ndim:
   raise ValueError("Number of indices must match the number of dimensions")

Tensor._C.get_item.argtypes =
Tensor._C.get_item.restype = ctypes.c_float

indices = (ctypes.c_int * len(indices))(*indices)
value = Tensor._C.get_item(self.tensor, indices)

return value

def reshape(self, new_shape):
"""
Reshape tensor
result = tensor.reshape()
"""
new_shape_ctype = (ctypes.c_int * len(new_shape))(*new_shape)
new_ndim_ctype = ctypes.c_int(len(new_shape))

Tensor._C.reshape_tensor.argtypes =
Tensor._C.reshape_tensor.restype = ctypes.POINTER(CTensor)
result_tensor_ptr = Tensor._C.reshape_tensor(self.tensor, new_shape_ctype, new_ndim_ctype)

result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.shape = new_shape.copy()
result_data.ndim = len(new_shape)
result_data.device = self.device

return result_data

def __add__(self, other):
"""
Add tensors
result = tensor1 + tensor2
"""

if self.shape != other.shape:
   raise ValueError("Tensors must have the same shape for addition")

Tensor._C.add_tensor.argtypes =
Tensor._C.add_tensor.restype = ctypes.POINTER(CTensor)

result_tensor_ptr = Tensor._C.add_tensor(self.tensor, other.tensor)

result_data = Tensor()
result_data.tensor = result_tensor_ptr
result_data.shape = self.shape.copy()
result_data.ndim = self.ndim
result_data.device = self.device

return result_data

# Include the other operations:
# __str__
# __sub__ (-)
# __mul__ (*)
# __matmul__ (@)
# __pow__ (**)
# __truediv__ (/)
# log
# ...
```

您现在就可以运行代码并开始执行一些张量运算！

```python
import norch

tensor1 = norch.Tensor([, ])
tensor2 = norch.Tensor([, ])

result = tensor1 + tensor2
print(result)
# 4
```

页: [1]

ResBang生信交流网's Archiver

从头开始重新创建 PyTorch