S1: Introduction for components of Deep Learning models

It is widely accepted that a Deep Learning (DL) model is essentially a piece of code (for PyTorch, it is a Python class inherited from torch.nn.Module).  A complete DL model test-case that can be executed by DL frameworks (e.g., PyTorch) consists of two key components (here we only discuss the simplest case because the real-world model is quite large, especially Large Models) :

  1. A Python class inherited from torch.nn.Module
  2. Tensor inputs

Here is a specific model corresponding to these two components.

				
					import torch


# A Python class inherited from `torch.nn.Module`
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = torch.nn.Conv1d(in_channels=1, out_channels=3, kernel_size=1)
        self.linear = torch.nn.Linear(3, 1)

    def forward(self, x):
        x = x.unsqueeze(1)  # tensor shape: (1,3) -> (1,1,3)
        x = self.conv(x)  # tensor shape: (1,1,3) -> (1,3,3)
        x = x.mean(dim=-1)  # tensor shape: (1,3,3) -> (1,3)
        return self.linear(x)  # tensor shape: (1,3) -> (1,1)

x = torch.randn(1, 3)  # Tensor inputs

m = Model()  # Model initialization

output = m(x)
print(output)
"""
tensor([[-0.0931]], grad_fn=<addmmbackward0>)
"""
</addmmbackward0>
				
			

OK, now, let me explain the code in more detail.

  • Model class is a Python class
  • Tensor inputs

S2: Debugging PyTorch models

Sometimes, we may encounter some errors when running the code. Of course, it may be a potential bug in PyTorch, but the majority of the time, it is a buggy code itself (usually generated by DL fuzzers or typos). For example, look at the code below:

				
					class MatrixModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.mm_layer = torch.nn.Linear(10, 10)
    def forward(self, x1, inp):
	    # Constraint1: Row-column alignment
        v1 = torch.mm(x1, inp)
	    # Constraint2: Same shape or broadcastable
        v2 = v1 + inp
        return v2

# tensor shape definition does not satisfy constraints
x1 = torch.randn(2, 10)
inp = torch.randn(2, 10)
				
			

The unsatisfied constraint relationship causes this RuntimError. Let’s check the code in forward function: torch.mm require the x1 and inp must be row-column aligned ([a, b] * [b,  c]). However, the shapes of x1 and inp are both [2, 10]. In the result, PyTorh throws this RuntimeError.  So how should we fix this model? We note that another operator in Line 9 ( v2 = v1 + inp) which apply an additional constraint: Same shape or broadcastable. This means after the first operation (Line 7), v1  should have the same shape with inp. Consequently, it is very easy for us to solve that we just need to modify the shape of x1 from [2, 10] to [2, 2]. Then the model can be executed successfully.

I give the above example because I want to show that before we determine if a DL model is triggering a bug, we should check if it is an invalid model that is causing a false alert. These invalid models (no matter whether we manually craft or fuzzer generate) are meaningless in AI Infra testing.

S3: Correctness of torch.compile()

torch.compile() is the most important feature in PyTorch 2.x