🛠️AI Infra Testing②: How to draft a good bug report for open-source community?

Note This blog is still on building

About one year ago, I read one of Jiawei Liu‘s blogs about How to draft a bug report for the open-source community, which inspired me a lot. Sadly, that blog seems to be lost now. Now, based on my experience of reporting bugs for the open source community over the past year, I want to write a blog about drafting bug reports that incorporates some of my personal understandings. As my research focuses on AI Infra testing, I take reporting DL system bugs as an example. As I mentioned in my previous blog, I assume that you have completed local debugging and found a bug (in PyTorch). (One more word, there are many ways to find bugs, including using some testing techniques or in your daily use.) If it is about the debugging of torch.compile(), I strongly recommend carefully reading this document first.

S1: Why do we report bugs to the open source community?

[Motivation]:Before we start, we should figure out why we need to report bugs to the open-source community. Well, for many people (researchers), this is an important “KPI” for their papers. There is no denying that the number of reported bugs is indeed a very important indicator of testing techniques. Forgive me! Forgive me! Forgive me! I must say that ⚠️reporting outdated or false bugs to the community to make the data in the paper look better is meaningless and even adds to the burden of the community. We should reject and resist this kind of behaviour.

From my perspective, the process from discovering a bug to reporting and then watching the issue be solved is very comforting. This also gives me sufficient research motivation to convince me that what I am doing now is meaningful. Additionally, to be honest, I also need these bugs to support my research (paper), proving the effectiveness of my designed approach. I’m also very willing to make my bug list public to facilitate the subsequent research. I believe that this contributes to this research field!

The community is not composed of a single person, but is made up of our efforts together

S2: How to draft a good bug report?

[Code Formatting]: Now, let’s get to the point. When you find a PyTorch bug, you should first provide an executable script (usually a code snippet of Python). This script should follow the below principal criteria.

Minimal reproduction: Before we submit a bug report, we should first do some ablations. Let me take PyTorch#151522 below as an example. At the beginning, my designed fuzzer generated a model with torch.relu()-.to_sparse()-.to_dense(). This model executes successfully on eager and fails on Inductor. Furthermore, I did some ablations and found that torch.relu()(Line 15) is not the condition to trigger the error, so we can remove this operator. I did the same thing on .to_sparse() and .to_dense() . Ultimately, I found that this bug could be triggered if and only if .to_sparse()-.to_dense() was used together. As a result, an intuitive idea is to isolate the optimization of these two operators through graph break. Immediately after that, I found that the crash did disappear along with the use of torch._dynamo.graph_break(). I added this comment on Line 17.
Well-organized and highly readable: In addition to minimal repro, a well-organized and highly readable code script is also necessary, which can reduce the understanding cost for developers. First, code snippet length should not exceed 100 lines. Redundant and long code is really annoying, discouraging me from continuing to read. Second, try to encapsulate some functions to reduce the lines of code. Look at the code below, I wrote a function (namely run_test) to reuse some repeated code (e.g., torch.compile, model(*inputs)). Moreover, encapsulated functions can shield readers from some details, and the function’s role can be accurately identified through the function signature.

				
					import torch
import torch.nn as nn
import torch.nn.functional as F
from torch._inductor import config

config.fallback_random = True
torch.set_grad_enabled(False)


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        # x = torch.relu(x)  # This is not the necessary factor to trigger crash
        x_sparse = x.to_sparse()
        # torch._dynamo.graph_break()  # using `graph_break` can eliminate crash
        x_dense = x_sparse.to_dense()
        return x_dense


model = Model()


x = torch.tensor([[1.0]])

inputs = [x]


def run_test(model, inputs, backend):
    torch.manual_seed(0)
    if backend != "eager":
        model = torch.compile(model, backend=backend)
    try:
        output = model(*inputs)
        print(f"succeed on {backend}")
    except Exception as e:
        print(e)


run_test(model, inputs, 'eager')
run_test(model, inputs, 'inductor')

Eager succeed on eager

Compiler LoweringException: NotImplementedError: could not find kernel for aten._to_dense.default at dispatch key DispatchKey.CPU

[Report Formatting]: I devote a separate paragraph to elaborating on the importance of code. However, some other components in the bug report are also important. Let me introduce you to how to write these nice components.

Bug report title: Usually, it is enough to summarize the problem you encounter in one sentence. A small tip here is that you can use some formatted words (usually wrapped in [xxx]) to attract the attention of the developers of these relevant modules (for example, I almost added [inductor] or [dynamo] before the title of each of my bug reports).
Symptom description: In this section, we should clarify what problem we met during the usage process (e.g., in PyTorch#151522, I wrote: “.to_sparse()-.to_dense() throws errors while eager can execute successfully.”). You can also add some supplementary information here to make it clearer for the developers. For example, in my bug report, I always clarify which code generation backend (i.e., CPP or Triton) the bug occurs on. The more detailed information you provide, the faster developers can understand and fix bugs.
Error logs: Here should contain the script execution results on your machine. In this case, I provided the execution results on eager and compiler, respectively (as you can see above). A small tip here is that the error log should also be concise, clear, and short. For example, sometimes the error information may contain some file paths that make the information too long. We can clear them because these file paths are usually useless for debugging (I mean for developers).
Environment and version: Environment and version information is a key part to instruct developers can concisely reproduce your bugs. Some popular open-source projects contain their own environment collection scripts. Take PyTorch as an example, in their bug report template, they provide this command to download and run the environment information below. After you run the script, you will obtain an extremely long piece of environmental information! If you simply paste them into the bug report, it will also make the bug report annoying. Tip here is: You can use some HTML tags (<details> and <summary>) to wrap environment information. For more details, you can refer to PyTorch#144536. I explained my thoughts in detail there.

				
					wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

OK, that’s all for the format of (PyTorch) bug report (from my personal understanding).

S3: Can we try to fix this bug by ourselves?

The last part of this blog is to introduce how we fix some simple bugs. It’s a more advanced skill, but please believe me, sometimes it’s not as difficult as you think! I list several items in chronological order.

Familiar with the code base: I must make it clear that “familiar” doesn’t mean that you should know every code file or even every line of code. But you should be roughly familiar with the directory structure of the code base and their functions. In PyTorch Inductor, you should know the code files whose optimizations correspond to (e.g., partition, decomp, lowering). I use some specific terms. If you don’t know them completely, maybe you should learn the PyTorch Inductor first.
Figure out the issue: Fixing some edge case bugs is often a good start. For semantic errors, I suggest you don’t even touch them. Please give them to professional maintainers. So-called edge case bugs. So-called edge case bugs are mostly caused by the lack of some simple boundary checks. Look at this Issue (#143779), eager throws an error, but Inductor passes the check, which means Inductor may lack some checks on bool dtype. Maybe only a simple check can also cause the Inductor to throw the error.
Locate the fault and fix it locally: After figuring out the issue, we can try to locate where we can add checks. In #143779, I find that we can try to add some checks for dtypes in _meta_registration. So I added an edge case check in _meta_registration for torch.signbit on processing bool dtype. In this case, I ran the original script, Inductor also throws the error like eager. After this, don’t forget the Unit Test (UT)! Usually, you can write UT based on the original bug report to check whether the bug still occurs. More details about bug fix and UT can be found in this PR (#147666) which is a good example (I think).
PR submission: After completing all the things above, you can submit your PR to the community. Different communities have different contribution guidelines. PyTorch contribution guidelines can be found here. After submitting the PR, usually, code reviewers will come to review your PR very soon. Please take further actions based on their comments utils your PR is merged. For example, they may think your fix is not a good fix and propose some better solutions. You need to modify your PR (add new commits) to satisfy their needs. Because they are professionals but you are amateurs. 🙂

OK. Now I believe your PR is going to be merged. It’s so cool you contribute to the open-source community. Thank you for your contribution!

[Conclusion]: That’s all, thank you for reading this far. Finshing a blog is really tiring for me (I will continue improving). Feel free to contact me (shaoyuyoung@gmail.com) at any time! Let’s fight for a better tomorrow!

2025年5月21日 shaoyuyoung

Shaoyu Yang

杨少宇