Triton 내부 심층 분석 (파트 2)

이전 글(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/)에서는 트리톤 컴파일러 내부와 코드 생성 파이프라인을 살펴보았다. 이번 글에서는 프런트엔드에서 triton.compile이 호출될 때 내부적으로 어떤 일이 일어나는지 더 깊이 파고든다. 먼저 파이썬 디버거로 파이썬 컴파일러 레이어를 들여다보고, 이후 C/C++ 네이티브 레이어를 살펴본다.

이 글은 심층 분석 시리즈의 파트 2이다. 같은 시리즈의 다른 글:

Triton 내부 심층 분석 (파트 1)(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals).
Triton 내부 심층 분석 (파트 3)(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals-3).

Git 해시

Triton 내부 구현은 앞으로 바뀔 수 있다. 이 글에서 사용한 버전의 git 해시는 다음과 같다:
1
2
git rev-parse origin/master # 이 글을 작성할 당시
# 14025786d108596cfd99700caa4f438938c2ceba

위 해시에서 공식 triton 저장소를 내 포크 저장소(https://github.com/kapilsh/triton)로 포크했다. 모든 코드 포인터 링크에서 해당 해시를 확인할 수 있다.

triton 저장소(https://github.com/triton-lang/triton)의 안내를 따라 포크하고 소스에서 triton을 설치하자.

시작하기

이전 글에서 triton 컴파일러를 직접 사용해 커널의 코드를 생성할 수 있음을 살펴보았다.

튜토리얼 예제 중 하나를 컴파일하는 명령은 다음과 같다:

1
2
3
4
5
$ python3 python/triton/tools/compile.py \
  --kernel-name add_kernel \
  --signature "*fp32,*fp32,*fp32,i32,64" \
  --grid=1024,1024,1024 \
  python/tutorials/01-vector-add.py

이 명령은 add_kernel.9969bdda_0123.c, add_kernel.9969bdda_0123.h 두 파일을 생성한다! 파일이 어떻게 생성되는지 살펴보자:

AST 소스 코드 생성

AST 소스 코드 생성(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.py#L110)은 Triton 컴파일러 파이프라인의 첫 단계다.
커맨드라인 인자로 다음을 지정할 수 있다:
- 커널이 들어 있는 Python 파일의 경로
- 컴파일할 커널의 이름
- 커널에 사용할 warp/stage 수
- 컴파일된 커널의 출력 파일명 및 경로
- 타입과 constexpr 값을 포함한 커널 시그니처
- 커널의 실행 그리드(launch grid)
대부분의 커맨드라인 인자는 ASTSource 생성자에 전달된다:

1
src = triton.compiler.ASTSource(fn=kernel, constants=constants, signature=signature, attrs=attrs)

브레이크포인트를 걸어 ASTSource 내부를 들여다보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ git --no-pager diff
diff --git a/python/triton/tools/compile.py b/python/triton/tools/compile.py
index 872332b0..b2fe5abf 100644
--- a/python/triton/tools/compile.py
+++ b/python/triton/tools/compile.py
@@ -108,6 +108,7 @@ if __name__ == "__main__":
     for i in equal_to_1:
         constants.update({i: 1})
     src = triton.compiler.ASTSource(fn=kernel, constants=constants, signature=signature, attrs=attrs)
+    import ipdb; ipdb.set_trace()
     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
     ccinfo = triton.compile(src, options=opts)
     arg_names = []

1
2
3
4
5
6
7
8
9
10
11
12
13
-> % python3 python/triton/tools/compile.py \
  --kernel-name add_kernel \
  --signature "*fp32,*fp32,*fp32,i32,64" \
  --grid=1024,1024,1024 \
  python/tutorials/01-vector-add.py
...
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(112)<module>()
    111     import ipdb; ipdb.set_trace()
--> 112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
    113     ccinfo = triton.compile(src, options=opts)

ipdb> src
<triton.compiler.compiler.ASTSource object at 0x7ecefad51af0>

참고로, ASTSource(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/compiler/compiler.py#L88-L116)는 triton/compiler/compiler.py에 정의되어 있다:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ipdb> dir(src)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'attrs', 'constants', 'ext', 'fn', 'hash', 'make_ir', 'name', 'parse_options', 'signature']
ipdb> src.signature
{0: '*fp32', 1: '*fp32', 2: '*fp32', 3: 'i32'}
ipdb> src .attrs
AttrsDescriptor(divisible_by_16=[], equal_to_1=[])
ipdb> src.fn
JITFunction(01-vector-add:add_kernel)
ipdb> src.fn.arg_names
['x_ptr', 'y_ptr', 'output_ptr', 'n_elements', 'BLOCK_SIZE']
ipdb> src.fn.params[0]
<triton.runtime.jit.KernelParam object at 0x7ecff9464e90>
ipdb> src.fn.params[0].name
'x_ptr'
ipdb> print(src.fn.src)
def add_kernel(x_ptr,  # *Pointer* to first input vector.
               y_ptr,  # *Pointer* to second input vector.
               output_ptr,  # *Pointer* to output vector.
               n_elements,  # Size of the vector.
               BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process.
               # NOTE: `constexpr` so it can be used as a shape value.
               ):
    ...truncated...
    output = x + y                                                                                                                      
    # Write x + y back to DRAM.                                                                                                         
    tl.store(output_ptr + offsets, output, mask=mask)

AST Source에는 커널 정보와 코드, 시그니처, 속성, 소스 코드 등 모든 정보가 들어 있음을 확인할 수 있다.

몇 단계 더 진행해 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(113)<module>()
    112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
--> 113     ccinfo = triton.compile(src, options=opts)
    114     arg_names = []

ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(114)<module>()
    113     ccinfo = triton.compile(src, options=opts)
--> 114     arg_names = []
    115     arg_types = []

ipdb> ccinfo
<triton.compiler.compiler.CompiledKernel object at 0x7ecef81fdc40>

ccinfo는 본질적으로 이전 글(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/#ttir-triton-ir)에서 봤던 것과 동일하다. LLVM IR, PTX, CUBIN 등을 포함하는 컴파일된 커널 정보를 담고 있다.

1
2
3
4
5
6
7
8
9
10
ipdb> ccinfo.asm.keys()
dict_keys(['ttir', 'ttgir', 'llir', 'ptx', 'cubin'])
ipdb> print(ccinfo.asm["ttir"])
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
module {
  tt.func public @add_kernel(%arg0: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg1: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg2: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg3: i32 loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)) attributes {noinline = false} {
    %c64_i32 = arith.constant 64 : i32 loc(#loc1)
    %0 = tt.get_program_id x : i32 loc(#loc2)
    %1 = arith.muli %0, %c64_i32 : i32 loc(#loc3)
    ...truncated...

이 시점에서 컴파일을 위한 대부분의 작업이 완료된다. 이어지는 몇 줄(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.py#L124-L145)에서, 이전 글에서 보았듯 커널이 .c/.h 파일로 덤프된다. 모든 정보는 커널 소스/헤더 템플릿에 전달되고, 파이썬 포매팅(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.py#L142-L145)으로 삽입된다.

C 소스 템플릿(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.c)
헤더 템플릿(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.h)

`triton.compile` 들어가기

이제 triton.compile 내부로 들어가 컴파일러 안에서 무슨 일이 일어나는지 보자. 바로 여기서 마법이 일어난다!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(113)<module>()
    112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
--> 113     ccinfo = triton.compile(src, options=opts)
    114     arg_names = []

ipdb> s
--Call--
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(226)compile()
    225 
--> 226 def compile(src, target=None, options=None):
    227     if target is None:

ipdb> options
{'num_warps': 1, 'num_stages': 3}
ipdb> src
<triton.compiler.compiler.ASTSource object at 0x7fe3cf7c1dc0>

이제 triton/compiler/compiler.py(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/compiler/compiler.py#L227)로 들어왔다.

몇 번 더 진행하면:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(231)compile()
    230     backend = make_backend(target)
--> 231     ir_source = not isinstance(src, ASTSource)
    232     # create backend

ipdb> backend
<nvidia.CUDABackend object at 0x7fe4cb7dcef0>

...step forward...

ipdb> options
CUDAOptions(num_warps=1, num_ctas=1, num_stages=3, maxnreg=None, cluster_dims=(1, 1, 1), ptx_version=None, enable_fp_fusion=True, allow_fp8e4nv=True, allow_fp8e4b15=True, default_dot_input_precision='tf32', allowed_dot_input_precisions=('tf32', 'tf32x3', 'ieee'), max_num_imprecise_acc_default=0, extern_libs=(('libdevice', '/home/ksharma/dev/git/triton/python/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda')

이 부분에서 백엔드 옵션이 생성/파싱된다. 이 코드는 일반화되어 있어 AMD 백엔드도 처리할 수 있을 것으로 보인다. 계속 진행해 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
...step forward...
ipdb> s
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(277)compile()
    276     import ipdb; ipdb.set_trace()
--> 277     context = ir.context()
    278     ir.load_dialects(context)

ipdb> import pprint
ipdb> pprint.pprint(metadata)
{'allow_fp8e4b15': True,
 'allow_fp8e4nv': True,
 'allowed_dot_input_precisions': ('tf32', 'tf32x3', 'ieee'),
 'backend_name': 'cuda',
 'cluster_dims': (1, 1, 1),
 'debug': False,
 'default_dot_input_precision': 'tf32',
 'enable_fp_fusion': True,
 'extern_libs': (('libdevice',
                  '/home/ksharma/dev/git/triton/python/triton/backends/nvidia/lib/libdevice.10.bc'),),
 'hash': 'c8abb49242c7120a41c83f2e04bf352aac3f33813783a2ccf837a9f62e0f66d7',
 'max_num_imprecise_acc_default': 0,
 'maxnreg': None,
 'num_ctas': 1,
 'num_stages': 3,
 'num_warps': 1,
 'ptx_version': None,
 'target': GPUTarget(backend='cuda', arch=89, warp_size=32)}

이제 타깃 백엔드가 설정되었음을 볼 수 있다 — NVIDIA CUDABackend!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(279)compile()
    278     ir.load_dialects(context)
--> 279     backend.load_dialects(context)
    280     codegen_fns = backend.get_codegen_implementation()

ipdb> ir
<module 'triton._C.libtriton.ir'>

ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(280)compile()
    279     backend.load_dialects(context)
--> 280     codegen_fns = backend.get_codegen_implementation()
    281     try:

ipdb> backend
<nvidia.CUDABackend object at 0x73a501b3e0f0>
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(281)compile()
    280     codegen_fns = backend.get_codegen_implementation()
--> 281     try:
    282         module = src.make_ir(options, codegen_fns, context)

ipdb> codegen_fns
{'convert_custom_types': <function convert_custom_float8_sm80 at 0x73a6c4c73100>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x73a5b7f45080>}

이제 IR 생성 단계로 진입하기 시작했고, 이 작업의 대부분은 C/C++ 레이어에서 수행된다.

context: 코드(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/src/ir.cc#L206-L215)
load_dialects: 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/python/src/ir.cc#L221-L231)
module: 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/python/src/ir.cc#L464)

i.e. 파이썬 프런트엔드를 구동하는 모든 C++ 코드는 python/src(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/src)에서 찾을 수 있다.

시험 삼아 확인해 보자. C++ 코드에 stdout 출력문을 넣어 실제로 출력되는지 보겠다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ git --no-pager diff
diff --git a/python/src/ir.cc b/python/src/ir.cc
index 46095dcc..ec78a7fd 100644
--- a/python/src/ir.cc
+++ b/python/src/ir.cc
@@ -1,6 +1,7 @@
 #include <pybind11/functional.h>
 #include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
+#include <iostream>
 
 #include "mlir/Bytecode/BytecodeWriter.h"
 #include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"
@@ -219,6 +220,9 @@ void init_triton_ir(py::module &&m) {
       .def(py::init<llvm::SourceMgr &, MLIRContext *>());
 
   m.def("load_dialects", [](MLIRContext &context) {
+    std::cout << "==========================================" << std::endl;
+    std::cout << "Loading dialects" << std::endl;
+    std::cout << "==========================================" << std::endl;
     DialectRegistry registry;
     registry.insert<TritonDialect, ::mlir::triton::gpu::TritonGPUDialect,
                     math::MathDialect, arith::ArithDialect, index::IndexDialect,

참고: C++ 코드를 변경하면 triton 파이썬 패키지를 다시 빌드해야 한다: pip install -e python # triton 저장소 내부. 추가 정보(https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source)

짜잔, 터미널에 출력문이 보인다:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(278)compile()
    277     context = ir.context()
--> 278     ir.load_dialects(context)
    279     backend.load_dialects(context)

ipdb> n
==========================================
Loading dialects
==========================================
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(279)compile()
    278     ir.load_dialects(context)
--> 279     backend.load_dialects(context)
    280     codegen_fns = backend.get_codegen_implementation()

컴파일러 백엔드

이제 백엔드 단계에 도달했다. 백엔드는 타깃 하드웨어용 코드를 생성하는 역할을 한다. 여기서는 CUDA 백엔드를 사용한다.

소스 코드에서 백엔드 코드는 third_party/nvidia(https://github.com/kapilsh/triton/tree/14025786d108596cfd99700caa4f438938c2ceba/third_party/nvidia)에 있다. 이는 빌드 과정에서 python/triton/backends/nvidia로 심볼릭 링크된다. 같은 디렉터리에 AMD 백엔드도 있는 것을 아래에서 확인할 수 있다.

내 머신에서 빌드 후 디렉터리 구조는 다음과 같다:

1
2
3
4
5
6
7
8
9
10
$ ll python/triton/backends 
total 16K
lrwxrwxrwx 1 ksharma ksharma   52 Aug 12 22:55 amd -> /home/ksharma/dev/git/triton/third_party/amd/backend
-rw-rw-r-- 1 ksharma ksharma 2.7K Aug  9 16:27 compiler.py
-rw-rw-r-- 1 ksharma ksharma  977 Aug  9 16:27 driver.py
-rw-rw-r-- 1 ksharma ksharma 1.6K Aug  9 16:27 __init__.py
lrwxrwxrwx 1 ksharma ksharma   55 Aug 12 22:55 nvidia -> /home/ksharma/dev/git/triton/third_party/nvidia/backend
drwxrwxr-x 2 ksharma ksharma 4.0K Aug  9 17:42 __pycache__

다음 두 줄을 보자:

1
2
backend.load_dialects(context)
codegen_fns = backend.get_codegen_implementation()

load_dialects는 NVIDIA 백엔드의 여기(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L158-L159)로 추적할 수 있다.
get_codegen_implementation은 NVIDIA 백엔드의 여기(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L149-L156)로 추적할 수 있다.

NVIDIA 백엔드 컴파일러 코드를 더 깊이 보면, CUDA 백엔드의 실제 코드 생성 포인터를 찾을 수 있다.

CUDA 백엔드의 backend.add_stages는 서로 다른 컴파일러 단계를 추가한다. 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L346-L351)

1
2
3
4
5
6
7
# NVIDIA 백엔드 compiler.py 중
def add_stages(self, stages, options):
    stages["ttir"] = lambda src, metadata: self.make_ttir(src, metadata, options)
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, self.capability)
    stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options, self.capability)
    stages["ptx"] = lambda src, metadata: self.make_ptx(src, metadata, options, self.capability)
    stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)

양파 껍질을 한 겹 더 벗겨 make_ttir 함수를 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
def make_ttir(mod, metadata, opt):
    pm = ir.pass_manager(mod.context)
    pm.enable_debug()
    passes.common.add_inliner(pm)
    passes.ttir.add_rewrite_tensor_pointer(pm)
    passes.ttir.add_combine(pm)
    passes.common.add_canonicalizer(pm)
    passes.ttir.add_reorder_broadcast(pm)
    passes.common.add_cse(pm)
    passes.common.add_licm(pm)
    passes.common.add_symbol_dce(pm)
    pm.run(mod)
    return mod

ir.pass_manager(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/python/src/ir.cc#L1579)를 보면 MLIR PassManager를 반환한다. 여기서부터 MLIR 레이어로 들어가기 시작한다. 이 코드를 수정해 MLIR IR 덤프(환경 변수 MLIR_ENABLE_DUMP로 제어됨)를 항상 출력하도록 해 보자. 추가로 진단 메시지도 항상 출력하도록 하자.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ git --no-pager diff
diff --git a/python/src/ir.cc b/python/src/ir.cc
index 46095dcc..b4f1aa22 100644
--- a/python/src/ir.cc
+++ b/python/src/ir.cc
@@ -1584,6 +1584,8 @@ void init_triton_ir(py::module &&m) {
              bool haveDiagnostics =
                  ::triton::tools::getBoolEnv("MLIR_ENABLE_DIAGNOSTICS");
              bool haveDump = ::triton::tools::getBoolEnv("MLIR_ENABLE_DUMP");
+             haveDiagnostics = true;
+             haveDump = true;
              std::string funcToDump;
              if (!haveDump) {
                funcToDump = triton::tools::getStrEnv("MLIR_ENABLE_DUMP");

다시 빌드해야 한다: pip install -e python. 출력이 매우 많지만, 여기서는 SymbolDCE와 ConvertTritonToTritonGPU 단계에서 MLIR이 출력되는 작은 일부만 발췌한다:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
// -----// IR Dump Before SymbolDCE (symbol-dce) ('builtin.module' operation) //----- //
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
module {
  tt.func public @add_kernel(%arg0: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg1: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg2: !tt.ptr<f32> loc("/home/ksharma/dev/git/trit
on/python/tutorials/01-vector-add.py":28:0), %arg3: i32 loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)) attributes {noinline = false} {
    %c64_i32 = arith.constant 64 : i32 loc(#loc1)
    %0 = tt.get_program_id x : i32 loc(#loc2)

...truncated...

// -----// IR Dump Before ConvertTritonToTritonGPU (convert-triton-to-tritongpu) ('builtin.module' operation) //----- //
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
...

참고: 디버그 정보, IR 등을 출력/덤프하는 것은 공식 triton 가이드를 따르자. 추가 정보(https://github.com/kapilsh/triton/tree/14025786d108596cfd99700caa4f438938c2ceba?tab=readme-ov-file#tips-for-hacking)

Triton C++ -> Python 바인딩

make_ttir 함수를 더 들여다보자. 실제 MLIR 패스가 추가되는 곳이다:

1
passes.common.add_inliner(pm)

passes.common은 init_triton_passes_common(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/passes.cc#L26-L34)에 매핑된다.

다른 패스도 유사하게 추가된다.

1
2
3
4
5
6
7
passes.ttir.add_rewrite_tensor_pointer(pm)
passes.ttir.add_combine(pm)
passes.common.add_canonicalizer(pm)
passes.ttir.add_reorder_broadcast(pm)
passes.common.add_cse(pm)
passes.common.add_licm(pm)
passes.common.add_symbol_dce(pm)

사실 모든 패스는 pythpn/src/passes.cc(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/passes.cc) 파일에 정의되어 있다. 예를 몇 가지 들면:

init_triton_passes_ttir
init_triton_passes_ttgpuir

나머지 triton 컴파일러 패스는 pythpn/src/passes.cc(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/passes.cc#L83-L90)를 참고하라.

마지막으로, C++ 컴파일러 백엔드는 pybind11 바인딩을 통해 파이썬에 노출된다. 바인딩은 python/src/main.cc(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/main.cc#L46-L55)에 정의되어 있다.

컴파일러 패스

커널을 컴파일하기 위해 적용되는 컴파일러 패스 몇 가지를 개별적으로 살펴보자.

인라이너 패스

1
passes.common.add_inliner(pm)

그리고

1
2
3
4
5
void init_triton_passes_common(py::module &&m) {
  ...
  ADD_PASS_WRAPPER_0("add_inliner", createInlinerPass);
  ...
}

결국 mlir::createInlinerPass에 매핑된다. 이는 CallGraph에 정의된 대로 호출과 callable 연산을 인라인하는 패스를 생성한다. MLIR 코드 포인터(https://mlir.llvm.org/doxygen/InlinerPass_8cpp_source.html)

Tensor Pointer 재작성 패스

1
passes.ttir.add_rewrite_tensor_pointer(pm)

은 다음에 매핑된다

1
2
3
4
5
6
7
8
9
void init_triton_passes_ttir(py::module &&m) {
  using namespace mlir::triton;
  ...
  ADD_PASS_WRAPPER_0("add_rewrite_tensor_pointer",
                     createRewriteTensorPointerPass);
                     
  ...
}

여기서 createRewriteTensorPointerPass(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp#L570)는 Triton MLIR Dialect의 일부이며 RewriteTensorPointerPass(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp#L199)를 반환한다.

1
2
3
std::unique_ptr<Pass> triton::createRewriteTensorPointerPass() {
  return std::make_unique<RewriteTensorPointerPass>();
}

이 패스가 정확히 무엇을 하는지는 명확하지 않다. 다만, 이 패스의 mlir-tblgen 정의를 찾아보니 include/triton/Dialect/Triton/Transforms/Passes.td(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/include/triton/Dialect/Triton/Transforms/Passes.td#L31-L42)에 정의되어 있다.

요약과 설명에 따르면, 이 패스는 tensor 포인터를 “레거시” 포인터로 다시 쓰는 것처럼 보인다. 하지만 레거시 포인터가 무엇인지는 잘 모르겠다! let summary = "Rewrite load/stores with tensor pointers into legacy load/stores";

다른 패스들

적용되는 다른 패스가 여럿 있으며, 독자가 직접 더 탐구해 보기를 권한다.

마무리 생각

점점 더 MLIR 레이어로 깊이 들어가고 있으며, 이는 또 다른 판도라의 상자를 여는 일일 수 있다. 글이 이미 꽤 길어졌으므로 그 탐구는 다음 기회로 미루겠다. Triton 컴파일러 프런트엔드-백엔드 통합에 관심 있는 이들에게 좋은 출발점이 되었길 바란다.

이 글은 심층 분석 시리즈의 파트 2이다. 같은 시리즈의 다른 글:

Triton 내부 심층 분석 (파트 1)(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals).
Triton 내부 심층 분석 (파트 3)(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals-3).

Git 해시

Triton 내부 구현은 앞으로 바뀔 수 있다. 이 글에서 사용한 버전의 git 해시는 다음과 같다:
1
2
git rev-parse origin/master # 이 글을 작성할 당시
# 14025786d108596cfd99700caa4f438938c2ceba

위 해시에서 공식 triton 저장소를 내 포크 저장소(https://github.com/kapilsh/triton)로 포크했다. 모든 코드 포인터 링크에서 해당 해시를 확인할 수 있다.

triton 저장소(https://github.com/triton-lang/triton)의 안내를 따라 포크하고 소스에서 triton을 설치하자.

시작하기

이전 글에서 triton 컴파일러를 직접 사용해 커널의 코드를 생성할 수 있음을 살펴보았다.

튜토리얼 예제 중 하나를 컴파일하는 명령은 다음과 같다:

1
2
3
4
5
$ python3 python/triton/tools/compile.py \
  --kernel-name add_kernel \
  --signature "*fp32,*fp32,*fp32,i32,64" \
  --grid=1024,1024,1024 \
  python/tutorials/01-vector-add.py

이 명령은 add_kernel.9969bdda_0123.c, add_kernel.9969bdda_0123.h 두 파일을 생성한다! 파일이 어떻게 생성되는지 살펴보자:

AST 소스 코드 생성

AST 소스 코드 생성(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.py#L110)은 Triton 컴파일러 파이프라인의 첫 단계다.
커맨드라인 인자로 다음을 지정할 수 있다:
- 커널이 들어 있는 Python 파일의 경로
- 컴파일할 커널의 이름
- 커널에 사용할 warp/stage 수
- 컴파일된 커널의 출력 파일명 및 경로
- 타입과 constexpr 값을 포함한 커널 시그니처
- 커널의 실행 그리드(launch grid)
대부분의 커맨드라인 인자는 ASTSource 생성자에 전달된다:

1
src = triton.compiler.ASTSource(fn=kernel, constants=constants, signature=signature, attrs=attrs)

브레이크포인트를 걸어 ASTSource 내부를 들여다보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ git --no-pager diff
diff --git a/python/triton/tools/compile.py b/python/triton/tools/compile.py
index 872332b0..b2fe5abf 100644
--- a/python/triton/tools/compile.py
+++ b/python/triton/tools/compile.py
@@ -108,6 +108,7 @@ if __name__ == "__main__":
     for i in equal_to_1:
         constants.update({i: 1})
     src = triton.compiler.ASTSource(fn=kernel, constants=constants, signature=signature, attrs=attrs)
+    import ipdb; ipdb.set_trace()
     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
     ccinfo = triton.compile(src, options=opts)
     arg_names = []

1
2
3
4
5
6
7
8
9
10
11
12
13
-> % python3 python/triton/tools/compile.py \
  --kernel-name add_kernel \
  --signature "*fp32,*fp32,*fp32,i32,64" \
  --grid=1024,1024,1024 \
  python/tutorials/01-vector-add.py
...
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(112)<module>()
    111     import ipdb; ipdb.set_trace()
--> 112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
    113     ccinfo = triton.compile(src, options=opts)

ipdb> src
<triton.compiler.compiler.ASTSource object at 0x7ecefad51af0>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ipdb> dir(src)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'attrs', 'constants', 'ext', 'fn', 'hash', 'make_ir', 'name', 'parse_options', 'signature']
ipdb> src.signature
{0: '*fp32', 1: '*fp32', 2: '*fp32', 3: 'i32'}
ipdb> src .attrs
AttrsDescriptor(divisible_by_16=[], equal_to_1=[])
ipdb> src.fn
JITFunction(01-vector-add:add_kernel)
ipdb> src.fn.arg_names
['x_ptr', 'y_ptr', 'output_ptr', 'n_elements', 'BLOCK_SIZE']
ipdb> src.fn.params[0]
<triton.runtime.jit.KernelParam object at 0x7ecff9464e90>
ipdb> src.fn.params[0].name
'x_ptr'
ipdb> print(src.fn.src)
def add_kernel(x_ptr,  # *Pointer* to first input vector.
               y_ptr,  # *Pointer* to second input vector.
               output_ptr,  # *Pointer* to output vector.
               n_elements,  # Size of the vector.
               BLOCK_SIZE: tl.constexpr,  # Number of elements each program should process.
               # NOTE: `constexpr` so it can be used as a shape value.
               ):
    ...truncated...
    output = x + y                                                                                                                      
    # Write x + y back to DRAM.                                                                                                         
    tl.store(output_ptr + offsets, output, mask=mask)

AST Source에는 커널 정보와 코드, 시그니처, 속성, 소스 코드 등 모든 정보가 들어 있음을 확인할 수 있다.

몇 단계 더 진행해 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(113)<module>()
    112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
--> 113     ccinfo = triton.compile(src, options=opts)
    114     arg_names = []

ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(114)<module>()
    113     ccinfo = triton.compile(src, options=opts)
--> 114     arg_names = []
    115     arg_types = []

ipdb> ccinfo
<triton.compiler.compiler.CompiledKernel object at 0x7ecef81fdc40>

ccinfo는 본질적으로 이전 글(https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/#ttir-triton-ir)에서 봤던 것과 동일하다. LLVM IR, PTX, CUBIN 등을 포함하는 컴파일된 커널 정보를 담고 있다.

1
2
3
4
5
6
7
8
9
10
ipdb> ccinfo.asm.keys()
dict_keys(['ttir', 'ttgir', 'llir', 'ptx', 'cubin'])
ipdb> print(ccinfo.asm["ttir"])
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
module {
  tt.func public @add_kernel(%arg0: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg1: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg2: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg3: i32 loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)) attributes {noinline = false} {
    %c64_i32 = arith.constant 64 : i32 loc(#loc1)
    %0 = tt.get_program_id x : i32 loc(#loc2)
    %1 = arith.muli %0, %c64_i32 : i32 loc(#loc3)
    ...truncated...

C 소스 템플릿(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.c)
헤더 템플릿(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/tools/compile.h)

`triton.compile` 들어가기

이제 triton.compile 내부로 들어가 컴파일러 안에서 무슨 일이 일어나는지 보자. 바로 여기서 마법이 일어난다!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/tools/compile.py(113)<module>()
    112     opts = {"num_warps": args.num_warps, "num_stages": args.num_stages}
--> 113     ccinfo = triton.compile(src, options=opts)
    114     arg_names = []

ipdb> s
--Call--
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(226)compile()
    225 
--> 226 def compile(src, target=None, options=None):
    227     if target is None:

ipdb> options
{'num_warps': 1, 'num_stages': 3}
ipdb> src
<triton.compiler.compiler.ASTSource object at 0x7fe3cf7c1dc0>

이제 triton/compiler/compiler.py(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/triton/compiler/compiler.py#L227)로 들어왔다.

몇 번 더 진행하면:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(231)compile()
    230     backend = make_backend(target)
--> 231     ir_source = not isinstance(src, ASTSource)
    232     # create backend

ipdb> backend
<nvidia.CUDABackend object at 0x7fe4cb7dcef0>

...step forward...

ipdb> options
CUDAOptions(num_warps=1, num_ctas=1, num_stages=3, maxnreg=None, cluster_dims=(1, 1, 1), ptx_version=None, enable_fp_fusion=True, allow_fp8e4nv=True, allow_fp8e4b15=True, default_dot_input_precision='tf32', allowed_dot_input_precisions=('tf32', 'tf32x3', 'ieee'), max_num_imprecise_acc_default=0, extern_libs=(('libdevice', '/home/ksharma/dev/git/triton/python/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda')

이 부분에서 백엔드 옵션이 생성/파싱된다. 이 코드는 일반화되어 있어 AMD 백엔드도 처리할 수 있을 것으로 보인다. 계속 진행해 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
...step forward...
ipdb> s
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(277)compile()
    276     import ipdb; ipdb.set_trace()
--> 277     context = ir.context()
    278     ir.load_dialects(context)

ipdb> import pprint
ipdb> pprint.pprint(metadata)
{'allow_fp8e4b15': True,
 'allow_fp8e4nv': True,
 'allowed_dot_input_precisions': ('tf32', 'tf32x3', 'ieee'),
 'backend_name': 'cuda',
 'cluster_dims': (1, 1, 1),
 'debug': False,
 'default_dot_input_precision': 'tf32',
 'enable_fp_fusion': True,
 'extern_libs': (('libdevice',
                  '/home/ksharma/dev/git/triton/python/triton/backends/nvidia/lib/libdevice.10.bc'),),
 'hash': 'c8abb49242c7120a41c83f2e04bf352aac3f33813783a2ccf837a9f62e0f66d7',
 'max_num_imprecise_acc_default': 0,
 'maxnreg': None,
 'num_ctas': 1,
 'num_stages': 3,
 'num_warps': 1,
 'ptx_version': None,
 'target': GPUTarget(backend='cuda', arch=89, warp_size=32)}

이제 타깃 백엔드가 설정되었음을 볼 수 있다 — NVIDIA CUDABackend!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(279)compile()
    278     ir.load_dialects(context)
--> 279     backend.load_dialects(context)
    280     codegen_fns = backend.get_codegen_implementation()

ipdb> ir
<module 'triton._C.libtriton.ir'>

ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(280)compile()
    279     backend.load_dialects(context)
--> 280     codegen_fns = backend.get_codegen_implementation()
    281     try:

ipdb> backend
<nvidia.CUDABackend object at 0x73a501b3e0f0>
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(281)compile()
    280     codegen_fns = backend.get_codegen_implementation()
--> 281     try:
    282         module = src.make_ir(options, codegen_fns, context)

ipdb> codegen_fns
{'convert_custom_types': <function convert_custom_float8_sm80 at 0x73a6c4c73100>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x73a5b7f45080>}

이제 IR 생성 단계로 진입하기 시작했고, 이 작업의 대부분은 C/C++ 레이어에서 수행된다.

context: 코드(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/src/ir.cc#L206-L215)
load_dialects: 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/python/src/ir.cc#L221-L231)
module: 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/python/src/ir.cc#L464)

i.e. 파이썬 프런트엔드를 구동하는 모든 C++ 코드는 python/src(https://github.com/kapilsh/triton/blob/14025786d108596cfd99700caa4f438938c2ceba/python/src)에서 찾을 수 있다.

시험 삼아 확인해 보자. C++ 코드에 stdout 출력문을 넣어 실제로 출력되는지 보겠다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ git --no-pager diff
diff --git a/python/src/ir.cc b/python/src/ir.cc
index 46095dcc..ec78a7fd 100644
--- a/python/src/ir.cc
+++ b/python/src/ir.cc
@@ -1,6 +1,7 @@
 #include <pybind11/functional.h>
 #include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
+#include <iostream>
 
 #include "mlir/Bytecode/BytecodeWriter.h"
 #include "mlir/Dialect/ControlFlow/IR/ControlFlow.h"
@@ -219,6 +220,9 @@ void init_triton_ir(py::module &&m) {
       .def(py::init<llvm::SourceMgr &, MLIRContext *>());
 
   m.def("load_dialects", [](MLIRContext &context) {
+    std::cout << "==========================================" << std::endl;
+    std::cout << "Loading dialects" << std::endl;
+    std::cout << "==========================================" << std::endl;
     DialectRegistry registry;
     registry.insert<TritonDialect, ::mlir::triton::gpu::TritonGPUDialect,
                     math::MathDialect, arith::ArithDialect, index::IndexDialect,

참고: C++ 코드를 변경하면 triton 파이썬 패키지를 다시 빌드해야 한다: pip install -e python # triton 저장소 내부. 추가 정보(https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source)

짜잔, 터미널에 출력문이 보인다:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ipdb> n
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(278)compile()
    277     context = ir.context()
--> 278     ir.load_dialects(context)
    279     backend.load_dialects(context)

ipdb> n
==========================================
Loading dialects
==========================================
> /home/ksharma/dev/git/triton/python/triton/compiler/compiler.py(279)compile()
    278     ir.load_dialects(context)
--> 279     backend.load_dialects(context)
    280     codegen_fns = backend.get_codegen_implementation()

컴파일러 백엔드

이제 백엔드 단계에 도달했다. 백엔드는 타깃 하드웨어용 코드를 생성하는 역할을 한다. 여기서는 CUDA 백엔드를 사용한다.

내 머신에서 빌드 후 디렉터리 구조는 다음과 같다:

1
2
3
4
5
6
7
8
9
10
$ ll python/triton/backends 
total 16K
lrwxrwxrwx 1 ksharma ksharma   52 Aug 12 22:55 amd -> /home/ksharma/dev/git/triton/third_party/amd/backend
-rw-rw-r-- 1 ksharma ksharma 2.7K Aug  9 16:27 compiler.py
-rw-rw-r-- 1 ksharma ksharma  977 Aug  9 16:27 driver.py
-rw-rw-r-- 1 ksharma ksharma 1.6K Aug  9 16:27 __init__.py
lrwxrwxrwx 1 ksharma ksharma   55 Aug 12 22:55 nvidia -> /home/ksharma/dev/git/triton/third_party/nvidia/backend
drwxrwxr-x 2 ksharma ksharma 4.0K Aug  9 17:42 __pycache__

다음 두 줄을 보자:

1
2
backend.load_dialects(context)
codegen_fns = backend.get_codegen_implementation()

load_dialects는 NVIDIA 백엔드의 여기(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L158-L159)로 추적할 수 있다.
get_codegen_implementation은 NVIDIA 백엔드의 여기(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L149-L156)로 추적할 수 있다.

NVIDIA 백엔드 컴파일러 코드를 더 깊이 보면, CUDA 백엔드의 실제 코드 생성 포인터를 찾을 수 있다.

CUDA 백엔드의 backend.add_stages는 서로 다른 컴파일러 단계를 추가한다. 코드(https://github.com/triton-lang/triton/blob/6a9a0a6474afa20498f3b8ae9a8bbb872cad458b/third_party/nvidia/backend/compiler.py#L346-L351)

1
2
3
4
5
6
7
# NVIDIA 백엔드 compiler.py 중
def add_stages(self, stages, options):
    stages["ttir"] = lambda src, metadata: self.make_ttir(src, metadata, options)
    stages["ttgir"] = lambda src, metadata: self.make_ttgir(src, metadata, options, self.capability)
    stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options, self.capability)
    stages["ptx"] = lambda src, metadata: self.make_ptx(src, metadata, options, self.capability)
    stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)

양파 껍질을 한 겹 더 벗겨 make_ttir 함수를 보자:

1
2
3
4
5
6
7
8
9
10
11
12
13
def make_ttir(mod, metadata, opt):
    pm = ir.pass_manager(mod.context)
    pm.enable_debug()
    passes.common.add_inliner(pm)
    passes.ttir.add_rewrite_tensor_pointer(pm)
    passes.ttir.add_combine(pm)
    passes.common.add_canonicalizer(pm)
    passes.ttir.add_reorder_broadcast(pm)
    passes.common.add_cse(pm)
    passes.common.add_licm(pm)
    passes.common.add_symbol_dce(pm)
    pm.run(mod)
    return mod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ git --no-pager diff
diff --git a/python/src/ir.cc b/python/src/ir.cc
index 46095dcc..b4f1aa22 100644
--- a/python/src/ir.cc
+++ b/python/src/ir.cc
@@ -1584,6 +1584,8 @@ void init_triton_ir(py::module &&m) {
              bool haveDiagnostics =
                  ::triton::tools::getBoolEnv("MLIR_ENABLE_DIAGNOSTICS");
              bool haveDump = ::triton::tools::getBoolEnv("MLIR_ENABLE_DUMP");
+             haveDiagnostics = true;
+             haveDump = true;
              std::string funcToDump;
              if (!haveDump) {
                funcToDump = triton::tools::getStrEnv("MLIR_ENABLE_DUMP");

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
// -----// IR Dump Before SymbolDCE (symbol-dce) ('builtin.module' operation) //----- //
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
module {
  tt.func public @add_kernel(%arg0: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg1: !tt.ptr<f32> loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0), %arg2: !tt.ptr<f32> loc("/home/ksharma/dev/git/trit
on/python/tutorials/01-vector-add.py":28:0), %arg3: i32 loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)) attributes {noinline = false} {
    %c64_i32 = arith.constant 64 : i32 loc(#loc1)
    %0 = tt.get_program_id x : i32 loc(#loc2)

...truncated...

// -----// IR Dump Before ConvertTritonToTritonGPU (convert-triton-to-tritongpu) ('builtin.module' operation) //----- //
#loc = loc("/home/ksharma/dev/git/triton/python/tutorials/01-vector-add.py":28:0)
...

참고: 디버그 정보, IR 등을 출력/덤프하는 것은 공식 triton 가이드를 따르자. 추가 정보(https://github.com/kapilsh/triton/tree/14025786d108596cfd99700caa4f438938c2ceba?tab=readme-ov-file#tips-for-hacking)

Triton C++ -> Python 바인딩

make_ttir 함수를 더 들여다보자. 실제 MLIR 패스가 추가되는 곳이다:

1
passes.common.add_inliner(pm)

passes.common은 init_triton_passes_common(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/passes.cc#L26-L34)에 매핑된다.

다른 패스도 유사하게 추가된다.

1
2
3
4
5
6
7
passes.ttir.add_rewrite_tensor_pointer(pm)
passes.ttir.add_combine(pm)
passes.common.add_canonicalizer(pm)
passes.ttir.add_reorder_broadcast(pm)
passes.common.add_cse(pm)
passes.common.add_licm(pm)
passes.common.add_symbol_dce(pm)

init_triton_passes_ttir
init_triton_passes_ttgpuir

나머지 triton 컴파일러 패스는 pythpn/src/passes.cc(https://github.com/triton-lang/triton/blob/9e955f1454095725dbb7bed96c8112092c02929e/python/src/passes.cc#L83-L90)를 참고하라.

컴파일러 패스

커널을 컴파일하기 위해 적용되는 컴파일러 패스 몇 가지를 개별적으로 살펴보자.

인라이너 패스

1
passes.common.add_inliner(pm)

그리고

1
2
3
4
5
void init_triton_passes_common(py::module &&m) {
  ...
  ADD_PASS_WRAPPER_0("add_inliner", createInlinerPass);
  ...
}

Tensor Pointer 재작성 패스

1
passes.ttir.add_rewrite_tensor_pointer(pm)

은 다음에 매핑된다

1
2
3
4
5
6
7
8
9
void init_triton_passes_ttir(py::module &&m) {
  using namespace mlir::triton;
  ...
  ADD_PASS_WRAPPER_0("add_rewrite_tensor_pointer",
                     createRewriteTensorPointerPass);
                     
  ...
}

1
2
3
std::unique_ptr<Pass> triton::createRewriteTensorPointerPass() {
  return std::make_unique<RewriteTensorPointerPass>();
}

요약과 설명에 따르면, 이 패스는 tensor 포인터를 “레거시” 포인터로 다시 쓰는 것처럼 보인다. 하지만 레거시 포인터가 무엇인지는 잘 모르겠다! let summary = "Rewrite load/stores with tensor pointers into legacy load/stores";

다른 패스들

적용되는 다른 패스가 여럿 있으며, 독자가 직접 더 탐구해 보기를 권한다.

Triton 내부 심층 분석 (파트 2)

Git 해시

시작하기

AST 소스 코드 생성

triton.compile 들어가기

컴파일러 백엔드

Triton C++ -> Python 바인딩

컴파일러 패스

인라이너 패스

Tensor Pointer 재작성 패스

다른 패스들

마무리 생각

Git 해시

시작하기

AST 소스 코드 생성

triton.compile 들어가기

컴파일러 백엔드

Triton C++ -> Python 바인딩

컴파일러 패스

인라이너 패스

Tensor Pointer 재작성 패스

다른 패스들

마무리 생각

`triton.compile` 들어가기

`triton.compile` 들어가기