MLIR 파트 3 - Affine 방언과 OpenMP

Image 1: mlir-egglog

MLIR의 affine 방언은 병렬 실행에 적합한 형태로 루프와 배열 접근을 표현하기 위한 추상화를 제공한다. 앞서 보았던 scf 방언과 달리, affine 방언은 고급 루프 변환과 자동 병렬화를 가능하게 한다. 이는 본질적으로 "일급 루프(first class loops)"이며, 우리가 늘 LLVM 위에 계층으로 얹어오던 것들이다.

affine 방언은 자동 병렬화 기회, 내장 벡터화 기능, 정교한 루프 최적화, 배열 계산을 자연스럽게 표현하는 능력, 그리고 OpenMP와의 통합 등 많은 장점을 가진다.

Polyhedral Optimization Model

폴리헤드럴(polyhedral) 모델은 루프 중첩과 그 안의 계산을 다차원 공간에서의 기하학적 객체로 표현하는 수학적 프레임워크다.

핵심적으로, 폴리헤드럴 모델은 루프 중첩 내부의 문장(statement) 실행을 다차원 공간의 점으로 나타낸다. 각 점은 반복 좌표(iteration coordinates)로 정의되는 실행에서의 특정 문장 인스턴스에 대응한다. 이 표현은 복잡한 루프 최적화 문제를 폴리헤드라(polyhedra) 위의 기하 연산으로 바꿈으로써 프로그램 분석 및 변환 기법을 가능하게 한다.

폴리헤드럴 모델의 주요 구성 요소로는 반복 영역(iteration domains), 접근 관계(access relations), 스케줄링 함수(scheduling functions)가 있다. 반복 영역은 루프 중첩이 실행하는 모든 반복의 집합을 정의하며, 아핀 제약으로 경계 지어진 정수 점들의 집합으로 표현된다. 접근 관계는 반복 점을 각 문장이 접근하는 메모리 위치로 매핑한다. 스케줄링 함수는 반복 점들의 실행 순서를 결정하여, 프로그램 의미를 보존하면서 성능 특성을 개선하는 변환을 가능하게 한다.

MLIR은 폴리헤드럴 개념을 일급 시민으로 통합한 affine을 제공한다. 이 하이브리드 설계는 고성능 데이터 병렬 시스템을 위한 고수준 데이터플로 그래프와 타깃 특화 코드의 표현, 분석, 변환을 최적화한다.

우리가 이해해야 할 핵심 개념은 세 가지다:

Affine maps: 차원과 심볼을 결과로 매핑하는 다차원 준선형(quasi-linear) 함수. 예를 들어 (d0, d1, d2, s0) → (d0 + d1, s0 * d2)는 차원 인자 d0, d1, d2와 심볼 인자 s0를 두 개의 결과로 매핑하는 2차원 아핀 맵을 나타낸다.
Integer sets: 아핀 부등식과 등식으로 표현되는 차원과 심볼에 대한 제약. 예를 들어 (i)[N, M] : (i >= 0, -i + N >= 0, N - 5 == 0, -i + M + 1 >= 0)는 0 <= i < N, N = 5, 그리고 i <= M + 1을 만족하는 i 값들의 집합을 나타낸다.
Affine operations: affine.for, affine.if, affine.parallel처럼 아핀 맵과 정수 집합을 활용해 아핀 제약을 가진 루프 중첩과 조건문을 표현하는 연산.

이론을 먼저 잔뜩 깔기보다는, 그냥 예제를 몇 개 보자.

Matrix Multiplication

두 행렬 A i j 와 B j k 의 고전적인 행렬곱은 다음과 같다:

C i k=(A B)i k=∑j=1 N A i j B j k

이를 MLIR에서 명시적인 루프 중첩으로 표현하면 다음과 같다:

func.func @matmul(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index

  %M = memref.dim %A, %c0 : memref<?x?xf32>
  %N = memref.dim %B, %c1 : memref<?x?xf32>
  %K = memref.dim %A, %c1 : memref<?x?xf32>

  // Sequential implementation
  affine.for %i = 0 to %M {
    affine.for %j = 0 to %N {
      affine.for %k = 0 to %K {
        %a = affine.load %A[%i, %k] : memref<?x?xf32>
        %b = affine.load %B[%k, %j] : memref<?x?xf32>
        %c = affine.load %C[%i, %j] : memref<?x?xf32>
        %prod = arith.mulf %a, %b : f32
        %sum = arith.addf %c, %prod : f32
        affine.store %sum, %C[%i, %j] : memref<?x?xf32>
      }
    }
  }
  return
}

%i와 %j에 대한 두 개의 루프는 서로 다른 반복들 사이에 의존성이 없다는 점에 주목하라. -affine-parallelize 패스를 사용하면 다음과 같이 변환할 수 있다:

func.func @matmul_parallel(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index

  %M = memref.dim %A, %c0 : memref<?x?xf32>
  %N = memref.dim %B, %c1 : memref<?x?xf32>
  %K = memref.dim %A, %c1 : memref<?x?xf32>

  // Parallel implementation
  affine.parallel (%i, %j) = (0, 0) to (%M, %N) {
    affine.for %k = 0 to %K {
      %a = affine.load %A[%i, %k] : memref<?x?xf32>
      %b = affine.load %B[%k, %j] : memref<?x?xf32>
      %c = affine.load %C[%i, %j] : memref<?x?xf32>
      %prod = arith.mulf %a, %b : f32
      %sum = arith.addf %c, %prod : f32
      affine.store %sum, %C[%i, %j] : memref<?x?xf32>
    }
  }
  return
}

affine.parallel 연산은 루프의 여러 반복이 서로 독립적으로 실행될 수 있으며 잠재적으로 동시에 실행될 수 있음을 표현하는 방법을 제공한다. 이는 "이 계산들은 어떤 순서로든, 혹은 한꺼번에 수행될 수 있다"고 말하는 것과 비슷하다. 서로 소통할 필요가 없는 여러 작업자에게 일을 나누는 것과 유사하다.

좀 더 정확히 말하면 affine.parallel은 다음을 만족하는 중첩 루프들의 집합을 정의한다:

각 반복은 다른 모든 반복과 독립적이다
반복들은 어떤 순서로든 또는 동시에 실행될 수 있다
연산은 리덕션(값을 합산하는 것 등)을 통해 결과를 생성하며, 결과를 결합하는 순서가 중요하지 않다

기술적으로는 "초직사각형 병렬 밴드(hyper-rectangular parallel band)"를 생성하는데, 이는 여러 개의 중첩 병렬 루프를 표현할 수 있음을 뜻한다. 각 루프는 자신만의 유도 변수(루프 카운터), 하한/상한, 스텝 크기(각 반복에서 얼마나 증가하는지), 그리고 리덕션 연산(서로 다른 반복의 결과를 어떻게 결합할지)을 가진다.

병렬 영역의 본문은 affine.yield로 끝나야 하며, 이는 덧셈이나 곱셈 같은 연산을 사용해 서로 다른 반복의 결과를 어떻게 결합할지 지정한다. 0번 실행되는 루프의 경우 결과는 해당 연산의 항등값이 된다(덧셈은 0, 곱셈은 1).

Affine Maps (`affine.apply`)

아핀 맵은 affine 방언의 핵심 구성 요소로, 다차원 아핀 변환을 나타낸다. 이는 루프 인덱스를 메모리 접근 패턴으로 매핑하는 방식을 기술할 수 있게 해 주며, 고성능 컴퓨팅에서 메모리 레이아웃과 접근 패턴을 최적화하는 데 필수적이다.

affine.apply의 다양한 사용법을 보여주는 포괄적인 예제를 보자:

// Define some reusable affine maps
#tile_map = affine_map<(d0) -> (d0 floordiv 32)>
#offset_map = affine_map<(d0)[s0] -> (d0 + s0)>
#complex_map = affine_map<(d0, d1)[s0] -> (d0 * 2 + d1 floordiv 4 + s0)>

func.func @affine_apply_examples() {
  // Create some test indices
  %c0 = arith.constant 0 : index
  %c42 = arith.constant 42 : index
  %c128 = arith.constant 128 : index
  
  // Example 1: Simple tiling calculation
  // Calculate which tile a given index belongs to (tile size = 32)
  %tile_idx = affine.apply #tile_map(%c42)
  
  // Example 2: Inline affine map for offset calculation
  %offset = affine.apply affine_map<(i) -> (i + 10)>(%c42)
  
  // Example 3: Using a symbol parameter
  %shifted = affine.apply #offset_map(%c42)[%c128]
  
  // Example 4: Multiple dimensions and a symbol
  %complex = affine.apply #complex_map(%c42, %c128)[%c0]
  
  // Example 5: Composition of affine applies
  %temp = affine.apply affine_map<(i) -> (i * 2)>(%c42)
  %final = affine.apply affine_map<(i) -> (i + 5)>(%temp)
  
  return
}

// Example showing how affine.apply can be used in a practical loop context
func.func @tiled_loop(%arg0: memref<256xf32>) {
  %c0 = arith.constant 0 : index
  %c256 = arith.constant 256 : index
  %c32 = arith.constant 32 : index

  // Outer loop iterates over tiles
  affine.for %i = 0 to 256 step 32 {
    // Inner loop processes elements within the tile
    affine.for %j = 0 to 32 {
      %idx = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%i, %j)
      %val = memref.load %arg0[%idx] : memref<256xf32>
      // Process value...
      memref.store %val, %arg0[%idx] : memref<256xf32>
    }
  }
  return
}

tiled_loop 함수는 affine.apply를 실제 상황에서 사용해 배열의 타일드 처리(tiled processing)를 구현하는 방법을 보여준다. 바깥쪽 루프는 타일 크기만큼 스텝을 증가시키며, affine.apply는 각 타일을 처리하는 안쪽 루프의 경계를 계산하는 데 사용된다.

아핀 맵에 대한 몇 가지 핵심 포인트는 다음과 같다. floordiv는 내림으로 반올림하는 정수 나눗셈을 수행하며, (d0)[s0] 문법은 차원 파라미터 d0 하나와 심볼 파라미터 s0 하나를 나타낸다. 또한 아핀 맵은 덧셈, 뺄셈, 상수 곱셈, 상수 나눗셈, 상수 모듈로, 그리고 상수에 대한 floordiv/ceildiv만 포함할 수 있다. 결과 타입은 항상 index다.

Affine Optimization Passes

MLIR 옵티마이저는 affine 방언을 위해 특별히 설계된 풍부한 패스 집합을 제공한다. 이 패스들은 다양한 최적화 기법을 통해 성능을 개선하도록 코드를 자동으로 변환할 수 있다. 가장 중요한 것들을 살펴보자:

루프 불변 코드 이동(Loop invariant code motion): 루프 인덱스에 의존하지 않는(루프 불변) 코드를 루프 밖으로 이동시키는 변환이다. 이는 루프 인덱스에 의존하지 않는 코드를 바깥 루프로 옮김으로써 병렬성을 드러내는 데 도움이 될 수 있다.

// Before
func.func @before(%A: memref<10xf32>, %B: memref<10xf32>) {
  %c0 = arith.constant 0 : index
  %c10 = arith.constant 10 : index
  %c1 = arith.constant 1 : index
  
  affine.for %i = 0 to 10 {
    %x = arith.constant 42.0 : f32  // This is loop invariant!
    %v = memref.load %A[%i] : memref<10xf32>
    %sum = arith.addf %v, %x : f32
    memref.store %sum, %B[%i] : memref<10xf32>
  }
  return
}

// After
func.func @after(%A: memref<10xf32>, %B: memref<10xf32>) {
  %c0 = arith.constant 0 : index
  %c10 = arith.constant 10 : index
  %c1 = arith.constant 1 : index
  
  %x = arith.constant 42.0 : f32  // Moved outside the loop
  affine.for %i = 0 to 10 {
    %v = memref.load %A[%i] : memref<10xf32>
    %sum = arith.addf %v, %x : f32
    memref.store %sum, %B[%i] : memref<10xf32>
  }
  return
}

루프 스큐잉(Loop skewing): 다른 루프 인덱스의 값에 따라 반복들을 이동시키며 반복 공간(iteration space)을 바꾸는 변환이다. 의존성을 변환하여 병렬성을 드러내는 데 도움이 될 수 있다.

// Before - Dependencies prevent parallelization
func.func @before(%A: memref<10x10xf32>) {
  affine.for %i = 1 to 10 {
    affine.for %j = 1 to 10 {
      %v1 = affine.load %A[%i-1, %j] : memref<10x10xf32>
      %v2 = affine.load %A[%i, %j-1] : memref<10x10xf32>
      %sum = arith.addf %v1, %v2 : f32
      affine.store %sum, %A[%i, %j] : memref<10x10xf32>
    }
  }
}

// After skewing - Inner loop can now be parallelized
func.func @after(%A: memref<10x10xf32>) {
  affine.for %i = 1 to 10 {
    affine.for %j = %i to %i + 9 {
      %i_new = affine.apply affine_map<(d0, d1) -> (d0)>(%i, %j)
      %j_new = affine.apply affine_map<(d0, d1) -> (d1 - d0)>(%i, %j)
      %v1 = affine.load %A[%i_new-1, %j_new] : memref<10x10xf32>
      %v2 = affine.load %A[%i_new, %j_new-1] : memref<10x10xf32>
      %sum = arith.addf %v1, %v2 : f32
      affine.store %sum, %A[%i_new, %j_new] : memref<10x10xf32>
    }
  }
}

루프 교환(Loop interchange): 루프의 중첩 순서를 바꾸는 변환으로, 캐시 지역성을 개선하고 병렬성을 드러낼 수 있다.

// Before - Column-major access pattern
func.func @before(%A: memref<16x16xf32>, %B: memref<16x16xf32>) {
  affine.for %i = 0 to 16 {
    affine.for %j = 0 to 16 {
      %v = affine.load %A[%j, %i] : memref<16x16xf32>
      affine.store %v, %B[%j, %i] : memref<16x16xf32>
    }
  }
}

// After - Row-major access pattern (better locality)
func.func @after(%A: memref<16x16xf32>, %B: memref<16x16xf32>) {
  affine.for %j = 0 to 16 {
    affine.for %i = 0 to 16 {
      %v = affine.load %A[%j, %i] : memref<16x16xf32>
      affine.store %v, %B[%j, %i] : memref<16x16xf32>
    }
  }
}

루프 퓨전(Loop fusion): 인접한 루프들을 결합해 지역성을 개선하고 루프 오버헤드를 줄이는 변환이다.

// Before fusion
func.func @before(%A: memref<10xf32>, %B: memref<10xf32>, %C: memref<10xf32>) {
  affine.for %i = 0 to 10 {
    %v1 = affine.load %A[%i] : memref<10xf32>
    %v2 = arith.mulf %v1, %v1 : f32
    affine.store %v2, %B[%i] : memref<10xf32>
  }
  
  affine.for %i = 0 to 10 {
    %v3 = affine.load %B[%i] : memref<10xf32>
    %v4 = arith.addf %v3, %v3 : f32
    affine.store %v4, %C[%i] : memref<10xf32>
  }
}

// After fusion
func.func @after(%A: memref<10xf32>, %B: memref<10xf32>, %C: memref<10xf32>) {
  affine.for %i = 0 to 10 {
    %v1 = affine.load %A[%i] : memref<10xf32>
    %v2 = arith.mulf %v1, %v1 : f32
    affine.store %v2, %B[%i] : memref<10xf32>
    %v3 = arith.addf %v2, %v2 : f32
    affine.store %v3, %C[%i] : memref<10xf32>
  }
}

루프 타일링(Loop tiling): 루프를 더 작은 덩어리로 나누어 캐시 지역성을 개선하는 변환이다.

// Before tiling
func.func @before(%A: memref<32x32xf32>) {
  affine.for %i = 0 to 32 {
    affine.for %j = 0 to 32 {
      %v = affine.load %A[%i, %j] : memref<32x32xf32>
      // ... computation ...
      affine.store %v, %A[%i, %j] : memref<32x32xf32>
    }
  }
}

// After tiling (tile size 8x8)
func.func @after(%A: memref<32x32xf32>) {
  affine.for %ti = 0 to 32 step 8 {
    affine.for %tj = 0 to 32 step 8 {
      affine.for %i = #map(%ti) to #map(%ti + 8) {
        affine.for %j = #map(%tj) to #map(%tj + 8) {
          %v = affine.load %A[%i, %j] : memref<32x32xf32>
          // ... computation ...
          affine.store %v, %A[%i, %j] : memref<32x32xf32>
        }
      }
    }
  }
}

이러한 최적화는 mlir-opt 도구에서 사용할 수 있다:

-affine-loop-coalescing - 연속된 루프들을 하나의 루프로 병합한다.
-affine-loop-fusion - 동일한 경계와 스텝을 갖는 루프들을 병합한다.
-affine-loop-invariant-code-motion - 루프 불변 코드를 루프 밖으로 이동한다.
-affine-loop-normalize - 루프 경계와 스텝을 정규화한다.
-affine-loop-tile - 루프를 타일링한다.
-affine-loop-unroll - 루프를 언롤한다.
-affine-loop-unroll-jam - 루프를 언롤하고 서로 합쳐(jam) 배치한다.
-affine-parallelize - 루프를 병렬화한다.
-affine-pipeline-data-transfer - 메모리 계층의 명시적으로 관리되는 레벨들 사이에서 비블로킹 데이터 전송을 파이프라이닝한다
-affine-scalrep - 스토어를 로드로 전달하고 중복 로드를 제거함으로써 affine memref 접근을 스칼라로 대체한다
-affine-simplify-structures - 맵/셋의 아핀 표현식을 단순화하고 memref를 정규화한다
-affine-super-vectorize - 타깃 독립적인 n-D 벡터 추상화로 벡터화한다

또한 나중에 사용할 -convert-affine-for-to-gpu도 있는데, 이는 affine 루프를 여러 GPU 아키텍처를 대상으로 할 수 있는 GPU 커널로 변환한다.

Convolution Kernels

이제 affine 방언에서 사용할 수 있는 기본 최적화를 이해했으니, 더 실용적인 예제로 2D 컨볼루션을 구현해 보자. 이 연산은 이미지 처리와 딥러닝에서 기본이며, 특징 추출과 패턴 인식에 사용된다.

2D 컨볼루션의 수학적 공식은 다음과 같다:

(A∗K)i j=∑u∑v K u v A(i+u)(j+v)

module {
  func.func @conv_2d(%input: memref<128x128xf32>, %filter: memref<16x16xf32>, %output: memref<113x113xf32>) {
    // Loop over the output matrix dimensions (113x113)
    affine.for %i = 0 to 113 {
      affine.for %j = 0 to 113 {
        // Use affine.parallel to accumulate values into %acc using iter_args
        %zero = arith.constant 0.0 : f32
        %acc = affine.for %fi = 0 to 16 iter_args(%acc = %zero) -> (f32) {
          %acc_inner = affine.for %fj = 0 to 16 iter_args(%acc_inner = %acc) -> (f32) {
            // Load filter value
            %filter_val = affine.load %filter[%fi, %fj] : memref<16x16xf32>

            // Load corresponding input value from the input matrix
            %input_val = affine.load %input[%i + %fi, %j + %fj] : memref<128x128xf32>

            // Multiply input value with filter value
            %prod = arith.mulf %input_val, %filter_val : f32

            // Add product to the accumulator
            %new_acc = arith.addf %acc_inner, %prod : f32
            affine.yield %new_acc : f32
          }
          affine.yield %acc_inner : f32
        }

        // Store the accumulated result in the output matrix
        affine.store %acc, %output[%i, %j] : memref<113x113xf32>
      }
    }
    return
  }
}

affine 연산을 제대로 최적화하고 병렬화하려면 특정한 패스 순서가 필요하다:

mlir-opt conv_2d.mlir \
  --affine-loop-normalize \
  --affine-parallelize \
  -lower-affine \
  --convert-scf-to-cf \
  --convert-cf-to-llvm \
  --convert-math-to-llvm \
  --convert-arith-to-llvm \
  --finalize-memref-to-llvm \
  --convert-func-to-llvm \
  --reconcile-unrealized-casts \
  -o conv_2d_opt.mlir

# Translate to LLVM IR
mlir-translate conv_2d_opt.mlir -mlir-to-llvmir -o conv_2d.ll

# Compile to object file
llc -filetype=obj --relocation-model=pic conv_2d.ll -o conv_2d.o

# Compile to shared library
clang -shared -fopenmp -o libconv2d.dylib conv_2d.o

여기서 -lower-affine 패스는 affine 방언을 scf 및 arith 방언의 조합으로 로워링한다.

Python에서 병렬화된 코드를 사용하기 위해, NumPy와 MLIR MemRef 디스크립터 사이의 2D 배열 변환을 처리하는 헬퍼 모듈을 만들었다. 이 모듈은 NumPy 배열을 변환하고 컴파일된 MLIR 코드를 실행하기 위한 유틸리티를 제공한다:

import ctypes
import numpy as np
from ctypes import c_void_p, c_longlong, Structure

class MemRef2DDescriptor(Structure):
    """Structure matching MLIR's 2D MemRef descriptor"""

    _fields_ = [
        ("allocated", c_void_p),  # Allocated pointer
        ("aligned", c_void_p),  # Aligned pointer
        ("offset", c_longlong),  # Offset in elements
        ("shape", c_longlong * 2),  # Array shape (2D)
        ("stride", c_longlong * 2),  # Strides in elements
    ]

def numpy_to_memref2d(arr):
    """Convert a 2D NumPy array to a MemRef descriptor"""
    if not arr.flags["C_CONTIGUOUS"]:
        arr = np.ascontiguousarray(arr)

    desc = MemRef2DDescriptor()
    desc.allocated = arr.ctypes.data_as(c_void_p)
    desc.aligned = desc.allocated
    desc.offset = 0
    desc.shape[0] = arr.shape[0]
    desc.shape[1] = arr.shape[1]
    desc.stride[0] = arr.strides[0] // arr.itemsize
    desc.stride[1] = arr.strides[1] // arr.itemsize

    return desc

def run_conv2d():
    """Run 2D convolution using MLIR compiled module"""
    # Create input arrays
    input_matrix = np.ones((10, 10), dtype=np.float32)
    conv_filter = np.arange(9, dtype=np.float32).reshape(3, 3)
    result = np.zeros((8, 8), dtype=np.float32)

    # Load compiled module
    module = ctypes.CDLL("./libconv2d.dylib")

    # Prepare MemRef descriptors
    input_memref = numpy_to_memref2d(input_matrix)
    filter_memref = numpy_to_memref2d(conv_filter)
    result_memref = numpy_to_memref2d(result)

    # Set function argument types
    module.conv_2d.argtypes = [
        *[type(x) for x in input_memref._fields_],
        *[type(x) for x in filter_memref._fields_],
        *[type(x) for x in result_memref._fields_],
    ]

    # Call the function
    module.conv_2d(
        *[getattr(input_memref, field[0]) for field in input_memref._fields_],
        *[getattr(filter_memref, field[0]) for field in filter_memref._fields_],
        *[getattr(result_memref, field[0]) for field in result_memref._fields_]
    )

    return result

if __name__ == "__main__":
    result = run_conv2d()
    print(result)

OpenMP

affine 방언은 병렬 실행을 위한 추상화를 제공하지만, 때로는 병렬화에 대해 더 직접적인 제어를 원한다. 이때 OpenMP가 등장한다. OpenMP는 지시문(directive) 기반 모델로, 병렬 실행을 명시적으로 제어할 수 있게 해 주며, 고성능 컴퓨팅에서 널리 쓰인다.

OpenMP 병렬화를 보여주는 C의 간단한 예제를 보자:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int kernel(float *input, float *output) {
    #pragma omp parallel
    {
        #pragma omp for
        for (int i = 0; i < 10; i++) {
            output[i] = input[i] * 2.0f;
        }
    }
}

int main() {
    float input[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    float output[10];
    kernel(input, output);
    for (int i = 0; i < 10; i++) {
        printf("%f ", output[i]);
    }
    return 0;
}

OpenMP 지원(clang -fopenmp)으로 컴파일하면 이 프로그램은 여러 스레드에 걸쳐 루프를 병렬화한다. 동일한 계산은 MLIR의 OpenMP 방언으로도 표현할 수 있다.

병렬 영역(omp.parallel)

OpenMP는 병렬 영역(parallel regions)이라는 개념으로 병렬 실행을 조직한다. MLIR에서는 omp.parallel 구성으로 이를 표현하며, 동시에 코드를 실행할 수 있는 스레드 팀(team of threads)을 만든다. 이는 affine 방언에서 보았던 자동 병렬화에 비해 병렬화를 더 명시적으로 제어하는 방법을 제공한다.

이 영역은 모든 OpenMP 영역에 필요한 omp.terminator 연산으로 종료된다.

작업 공유 루프(omp.wsloop)

병렬 루프를 위해 OpenMP 방언은 omp.wsloop 연산을 제공한다. 이는 루프 반복들이 현재 컨텍스트의 스레드들에 의해 병렬로 실행됨을 지정한다. 작업 공유 루프는 일반적으로 병렬 영역 내부에서 사용되어, 사용 가능한 스레드들에 반복을 분배한다:

omp.wsloop {
  omp.loop_nest (%i1, %i2) : index = (%c0, %c0) to (%c10, %c10) step (%c1, %c1) {
    %a = load %arrA[%i1, %i2] : memref<?x?xf32>
    %b = load %arrB[%i1, %i2] : memref<?x?xf32>
    %sum = arith.addf %a, %b : f32
    store %sum, %arrC[%i1, %i2] : memref<?x?xf32>
    omp.yield
  }
}

이 예제는 두 행렬을 더하는 중첩 루프를 병렬화하는 모습을 보여준다. omp.wsloop 안의 omp.loop_nest는 반복 변수, 경계, 스텝을 포함한 중첩 루프 구조를 나타낸다.

배리어를 통한 동기화(omp.barrier)

스레드 동기화를 위해 OpenMP 방언은 omp.barrier 연산을 제공하며, 이는 등장하는 지점에 명시적 배리어를 지정한다. 이를 통해 팀의 모든 스레드가 배리어에 도달하기 전에는 어떤 스레드도 다음으로 진행하지 않도록 보장한다:

// Code executed by all threads
omp.barrier
// Code executed after all threads reach the barrier

omp.barrier 연산은 입력이나 출력이 없는 단순 연산이며, 어셈블리 형식은 이름과 (있다면) 속성만 포함한다.

`omp` 방언

위의 병렬 루프 C 코드를 MLIR로 옮겨 보자:

func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) {
  %loop_ub = llvm.mlir.constant(9 : i32) : i32
  %loop_lb = llvm.mlir.constant(0 : i32) : i32
  %loop_step = llvm.mlir.constant(1 : i32) : i32

  omp.parallel {
      omp.wsloop {
        omp.loop_nest (%i) : i32 = (%loop_lb) to (%loop_ub) inclusive step (%loop_step) {
          %ix = arith.index_cast %i : i32 to index
          %input_val = memref.load %input[%ix] : memref<10xf32>
          %two = arith.constant 2.0 : f32
          %result = arith.mulf %input_val, %two : f32
          memref.store %result, %output[%ix] : memref<10xf32>
          omp.yield
        }
      }
    omp.barrier
    omp.terminator
  }

  return
}

그리고 정적으로 정의된 입력에 대해 커널을 호출하는 main 함수를 만들 수 있다:

memref.global constant @input : memref<10xf32> = dense<[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]>

// Define external references to printf and kernel
llvm.func @printf(!llvm.ptr, ...) -> i32
func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) -> ()

func.func @main() {
  %input = memref.get_global @input : memref<10xf32>
  %output = memref.alloc() : memref<10xf32>

  func.call @kernel(%input, %output) : (memref<10xf32>, memref<10xf32>) -> ()

  %lb = index.constant 0
  %ub = index.constant 10 
  %step = index.constant 1

  %fs = llvm.mlir.addressof @fmt : !llvm.ptr

  scf.for %iv = %lb to %ub step %step {
    %el = memref.load %output[%iv] : memref<10xf32>
    llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32
  }

  return
}

// Define a constant string for the format specifier
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

덧붙여서, MLIR의 가변 인자(variadic) 함수는 C와 유사하게 유연한 인자 전달을 가능하게 한다. 우리가 사용할 흔한 패턴은 가변 개수의 인자를 받을 수 있는 printf 같은 함수를 호출하는 것이다. main 함수에서는 (가변 인자 함수인) printf를 호출해 결과를 출력한다. 아래는 C에서의 printf("%f\n", el)과 동등하며, MLIR에서 printf를 사용하는 방법을 보여준다.

// Define external printf function from stdio.h
llvm.func @printf(!llvm.ptr, ...) -> i32

// Define a constant string for the format specifier
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

// Call the variadic printf function with the format specifier and type specialized arguments
%fs = llvm.mlir.addressof @fmt : !llvm.ptr
%el = llvm.mlir.constant(1.0 : f32) : f32
llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32

병렬 루프와 main 함수를 포함한 전체 예제는 다음과 같다:

memref.global constant @input : memref<10xf32> = dense<[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]>

llvm.func @printf(!llvm.ptr, ...) -> i32
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) {
  %loop_ub = llvm.mlir.constant(9 : i32) : i32
  %loop_lb = llvm.mlir.constant(0 : i32) : i32
  %loop_step = llvm.mlir.constant(1 : i32) : i32

  omp.parallel {
      omp.wsloop {
        omp.loop_nest (%i) : i32 = (%loop_lb) to (%loop_ub) inclusive step (%loop_step) {
          %ix = arith.index_cast %i : i32 to index
          %input_val = memref.load %input[%ix] : memref<10xf32>
          %two = arith.constant 2.0 : f32
          %result = arith.mulf %input_val, %two : f32
          memref.store %result, %output[%ix] : memref<10xf32>
          omp.yield
        }
      }
    omp.barrier
    omp.terminator
  }

  return
}

func.func private @main() {
  %input = memref.get_global @input : memref<10xf32>
  %output = memref.alloc() : memref<10xf32>

  call @kernel(%input, %output) : (memref<10xf32>, memref<10xf32>) -> ()

  %lb = index.constant 0
  %ub = index.constant 10 
  %step = index.constant 1

  %fs = llvm.mlir.addressof @fmt : !llvm.ptr

  scf.for %iv = %lb to %ub step %step {
    %el = memref.load %output[%iv] : memref<10xf32>
    llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32
  }

  return
}

SCF에서 OpenMP로의 변환 패스

-convert-scf-to-openmp 패스는 SCF 방언의 병렬 루프를 동등한 OpenMP 연산으로 자동 변환한다. 이 패스는 OpenMP 실행을 대상으로 하고 싶지만 OpenMP 방언 코드를 직접 작성하고 싶지 않을 때 특히 유용하다.

예를 들어 다음과 같은 SCF 병렬 루프는:

scf.parallel (%i) = (%c0) to (%c100) step (%c1) {
  %val = memref.load %A[%i] : memref<100xf32>
  %result = arith.addf %val, %val : f32
  memref.store %result, %B[%i] : memref<100xf32>
  scf.yield
}

OpenMP 연산으로 변환된다:

omp.parallel {
  omp.wsloop for (%i) : i32 = (%c0) to (%c100) step (%c1) {
    %val = memref.load %A[%i] : memref<100xf32>
    %result = arith.addf %val, %val : f32
    memref.store %result, %B[%i] : memref<100xf32>
    omp.yield
  }
  omp.terminator
}

이 패스는 여러 중요한 변환을 처리하는데, scf.parallel 연산을 omp.wsloop를 포함하는 omp.parallel 영역으로 바꾸면서 루프 경계와 스텝 크기를 보존한다. 또한 SCF의 리덕션 연산은 OpenMP 리덕션으로 매핑되며, 중첩 병렬 루프도 적절히 관리된다. 이 변환은 보통 더 큰 로워링 파이프라인의 일부로 사용되며, 대개 OpenMP 런타임 지원과 함께 실행할 수 있는 LLVM IR을 생성하기 위해 -convert-openmp-to-llvm가 뒤따른다.

External Resources

Image 1: mlir-egglog

Polyhedral Optimization Model

폴리헤드럴(polyhedral) 모델은 루프 중첩과 그 안의 계산을 다차원 공간에서의 기하학적 객체로 표현하는 수학적 프레임워크다.

우리가 이해해야 할 핵심 개념은 세 가지다:

Affine maps: 차원과 심볼을 결과로 매핑하는 다차원 준선형(quasi-linear) 함수. 예를 들어 (d0, d1, d2, s0) → (d0 + d1, s0 * d2)는 차원 인자 d0, d1, d2와 심볼 인자 s0를 두 개의 결과로 매핑하는 2차원 아핀 맵을 나타낸다.
Integer sets: 아핀 부등식과 등식으로 표현되는 차원과 심볼에 대한 제약. 예를 들어 (i)[N, M] : (i >= 0, -i + N >= 0, N - 5 == 0, -i + M + 1 >= 0)는 0 <= i < N, N = 5, 그리고 i <= M + 1을 만족하는 i 값들의 집합을 나타낸다.
Affine operations: affine.for, affine.if, affine.parallel처럼 아핀 맵과 정수 집합을 활용해 아핀 제약을 가진 루프 중첩과 조건문을 표현하는 연산.

이론을 먼저 잔뜩 깔기보다는, 그냥 예제를 몇 개 보자.

Matrix Multiplication

두 행렬 A i j 와 B j k 의 고전적인 행렬곱은 다음과 같다:

C i k=(A B)i k=∑j=1 N A i j B j k

이를 MLIR에서 명시적인 루프 중첩으로 표현하면 다음과 같다:

func.func @matmul(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index

  %M = memref.dim %A, %c0 : memref<?x?xf32>
  %N = memref.dim %B, %c1 : memref<?x?xf32>
  %K = memref.dim %A, %c1 : memref<?x?xf32>

  // Sequential implementation
  affine.for %i = 0 to %M {
    affine.for %j = 0 to %N {
      affine.for %k = 0 to %K {
        %a = affine.load %A[%i, %k] : memref<?x?xf32>
        %b = affine.load %B[%k, %j] : memref<?x?xf32>
        %c = affine.load %C[%i, %j] : memref<?x?xf32>
        %prod = arith.mulf %a, %b : f32
        %sum = arith.addf %c, %prod : f32
        affine.store %sum, %C[%i, %j] : memref<?x?xf32>
      }
    }
  }
  return
}

func.func @matmul_parallel(%A: memref<?x?xf32>, %B: memref<?x?xf32>, %C: memref<?x?xf32>) {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index

  %M = memref.dim %A, %c0 : memref<?x?xf32>
  %N = memref.dim %B, %c1 : memref<?x?xf32>
  %K = memref.dim %A, %c1 : memref<?x?xf32>

  // Parallel implementation
  affine.parallel (%i, %j) = (0, 0) to (%M, %N) {
    affine.for %k = 0 to %K {
      %a = affine.load %A[%i, %k] : memref<?x?xf32>
      %b = affine.load %B[%k, %j] : memref<?x?xf32>
      %c = affine.load %C[%i, %j] : memref<?x?xf32>
      %prod = arith.mulf %a, %b : f32
      %sum = arith.addf %c, %prod : f32
      affine.store %sum, %C[%i, %j] : memref<?x?xf32>
    }
  }
  return
}

좀 더 정확히 말하면 affine.parallel은 다음을 만족하는 중첩 루프들의 집합을 정의한다:

각 반복은 다른 모든 반복과 독립적이다
반복들은 어떤 순서로든 또는 동시에 실행될 수 있다
연산은 리덕션(값을 합산하는 것 등)을 통해 결과를 생성하며, 결과를 결합하는 순서가 중요하지 않다

Affine Maps (`affine.apply`)

affine.apply의 다양한 사용법을 보여주는 포괄적인 예제를 보자:

// Define some reusable affine maps
#tile_map = affine_map<(d0) -> (d0 floordiv 32)>
#offset_map = affine_map<(d0)[s0] -> (d0 + s0)>
#complex_map = affine_map<(d0, d1)[s0] -> (d0 * 2 + d1 floordiv 4 + s0)>

func.func @affine_apply_examples() {
  // Create some test indices
  %c0 = arith.constant 0 : index
  %c42 = arith.constant 42 : index
  %c128 = arith.constant 128 : index
  
  // Example 1: Simple tiling calculation
  // Calculate which tile a given index belongs to (tile size = 32)
  %tile_idx = affine.apply #tile_map(%c42)
  
  // Example 2: Inline affine map for offset calculation
  %offset = affine.apply affine_map<(i) -> (i + 10)>(%c42)
  
  // Example 3: Using a symbol parameter
  %shifted = affine.apply #offset_map(%c42)[%c128]
  
  // Example 4: Multiple dimensions and a symbol
  %complex = affine.apply #complex_map(%c42, %c128)[%c0]
  
  // Example 5: Composition of affine applies
  %temp = affine.apply affine_map<(i) -> (i * 2)>(%c42)
  %final = affine.apply affine_map<(i) -> (i + 5)>(%temp)
  
  return
}

// Example showing how affine.apply can be used in a practical loop context
func.func @tiled_loop(%arg0: memref<256xf32>) {
  %c0 = arith.constant 0 : index
  %c256 = arith.constant 256 : index
  %c32 = arith.constant 32 : index

  // Outer loop iterates over tiles
  affine.for %i = 0 to 256 step 32 {
    // Inner loop processes elements within the tile
    affine.for %j = 0 to 32 {
      %idx = affine.apply affine_map<(d0, d1) -> (d0 + d1)>(%i, %j)
      %val = memref.load %arg0[%idx] : memref<256xf32>
      // Process value...
      memref.store %val, %arg0[%idx] : memref<256xf32>
    }
  }
  return
}

Affine Optimization Passes

루프 불변 코드 이동(Loop invariant code motion): 루프 인덱스에 의존하지 않는(루프 불변) 코드를 루프 밖으로 이동시키는 변환이다. 이는 루프 인덱스에 의존하지 않는 코드를 바깥 루프로 옮김으로써 병렬성을 드러내는 데 도움이 될 수 있다.

// Before
func.func @before(%A: memref<10xf32>, %B: memref<10xf32>) {
  %c0 = arith.constant 0 : index
  %c10 = arith.constant 10 : index
  %c1 = arith.constant 1 : index
  
  affine.for %i = 0 to 10 {
    %x = arith.constant 42.0 : f32  // This is loop invariant!
    %v = memref.load %A[%i] : memref<10xf32>
    %sum = arith.addf %v, %x : f32
    memref.store %sum, %B[%i] : memref<10xf32>
  }
  return
}

// After
func.func @after(%A: memref<10xf32>, %B: memref<10xf32>) {
  %c0 = arith.constant 0 : index
  %c10 = arith.constant 10 : index
  %c1 = arith.constant 1 : index
  
  %x = arith.constant 42.0 : f32  // Moved outside the loop
  affine.for %i = 0 to 10 {
    %v = memref.load %A[%i] : memref<10xf32>
    %sum = arith.addf %v, %x : f32
    memref.store %sum, %B[%i] : memref<10xf32>
  }
  return
}

루프 스큐잉(Loop skewing): 다른 루프 인덱스의 값에 따라 반복들을 이동시키며 반복 공간(iteration space)을 바꾸는 변환이다. 의존성을 변환하여 병렬성을 드러내는 데 도움이 될 수 있다.

// Before - Dependencies prevent parallelization
func.func @before(%A: memref<10x10xf32>) {
  affine.for %i = 1 to 10 {
    affine.for %j = 1 to 10 {
      %v1 = affine.load %A[%i-1, %j] : memref<10x10xf32>
      %v2 = affine.load %A[%i, %j-1] : memref<10x10xf32>
      %sum = arith.addf %v1, %v2 : f32
      affine.store %sum, %A[%i, %j] : memref<10x10xf32>
    }
  }
}

// After skewing - Inner loop can now be parallelized
func.func @after(%A: memref<10x10xf32>) {
  affine.for %i = 1 to 10 {
    affine.for %j = %i to %i + 9 {
      %i_new = affine.apply affine_map<(d0, d1) -> (d0)>(%i, %j)
      %j_new = affine.apply affine_map<(d0, d1) -> (d1 - d0)>(%i, %j)
      %v1 = affine.load %A[%i_new-1, %j_new] : memref<10x10xf32>
      %v2 = affine.load %A[%i_new, %j_new-1] : memref<10x10xf32>
      %sum = arith.addf %v1, %v2 : f32
      affine.store %sum, %A[%i_new, %j_new] : memref<10x10xf32>
    }
  }
}

루프 교환(Loop interchange): 루프의 중첩 순서를 바꾸는 변환으로, 캐시 지역성을 개선하고 병렬성을 드러낼 수 있다.

// Before - Column-major access pattern
func.func @before(%A: memref<16x16xf32>, %B: memref<16x16xf32>) {
  affine.for %i = 0 to 16 {
    affine.for %j = 0 to 16 {
      %v = affine.load %A[%j, %i] : memref<16x16xf32>
      affine.store %v, %B[%j, %i] : memref<16x16xf32>
    }
  }
}

// After - Row-major access pattern (better locality)
func.func @after(%A: memref<16x16xf32>, %B: memref<16x16xf32>) {
  affine.for %j = 0 to 16 {
    affine.for %i = 0 to 16 {
      %v = affine.load %A[%j, %i] : memref<16x16xf32>
      affine.store %v, %B[%j, %i] : memref<16x16xf32>
    }
  }
}

루프 퓨전(Loop fusion): 인접한 루프들을 결합해 지역성을 개선하고 루프 오버헤드를 줄이는 변환이다.

// Before fusion
func.func @before(%A: memref<10xf32>, %B: memref<10xf32>, %C: memref<10xf32>) {
  affine.for %i = 0 to 10 {
    %v1 = affine.load %A[%i] : memref<10xf32>
    %v2 = arith.mulf %v1, %v1 : f32
    affine.store %v2, %B[%i] : memref<10xf32>
  }
  
  affine.for %i = 0 to 10 {
    %v3 = affine.load %B[%i] : memref<10xf32>
    %v4 = arith.addf %v3, %v3 : f32
    affine.store %v4, %C[%i] : memref<10xf32>
  }
}

// After fusion
func.func @after(%A: memref<10xf32>, %B: memref<10xf32>, %C: memref<10xf32>) {
  affine.for %i = 0 to 10 {
    %v1 = affine.load %A[%i] : memref<10xf32>
    %v2 = arith.mulf %v1, %v1 : f32
    affine.store %v2, %B[%i] : memref<10xf32>
    %v3 = arith.addf %v2, %v2 : f32
    affine.store %v3, %C[%i] : memref<10xf32>
  }
}

루프 타일링(Loop tiling): 루프를 더 작은 덩어리로 나누어 캐시 지역성을 개선하는 변환이다.

// Before tiling
func.func @before(%A: memref<32x32xf32>) {
  affine.for %i = 0 to 32 {
    affine.for %j = 0 to 32 {
      %v = affine.load %A[%i, %j] : memref<32x32xf32>
      // ... computation ...
      affine.store %v, %A[%i, %j] : memref<32x32xf32>
    }
  }
}

// After tiling (tile size 8x8)
func.func @after(%A: memref<32x32xf32>) {
  affine.for %ti = 0 to 32 step 8 {
    affine.for %tj = 0 to 32 step 8 {
      affine.for %i = #map(%ti) to #map(%ti + 8) {
        affine.for %j = #map(%tj) to #map(%tj + 8) {
          %v = affine.load %A[%i, %j] : memref<32x32xf32>
          // ... computation ...
          affine.store %v, %A[%i, %j] : memref<32x32xf32>
        }
      }
    }
  }
}

이러한 최적화는 mlir-opt 도구에서 사용할 수 있다:

-affine-loop-coalescing - 연속된 루프들을 하나의 루프로 병합한다.
-affine-loop-fusion - 동일한 경계와 스텝을 갖는 루프들을 병합한다.
-affine-loop-invariant-code-motion - 루프 불변 코드를 루프 밖으로 이동한다.
-affine-loop-normalize - 루프 경계와 스텝을 정규화한다.
-affine-loop-tile - 루프를 타일링한다.
-affine-loop-unroll - 루프를 언롤한다.
-affine-loop-unroll-jam - 루프를 언롤하고 서로 합쳐(jam) 배치한다.
-affine-parallelize - 루프를 병렬화한다.
-affine-pipeline-data-transfer - 메모리 계층의 명시적으로 관리되는 레벨들 사이에서 비블로킹 데이터 전송을 파이프라이닝한다
-affine-scalrep - 스토어를 로드로 전달하고 중복 로드를 제거함으로써 affine memref 접근을 스칼라로 대체한다
-affine-simplify-structures - 맵/셋의 아핀 표현식을 단순화하고 memref를 정규화한다
-affine-super-vectorize - 타깃 독립적인 n-D 벡터 추상화로 벡터화한다

또한 나중에 사용할 -convert-affine-for-to-gpu도 있는데, 이는 affine 루프를 여러 GPU 아키텍처를 대상으로 할 수 있는 GPU 커널로 변환한다.

Convolution Kernels

2D 컨볼루션의 수학적 공식은 다음과 같다:

(A∗K)i j=∑u∑v K u v A(i+u)(j+v)

module {
  func.func @conv_2d(%input: memref<128x128xf32>, %filter: memref<16x16xf32>, %output: memref<113x113xf32>) {
    // Loop over the output matrix dimensions (113x113)
    affine.for %i = 0 to 113 {
      affine.for %j = 0 to 113 {
        // Use affine.parallel to accumulate values into %acc using iter_args
        %zero = arith.constant 0.0 : f32
        %acc = affine.for %fi = 0 to 16 iter_args(%acc = %zero) -> (f32) {
          %acc_inner = affine.for %fj = 0 to 16 iter_args(%acc_inner = %acc) -> (f32) {
            // Load filter value
            %filter_val = affine.load %filter[%fi, %fj] : memref<16x16xf32>

            // Load corresponding input value from the input matrix
            %input_val = affine.load %input[%i + %fi, %j + %fj] : memref<128x128xf32>

            // Multiply input value with filter value
            %prod = arith.mulf %input_val, %filter_val : f32

            // Add product to the accumulator
            %new_acc = arith.addf %acc_inner, %prod : f32
            affine.yield %new_acc : f32
          }
          affine.yield %acc_inner : f32
        }

        // Store the accumulated result in the output matrix
        affine.store %acc, %output[%i, %j] : memref<113x113xf32>
      }
    }
    return
  }
}

affine 연산을 제대로 최적화하고 병렬화하려면 특정한 패스 순서가 필요하다:

mlir-opt conv_2d.mlir \
  --affine-loop-normalize \
  --affine-parallelize \
  -lower-affine \
  --convert-scf-to-cf \
  --convert-cf-to-llvm \
  --convert-math-to-llvm \
  --convert-arith-to-llvm \
  --finalize-memref-to-llvm \
  --convert-func-to-llvm \
  --reconcile-unrealized-casts \
  -o conv_2d_opt.mlir

# Translate to LLVM IR
mlir-translate conv_2d_opt.mlir -mlir-to-llvmir -o conv_2d.ll

# Compile to object file
llc -filetype=obj --relocation-model=pic conv_2d.ll -o conv_2d.o

# Compile to shared library
clang -shared -fopenmp -o libconv2d.dylib conv_2d.o

여기서 -lower-affine 패스는 affine 방언을 scf 및 arith 방언의 조합으로 로워링한다.

import ctypes
import numpy as np
from ctypes import c_void_p, c_longlong, Structure

class MemRef2DDescriptor(Structure):
    """Structure matching MLIR's 2D MemRef descriptor"""

    _fields_ = [
        ("allocated", c_void_p),  # Allocated pointer
        ("aligned", c_void_p),  # Aligned pointer
        ("offset", c_longlong),  # Offset in elements
        ("shape", c_longlong * 2),  # Array shape (2D)
        ("stride", c_longlong * 2),  # Strides in elements
    ]

def numpy_to_memref2d(arr):
    """Convert a 2D NumPy array to a MemRef descriptor"""
    if not arr.flags["C_CONTIGUOUS"]:
        arr = np.ascontiguousarray(arr)

    desc = MemRef2DDescriptor()
    desc.allocated = arr.ctypes.data_as(c_void_p)
    desc.aligned = desc.allocated
    desc.offset = 0
    desc.shape[0] = arr.shape[0]
    desc.shape[1] = arr.shape[1]
    desc.stride[0] = arr.strides[0] // arr.itemsize
    desc.stride[1] = arr.strides[1] // arr.itemsize

    return desc

def run_conv2d():
    """Run 2D convolution using MLIR compiled module"""
    # Create input arrays
    input_matrix = np.ones((10, 10), dtype=np.float32)
    conv_filter = np.arange(9, dtype=np.float32).reshape(3, 3)
    result = np.zeros((8, 8), dtype=np.float32)

    # Load compiled module
    module = ctypes.CDLL("./libconv2d.dylib")

    # Prepare MemRef descriptors
    input_memref = numpy_to_memref2d(input_matrix)
    filter_memref = numpy_to_memref2d(conv_filter)
    result_memref = numpy_to_memref2d(result)

    # Set function argument types
    module.conv_2d.argtypes = [
        *[type(x) for x in input_memref._fields_],
        *[type(x) for x in filter_memref._fields_],
        *[type(x) for x in result_memref._fields_],
    ]

    # Call the function
    module.conv_2d(
        *[getattr(input_memref, field[0]) for field in input_memref._fields_],
        *[getattr(filter_memref, field[0]) for field in filter_memref._fields_],
        *[getattr(result_memref, field[0]) for field in result_memref._fields_]
    )

    return result

if __name__ == "__main__":
    result = run_conv2d()
    print(result)

OpenMP

OpenMP 병렬화를 보여주는 C의 간단한 예제를 보자:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int kernel(float *input, float *output) {
    #pragma omp parallel
    {
        #pragma omp for
        for (int i = 0; i < 10; i++) {
            output[i] = input[i] * 2.0f;
        }
    }
}

int main() {
    float input[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    float output[10];
    kernel(input, output);
    for (int i = 0; i < 10; i++) {
        printf("%f ", output[i]);
    }
    return 0;
}

병렬 영역(omp.parallel)

이 영역은 모든 OpenMP 영역에 필요한 omp.terminator 연산으로 종료된다.

작업 공유 루프(omp.wsloop)

omp.wsloop {
  omp.loop_nest (%i1, %i2) : index = (%c0, %c0) to (%c10, %c10) step (%c1, %c1) {
    %a = load %arrA[%i1, %i2] : memref<?x?xf32>
    %b = load %arrB[%i1, %i2] : memref<?x?xf32>
    %sum = arith.addf %a, %b : f32
    store %sum, %arrC[%i1, %i2] : memref<?x?xf32>
    omp.yield
  }
}

배리어를 통한 동기화(omp.barrier)

// Code executed by all threads
omp.barrier
// Code executed after all threads reach the barrier

omp.barrier 연산은 입력이나 출력이 없는 단순 연산이며, 어셈블리 형식은 이름과 (있다면) 속성만 포함한다.

`omp` 방언

위의 병렬 루프 C 코드를 MLIR로 옮겨 보자:

func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) {
  %loop_ub = llvm.mlir.constant(9 : i32) : i32
  %loop_lb = llvm.mlir.constant(0 : i32) : i32
  %loop_step = llvm.mlir.constant(1 : i32) : i32

  omp.parallel {
      omp.wsloop {
        omp.loop_nest (%i) : i32 = (%loop_lb) to (%loop_ub) inclusive step (%loop_step) {
          %ix = arith.index_cast %i : i32 to index
          %input_val = memref.load %input[%ix] : memref<10xf32>
          %two = arith.constant 2.0 : f32
          %result = arith.mulf %input_val, %two : f32
          memref.store %result, %output[%ix] : memref<10xf32>
          omp.yield
        }
      }
    omp.barrier
    omp.terminator
  }

  return
}

그리고 정적으로 정의된 입력에 대해 커널을 호출하는 main 함수를 만들 수 있다:

memref.global constant @input : memref<10xf32> = dense<[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]>

// Define external references to printf and kernel
llvm.func @printf(!llvm.ptr, ...) -> i32
func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) -> ()

func.func @main() {
  %input = memref.get_global @input : memref<10xf32>
  %output = memref.alloc() : memref<10xf32>

  func.call @kernel(%input, %output) : (memref<10xf32>, memref<10xf32>) -> ()

  %lb = index.constant 0
  %ub = index.constant 10 
  %step = index.constant 1

  %fs = llvm.mlir.addressof @fmt : !llvm.ptr

  scf.for %iv = %lb to %ub step %step {
    %el = memref.load %output[%iv] : memref<10xf32>
    llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32
  }

  return
}

// Define a constant string for the format specifier
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

// Define external printf function from stdio.h
llvm.func @printf(!llvm.ptr, ...) -> i32

// Define a constant string for the format specifier
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

// Call the variadic printf function with the format specifier and type specialized arguments
%fs = llvm.mlir.addressof @fmt : !llvm.ptr
%el = llvm.mlir.constant(1.0 : f32) : f32
llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32

병렬 루프와 main 함수를 포함한 전체 예제는 다음과 같다:

memref.global constant @input : memref<10xf32> = dense<[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]>

llvm.func @printf(!llvm.ptr, ...) -> i32
llvm.mlir.global private constant @fmt("%f\0A\00") {addr_space = 0 : i32}

func.func private @kernel(%input: memref<10xf32>, %output: memref<10xf32>) {
  %loop_ub = llvm.mlir.constant(9 : i32) : i32
  %loop_lb = llvm.mlir.constant(0 : i32) : i32
  %loop_step = llvm.mlir.constant(1 : i32) : i32

  omp.parallel {
      omp.wsloop {
        omp.loop_nest (%i) : i32 = (%loop_lb) to (%loop_ub) inclusive step (%loop_step) {
          %ix = arith.index_cast %i : i32 to index
          %input_val = memref.load %input[%ix] : memref<10xf32>
          %two = arith.constant 2.0 : f32
          %result = arith.mulf %input_val, %two : f32
          memref.store %result, %output[%ix] : memref<10xf32>
          omp.yield
        }
      }
    omp.barrier
    omp.terminator
  }

  return
}

func.func private @main() {
  %input = memref.get_global @input : memref<10xf32>
  %output = memref.alloc() : memref<10xf32>

  call @kernel(%input, %output) : (memref<10xf32>, memref<10xf32>) -> ()

  %lb = index.constant 0
  %ub = index.constant 10 
  %step = index.constant 1

  %fs = llvm.mlir.addressof @fmt : !llvm.ptr

  scf.for %iv = %lb to %ub step %step {
    %el = memref.load %output[%iv] : memref<10xf32>
    llvm.call @printf(%fs, %el) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, f32) -> i32
  }

  return
}

SCF에서 OpenMP로의 변환 패스

예를 들어 다음과 같은 SCF 병렬 루프는:

scf.parallel (%i) = (%c0) to (%c100) step (%c1) {
  %val = memref.load %A[%i] : memref<100xf32>
  %result = arith.addf %val, %val : f32
  memref.store %result, %B[%i] : memref<100xf32>
  scf.yield
}

OpenMP 연산으로 변환된다:

omp.parallel {
  omp.wsloop for (%i) : i32 = (%c0) to (%c100) step (%c1) {
    %val = memref.load %A[%i] : memref<100xf32>
    %result = arith.addf %val, %val : f32
    memref.store %result, %B[%i] : memref<100xf32>
    omp.yield
  }
  omp.terminator
}

MLIR 파트 3 - Affine 방언과 OpenMP

Polyhedral Optimization Model

Matrix Multiplication

Affine Maps (`affine.apply`)

Affine Optimization Passes

Convolution Kernels

OpenMP

병렬 영역(omp.parallel)

작업 공유 루프(omp.wsloop)

배리어를 통한 동기화(omp.barrier)

`omp` 방언

SCF에서 OpenMP로의 변환 패스

External Resources

관련 추천 글

'affine' 방언 - MLIR

챕터 0: “구조화(Structured)” Linalg 연산 입문

MLIR 파트 4 - MLIR의 선형대수

'linalg' 다이얼렉트 - MLIR

Polyhedral Optimization Model

Matrix Multiplication

Affine Maps (`affine.apply`)

Affine Optimization Passes

Convolution Kernels

OpenMP

병렬 영역(omp.parallel)

작업 공유 루프(omp.wsloop)

배리어를 통한 동기화(omp.barrier)

`omp` 방언

SCF에서 OpenMP로의 변환 패스

External Resources

관련 추천 글

'affine' 방언 - MLIR

챕터 0: “구조화(Structured)” Linalg 연산 입문

MLIR 파트 4 - MLIR의 선형대수

'linalg' 다이얼렉트 - MLIR

MLIR 파트 3 - Affine 방언과 OpenMP

Polyhedral Optimization Model

Matrix Multiplication

Affine Maps (affine.apply)

Affine Optimization Passes

Convolution Kernels

OpenMP

병렬 영역(omp.parallel)

작업 공유 루프(omp.wsloop)

배리어를 통한 동기화(omp.barrier)

omp 방언

SCF에서 OpenMP로의 변환 패스

External Resources

관련 추천 글

'affine' 방언 - MLIR

챕터 0: “구조화(Structured)” Linalg 연산 입문

MLIR 파트 4 - MLIR의 선형대수

'linalg' 다이얼렉트 - MLIR

Polyhedral Optimization Model

Matrix Multiplication

Affine Maps (affine.apply)

Affine Optimization Passes

Convolution Kernels

OpenMP

병렬 영역(omp.parallel)

작업 공유 루프(omp.wsloop)

배리어를 통한 동기화(omp.barrier)

omp 방언

SCF에서 OpenMP로의 변환 패스

External Resources

관련 추천 글

'affine' 방언 - MLIR

챕터 0: “구조화(Structured)” Linalg 연산 입문

MLIR 파트 4 - MLIR의 선형대수

'linalg' 다이얼렉트 - MLIR

Affine Maps (`affine.apply`)

`omp` 방언

Affine Maps (`affine.apply`)

`omp` 방언