맨체스터 가비지 컬렉터와 purple-garden의 런타임

최근 맨체스터에 머무는 동안, purple-garden을 위한 mgc - manchester garbage collector #10을 설계하고 구현했다. 이 글은 mgc의 내부 동작과, 왜 이런 방식으로 설계되었는지에 대한 딥다이브다.

Intro

가비지 컬렉션 자체는 널리 연구된 주제지만, 여러 GC 기법을 조합하면 단일 전략에만 의존하는 것보다 더 나은 결과를 얻을 수 있다. 맨체스터 가비지 컬렉터(mgc)는 바로 그런 여러 패러다임의 합성물이다. 즉, 사전 할당된 메모리 영역으로의 할당, 재귀적 루트 셋(root set) 추적을 통한 도달성 분석, **세미스페이스(semi-space) 복사 기반의 압축(compacting)**을 결합한다. 동시에 이것이 완전히 새로운 아이디어는 아니지만, 이 런타임을 위해 특별히 엔지니어링된 설계다.

이 접근들을 결합하면 빠른 할당, 낮은 할당 지연, 그리고 단편화 감소를 얻을 수 있다.

Purple-garden

mgc는 purple-garden 런타임을 위해 특별히 설계되었다. Purple-garden은 성능과 낮은 메모리 프로파일에 초점을 맞춰 설계·구현된 미니멀 스크립팅 언어다. 참고: xnacly/purple-garden.

NIX

1fn greeting :: greetee { std::println("hello world to:" greetee) }
2greeting(std::env::get("USER")) # hello world to: $USER
3
4var v = "Hello World"
5std::println(std::runtime::type(v) std::len(v)) # str 11

Purple-garden은 스크립팅 언어로서의 임베드 용이성과 사용 편의성에 초점을 맞춘다. 이를 위해 작은 메모리 풋프린트, 고성능 런타임, 미니멀하지만 유용한 표준 라이브러리를 갖춘다. 구현은 C로 되어 있으며 레지스터 기반 바이트코드 컴파일러와 가상 머신이다.

위 예제는 (런타임 내부에서) 메모리를 할당하지 않기 때문에 이 글에 적합한 예시는 아니지만, purple-garden이 무엇을 지향하는지 보여주는 좋은 예시다.

런타임 힙에 부담을 주는 더 나은 예시는 다음의 재귀 문자열 연결이다:

NIX

1fn f :: n {
 2    match {
 3        n < 100000 {
 4            f(n+1)
 5            var s1 = std::str::append("a" "b")
 6            var s2 = std::str::append("c" "d")
 7            var pair = std::str::append(s1 s2)
 8        }
 9        { std::println(n) }
10    }
11}
12f(0)
13std::runtime::gc::cycle()

f는 GC를 트리거할 만큼 충분한 메모리를 할당하지 않기 때문에(참고: Triggering Garbage Collection), std::runtime::gc::cycle()로 수동 컬렉션을 실행한다. 또한 purple-garden은 GC 압력을 줄이기 위해 콜 프레임(call frame)용 별도 할당기를 사용한다.

Garbage collection stages

mgc는 세 단계로 나뉘며, 각 단계는 컬렉션 동안 런타임이 멈추는(stop-the-world) 것을 요구한다. 이 세 단계가 합쳐져 하나의 컬렉션 사이클을 이룬다:

TEXT

1+-------------+
 2|  Roots Set  |
 3| (VM regs,   |
 4| globals...) |
 5+------+------+
 6       |
 7       v
 8+-------------+
 9|  Mark Phase |
10| Mark all    |
11| live objects|
12+------+------+
13       |
14       v              Copy
15+-------------------+ live     +-------------------+
16|   Old Bump Space  | objects  |   New Bump Space  |
17|  (old allocator)  | -------> |  (new allocator)  |
18+-------------------+          +-------------------+
19       |
20       v
21+--------------+
22| Reset Old    |
23| Bump Alloc   |
24| (len=0,pos=0)|
25+------+-------+
26       |
27       v
28+-------------+
29| Swap Alloc  |
30| old <-> new |
31+-------------+

Mark

이 가상 머신에서 할당을 관리하기 위해, 가비지 컬렉터는 도달 가능한 모든 할당을 걸어가며(reachability walk) 도달 가능하다고 표시(mark)하기 위한 시작점들의 집합을 정의한다. 이 집합을 **루트 셋(root set)**이라고 한다. purple-garden(pg)에서는 루트 셋이 현재 레지스터들과 로컬 변수 바인딩을 보관하는 변수 테이블로 구성된다. 상위 스코프의 변수는 스코프 밖에 있어도 살아있는 것으로 간주되므로, GC는 부모 스코프까지 고려해야 한다. 이는 루트 셋을 재귀적으로 걸어가며 도달 가능한 모든 객체에 mark 비트를 세팅하는 방식으로 구현된다.

예를 들어 다음 코드에서 z() 호출 이후 GC가 실행된다면 a와 b는 살아있지만 c는 더 이상 살아있지 않으므로 안전하게 정리될 수 있다:

NIX

1# alive
2var a = allocating() 
3fn z { 
4    # dead
5    var c = allocating()
6}
7# alive
8var b = allocating() 
9z()

Copy

현재 할당된 집합 M 중 살아있는 부분집합 A가 결정되면, A의 값들은 이전 메모리 영역(from-space)에서 새 영역(to-space)으로 안전하게 복사할 수 있다. 두 공간은 모두 bump allocator로 백업되며, bump allocator는 mmap으로 할당된 메모리 영역들의 세그먼트 리스트로 구현되어 있다.

복사가 끝나면 from-space는 해제(deallocate)하지 않고 reset만 한다. 그런 다음 from-space와 to-space를 스왑해, 기존 from-space가 to-space가 되고 그 반대가 된다.

Rewriting references

복사로 인해 객체들이 새 메모리 위치로 이동하므로, 기존에 들고 있던 참조들은 더 이상 유효하지 않다. 따라서 새로 복사된 메모리 영역을 가리키도록 참조를 재작성(rewrite)해야 한다.

이를 위해 앞서 구성했던 루트 셋을 다시 한 번 순회하며 모든 참조를 업데이트한다.

Reference implementation

이제 위 개념들을 실제 C 코드로 옮긴 내용이 다음 장들이다.

1typedef struct {
 2  // from-space
 3  Allocator *old;
 4  // to-space
 5  Allocator *new;
 6  void *vm;
 7  GcHeader *head;
 8  size_t allocated;
 9  size_t allocated_since_last_cycle;
10} Gc;
11
12Gc gc_init(size_t gc_size) {
13  size_t half = gc_size / 2;
14  return (Gc){
15      .old = bump_init(half, 0),
16      .new = bump_init(half, 0),
17      .head = NULL,
18  };
19}

Allocator는 할당을 추상화하는 간단한 구조체다. 예를 들어 bump allocator를 위해:

1typedef struct {
 2  // Allocator::ctx refers to an internal allocator state and owned memory
 3  // areas, for instance, a bump allocator would attach its meta data (current
 4  // position, cap, etc) here
 5  void *ctx;
 6  Stats (*stats)(void *ctx);
 7  void *(*request)(void *ctx, size_t size);
 8  void (*reset)(void *ctx);
 9  void (*destroy)(void *ctx);
10} Allocator;

Keeping track of allocations

GC가 어떤 영역을, 어떤 크기로, 어떤 타입으로 내줬는지 알기 위해, 각 영역은 메타데이터 헤더와 함께 할당된다. 이 헤더는 다음으로 구성된다:

1typedef struct GcHeader {
2  unsigned int marked : 1;
3  unsigned int type : 3;
4  uintptr_t forward;
5  uint16_t size;
6  struct GcHeader *next;
7} GcHeader;

3비트의 type은 페이로드를 raw bytes, string, list, map 중 무엇으로 볼지 식별한다(아래 참고). 또한 헤더는 참조 재작성에 사용되는 forwarding 포인터, 해당 페이로드 크기(객체 크기는 64KB로 제한; 현재 런타임 요구와 일치), 그리고 다음 할당을 가리키는 포인터를 담는다. 이로써 헤더의 연결 리스트를 순회하는 방식으로 힙 스캔이 가능해지고, 이는 재작성 과정에서 도움이 된다.

1typedef enum {
 2  // just bytes
 3  GC_OBJ_RAW = 0b000,
 4  // a string with a reference to an inner string, can be allocated or not
 5  GC_OBJ_STR = 0b001,
 6  // list has zero or more children
 7  GC_OBJ_LIST = 0b010,
 8  // map holds allocated buckets with owned children
 9  GC_OBJ_MAP = 0b011,
10} ObjType;

따라서 할당은 다음과 같은 형태가 된다:

TEXT

1       allocation 
 2      (raw pointer)
 3            |
 4            |
 5            v
 6 +----------------------+
 7 | GcHeader             | <-- header
 8 +----------------------+
 9 |                      |
10 | payload (size B)     | <-- data handed out as ptr to the user
11 |                      |
12 +----------------------+

각 GcHeader는 32B이므로, 모든 힙 할당은 32B의 오버헤드가 추가된다.

1void *gc_request(Gc *gc, size_t size, ObjType t) {
 2  void *allocation = gc->old->request(gc->old->ctx, size + sizeof(GcHeader));
 3  void *payload = (char *)allocation + sizeof(GcHeader);
 4  GcHeader *h = (GcHeader *)allocation;
 5  h->type = t;
 6  h->marked = 0;
 7  h->size = size;
 8  h->next = gc->head;
 9  gc->head = h;
10  gc->allocated_since_last_cycle += size;
11  gc->allocated += size;
12  return (void *)payload;
13}

앞서 설명했듯 살아있는 객체를 모두 마킹한 뒤에는 그 살아있는 객체들을 복사한다. 이 과정 역시 헤더 체인을 걷지만 GcHeader.marked가 켜진 것들로 제한된다. 각 객체는 to-space로 복사되며 GcHeader.forward 필드는 새로운 메모리 영역을 가리키도록 설정된다.

(Almost) Zero cost abstractions and small Values

purple-garden 런타임의 값(Value)은 타입 인코딩을 위해 3비트를 사용한다:

1typedef enum {
 2  V_NONE,   // zero value, comparable to rusts Option::None
 3  V_STR,    // string view
 4  V_DOUBLE, // floating point number
 5  V_INT,    // integer
 6
 7  V_TRUE,   // booleans
 8  V_FALSE,
 9
10  V_ARRAY,  // containers / adts
11  V_OBJ,
12} ValueType;

각 타입의 데이터를 저장하기 위한 union(단, true/false는 추가 데이터가 필요 없으므로 제외). Str은 커스텀 문자열 추상화, List는 동적 증가 배열, Map은 해시 맵, 그리고 double, int64_t를 담는다.

1typedef struct Value {
 2  unsigned int type : 3;
 3  union {
 4    Str string;
 5    List *array;
 6    Map *obj;
 7    double floating;
 8    int64_t integer;
 9  };
10} Value;

그리고 값을 optional로 표시하기 위한 단일 비트 is_some을 추가해, 모든 타입이 optional 표현을 가질 수 있게 한다:

1typedef struct Value {
 2  unsigned int type : 3;
 3  unsigned int is_some : 1;
 4  union {
 5    Str string;
 6    List *array;
 7    Map *obj;
 8    double floating;
 9    int64_t integer;
10  };
11} Value;

Value 자체에 bit tag를 두면, V_NONE을 제외한 모든 타입은 is_some이 true/false 모두와 결합 가능한 상태를 가진다. 반면 V_NONE은 is_some이 false인 경우에만 결합 가능하다. 이 불변조건은 런타임에서 검사되며, 특정 Value 인스턴스가 optional인지 판별하는 전용 함수가 있다:

1inline bool Value_is_opt(const Value *v) {
2  return v->type == V_NONE || v->is_some;
3}

이 접근의 큰 장점은 런타임에서 optional을 구현할 때 할당 0, 복사 0, 간접 참조 0을 가능하게 한다는 점이다. 각 값은 Value.is_some 비트를 세팅하는 것만으로 optional 값이 된다. 만약 optionality를 V_OPTION으로 구현하고 Value.inner가 값이며 Value.is_some이 option의 유무를 나타내도록 했다면, 재귀적 구조 때문에 항상 할당이 필요했을 것이다.

NIX

1# optionals
2var opt_none = std::None()
3var opt_some = std::Some([])
4std::opt::or(opt_none "anything else") # -> "anything else"
5std::opt::unwrap(opt_some) # -> []
6std::opt::expect(opt_some "would panic with this message") # -> []
7std::opt::is_some(opt_some) # -> true 
8std::opt::is_none(opt_none) # -> true

optional과 상호작용하는 표준 라이브러리 함수는 std::opt에 있으며, 다른 패키지의 많은 함수들이 optional 값을 반환하거나 인자로 받는다. 이 값 설계 덕분에 이 패키지의 함수들은 구현이 매우 단순하다. 예를 들어 std::opt::{some, none, or}:

1static void pg_builtin_opt_some(Vm *vm) {
 2  Value inner = ARG(0);
 3  inner.is_some = true;
 4  RETURN(inner);
 5}
 6
 7// INTERNED_NONE is a static Value instance thats always available to the runtime
 8
 9static void pg_builtin_opt_none(Vm *vm) { RETURN(*INTERNED_NONE); }
10
11static void pg_builtin_opt_or(Vm *vm) {
12  Value lhs = ARG(0);
13  Value rhs = ARG(1);
14  ASSERT(Value_is_opt(&lhs), "Or: lhs wasnt an Optional");
15  if (!lhs.is_some) {
16    RETURN(rhs);
17  } else {
18    lhs.is_some = false;
19    RETURN(lhs);
20  }
21}

이 설계는 또한 숫자 타입 Value를 할당 없이 표현할 수 있게 해, 핫패스에서 힙 할당을 줄이는 대신 Value의 크기를 조금 더 크게 만드는 트레이드오프를 한다.

Differentiating heap and non-heap values

purple-garden에서 불리언, double, int는 힙 할당이 필요 없다. 컴파일 타임에 알려진 문자열은 인터프리터가 받는 초기 입력에 대한 단순한 윈도우(view)이므로, gc_alloc로 힙 할당되지 않으며 마킹이나 컬렉션의 대상도 아니다.

따라서 힙 값과 비힙 값을 구분하는 마커가 필요하다: Value.is_heap. 이 비트가 켜진 값만 마킹·컬렉션 대상이며, GC의 copy 및 rewrite 단계로 전달된다.

1typedef struct Value {
 2  unsigned int is_heap : 1;
 3  unsigned int type : 3;
 4  union {
 5    Str string;
 6    List *array;
 7    Map *obj;
 8    double floating;
 9    int64_t integer;
10  };
11} Value;

Dynamically growable bump allocation

시스템 콜을 통한 OS 상호작용은 비용이 크므로, purple-garden은 가능한 한 시스템 콜을 줄이도록 설계되었다. 이를 위해 성장 가능한 bump allocator를 사용한다. bump allocator가 요청을 만족시키지 못하면 OS로부터 새로운 메모리 청크를 요청한다. 이 청크는 이전 청크의 두 배 크기이며, 첫 청크는 실행 중인 시스템의 페이지 크기만큼 크다.

String abstraction

C 스타일 문자열은 길이가 항상 함께 있지 않아 사용하기 불편하므로, purple-garden은 문자열 추상화를 내장한다. 비용이 큰 C 문자열 연산을 개선하기 위해 해시와 길이를 1급 시민으로 만든다. Str은 자신이 관리하지 않는 버퍼에 대한 뷰(view)로, 버퍼 포인터, 길이, 해시를 가진다. 백킹 스토리지는 불변(immutable)이다. Str은 복사가 저렴하므로 Value에서 참조가 아니라 값으로 저장된다.

1typedef struct __Str {
2  uint64_t hash;
3  uint32_t len;
4  const uint8_t *p;
5} Str;

Str.length는 생성 시 계산된다. 런타임 컴파일 시점에 알려진 정적 문자열(에러 메시지, 값 타입 이름의 문자열 표현 등)은 STRING 매크로로 전달한다:

1#define STRING(str)                                                            \
2  ((Str){.len = sizeof(str) - 1, .p = (const uint8_t *)str, .hash = 0})

런타임 입력에 포함된 문자열에 대해서는 다른 접근을 사용한다. 예를 들어 다음을 보자:

NIX

1std::println("Hello" "World")

이 예제에서 "Hello"와 "World"의 길이와 해시는 렉서가 문자열 토큰으로 인식하는 즉시 알 수 있다. 문자열의 모든 문자를 순회해야 한다는 사실을 이용해, 렉서가 문자열을 인식하는 동안 길이와 해시를 동시에 계산한다:

1// ! error handling omitted for clarity
 2string: {
 3    // skip "
 4    l->pos++;
 5    size_t start = l->pos;
 6    uint64_t hash = FNV_OFFSET_BASIS;
 7
 8    char cc = l->input.p[l->pos];
 9    for (; cc > 0 && cc != '"'; l->pos++, cc = l->input.p[l->pos]) {
10        hash ^= cc;
11        hash *= FNV_PRIME;
12    }
13
14    Token *t = CALL(a, request, sizeof(Token));
15    *t = (Token){0};
16    t->type = T_STRING;
17    t->string = (Str){
18        .p = l->input.p + start,
19        .len = l->pos - start,
20        .hash = hash,
21    };
22
23    l->pos++;
24    return t;
25}

해싱은 FNV-1a로 한다. 런타임에 생성된 문자열의 해시를 계산하기 위해 보완적인 Str_hash 함수도 제공한다:

1inline uint64_t Str_hash(const Str *str) {
2  uint64_t hash = FNV_OFFSET_BASIS;
3  for (size_t i = 0; i < str->len; i++) {
4    hash ^= (uint64_t)str->p[i];
5    hash *= FNV_PRIME;
6  }
7  return hash;
8}

이 추상화 위에는 작은 내부 API가 구축되어 있다:

1char Str_get(const Str *str, size_t index);
2Str Str_from(const char *s);
3Str Str_slice(const Str *str, size_t start, size_t end);
4Str Str_concat(const Str *a, const Str *b, Allocator *alloc);
5bool Str_eq(const Str *a, const Str *b);
6void Str_debug(const Str *str);
7int64_t Str_to_int64_t(const Str *str);
8double Str_to_double(const Str *str);

그리고 str 표준 라이브러리 패키지는 str::append, str::lines, str::slice를 노출한다:

NIX

1std::str::append("Hello" " World") # Hello World
2std::str::slice("Hello" 0 2) # hel
3std::str::slice("Hello" 1 2) # el
4std::str::lines("hello\nworld") # [hello world]

Triggering garbage collection

짧게 실행되는 스크립트에서는 할당 지연이 더 중요하므로, 할당 시마다(혹은 더 나쁘게는 각 바이트코드 명령마다) 메모리 사용량이 GC 임계값을 넘었는지 검사하면 VM이 크게 느려지고 실행/할당 속도 모두에 지연이 생긴다.

이를 피하기 위해 임계값 검사는 스코프를 벗어날 때만 수행한다. 이렇게 하면 VM이 스코프를 떠나기 전에 살아있다고 고려해야 할 객체들을 생략할 수 있다. 제어 흐름과 메모리 사용량에 ‘기생(piggy backing)’해서 GC 사이클을 트리거하면 GC 동작을 더 쉽게 추론할 수 있다.

1    case OP_RET: {
 2      Frame *old = vm->frame;
 3      if (vm->frame->prev) {
 4        vm->pc = vm->frame->return_to_bytecode;
 5        vm->frame = vm->frame->prev;
 6      }
 7
 8      // [...] free lists for stack frames and stuff
 9
10      if (!vm->config.disable_gc &&
11          vm->gc->allocated >= (vm->config.gc_size * vm->config.gc_limit)) {
12        gc_cycle(vm->gc);
13      }
14      break;
15    }

runtime 표준 라이브러리는 std::runtime::gc::stats와 수동 GC 사이클 트리거 std::runtime::gc::cycle()을 위한 gc 패키지를 제공한다(글 첫 예시에서 보였다). 후자의 구현과 사용은 아래와 같다.

1  if (!vm->config.disable_gc) {
2    gc_cycle(vm->gc);
3  }
4}

NIX

1var combination = std::str::append("Hello" " " "World")
2std::println(combination)
3std::runtime::gc::cycle()

Marking a Value

앞서 설명했듯 is_heap이 켜진 값만 루트로 고려되고 GC가 관리해야 한다. 따라서 mark는 빠른 탈출 조건을 가진다:

1static inline void mark(Gc *gc, const Value *val) {
2  if (!val || !val->is_heap) {
3    return;
4  }
5
6  // [...]
7}

Str은 GC가 관리하는 내부 버퍼(GC_OBJ_RAW)를 감싸지만, Str 자체는 할당되지 않으므로 V_ARRAY/V_OBJ와 다르게 처리해야 한다:

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  void *payload = NULL;
 5  switch ((ValueType)val->type) {
 6  case V_STR:
 7    GcHeader *raw = (GcHeader *)((uint8_t *)val->string.p - sizeof(GcHeader));
 8    raw->marked = true;
 9    break;
10    return;
11  case V_ARRAY:
12    payload = (void *)val->array;
13    break;
14  case V_OBJ:
15    payload = (void *)val->obj;
16    break;
17  default:
18    return;
19  }
20
21  // [...]
22}

배열과 객체의 payload는 할당된 것이므로 캐스팅해 할당한다. payload가 sizeof(GcHeader)보다 작으면 힙 손상 또는 GC 버그이므로 assert한다. 그리고 이미 마킹된 헤더는 조기 종료한다.

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  ASSERT((uintptr_t)payload > sizeof(GcHeader),
 5         "payload too small, GC logic bug, this shouldnt happen");
 6
 7  GcHeader *h = (GcHeader *)((char *)payload - sizeof(GcHeader));
 8  if (!h || h->marked) {
 9    return;
10  }
11
12  h->marked = 1;
13
14  // [...]

문자열은 이미 처리했으므로, 이제 GcHeader->type 비트를 기준으로 스위치하여 배열의 각 멤버와 객체의 각 값을 재귀적으로 마킹한다.

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  switch ((ObjType)h->type) {
 5  case GC_OBJ_LIST: {
 6    List *l = (List *)payload;
 7    for (size_t i = 0; i < l->len; i++) {
 8      mark(gc, &l->arr[i]);
 9    }
10    break;
11  }
12  case GC_OBJ_MAP:
13    Map *m = (Map *)payload;
14    for (size_t i = 0; i < m->cap; i++) {
15      MapEntry e = m->buckets[i];
16      mark(gc, &e.value);
17    }
18  default:
19    return;
20  }
21}

Letting mark loose on registers and call frames

gc_cycle에서 루트 셋을 마킹해야 하며, 루트 셋은 레지스터들과 현재 및 이전 모든 콜 프레임의 변수 테이블로 구성되므로, 각각에 대해 mark를 호출해야 한다:

1void gc_cycle(Gc *gc) {
 2  if (!gc->allocated_since_last_cycle) {
 3    return;
 4  }
 5  gc->allocated_since_last_cycle = 0;
 6  Vm *vm = ((Vm *)gc->vm);
 7  for (size_t i = 0; i < REGISTERS; i++) {
 8    const Value *ri = vm->registers + i;
 9    mark(gc, ri);
10  }
11
12  for (Frame *f = vm->frame; f; f = f->prev) {
13    for (size_t i = 0; i < f->variable_table.cap; i++) {
14      MapEntry *me = &f->variable_table.buckets[i];
15      if (me->hash) {
16        mark(gc, &me->value);
17      }
18    }
19  }
20
21  // [...]
22}

루트 셋이 참조하는 모든 Value 및 GcHeader를 마킹한 뒤, 이들을 to-space/new 할당기로 복사한다. 또한 복사된 헤더 체인을 다시 구축한다.

1void gc_cycle(Gc *gc) {
 2    // [...]
 3
 4  GcHeader *new_head = NULL;
 5  size_t new_alloc = 0;
 6  for (GcHeader *h = gc->head; h; h = h->next) {
 7    if (!h->marked) {
 8      continue;
 9    }
10
11    void *buf = CALL(gc->new, request, h->size + sizeof(GcHeader));
12    GcHeader *nh = (GcHeader *)buf;
13    void *new_payload = (char *)buf + sizeof(GcHeader);
14    memcpy(nh, h, sizeof(GcHeader) + h->size);
15    nh->next = new_head;
16    new_head = nh;
17    h->forward = (uintptr_t)new_payload;
18    nh->forward = 0;
19    nh->marked = 0;
20    new_alloc += h->size;
21  }
22
23    // [...]
24}

Forwarding pointers and Rewriting references

mark와 rewrite_nested는 스택을 터뜨릴 수 있으므로, 둘 다 향후에는 워크리스트(work list)를 사용할 예정이다.

각 GcHeader.forward는 새로 복사된 영역을 가리키도록 모든 참조를 업데이트하는 데 사용된다:

1static inline void *forward_ptr(void *payload) {
 2  if (!payload) {
 3    return NULL;
 4  }
 5
 6  GcHeader *old = (GcHeader *)((char *)payload - sizeof(GcHeader));
 7  if (!old) {
 8    // not a gc object, hitting this is a bug
 9    unreachable();
10  }
11
12  if (old->type < GC_OBJ_RAW || old->type > GC_OBJ_MAP) {
13    // either already in newspace or not a heap object; return payload unchanged
14    return payload;
15  }
16
17  if (old->forward) {
18    return (void *)old->forward;
19  }
20
21  // normally this would be unreachable, but since pg doesnt clear registers
22  // after insertions into adts or the variable table, references to heap data
23  // can be both in the variable table and registers at the same time, thus
24  // allowing for multiple forward_ptr calls since there are multiple references
25  // to a single point in memory. This results in double forwarding and other
26  // shenanigans. Just returning the payload if no forward was found is correct
27  // and a fix.
28  return payload;
29}

rewrite는 가능한 한 단순하다:

1static inline void rewrite(Gc *gc, Value *v) {
 2  if (!v->is_heap) {
 3    return;
 4  }
 5
 6  switch ((ValueType)v->type) {
 7  case V_STR:
 8    v->string.p = (const uint8_t *)forward_ptr((void *)v->string.p);
 9    break;
10  case V_ARRAY:
11    v->array = (List *)forward_ptr((void *)v->array);
12    break;
13  case V_OBJ:
14    v->obj = (Map *)forward_ptr((void *)v->obj);
15    break;
16  default:
17    return;
18  }
19}

rewrite_nested는 객체와 배열에 대해 중첩하여 재작성 과정을 수행하는 점에서 rewrite를 확장한다:

1void rewrite_nested(Gc *gc, Value *v) {
 2  rewrite(gc, v);
 3
 4  switch (v->type) {
 5  case V_ARRAY:
 6    for (size_t i = 0; i < v->array->len; i++) {
 7      rewrite_nested(gc, &v->array->arr[i]);
 8    }
 9    break;
10  case V_OBJ:
11    for (size_t i = 0; i < v->obj->cap; i++) {
12      MapEntry *me = &v->obj->buckets[i];
13      if (me->hash) {
14        rewrite_nested(gc, &me->value);
15      }
16    }
17    break;
18  default:
19    break;
20  }
21}

이를 gc_cycle에서 합치면, 레지스터들, 각 프레임의 변수 테이블 엔트리들, 그리고 GcHeader 체인의 각 헤더에 대해 모든 Value가 재작성된다.

1void gc_cycle(Gc *gc) {
 2  // [...]
 3
 4  for (size_t i = 0; i < REGISTERS; i++) {
 5    Value *ri = &vm->registers[i];
 6    rewrite(gc, ri);
 7  }
 8
 9  for (Frame *f = vm->frame; f; f = f->prev) {
10    for (size_t i = 0; i < f->variable_table.cap; i++) {
11      MapEntry *me = &f->variable_table.buckets[i];
12      if (me->hash) {
13        rewrite_nested(gc, &me->value);
14      }
15    }
16  }
17
18  for (GcHeader *h = new_head; h; h = h->next) {
19    switch (h->type) {
20    case GC_OBJ_LIST: {
21      List *l = (List *)((uint8_t *)h + sizeof(GcHeader));
22      for (size_t i = 0; i < l->len; i++) {
23        rewrite_nested(gc, &l->arr[i]);
24      }
25      break;
26    }
27    case GC_OBJ_MAP: {
28      Map *m = (Map *)((uint8_t *)h + sizeof(GcHeader));
29      for (size_t i = 0; i < m->cap; i++) {
30        MapEntry *me = &m->buckets[i];
31        if (me->hash) {
32          rewrite_nested(gc, &me->value);
33        }
34      }
35      break;
36    }
37    case GC_OBJ_STR: {
38      Str *str = (Str *)((uint8_t *)h + sizeof(GcHeader));
39      str->p = forward_ptr((void *)str->p);
40      break;
41    }
42    default:
43      break;
44    }
45  }
46
47  // [...]
48}

Bookkeeping after collection

1void gc_cycle(Gc *gc) {
2  // [...]
3  gc->head = new_head;
4  SWAP_STRUCT(gc->old, gc->new);
5  CALL(gc->new, reset);
6  gc->allocated = new_alloc;
7}

mark, copy, rewrite 이후 새 체인을 Gc.head에 연결한다.

SWAP_STRUCT로 from/to-space를 스왑한다:

1#define SWAP_STRUCT(A, B)                                                      \
2  do {                                                                         \
3    _Static_assert(__builtin_types_compatible_p(typeof(A), typeof(B)),         \
4                   "SWAP_STRUCT arguments must have identical types");         \
5                                                                               \
6    typeof(A) __swap_tmp = (A);                                                \
7    (A) = (B);                                                                 \
8    (B) = __swap_tmp;                                                          \
9  } while (0)

이제 (새로운 new가 된, 즉 이전 old였던) 할당기는 CALL로 reset한다. CALL은 매크로다:

1#define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)

이는 (gc->new)->reset((gc->new)->ctx);로 확장된다.

Tuning

가상 머신이 VmConfig로 설정 가능하듯, GC도 설정 가능하다:

1typedef struct {
 2  // defines the maximum amount of memory purple garden is allowed to allocate,
 3  // if this is hit, the vm exits with a non zero code
 4  uint64_t max_memory;
 5  // define gc heap size in bytes
 6  uint64_t gc_size;
 7  // gc threshold in percent 5-99%
 8  double gc_limit;
 9  // disables the garbage collector
10  bool disable_gc;
11
12  // [...]
13} Vm_Config;

이 설정은 Vm_new에 전달되어 Vm 구조체에 붙는다. GC를 완전히 비활성화하거나, GC 힙의 총 크기를 정하거나, GC가 사이클을 시작해야 하는 힙 사용량 임계값(퍼센트)을 설정할 수 있다.

임베딩이 아닌(인터프리터 실행) 목적에서는, purple-garden 바이너리가 받는 CLI 플래그 목록에 각 옵션의 짧은 이름, 기본값, 타입, 설명이 정의되어 있다(명확성을 위해 다른 옵션은 생략):

1#define CLI_ARGS                                                               \
2  X(gc_max, 0, GC_MIN_HEAP * 64l, LONG,                                        \
3    "set hard max gc space in bytes, default is GC_MIN_HEAP*64")               \
4  X(gc_size, 0, GC_MIN_HEAP * 2l, LONG, "define gc heap size in bytes")        \
5  X(gc_limit, 0, 70.0, DOUBLE,                                                 \
6    "instruct memory usage amount for gc to start collecting, in percent "     \
7    "(5-99%)")                                                                 \
8  X(no_gc, 0, false, BOOL, "disable garbage collection")

수정된 6cl 라이브러리로 깔끔한 도움말 텍스트를 생성한다(명확성을 위해 다른 옵션은 생략):

TEXT

1usage ./build/purple_garden_debug: [ +gc_max <1638400>] [ +gc_size <51200>]
 2                                   [ +gc_limit <70>] [ +no_gc]
 3                                   // [...]
 4                                   <file.garden>
 5
 6Option:
 7          +gc_max <1638400>
 8                set hard max gc space in bytes, default is GC_MIN_HEAP*64
 9          +gc_size <51200>
10                define gc heap size in bytes
11          +gc_limit <70>
12                instruct memory usage amount for gc to start collecting, in
13                percent (5-99%)
14          +no_gc
15                disable garbage collection
16          // [...]
17
18          +h/+help
19                help page and usage
20Examples:
21        ./build/purple_garden_debug +gc_max 1638400 +gc_size 51200 \
22                                    +gc_limit 0 +no_gc

사용자가 이를 설정하면, purple-garden의 메인 엔트리 포인트가 CLI 옵션으로 VmConfig를 구성한다:

1Vm vm = Vm_new(
2  (Vm_Config){
3      .gc_size = a.gc_size,
4      .gc_limit = a.gc_limit,
5      .disable_gc = a.no_gc,
6      .max_memory = a.gc_max,
7      .disable_std = a.no_std,
8      .no_env = a.no_env,
9  }, pipeline_allocator, &gc);

Comparing mgc to other gc idioms and other reasoning

여기서 가장 obvious한 질문은 목적과 복잡성이다. 왜 여러 패러다임을 활용하는 GC를 설계했는가? 각 패러다임 하나만으로도 충분해 보이는데.

Why Bump allocation instead of malloc

무엇보다 이 런타임에서 할당을 빠르게 유지하는 것이 중요하다. 이를 위해서는 bump allocation 외에는 선택지가 없다. 시스템 콜은 느리고, 작은 객체에 대해 여러 번 호출하면 더 느리며, 핫패스는 폭발한다. purple-garden은 해제보다 할당에 최적화되어 있으므로 bump allocation의 대안을 택할 수 없다.

Why Copying, Compacting and resetting whole Heap regions instead of single value dealloc

bump allocation은 단일 객체 해제를 허용하지 않으므로, 객체 수집을 제한하게 된다. 따라서 to-space/from-space 복사 수집 접근이 가장 자연스럽다. 복사는 대량 회수(bulk reclamation)와 단편화 감소라는 장점도 있으며, 단일 객체 해제보다 특히 유리하다.

Why Register and variable tracing vs heap scanning

루트 셋은 절대적이며 항상 알 수 있다. 도달성을 결정하는 이 값들을 알고 있는 것이 전체 GC 힙을 걷는 것보다 훨씬 효율적인 구현이다. 루트 값이 항상 알려져 있고 헤더도 체인으로 연결돼 있는데, 굳이 힙 전체를 스캔하며 사이클을 낭비할 이유가 없다.

Why not X

한편, 유명 런타임의 일부 GC가 사용하는 개념들 중 일부는 의도적으로 제외했다. 이는 기능 크립(feature creep)을 방지하기 위함이기도 하고, 런타임이 단일 스레드이며 GC는 stop-the-world이고, 일반적으로(특히 GC 상태와 루트 셋 포함) 어떤 상태도 동시에 변이(concurrent mutation)되는 것을 허용하지 않기 때문이다:

write barriers/incremental: 동시(concurrent) 컬렉터에 필요하지만, purple-garden은 단일 스레드이고 stop-the-world다. 컬렉션 시점에 모든 객체가 알려져 있다.
generational: 장수 객체에 최적화되지만, purple-garden은 임베더블 스크립팅 언어다. 게임에서 각 이벤트/엔티티마다 실행되는 플러그인 같은 용도로 빠르고 저렴하게 실행되도록 설계되었다. 짧은 실행과 그에 수반되는 할당에 특별히 초점을 맞춘다.
RC/ARC: 런타임 오버헤드가 크고 사이클은 별도의 비(非)자명한 탐지 로직이 필요해 핫패스를 느리게 만든다.
escape analysis: 힙 할당 감소에 최적화되지만, mgc는 할당 자체를 매우 싸게 만든다. 또한 purple-garden의 바이트코드는 AST에서 컴팩트한 포맷으로 컴파일되며, escape analysis를 수행할 만큼의 수명 정보가 있는 즉시 표현이 없다.

Gc invariants and assumptions

완전성을 위해 포함한 짧은 목록이다. 이 불변조건 중 하나라도 위반하면 정의되지 않은 동작(UB)이 발생한다.

레지스터가 클리어되지 않는다는 점에 의존
값이 잘못 복제되지 않는다는 점
Value.type=V_STR 및 Value.is_heap=1은 문자열이 힙에 할당됨을 의미하고, 그렇지 않으면 렉서가 ‘할당’한 것(기술적으로는 입력을 mmap하여 프로세스 메모리에 올린 것이지만 요지는 그렇다)
Value.is_heap 설정에 대한 VM 규율(discipline)

(non-)Portability to Rust and Design Improvements

나는 현재 여러 이유로 purple-garden을 Rust로 재작성하고 있다(언어 런타임을 만들어 본 적이 없다면 이해하기 쉽지 않은 이유들이다):

추상화가 줄어든다. C에서는 해싱, 해시맵, 배열, 배열 리스트 등을 전부 직접 만들어야 했는데, Rust에서는 std::collection으로 대체할 수 있다.
Rust에서는 Vec<Op>를 쓸 수 있어 append/insert가 내가 구현한 것보다 훨씬 좋아, 바이트코드 컴파일러가 더 낫고 더 짧아진다.
더 나은 바이트코드 포맷과 각 명령당 낭비되는 바이트 감소. 예를 들어 RET는 1바이트, ADD rA, rB 같은 것은 여러 바이트로 할 수 있다:

RUST

1enum Op<'vm> {
 2    Add {
 3        dst: u8,
 4        lhs: u8,
 5        rhs: u8,
 6    },
 7    // [...]
 8    Ret {
 9        /// used for peephole optimisation, merging multiple RET into a single
10        /// RET with a count
11        times: u8,
12    },
13}

각 컴포넌트에서 훨씬 나은 에러 핸들링. 현재 C 인터프리터는 assertion으로 그냥 abort한다.
Rust 해시맵을 통해 컴파일 타임 값 interning이 쉬워진다(기존에는 서로 다른 타입에 대해 해시맵 3개를 썼는데, 이제는 std::collections::HashMap 하나를 사용).
ARGS 명령은 더 이상 필요 없다. builtin과 사용자 함수 호출에 대한 인자 레지스터 오프셋과 개수를 인코딩하기 위해 ARGS가 있었지만, 이제는 SYS와 CALL이 명령 자체에 오프셋과 인자 수를 인코딩한다:

RUST

1enum Op<'vm> {
 2    // [...]
 3    Call {
 4        func: u16,
 5        args_start: u8,
 6        args_len: u8,
 7    },
 8    Sys {
 9        ptr: BuiltinFn<'vm>,
10        args_start: u8,
11        args_len: u8,
12    },
13}

SYS의 builtin 함수는 더 이상 타입 소거(type erasure), 캐스팅, 컴파일러 내 간접 참조가 필요 없다. 이전에는 64비트 포인터를 word에 저장할 수 없어서 컴파일러가 builtin 포인터 배열을 만들고 바이트코드에 그 인덱스를 넣어 VM이 올바른 builtin을 호출했는데, 이제는 단순한 함수 포인터가 된다. 바이트코드 레퍼런스는 위에 있으며, 타입 정의는 아래와 같다:

RUST

1type BuiltinFn<'vm> = fn(&mut Vm<'vm>, &[Value]);

Intro

이 접근들을 결합하면 빠른 할당, 낮은 할당 지연, 그리고 단편화 감소를 얻을 수 있다.

Purple-garden

NIX

1fn greeting :: greetee { std::println("hello world to:" greetee) }
2greeting(std::env::get("USER")) # hello world to: $USER
3
4var v = "Hello World"
5std::println(std::runtime::type(v) std::len(v)) # str 11

런타임 힙에 부담을 주는 더 나은 예시는 다음의 재귀 문자열 연결이다:

NIX

1fn f :: n {
 2    match {
 3        n < 100000 {
 4            f(n+1)
 5            var s1 = std::str::append("a" "b")
 6            var s2 = std::str::append("c" "d")
 7            var pair = std::str::append(s1 s2)
 8        }
 9        { std::println(n) }
10    }
11}
12f(0)
13std::runtime::gc::cycle()

Garbage collection stages

mgc는 세 단계로 나뉘며, 각 단계는 컬렉션 동안 런타임이 멈추는(stop-the-world) 것을 요구한다. 이 세 단계가 합쳐져 하나의 컬렉션 사이클을 이룬다:

TEXT

1+-------------+
 2|  Roots Set  |
 3| (VM regs,   |
 4| globals...) |
 5+------+------+
 6       |
 7       v
 8+-------------+
 9|  Mark Phase |
10| Mark all    |
11| live objects|
12+------+------+
13       |
14       v              Copy
15+-------------------+ live     +-------------------+
16|   Old Bump Space  | objects  |   New Bump Space  |
17|  (old allocator)  | -------> |  (new allocator)  |
18+-------------------+          +-------------------+
19       |
20       v
21+--------------+
22| Reset Old    |
23| Bump Alloc   |
24| (len=0,pos=0)|
25+------+-------+
26       |
27       v
28+-------------+
29| Swap Alloc  |
30| old <-> new |
31+-------------+

Mark

예를 들어 다음 코드에서 z() 호출 이후 GC가 실행된다면 a와 b는 살아있지만 c는 더 이상 살아있지 않으므로 안전하게 정리될 수 있다:

NIX

1# alive
2var a = allocating() 
3fn z { 
4    # dead
5    var c = allocating()
6}
7# alive
8var b = allocating() 
9z()

Copy

복사가 끝나면 from-space는 해제(deallocate)하지 않고 reset만 한다. 그런 다음 from-space와 to-space를 스왑해, 기존 from-space가 to-space가 되고 그 반대가 된다.

Rewriting references

이를 위해 앞서 구성했던 루트 셋을 다시 한 번 순회하며 모든 참조를 업데이트한다.

Reference implementation

이제 위 개념들을 실제 C 코드로 옮긴 내용이 다음 장들이다.

1typedef struct {
 2  // from-space
 3  Allocator *old;
 4  // to-space
 5  Allocator *new;
 6  void *vm;
 7  GcHeader *head;
 8  size_t allocated;
 9  size_t allocated_since_last_cycle;
10} Gc;
11
12Gc gc_init(size_t gc_size) {
13  size_t half = gc_size / 2;
14  return (Gc){
15      .old = bump_init(half, 0),
16      .new = bump_init(half, 0),
17      .head = NULL,
18  };
19}

Allocator는 할당을 추상화하는 간단한 구조체다. 예를 들어 bump allocator를 위해:

1typedef struct {
 2  // Allocator::ctx refers to an internal allocator state and owned memory
 3  // areas, for instance, a bump allocator would attach its meta data (current
 4  // position, cap, etc) here
 5  void *ctx;
 6  Stats (*stats)(void *ctx);
 7  void *(*request)(void *ctx, size_t size);
 8  void (*reset)(void *ctx);
 9  void (*destroy)(void *ctx);
10} Allocator;

Keeping track of allocations

GC가 어떤 영역을, 어떤 크기로, 어떤 타입으로 내줬는지 알기 위해, 각 영역은 메타데이터 헤더와 함께 할당된다. 이 헤더는 다음으로 구성된다:

1typedef struct GcHeader {
2  unsigned int marked : 1;
3  unsigned int type : 3;
4  uintptr_t forward;
5  uint16_t size;
6  struct GcHeader *next;
7} GcHeader;

1typedef enum {
 2  // just bytes
 3  GC_OBJ_RAW = 0b000,
 4  // a string with a reference to an inner string, can be allocated or not
 5  GC_OBJ_STR = 0b001,
 6  // list has zero or more children
 7  GC_OBJ_LIST = 0b010,
 8  // map holds allocated buckets with owned children
 9  GC_OBJ_MAP = 0b011,
10} ObjType;

따라서 할당은 다음과 같은 형태가 된다:

TEXT

1       allocation 
 2      (raw pointer)
 3            |
 4            |
 5            v
 6 +----------------------+
 7 | GcHeader             | <-- header
 8 +----------------------+
 9 |                      |
10 | payload (size B)     | <-- data handed out as ptr to the user
11 |                      |
12 +----------------------+

각 GcHeader는 32B이므로, 모든 힙 할당은 32B의 오버헤드가 추가된다.

1void *gc_request(Gc *gc, size_t size, ObjType t) {
 2  void *allocation = gc->old->request(gc->old->ctx, size + sizeof(GcHeader));
 3  void *payload = (char *)allocation + sizeof(GcHeader);
 4  GcHeader *h = (GcHeader *)allocation;
 5  h->type = t;
 6  h->marked = 0;
 7  h->size = size;
 8  h->next = gc->head;
 9  gc->head = h;
10  gc->allocated_since_last_cycle += size;
11  gc->allocated += size;
12  return (void *)payload;
13}

(Almost) Zero cost abstractions and small Values

purple-garden 런타임의 값(Value)은 타입 인코딩을 위해 3비트를 사용한다:

1typedef enum {
 2  V_NONE,   // zero value, comparable to rusts Option::None
 3  V_STR,    // string view
 4  V_DOUBLE, // floating point number
 5  V_INT,    // integer
 6
 7  V_TRUE,   // booleans
 8  V_FALSE,
 9
10  V_ARRAY,  // containers / adts
11  V_OBJ,
12} ValueType;

1typedef struct Value {
 2  unsigned int type : 3;
 3  union {
 4    Str string;
 5    List *array;
 6    Map *obj;
 7    double floating;
 8    int64_t integer;
 9  };
10} Value;

그리고 값을 optional로 표시하기 위한 단일 비트 is_some을 추가해, 모든 타입이 optional 표현을 가질 수 있게 한다:

1typedef struct Value {
 2  unsigned int type : 3;
 3  unsigned int is_some : 1;
 4  union {
 5    Str string;
 6    List *array;
 7    Map *obj;
 8    double floating;
 9    int64_t integer;
10  };
11} Value;

1inline bool Value_is_opt(const Value *v) {
2  return v->type == V_NONE || v->is_some;
3}

NIX

1# optionals
2var opt_none = std::None()
3var opt_some = std::Some([])
4std::opt::or(opt_none "anything else") # -> "anything else"
5std::opt::unwrap(opt_some) # -> []
6std::opt::expect(opt_some "would panic with this message") # -> []
7std::opt::is_some(opt_some) # -> true 
8std::opt::is_none(opt_none) # -> true

1static void pg_builtin_opt_some(Vm *vm) {
 2  Value inner = ARG(0);
 3  inner.is_some = true;
 4  RETURN(inner);
 5}
 6
 7// INTERNED_NONE is a static Value instance thats always available to the runtime
 8
 9static void pg_builtin_opt_none(Vm *vm) { RETURN(*INTERNED_NONE); }
10
11static void pg_builtin_opt_or(Vm *vm) {
12  Value lhs = ARG(0);
13  Value rhs = ARG(1);
14  ASSERT(Value_is_opt(&lhs), "Or: lhs wasnt an Optional");
15  if (!lhs.is_some) {
16    RETURN(rhs);
17  } else {
18    lhs.is_some = false;
19    RETURN(lhs);
20  }
21}

Differentiating heap and non-heap values

따라서 힙 값과 비힙 값을 구분하는 마커가 필요하다: Value.is_heap. 이 비트가 켜진 값만 마킹·컬렉션 대상이며, GC의 copy 및 rewrite 단계로 전달된다.

1typedef struct Value {
 2  unsigned int is_heap : 1;
 3  unsigned int type : 3;
 4  union {
 5    Str string;
 6    List *array;
 7    Map *obj;
 8    double floating;
 9    int64_t integer;
10  };
11} Value;

Dynamically growable bump allocation

String abstraction

1typedef struct __Str {
2  uint64_t hash;
3  uint32_t len;
4  const uint8_t *p;
5} Str;

Str.length는 생성 시 계산된다. 런타임 컴파일 시점에 알려진 정적 문자열(에러 메시지, 값 타입 이름의 문자열 표현 등)은 STRING 매크로로 전달한다:

1#define STRING(str)                                                            \
2  ((Str){.len = sizeof(str) - 1, .p = (const uint8_t *)str, .hash = 0})

런타임 입력에 포함된 문자열에 대해서는 다른 접근을 사용한다. 예를 들어 다음을 보자:

NIX

1std::println("Hello" "World")

1// ! error handling omitted for clarity
 2string: {
 3    // skip "
 4    l->pos++;
 5    size_t start = l->pos;
 6    uint64_t hash = FNV_OFFSET_BASIS;
 7
 8    char cc = l->input.p[l->pos];
 9    for (; cc > 0 && cc != '"'; l->pos++, cc = l->input.p[l->pos]) {
10        hash ^= cc;
11        hash *= FNV_PRIME;
12    }
13
14    Token *t = CALL(a, request, sizeof(Token));
15    *t = (Token){0};
16    t->type = T_STRING;
17    t->string = (Str){
18        .p = l->input.p + start,
19        .len = l->pos - start,
20        .hash = hash,
21    };
22
23    l->pos++;
24    return t;
25}

해싱은 FNV-1a로 한다. 런타임에 생성된 문자열의 해시를 계산하기 위해 보완적인 Str_hash 함수도 제공한다:

1inline uint64_t Str_hash(const Str *str) {
2  uint64_t hash = FNV_OFFSET_BASIS;
3  for (size_t i = 0; i < str->len; i++) {
4    hash ^= (uint64_t)str->p[i];
5    hash *= FNV_PRIME;
6  }
7  return hash;
8}

이 추상화 위에는 작은 내부 API가 구축되어 있다:

1char Str_get(const Str *str, size_t index);
2Str Str_from(const char *s);
3Str Str_slice(const Str *str, size_t start, size_t end);
4Str Str_concat(const Str *a, const Str *b, Allocator *alloc);
5bool Str_eq(const Str *a, const Str *b);
6void Str_debug(const Str *str);
7int64_t Str_to_int64_t(const Str *str);
8double Str_to_double(const Str *str);

그리고 str 표준 라이브러리 패키지는 str::append, str::lines, str::slice를 노출한다:

NIX

1std::str::append("Hello" " World") # Hello World
2std::str::slice("Hello" 0 2) # hel
3std::str::slice("Hello" 1 2) # el
4std::str::lines("hello\nworld") # [hello world]

Triggering garbage collection

1    case OP_RET: {
 2      Frame *old = vm->frame;
 3      if (vm->frame->prev) {
 4        vm->pc = vm->frame->return_to_bytecode;
 5        vm->frame = vm->frame->prev;
 6      }
 7
 8      // [...] free lists for stack frames and stuff
 9
10      if (!vm->config.disable_gc &&
11          vm->gc->allocated >= (vm->config.gc_size * vm->config.gc_limit)) {
12        gc_cycle(vm->gc);
13      }
14      break;
15    }

1  if (!vm->config.disable_gc) {
2    gc_cycle(vm->gc);
3  }
4}

NIX

1var combination = std::str::append("Hello" " " "World")
2std::println(combination)
3std::runtime::gc::cycle()

Marking a Value

앞서 설명했듯 is_heap이 켜진 값만 루트로 고려되고 GC가 관리해야 한다. 따라서 mark는 빠른 탈출 조건을 가진다:

1static inline void mark(Gc *gc, const Value *val) {
2  if (!val || !val->is_heap) {
3    return;
4  }
5
6  // [...]
7}

Str은 GC가 관리하는 내부 버퍼(GC_OBJ_RAW)를 감싸지만, Str 자체는 할당되지 않으므로 V_ARRAY/V_OBJ와 다르게 처리해야 한다:

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  void *payload = NULL;
 5  switch ((ValueType)val->type) {
 6  case V_STR:
 7    GcHeader *raw = (GcHeader *)((uint8_t *)val->string.p - sizeof(GcHeader));
 8    raw->marked = true;
 9    break;
10    return;
11  case V_ARRAY:
12    payload = (void *)val->array;
13    break;
14  case V_OBJ:
15    payload = (void *)val->obj;
16    break;
17  default:
18    return;
19  }
20
21  // [...]
22}

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  ASSERT((uintptr_t)payload > sizeof(GcHeader),
 5         "payload too small, GC logic bug, this shouldnt happen");
 6
 7  GcHeader *h = (GcHeader *)((char *)payload - sizeof(GcHeader));
 8  if (!h || h->marked) {
 9    return;
10  }
11
12  h->marked = 1;
13
14  // [...]

문자열은 이미 처리했으므로, 이제 GcHeader->type 비트를 기준으로 스위치하여 배열의 각 멤버와 객체의 각 값을 재귀적으로 마킹한다.

1static inline void mark(Gc *gc, const Value *val) {
 2  // [...]
 3
 4  switch ((ObjType)h->type) {
 5  case GC_OBJ_LIST: {
 6    List *l = (List *)payload;
 7    for (size_t i = 0; i < l->len; i++) {
 8      mark(gc, &l->arr[i]);
 9    }
10    break;
11  }
12  case GC_OBJ_MAP:
13    Map *m = (Map *)payload;
14    for (size_t i = 0; i < m->cap; i++) {
15      MapEntry e = m->buckets[i];
16      mark(gc, &e.value);
17    }
18  default:
19    return;
20  }
21}

Letting mark loose on registers and call frames

1void gc_cycle(Gc *gc) {
 2  if (!gc->allocated_since_last_cycle) {
 3    return;
 4  }
 5  gc->allocated_since_last_cycle = 0;
 6  Vm *vm = ((Vm *)gc->vm);
 7  for (size_t i = 0; i < REGISTERS; i++) {
 8    const Value *ri = vm->registers + i;
 9    mark(gc, ri);
10  }
11
12  for (Frame *f = vm->frame; f; f = f->prev) {
13    for (size_t i = 0; i < f->variable_table.cap; i++) {
14      MapEntry *me = &f->variable_table.buckets[i];
15      if (me->hash) {
16        mark(gc, &me->value);
17      }
18    }
19  }
20
21  // [...]
22}

루트 셋이 참조하는 모든 Value 및 GcHeader를 마킹한 뒤, 이들을 to-space/new 할당기로 복사한다. 또한 복사된 헤더 체인을 다시 구축한다.

1void gc_cycle(Gc *gc) {
 2    // [...]
 3
 4  GcHeader *new_head = NULL;
 5  size_t new_alloc = 0;
 6  for (GcHeader *h = gc->head; h; h = h->next) {
 7    if (!h->marked) {
 8      continue;
 9    }
10
11    void *buf = CALL(gc->new, request, h->size + sizeof(GcHeader));
12    GcHeader *nh = (GcHeader *)buf;
13    void *new_payload = (char *)buf + sizeof(GcHeader);
14    memcpy(nh, h, sizeof(GcHeader) + h->size);
15    nh->next = new_head;
16    new_head = nh;
17    h->forward = (uintptr_t)new_payload;
18    nh->forward = 0;
19    nh->marked = 0;
20    new_alloc += h->size;
21  }
22
23    // [...]
24}

Forwarding pointers and Rewriting references

mark와 rewrite_nested는 스택을 터뜨릴 수 있으므로, 둘 다 향후에는 워크리스트(work list)를 사용할 예정이다.

각 GcHeader.forward는 새로 복사된 영역을 가리키도록 모든 참조를 업데이트하는 데 사용된다:

1static inline void *forward_ptr(void *payload) {
 2  if (!payload) {
 3    return NULL;
 4  }
 5
 6  GcHeader *old = (GcHeader *)((char *)payload - sizeof(GcHeader));
 7  if (!old) {
 8    // not a gc object, hitting this is a bug
 9    unreachable();
10  }
11
12  if (old->type < GC_OBJ_RAW || old->type > GC_OBJ_MAP) {
13    // either already in newspace or not a heap object; return payload unchanged
14    return payload;
15  }
16
17  if (old->forward) {
18    return (void *)old->forward;
19  }
20
21  // normally this would be unreachable, but since pg doesnt clear registers
22  // after insertions into adts or the variable table, references to heap data
23  // can be both in the variable table and registers at the same time, thus
24  // allowing for multiple forward_ptr calls since there are multiple references
25  // to a single point in memory. This results in double forwarding and other
26  // shenanigans. Just returning the payload if no forward was found is correct
27  // and a fix.
28  return payload;
29}

rewrite는 가능한 한 단순하다:

1static inline void rewrite(Gc *gc, Value *v) {
 2  if (!v->is_heap) {
 3    return;
 4  }
 5
 6  switch ((ValueType)v->type) {
 7  case V_STR:
 8    v->string.p = (const uint8_t *)forward_ptr((void *)v->string.p);
 9    break;
10  case V_ARRAY:
11    v->array = (List *)forward_ptr((void *)v->array);
12    break;
13  case V_OBJ:
14    v->obj = (Map *)forward_ptr((void *)v->obj);
15    break;
16  default:
17    return;
18  }
19}

rewrite_nested는 객체와 배열에 대해 중첩하여 재작성 과정을 수행하는 점에서 rewrite를 확장한다:

1void rewrite_nested(Gc *gc, Value *v) {
 2  rewrite(gc, v);
 3
 4  switch (v->type) {
 5  case V_ARRAY:
 6    for (size_t i = 0; i < v->array->len; i++) {
 7      rewrite_nested(gc, &v->array->arr[i]);
 8    }
 9    break;
10  case V_OBJ:
11    for (size_t i = 0; i < v->obj->cap; i++) {
12      MapEntry *me = &v->obj->buckets[i];
13      if (me->hash) {
14        rewrite_nested(gc, &me->value);
15      }
16    }
17    break;
18  default:
19    break;
20  }
21}

이를 gc_cycle에서 합치면, 레지스터들, 각 프레임의 변수 테이블 엔트리들, 그리고 GcHeader 체인의 각 헤더에 대해 모든 Value가 재작성된다.

1void gc_cycle(Gc *gc) {
 2  // [...]
 3
 4  for (size_t i = 0; i < REGISTERS; i++) {
 5    Value *ri = &vm->registers[i];
 6    rewrite(gc, ri);
 7  }
 8
 9  for (Frame *f = vm->frame; f; f = f->prev) {
10    for (size_t i = 0; i < f->variable_table.cap; i++) {
11      MapEntry *me = &f->variable_table.buckets[i];
12      if (me->hash) {
13        rewrite_nested(gc, &me->value);
14      }
15    }
16  }
17
18  for (GcHeader *h = new_head; h; h = h->next) {
19    switch (h->type) {
20    case GC_OBJ_LIST: {
21      List *l = (List *)((uint8_t *)h + sizeof(GcHeader));
22      for (size_t i = 0; i < l->len; i++) {
23        rewrite_nested(gc, &l->arr[i]);
24      }
25      break;
26    }
27    case GC_OBJ_MAP: {
28      Map *m = (Map *)((uint8_t *)h + sizeof(GcHeader));
29      for (size_t i = 0; i < m->cap; i++) {
30        MapEntry *me = &m->buckets[i];
31        if (me->hash) {
32          rewrite_nested(gc, &me->value);
33        }
34      }
35      break;
36    }
37    case GC_OBJ_STR: {
38      Str *str = (Str *)((uint8_t *)h + sizeof(GcHeader));
39      str->p = forward_ptr((void *)str->p);
40      break;
41    }
42    default:
43      break;
44    }
45  }
46
47  // [...]
48}

Bookkeeping after collection

1void gc_cycle(Gc *gc) {
2  // [...]
3  gc->head = new_head;
4  SWAP_STRUCT(gc->old, gc->new);
5  CALL(gc->new, reset);
6  gc->allocated = new_alloc;
7}

mark, copy, rewrite 이후 새 체인을 Gc.head에 연결한다.

SWAP_STRUCT로 from/to-space를 스왑한다:

1#define SWAP_STRUCT(A, B)                                                      \
2  do {                                                                         \
3    _Static_assert(__builtin_types_compatible_p(typeof(A), typeof(B)),         \
4                   "SWAP_STRUCT arguments must have identical types");         \
5                                                                               \
6    typeof(A) __swap_tmp = (A);                                                \
7    (A) = (B);                                                                 \
8    (B) = __swap_tmp;                                                          \
9  } while (0)

이제 (새로운 new가 된, 즉 이전 old였던) 할당기는 CALL로 reset한다. CALL은 매크로다:

1#define CALL(SELF, METHOD, ...) (SELF)->METHOD((SELF)->ctx, ##__VA_ARGS__)

이는 (gc->new)->reset((gc->new)->ctx);로 확장된다.

Tuning

가상 머신이 VmConfig로 설정 가능하듯, GC도 설정 가능하다:

1typedef struct {
 2  // defines the maximum amount of memory purple garden is allowed to allocate,
 3  // if this is hit, the vm exits with a non zero code
 4  uint64_t max_memory;
 5  // define gc heap size in bytes
 6  uint64_t gc_size;
 7  // gc threshold in percent 5-99%
 8  double gc_limit;
 9  // disables the garbage collector
10  bool disable_gc;
11
12  // [...]
13} Vm_Config;

1#define CLI_ARGS                                                               \
2  X(gc_max, 0, GC_MIN_HEAP * 64l, LONG,                                        \
3    "set hard max gc space in bytes, default is GC_MIN_HEAP*64")               \
4  X(gc_size, 0, GC_MIN_HEAP * 2l, LONG, "define gc heap size in bytes")        \
5  X(gc_limit, 0, 70.0, DOUBLE,                                                 \
6    "instruct memory usage amount for gc to start collecting, in percent "     \
7    "(5-99%)")                                                                 \
8  X(no_gc, 0, false, BOOL, "disable garbage collection")

수정된 6cl 라이브러리로 깔끔한 도움말 텍스트를 생성한다(명확성을 위해 다른 옵션은 생략):

TEXT

1usage ./build/purple_garden_debug: [ +gc_max <1638400>] [ +gc_size <51200>]
 2                                   [ +gc_limit <70>] [ +no_gc]
 3                                   // [...]
 4                                   <file.garden>
 5
 6Option:
 7          +gc_max <1638400>
 8                set hard max gc space in bytes, default is GC_MIN_HEAP*64
 9          +gc_size <51200>
10                define gc heap size in bytes
11          +gc_limit <70>
12                instruct memory usage amount for gc to start collecting, in
13                percent (5-99%)
14          +no_gc
15                disable garbage collection
16          // [...]
17
18          +h/+help
19                help page and usage
20Examples:
21        ./build/purple_garden_debug +gc_max 1638400 +gc_size 51200 \
22                                    +gc_limit 0 +no_gc

사용자가 이를 설정하면, purple-garden의 메인 엔트리 포인트가 CLI 옵션으로 VmConfig를 구성한다:

1Vm vm = Vm_new(
2  (Vm_Config){
3      .gc_size = a.gc_size,
4      .gc_limit = a.gc_limit,
5      .disable_gc = a.no_gc,
6      .max_memory = a.gc_max,
7      .disable_std = a.no_std,
8      .no_env = a.no_env,
9  }, pipeline_allocator, &gc);

Comparing mgc to other gc idioms and other reasoning

여기서 가장 obvious한 질문은 목적과 복잡성이다. 왜 여러 패러다임을 활용하는 GC를 설계했는가? 각 패러다임 하나만으로도 충분해 보이는데.

Why Bump allocation instead of malloc

Why Copying, Compacting and resetting whole Heap regions instead of single value dealloc

Why Register and variable tracing vs heap scanning

Why not X

write barriers/incremental: 동시(concurrent) 컬렉터에 필요하지만, purple-garden은 단일 스레드이고 stop-the-world다. 컬렉션 시점에 모든 객체가 알려져 있다.
generational: 장수 객체에 최적화되지만, purple-garden은 임베더블 스크립팅 언어다. 게임에서 각 이벤트/엔티티마다 실행되는 플러그인 같은 용도로 빠르고 저렴하게 실행되도록 설계되었다. 짧은 실행과 그에 수반되는 할당에 특별히 초점을 맞춘다.
RC/ARC: 런타임 오버헤드가 크고 사이클은 별도의 비(非)자명한 탐지 로직이 필요해 핫패스를 느리게 만든다.
escape analysis: 힙 할당 감소에 최적화되지만, mgc는 할당 자체를 매우 싸게 만든다. 또한 purple-garden의 바이트코드는 AST에서 컴팩트한 포맷으로 컴파일되며, escape analysis를 수행할 만큼의 수명 정보가 있는 즉시 표현이 없다.

Gc invariants and assumptions

완전성을 위해 포함한 짧은 목록이다. 이 불변조건 중 하나라도 위반하면 정의되지 않은 동작(UB)이 발생한다.

레지스터가 클리어되지 않는다는 점에 의존
값이 잘못 복제되지 않는다는 점
Value.type=V_STR 및 Value.is_heap=1은 문자열이 힙에 할당됨을 의미하고, 그렇지 않으면 렉서가 ‘할당’한 것(기술적으로는 입력을 mmap하여 프로세스 메모리에 올린 것이지만 요지는 그렇다)
Value.is_heap 설정에 대한 VM 규율(discipline)

(non-)Portability to Rust and Design Improvements

나는 현재 여러 이유로 purple-garden을 Rust로 재작성하고 있다(언어 런타임을 만들어 본 적이 없다면 이해하기 쉽지 않은 이유들이다):

추상화가 줄어든다. C에서는 해싱, 해시맵, 배열, 배열 리스트 등을 전부 직접 만들어야 했는데, Rust에서는 std::collection으로 대체할 수 있다.
Rust에서는 Vec<Op>를 쓸 수 있어 append/insert가 내가 구현한 것보다 훨씬 좋아, 바이트코드 컴파일러가 더 낫고 더 짧아진다.
더 나은 바이트코드 포맷과 각 명령당 낭비되는 바이트 감소. 예를 들어 RET는 1바이트, ADD rA, rB 같은 것은 여러 바이트로 할 수 있다:

RUST

1enum Op<'vm> {
 2    Add {
 3        dst: u8,
 4        lhs: u8,
 5        rhs: u8,
 6    },
 7    // [...]
 8    Ret {
 9        /// used for peephole optimisation, merging multiple RET into a single
10        /// RET with a count
11        times: u8,
12    },
13}

각 컴포넌트에서 훨씬 나은 에러 핸들링. 현재 C 인터프리터는 assertion으로 그냥 abort한다.
Rust 해시맵을 통해 컴파일 타임 값 interning이 쉬워진다(기존에는 서로 다른 타입에 대해 해시맵 3개를 썼는데, 이제는 std::collections::HashMap 하나를 사용).
ARGS 명령은 더 이상 필요 없다. builtin과 사용자 함수 호출에 대한 인자 레지스터 오프셋과 개수를 인코딩하기 위해 ARGS가 있었지만, 이제는 SYS와 CALL이 명령 자체에 오프셋과 인자 수를 인코딩한다:

RUST

1enum Op<'vm> {
 2    // [...]
 3    Call {
 4        func: u16,
 5        args_start: u8,
 6        args_len: u8,
 7    },
 8    Sys {
 9        ptr: BuiltinFn<'vm>,
10        args_start: u8,
11        args_len: u8,
12    },
13}

SYS의 builtin 함수는 더 이상 타입 소거(type erasure), 캐스팅, 컴파일러 내 간접 참조가 필요 없다. 이전에는 64비트 포인터를 word에 저장할 수 없어서 컴파일러가 builtin 포인터 배열을 만들고 바이트코드에 그 인덱스를 넣어 VM이 올바른 builtin을 호출했는데, 이제는 단순한 함수 포인터가 된다. 바이트코드 레퍼런스는 위에 있으며, 타입 정의는 아래와 같다:

RUST

1type BuiltinFn<'vm> = fn(&mut Vm<'vm>, &[Value]);

맨체스터 가비지 컬렉터와 purple-garden의 런타임

Intro

Purple-garden

Garbage collection stages

Mark

Copy

Rewriting references

Reference implementation

Keeping track of allocations

(Almost) Zero cost abstractions and small Values

Differentiating heap and non-heap values

Dynamically growable bump allocation

String abstraction

Triggering garbage collection

Marking a Value

Letting mark loose on registers and call frames

Forwarding pointers and Rewriting references

Bookkeeping after collection

Tuning

Comparing mgc to other gc idioms and other reasoning

Why Bump allocation instead of malloc

Why Copying, Compacting and resetting whole Heap regions instead of single value dealloc

Why Register and variable tracing vs heap scanning

Why not X

Gc invariants and assumptions

(non-)Portability to Rust and Design Improvements

관련 추천 글

Alkyne GC · mcyoung

그린 티 가비지 컬렉터 - Go 프로그래밍 언어

Go 가비지 컬렉터 안내서 - Go 프로그래밍 언어

가비지 컬렉션은 역발상이다 · Nova

Intro

Purple-garden

Garbage collection stages

Mark

Copy

Rewriting references

Reference implementation

Keeping track of allocations

(Almost) Zero cost abstractions and small Values

Differentiating heap and non-heap values

Dynamically growable bump allocation

String abstraction

Triggering garbage collection

Marking a Value

Letting mark loose on registers and call frames

Forwarding pointers and Rewriting references

Bookkeeping after collection

Tuning

Comparing mgc to other gc idioms and other reasoning

Why Bump allocation instead of malloc

Why Copying, Compacting and resetting whole Heap regions instead of single value dealloc

Why Register and variable tracing vs heap scanning

Why not X

Gc invariants and assumptions

(non-)Portability to Rust and Design Improvements

관련 추천 글

Alkyne GC · mcyoung

그린 티 가비지 컬렉터 - Go 프로그래밍 언어

Go 가비지 컬렉터 안내서 - Go 프로그래밍 언어

가비지 컬렉션은 역발상이다 · Nova