URL: https://grafana.com/blog/how-prometheus-remote-write-v2-can-help-cut-network-egress-costs-by-as-much-as-50-/

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 줄이는 방법

2021년, Grafana Labs CTO Tom Wilkie(당시 제품 VP)는 PromCON에서 Prometheus의 remote write 기능 개선 필요성에 대해 이야기했습니다.

“우리는 remote write로 샘플 하나를 보내는 데 10~20바이트를 사용하고, Prometheus는 로컬 디스크에서는 샘플당 1~2바이트만 사용합니다. 개선할 여지가 정말, 정말 큽니다.” 당시 Wilkie는 이렇게 말했습니다. “원자성(atomicity)과 배칭(batching)에 관해 우리가 하려는 많은 작업은 remote write 요청에 심볼 테이블(symbol table)을 넣을 수 있게 해줄 것이고, 그렇게 되면 대역폭 사용량을 줄일 수 있습니다.”

그로부터 거의 5년이 지난 지금, 이러한 대역폭 제약을 개선하기 위한 작업이 성과를 내고 있다는 사실을 기쁘게 전할 수 있습니다. Prometheus Remote Write v2는 2024년에 제안되었고, 아직 실험적(experimental) 상태임에도 불구하고, Prometheus 백엔드와 텔레메트리 수집기(collector)에서 이미 채택이 진행되고 있으며 눈에 띄는 이점(즉, 상당한 비용 절감!)을 얻고 있습니다.

이 글에서는 v2의 장점을 설명하고 Alloy에서 이를 활성화하는 방법을 보여드립니다. 또한 우리가 네트워크 이그레스 비용에서 확인한 엄청난 개선과, 여러분의 조직에서도 비슷한 비용 절감을 얻는 방법을 소개하겠습니다.

remote write란 무엇이며, v2가 좋은 점은 무엇인가?

메트릭을 Prometheus 백엔드로 보내고 싶을 때 Prometheus Remote Write를 사용합니다. remote write v1 프로토콜은 메트릭 샘플 전송을 훌륭하게 수행하지만, 오늘날처럼 메트릭 메타데이터(메트릭 타입, 단위, 도움말 텍스트)가 필수적인 시대가 오기 전 설계되었습니다. 동시에, 와이어 프로토콜로서도 가장 효율적이지는 않습니다. 각 샘플에 중복 텍스트를 많이 포함해 보내면 누적되어 페이로드(payload)가 매우 커지기 때문입니다.

request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="0"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="5"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="10"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="25"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="50"}
...
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="10000"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="+Inf"}
request_size_bytes_sum{method="POST",response_code="200",server_address="otlp.example.com"}

Remote write v2는 샘플 페이로드에 메타데이터를 1급(First-class)으로 지원합니다. 하지만 진짜 효율성과 비용 절감은 Wilkie가 2021년 발표에서 언급했던 심볼 테이블 구현에서 나옵니다.

symbols: ["request_size_bytes_bucket", "method", "POST", "response_code", "200", "server_address", "otlp.example.com", "le", "0", "5", "10", "25", "50", ... "10000", "+Inf", "request_size_bytes_sum"]

0{1=2,3=4,5=6,7=8}
0{1=2,3=4,5=6,7=9}
0{1=2,3=4,5=6,7=10}
0{1=2,3=4,5=6,7=11}
0{1=2,3=4,5=6,7=12}
...
0{1=2,3=4,5=6,7=13}
0{1=2,3=4,5=6,7=14}
15{1=2,3=4,5=6}

메트릭 이름, 라벨 이름, 라벨 값, 그리고 메타데이터에서 반복되는 문자열이 많을수록, 기존 remote write 형식 대비 더 큰 효율 향상을 얻을 수 있습니다.

왜 이게 Grafana에 중요했나?

Grafana Cloud를 운영하면 엄청난 양의 텔레메트리가 생성됩니다! 우리는 분당 1~4회(DPM) 수준으로 수백만 개의 활성 시계열(active series)을 모니터링하며, 이 텔레메트리는 상당한 네트워크 이그레스로 이어집니다.

그래서 Grafana Labs는 지난가을 내부 Prometheus 모니터링 워크로드 전체를 remote write v1에서 remote write v2로 마이그레이션했습니다. CPU 및 메모리 사용량이 5%~10% 정도 소폭 증가한 대신, 이 간단한 변경만으로 내부 텔레메트리의 네트워크 이그레스 비용을 50% 이상 절감했습니다. 주요 클라우드 제공업체가 부과하는 요금을 고려할 때, 추가 리소스 비용은 미미했지만 네트워크 비용 절감 효과는 매우 컸습니다.

이미지 1: Grafana 대시보드의 시계열 그래프에서 선이 400 이상에서 200으로 떨어지는 모습

참고: v2를 적용했을 때 트래픽 감소 폭이 다르게 나타난다면, prometheus.remote_write 컴포넌트의 배칭 설정을 실험해볼 수 있습니다. 더 큰 배치는 더 높은 트래픽 감소를 보여줄 가능성이 큽니다.

왜 이것이 여러분에게 중요할까?

관측 가능성(Observability) 비용은 빠르게 누적될 수 있고, 팀은 어떤 텔레메트리가 필수이고 어떤 것은 생략할 수 있는지 결정하는 데 자주 어려움을 겪습니다. 하지만 remote write v2는 신중한 평가나 어려운 논의가 필요 없는 변경 사항 중 하나입니다. 새로운 실험적 기능을 활성화하기만 하면 즉시 절감 효과를 확인할 수 있습니다.

참고: 관측 가능성 구성에서 더 나은 가치를 얻는 방법을 찾고 있다면, Grafana Cloud에는 비용을 줄이고 최적화하도록 설계된 여러 기능이 있습니다.

Alloy에서 remote write v2 활성화하기

현재 remote write v2 명세는 업스트림 Prometheus에서 실험적이며, 따라서 Alloy에서도 실험적입니다. 업스트림 Prometheus와 Mimir 모두 현재 명세를 지원하지만, 명세의 최종 릴리스 전까지는 호환성이 깨지는 변경(breaking change)이 발생할 여지가 있습니다. 이런 이유로 Alloy에서 remote write v2를 활성화하려면 --stability.level=experimental 런타임 플래그로 실행되도록 Alloy를 구성해야 합니다.

Alloy

실험적 런타임 플래그를 추가한 뒤, prometheus.remote_write 컴포넌트의 endpoint 블록 설정을 업데이트하여 protobuf_message 속성 값을 io.prometheus.write.v2.Request로 추가합니다. 예:

prometheus.remote_write "grafana_cloud" {
  endpoint {
    protobuf_message = "io.prometheus.write.v2.Request"
    url = "https://example-prod-us-east-0.grafana.net/api/prom/push"

    basic_auth {
      username = "stack_id"
      password = sys.env("GCLOUD_RW_API_KEY")
    }
  }
}

Alloy Helm 차트에서도 마찬가지로 매우 간단합니다:

image:
  registry: "docker.io"
  repository: grafana/alloy
  tag: latest
alloy:
  ...
  configMap:
    content: |-
        ...

        prometheus.remote_write "metrics_service" {
          endpoint {
            protobuf_message = "io.prometheus.write.v2.Request"
            url = "https://example-prod-us-east-0.grafana.net/api/prom/push"

            basic_auth {
              username = "stack_id"
              password = sys.env("GCLOUD_RW_API_KEY")
            }
          }
        }

Kubernetes Monitoring Helm Chart

곧 출시될(soon) Kubernetes Monitoring Helm chart v3.8에서는 Prometheus 목적지(destination)가 remote write v2를 사용하도록 구성하는 방법이 두 가지 있습니다. Alloy와 동일한 방식으로 목적지에 protobufMessage를 설정할 수 있습니다. 또는 목적지에 remoteWriteProtocol을 정의하는 단축키를 사용할 수도 있는데, 그러면 렌더링된 구성에서 올바른 protobufMessage가 출력됩니다.

destinations:
  - name: grafana-cloud-metrics
    type: prometheus
    url: https://example-prod-us-east-0.grafana.net/api/prom/push
    remoteWriteProtocol: 2
  - name: grafana-cloud-metrics-again
    type: prometheus
    url: https://example-prod-us-east-0.grafana.net/api/prom/push
    protobufMessage: io.prometheus.write.v2.Request

Prometheus Remote Write의 다음 단계는?

remote write v2로 얻는 이점을 확인하면서 우리는 매우 고무적이었고, 여러분도 이를 활용할 수 있기를 바랍니다. 다만 v2 명세 외에도 remote write에는 추가 개선이 예정되어 있습니다. 예를 들어:

Prometheus Agent 및 Alloy의 prometheus.remote_write 컴포넌트로 리소스 활용도/신뢰성 개선
- tsdb/agent: Prevent unread segments from being truncated · Issue #17616

_Grafana Cloud_는 메트릭, 로그, 트레이스, 대시보드 등으로 시작하기 가장 쉬운 방법입니다. 넉넉한 영구 무료 티어와 모든 사용 사례에 맞는 요금제를 제공합니다. 지금 무료로 가입하세요!

태그

URL: https://grafana.com/blog/how-prometheus-remote-write-v2-can-help-cut-network-egress-costs-by-as-much-as-50-/

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 줄이는 방법

2021년, Grafana Labs CTO Tom Wilkie(당시 제품 VP)는 PromCON에서 Prometheus의 remote write 기능 개선 필요성에 대해 이야기했습니다.

remote write란 무엇이며, v2가 좋은 점은 무엇인가?

request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="0"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="5"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="10"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="25"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="50"}
...
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="10000"}
request_size_bytes_bucket{method="POST",response_code="200",server_address="otlp.example.com",le="+Inf"}
request_size_bytes_sum{method="POST",response_code="200",server_address="otlp.example.com"}

symbols: ["request_size_bytes_bucket", "method", "POST", "response_code", "200", "server_address", "otlp.example.com", "le", "0", "5", "10", "25", "50", ... "10000", "+Inf", "request_size_bytes_sum"]

0{1=2,3=4,5=6,7=8}
0{1=2,3=4,5=6,7=9}
0{1=2,3=4,5=6,7=10}
0{1=2,3=4,5=6,7=11}
0{1=2,3=4,5=6,7=12}
...
0{1=2,3=4,5=6,7=13}
0{1=2,3=4,5=6,7=14}
15{1=2,3=4,5=6}

왜 이게 Grafana에 중요했나?

이미지 1: Grafana 대시보드의 시계열 그래프에서 선이 400 이상에서 200으로 떨어지는 모습

왜 이것이 여러분에게 중요할까?

참고: 관측 가능성 구성에서 더 나은 가치를 얻는 방법을 찾고 있다면, Grafana Cloud에는 비용을 줄이고 최적화하도록 설계된 여러 기능이 있습니다.

Alloy에서 remote write v2 활성화하기

Alloy

prometheus.remote_write "grafana_cloud" {
  endpoint {
    protobuf_message = "io.prometheus.write.v2.Request"
    url = "https://example-prod-us-east-0.grafana.net/api/prom/push"

    basic_auth {
      username = "stack_id"
      password = sys.env("GCLOUD_RW_API_KEY")
    }
  }
}

Alloy Helm 차트에서도 마찬가지로 매우 간단합니다:

image:
  registry: "docker.io"
  repository: grafana/alloy
  tag: latest
alloy:
  ...
  configMap:
    content: |-
        ...

        prometheus.remote_write "metrics_service" {
          endpoint {
            protobuf_message = "io.prometheus.write.v2.Request"
            url = "https://example-prod-us-east-0.grafana.net/api/prom/push"

            basic_auth {
              username = "stack_id"
              password = sys.env("GCLOUD_RW_API_KEY")
            }
          }
        }

Kubernetes Monitoring Helm Chart

destinations:
  - name: grafana-cloud-metrics
    type: prometheus
    url: https://example-prod-us-east-0.grafana.net/api/prom/push
    remoteWriteProtocol: 2
  - name: grafana-cloud-metrics-again
    type: prometheus
    url: https://example-prod-us-east-0.grafana.net/api/prom/push
    protobufMessage: io.prometheus.write.v2.Request

Prometheus Remote Write의 다음 단계는?

Prometheus Agent 및 Alloy의 prometheus.remote_write 컴포넌트로 리소스 활용도/신뢰성 개선
- tsdb/agent: Prevent unread segments from being truncated · Issue #17616

태그

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 줄이는 방법

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 줄이는 방법

remote write란 무엇이며, v2가 좋은 점은 무엇인가?

왜 이게 Grafana에 중요했나?

왜 이것이 여러분에게 중요할까?

Alloy에서 remote write v2 활성화하기

Alloy

Kubernetes Monitoring Helm Chart

Prometheus Remote Write의 다음 단계는?

관련 추천 글

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 절감하는 방법

OTel에서 Rotel로: 처리량 4배, 페타바이트 규모 트레이싱

ClickHouse로 페타바이트 규모 로깅 시스템을 최적화한 Netflix의 방법

ClickHouse에서의 BEAM 메트릭

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 줄이는 방법

remote write란 무엇이며, v2가 좋은 점은 무엇인가?

왜 이게 Grafana에 중요했나?

왜 이것이 여러분에게 중요할까?

Alloy에서 remote write v2 활성화하기

Alloy

Kubernetes Monitoring Helm Chart

Prometheus Remote Write의 다음 단계는?

관련 추천 글

Prometheus Remote Write v2로 네트워크 이그레스 비용을 최대 50%까지 절감하는 방법

OTel에서 Rotel로: 처리량 4배, 페타바이트 규모 트레이싱

ClickHouse로 페타바이트 규모 로깅 시스템을 최적화한 Netflix의 방법

ClickHouse에서의 BEAM 메트릭