Claude SWE-Bench 성능

Anthropic 엔지니어링

업그레이드된 Claude 3.5 Sonnet은 SWE-bench Verified라는 소프트웨어 엔지니어링 평가에서 49%를 기록하여, 기존 최고 모델의 45%를 넘었습니다. 이 글에서는 해당 모델을 기반으로 구축한 "에이전트" 구조를 설명하고, 개발자들이 Claude 3.5 Sonnet의 성능을 최대로 끌어낼 수 있도록 돕는 것을 목적으로 합니다.

SWE-bench는 실제 소프트웨어 엔지니어링 작업 완수 능력을 평가하는 AI 벤치마크입니다. 구체적으로는, 인기 오픈소스 Python 저장소의 GitHub 이슈를 실제로 해결할 수 있는지 테스트합니다. 벤치마크의 각 과제마다, AI 모델은 준비된 Python 환경과 이슈가 해결되기 전 시점의 저장소(로컬 작업본)를 받게 됩니다. 모델은 코드를 이해하고, 수정하고, 테스트를 거쳐 최종 솔루션을 제출해야 합니다.

각 솔루션은 원래 GitHub 이슈를 닫은 Pull Request의 실제 유닛 테스트로 채점됩니다. 즉, AI 모델이 인간 PR 작성자와 동등한 기능을 달성했는지 확인하는 것입니다.

SWE-bench는 AI 모델만 평가하는 것이 아니라, 전체 "에이전트" 시스템을 평가합니다. 여기서 "에이전트"란 AI 모델과 그 주변 소프트웨어 스캐폴딩(scaffolding, 보조 구조물)의 조합을 의미합니다. 이 스캐폴딩은 모델에 입력 프롬프트(명령)를 생성하고, 모델의 출력을 해석하여 액션을 취하며, 모델의 이전 행위 결과를 다음 프롬프트에 반영하는 상호작용 루프를 관리합니다. 같은 AI 모델을 써도 이 스캐폴딩 설계에 따라 에이전트의 성능 차이가 크게 날 수 있습니다.

코딩 능력 평가 벤치마크는 다양하지만, SWE-bench가 인기를 끄는 데에는 다음과 같은 이유가 있습니다:

실제 프로젝트의 엔지니어링 작업을 사용하며, 코딩 대회나 인터뷰 문제와 다릅니다.
아직 최고점이 멀었습니다 — SWE-bench Verified에서 50%를 넘긴 모델이 없습니다(2024년 6월 기준 Claude 3.5 Sonnet이 49% 기록).
모델 자체가 아니라 전체 "에이전트"를 평가합니다. 오픈소스 개발자와 스타트업들이 동일한 모델에 더 잘 어울리는 보조구조를 최적화하여 훨씬 우수한 성능을 내고 있습니다.

원래 SWE-bench 데이터셋에는 GitHub 이슈 외의 추가 문맥 없이는 풀 수 없는 문제가 일부 포함되어 있습니다(예: 반환할 특정 에러 메시지 등). SWE-bench-Verified는 500개의 문제를 인간이 직접 검토해서 반드시 풀 수 있도록 한 서브셋으로, 코딩 에이전트의 실질 성능을 가늠하는 가장 분명한 기준입니다. 이 글에서 언급하는 벤치마크도 SWE-bench Verified를 기준으로 합니다.

최첨단 성능 달성하기

툴 유징 에이전트

업그레이드된 Claude 3.5 Sonnet에 최적화한 에이전트 스캐폴딩 설계 철학은, 언어모델에게 최대한 많은 자율권을 주고 보조구조는 최소화한다는 것이었습니다. 에이전트는 프롬프트, bash 명령 실행용 Bash Tool, 파일/디렉터리 조회 및 편집용 Edit Tool만 있습니다. 모델이 끝났다고 결정하거나 200k 컨텍스트 길이를 초과할 때까지 샘플링을 반복합니다. 이 구조는 모델이 특정 패턴이나 워크플로우에 하드코딩되지 않고, 문제 해결 전략에 스스로 판단을 내리도록 합니다.

프롬프트는 모델에게 제안 형태의 접근법만 간단히 안내하며, 너무 길거나 자세하지 않습니다. 단계별로 움직임에 있어 자유를 주며, 토큰을 아끼지 않는 상황이라면 답변이 장문이 되도록 유도하는 것이 도움이 됩니다.

아래는 실제 에이전트 프롬프트 예시입니다:

<uploaded_files>
{location}
</uploaded_files>
I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:

<pr_description>
{pr_description}
</pr_description>

Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!

Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.

Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script to reproduce the error and execute it with `python <filename.py>` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to resolve the issue
4. Rerun your reproduce script and confirm that the error is fixed!
5. Think about edgecases and make sure your fix handles them as well

Your thinking should be thorough and so it's fine if it's very long.

모델의 첫 번째 툴은 Bash 명령 실행입니다. 스키마는 간단히 실행할 명령을 입력으로 받습니다. 하지만, 툴의 설명은 더욱 상세하며, 입력 이스케이프, 인터넷 미접속, 명령 백그라운드 실행 등 다양한 주의사항을 안내합니다.

Bash Tool의 사양 예시:

{
   "name": "bash",
   "description": "Run commands in a bash shell\n
* When invoking this tool, the contents of the \"command\" parameter does NOT need to be XML-escaped.\n
* You don't have access to the internet via this tool.\n
* You do have access to a mirror of common linux and python packages via apt and pip.\n
* State is persistent across command calls and discussions with the user.\n
* To inspect a particular line range of a file, e.g. lines 10-25, try 'sed -n 10,25p /path/to/the/file'.\n
* Please avoid commands that may produce a very large amount of output.\n
* Please run long lived commands in the background, e.g. 'sleep 10 &' or start a server in the background.",
   "input_schema": {
       "type": "object",
       "properties": {
           "command": {
               "type": "string",
               "description": "The bash command to run."
           }
       },
       "required": ["command"]
   }
}

두 번째 툴(Edit Tool)은 파일 보기, 생성, 편집 작업을 모두 지원하는 더 복잡한 도구입니다. 역시 도구 설명에 모델이 오해하지 않도록 상세 가이드를 마련했습니다.

우리는 다양한 에이전트 특화 작업에서 이들 도구의 설명과 사양에 많은 공을 들였습니다. 모델이 스펙을 잘못 이해하거나 예상치 못한 문제가 발생할 수 있는 사례를 직접 테스트하고, 문제를 미연에 방지할 수 있도록 설명을 수정했습니다. 앞으로도 모델용 툴 인터페이스 설계에 인간용 툴 인터페이스 만큼이나 많은 주의가 필요하다고 생각합니다.

아래는 Edit Tool 설명 예시 일부입니다:

{
   "name": "str_replace_editor",
   "description": "Custom editing tool for viewing, creating and editing files\n
* State is persistent across command calls and discussions with the user\n
* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep\n
* The `create` command cannot be used if the specified `path` already exists as a file\n
* If a `command` generates a long output, it will be truncated and marked with `<response clipped>` \n
* The `undo_edit` command will revert the last edit made to the file at `path`\n
\n
Notes for using the `str_replace` command:\n
* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!\n
* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique\n
* The `new_str` parameter should contain the edited lines that should replace the `old_str`",
...

성능을 향상시키기 위해, 도구를 "에러 방지" 형태로 설계했습니다. 예를 들어, 에이전트가 루트 디렉터리 바깥에서 상대 경로를 잘못 쓰는 실수를 방지하기 위해 항상 절대경로 입력만 받도록 수정했습니다.

기존 파일 편집 방식을 다양하게 실험한 끝에, 문자열 교체(str_replace) 명령이 가장 안정적임을 확인했습니다. 모델이 특정 파일에서 old_str을 찾아, 하나만 일치하면 new_str으로 교체합니다. 일치가 0개나 2개 이상이면 에러 메시지를 보여주고 재시도하도록 합니다.

Edit Tool의 스펙 일부는 다음과 같습니다:

...
   "input_schema": {
       "type": "object",
       "properties": {
           "command": {
               "type": "string",
               "enum": ["view", "create", "str_replace", "insert", "undo_edit"],
               "description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`."
           },
           "file_text": {
               "description": "Required parameter of `create` command, with the content of the file to be created.",
               "type": "string"
           },
           "insert_line": {
               "description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.",
               "type": "integer"
           },
           "new_str": {
               "description": "Required parameter of `str_replace` command containing the new string. Required parameter of `insert` command containing the string to insert.",
               "type": "string"
           },
           "old_str": {
               "description": "Required parameter of `str_replace` command containing the string in `path` to replace.",
               "type": "string"
           },
           "path": {
               "description": "Absolute path to file or directory, e.g. `/repo/file.py` or `/repo`.",
               "type": "string"
           },
           "view_range": {
               "description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.",
               "items": {
                   "type": "integer"
               },
               "type": "array"
           }
       },
       "required": ["command", "path"]
   }
}

결과

전반적으로, 업그레이드된 Claude 3.5 Sonnet은 기존 모델 및 이전 최고 성능 모델보다 더 높은 추론, 코딩, 수학 능력을 보였으며, 도구 및 스캐폴딩 구조와 결합했을 때 최고의 성능을 발휘했습니다.

모델	Claude 3.5 Sonnet (새)	기존 SOTA	Claude 3.5 Sonnet (구버전)	Claude 3 Opus
SWE-bench Verified 점수	49%	45%	33%	22%

이 표는 동일한 에이전트 구조로 여러 모델을 평가한 SWE-bench Verified 점수입니다.

에이전트 동작 예시

벤치마크 실행에 SWE-Agent 프레임워크를 기초로 하였으며, 아래 로그 예시에서는 텍스트 출력, 도구 호출, 도구 응답을 각각 THOUGHT, ACTION, OBSERVATION으로 구분해 표기했습니다(실제론 고정된 호출 순서 없이 모델을 자유롭게 작동시킴).

아래 예시는 Sonnet 3.5가 SWE-bench 문제를 해결하는 전형적인 과정을 보여줍니다.

첫 번째 블록에서는 실제 SWE-bench 태스크에서 {pr_description}이 채워진 최초 프롬프트 일부를 볼 수 있습니다. 이슈 재현용 단계가 명시되어 있어 모델이 조사 실마리를 확보할 수 있습니다.

<uploaded_files>
/repo
</uploaded_files>
I've uploaded a python code repository in the directory /repo (not in /tmp/inputs). Consider the following PR description:

<pr_description>
linear_model.RidgeClassifierCV's Parameter store_cv_values issue
#### Description
Parameter store_cv_values error on sklearn.linear_model.RidgeClassifierCV

#### Steps/Code to Reproduce
import numpy as np
from sklearn import linear_model as lm

#test database
n = 100
x = np.random.randn(n, 30)
y = np.random.normal(size = n)

rr = lm.RidgeClassifierCV(alphas = np.arange(0.1, 1000, 0.1), normalize = True, 
                                         store_cv_values = True).fit(x, y)

#### Expected Results
Expected to get the usual ridge regression model output, keeping the cross validation predictions as attribute.

#### Actual Results
TypeError: __init__() got an unexpected keyword argument 'store_cv_values'

lm.RidgeClassifierCV actually has no parameter store_cv_values, even though some attributes depends on it.

모델은 이후 Edit Tool을 사용해 저장소 구조를 탐색합니다. 아래는 텍스트 출력 및 도구 호출 인자(THOUGHT/ACTION), 도구 응답(OBSERVATION) 예시입니다:

THOUGHT:
I'll help you implement the changes to satisfy the PR description. Let's follow the steps you outlined.

1. First, let's explore the repo structure:
ACTION (str_replace_editor):
command:
view

path:
/repo

OBSERVATION:
Here's the files and directories up to 2 levels deep in /repo, excluding hidden items:
/repo
/repo/PULL_REQUEST_TEMPLATE.md
/repo/ISSUE_TEMPLATE.md
/repo/sklearn
/repo/sklearn/decomposition
/repo/sklearn/kernel_ridge.py
/repo/sklearn/model_selection
/repo/sklearn/covariance
/repo/sklearn/cluster
/repo/sklearn/base.py
/repo/sklearn/dummy.py
/repo/sklearn/naive_bayes.py
...

저장소 구조를 파악한 후, 모델은 이슈 재현 및 수정 확인을 위한 새 스크립트를 생성합니다:

THOUGHT:
Based on the repository structure, we'll need to modify the Ridge Classifier in the `sklearn/linear_model` directory.

Let's create a script to reproduce the error:
ACTION (str_replace_editor):
command:
create

path:
/repo/reproduce_error.py

file_text:
import numpy as np
from sklearn import linear_model as lm

#test database
n = 100
x = np.random.randn(n, 30)
y = np.random.normal(size = n)

rr = lm.RidgeClassifierCV(alphas=np.arange(0.1, 1000, 0.1), normalize=True, 
                         store_cv_values=True).fit(x, y)

OBSERVATION:
File created successfully at: /repo/reproduce_error.py

모델은 Bash Tool로 위 스크립트를 실행해 실제로 문제가 재현되는지 확인합니다:

THOUGHT:
Let's verify that we can reproduce the error:
ACTION (bash):
command:
python3 /repo/reproduce_error.py

OBSERVATION:
Traceback (most recent call last):
  File "/repo/reproduce_error.py", line 10, in <module>
    store_cv_values=True).fit(x, y)
TypeError: __init__() got an unexpected keyword argument 'store_cv_values'

이후 Edit Tool을 사용해 저장소 소스코드를 수정하고, 재현 스크립트를 재실행하여 수정이 성공했는지 확인합니다:

THOUGHT:
The error is reproducible. Looking at the code, I can see that the issue is that while the `RidgeClassifierCV` class inherits from `_BaseRidgeCV`, it fails to pass the `store_cv_values` parameter from its constructor through to the base class constructor.

Let's fix this:
ACTION (str_replace_editor):
command:
str_replace

path:
/repo/sklearn/linear_model/ridge.py

old_str:
    def __init__(self, alphas=(0.1, 1.0, 10.0), fit_intercept=True,
                 normalize=False, scoring=None, cv=None, class_weight=None):
        super(RidgeClassifierCV, self).__init__(
            alphas=alphas, fit_intercept=fit_intercept, normalize=normalize,
            scoring=scoring, cv=cv)
        self.class_weight = class_weight

new_str:
    def __init__(self, alphas=(0.1, 1.0, 10.0), fit_intercept=True,
                 normalize=False, scoring=None, cv=None, class_weight=None,
                 store_cv_values=False):
        super(RidgeClassifierCV, self).__init__(
            alphas=alphas, fit_intercept=fit_intercept, normalize=normalize,
            scoring=scoring, cv=cv, store_cv_values=store_cv_values)
        self.class_weight = class_weight

이 예에서 모델은 12단계 작업 끝에 솔루션을 제출했습니다. 이후 자동 테스트로 정상 해결이 확인되었습니다. 어떤 과제는 100턴 이상 걸리기도 하며, 컨텍스트를 모두 소진할 때까지 계속 시도하는 경우도 있습니다.

업데이트된 Claude 3.5 Sonnet은 이전 모델보다 자가수정(self-correct) 빈도와 다양한 시도 능력 면에서 우수함을 보였습니다.

도전과제(챌린지)

SWE-bench Verified는 강력한 평가이지만, 단순 단일턴 평가보다 실전 운용 난도가 높습니다. 우리가 겪었던 과제와 다른 AI 개발자들이 겪을 수 있는 챌린지들입니다.

긴 소요시간 및 토큰 비용: 위 예시는 12단계 만에 완료됐지만, 성공적인 실행의 상당수는 수백 턴·10만 토큰 이상을 요구했습니다. Claude 3.5 Sonnet은 집요하게 문제를 파고들 능력이 있지만, 그만큼 비용도 발생합니다.
채점: 실패 과제 검사 중, 모델이 문제 없이 처리했음에도 환경 문제(setup)나 패치 중복 적용이 원인인 사례도 있었습니다. 이런 시스템 이슈 해결이 AI 에이전트의 정확한 평가에 중요합니다.
비가시 테스트: 모델은 자신이 채점받을 테스트 파일을 볼 수 없습니다. 종종 성공했다고 생각해 제출하지만 실제로는 실패가 되는 것은 이 때문입니다. 이는 추상화 수준 차이(밴드에이드 수정보단 근본 강조 필요)나, 테스크 특이성 때문일 수 있습니다.
멀티모달: Claude 3.5 Sonnet은 뛰어난 비전 및 멀티모달 능력을 갖고 있지만, 파일시스템에 저장된 파일이나 URL 참조 파일을 직접 볼 방법은 구현하지 않았습니다. 이로 인해, 특히 Matplotlib 등 그래픽 기반 과제에서 모델이 혼동·환각(hallucination)에 빠질 수 있었습니다. 이 분야에서 개발자들이 개선할 수 있는 여지가 많으며, SWE-bench 역시 멀티모달 특화 평가를 신설했습니다. 앞으로 Claude가 이 평가에서 더 높은 점수를 기록할 것도 기대합니다.

업그레이드된 Claude 3.5 Sonnet은 단순 프롬프트와 범용 도구 2개만으로 SWE-bench Verified에서 49%를 기록, 기존 최고점(45%)을 경신했습니다. 앞으로 개발자분들이 Claude 3.5 Sonnet을 활용해 더 높은 SWE-bench 점수 달성법을 빠르게 찾게 될 것이라 확신합니다.

감사의 말

Erik Schluntz가 SWE-bench 에이전트 최적화 및 본 글 집필을 담당했습니다. Simon Biggs, Dawn Drain, Eric Christiansen는 벤치마크 구현을 도왔습니다. Shauna Kravec, Dawn Drain, Felipe Rosso, Nova DasSarma, Ven Chandrasekaran 및 다수의 기여자들이 Claude 3.5 Sonnet의 에이전트 코딩 성능을 높이기 위해 노력해주셨습니다.