Replication guide

This guide is written for someone who just found the repo and wants to test their own local model.

The repo does not provide model weights and does not launch a model server. You bring a local model runtime; this repo provides the eval cases, runner, scoring, and submission packaging.

1. Clone the repo

git clone https://github.com/robert896r1/qwen-realworld-accuracy-evals.git
cd qwen-realworld-accuracy-evals

2. Start your local model

Start your model using your own runtime: llama.cpp, Ollama, vLLM, LM Studio, or anything else that exposes an OpenAI-compatible endpoint.

The runner expects this kind of endpoint:

http://127.0.0.1:<port>/v1/chat/completions

If your base URL is:

http://127.0.0.1:8082/v1

then pass:

--base-url http://127.0.0.1:8082/v1

Reference llama.cpp launch examples from the original tests live here:

examples/launch/llama-cpp/

Those scripts are examples only. Adjust model paths, ports, context size, KV cache, and runtime flags for your machine.

3. Run a two-case smoke test

Do this before running the full suite. It confirms that your endpoint, auth, JSON parsing, scoring, and output folder all work.

python tools/run_eval.py \
  --base-url http://127.0.0.1:8082/v1 \
  --profile-label my-local-smoke-test \
  --case-limit 2

Important distinction:

--base-url points to the model server that is already running.
--profile-label names the output folder and result package.
--profile-label does not change the model being tested.
--model defaults to local; only set it if your server requires a specific model name.

If your server does require a model name:

python tools/run_eval.py \
  --base-url http://127.0.0.1:8082/v1 \
  --model local \
  --profile-label qwen3.6-27b-unsloth-128k-q8-smoke \
  --case-limit 2

Expected output:

a new directory under runs/local/;
run.json with model/runtime metadata;
raw.jsonl with per-case outputs;
score.json with exact score summary;
notes.md, environment.txt, and launch_command.txt.

Validate the newest run:

RUN_DIR="$(ls -td runs/local/* | head -1)"
python tools/validate_run.py "$RUN_DIR"

4. Run the full canonical suite

Once the smoke test works, run the full suite by removing --case-limit:

python tools/run_eval.py \
  --base-url http://127.0.0.1:8082/v1 \
  --profile-label qwen3.6-27b-unsloth-128k-q8

5. Package a shareable submission

RUN_DIR="$(ls -td runs/local/* | head -1)"
python tools/validate_run.py "$RUN_DIR"
python tools/package_submission.py "$RUN_DIR"

The zip lands in:

submissions/out/

Attach that zip to a Submit run result issue. Do not commit local run output directly.

6. Regenerate existing summary charts

The published charts are generated from the historical result set already committed under results/max-accuracy-v1/raw/:

python scripts/generate_summary_and_charts.py
cp charts/*.svg docs/assets/charts/

That chart script is for the current published result matrix. New community run packages should stay out of the canonical chart until reviewed.