These tests came from a working Codex + local Qwen workflow. Qwen’s job there is not to replace Codex; it is the second set of eyes for challenge, validation, UI review, and long-context sanity checks.

The practical question was: which local llama.cpp profile should stay in rotation?

Short answer

  • bartowski-128k-f16, bartowski-128k-q8, and unsloth-128k-q8 tied at 36/39 exact and 45/48 weighted.
  • q8 KV showed no measured accuracy loss in this suite.
  • 65k context was the wrong envelope for the full workload because it could not take the >65k needle case.
  • unsloth-128k-f16 loaded, but the long-context requests timed out under local memory/throughput pressure.

Headline figures

Elapsed seconds

Case heatmap

Result table

Profile Exact Weighted Errors Elapsed
unsloth-65k-f16 33/39 36/48 1 22.0s
unsloth-65k-q8 33/39 36/48 1 23.7s
bartowski-128k-f16 36/39 45/48 0 55.7s
bartowski-128k-q8 36/39 45/48 0 56.7s
unsloth-128k-f16 30/39 30/48 2 634.1s*
unsloth-128k-q8 36/39 45/48 0 57.6s

* Loaded, then timed out on both long-context prompts under local memory/throughput pressure. The elapsed value includes those timeouts.

Pages