Contributing real-world cases and runs
The repo is useful only if it stays grounded in real work without becoming a dump of private prompts or random benchmark puzzles.
The contribution model
There are two primary contribution types:
- Run results against the canonical suite.
- Case proposals derived from real coding-agent failure modes.
Both start as issues. Accepted suite changes are curated.
Run results
Run results should be packaged with:
python tools/package_submission.py runs/local/<run-dir>
The zip includes:
run.json— model/profile/runtime metadata;score.json— scored summary;raw.jsonl— per-case outputs and scores;raw_responses/— raw OpenAI-compatible responses;environment.txt— Python/platform/server props where available;launch_command.txt— launch command if supplied;notes.md— human-readable run notes.
Attach that zip to a Submit run result issue instead of committing local run output directly.
Case proposals
A useful case proposal answers:
- What real workflow failure inspired this?
- What is the minimum safe context needed to reproduce it?
- What should the sidecar model return?
- How is it scored?
- Why does this matter when a local model is used beside Codex, Claude Code, Cursor, Aider, or another frontier coding agent?
Accepted case rules
Accepted cases should test companion-agent usefulness:
- challenge quality;
- over-build detection;
- directive fidelity;
- frontend/UI judgment;
- artifact triage;
- code-review precision;
- long-context attention.
They should not be generic trivia or abstract puzzle prompts.
Privacy rule
Describe the failure mode, not your business. Sanitize aggressively.