Google's Gemini 2.5 Flash — fast multimodal model with 1M context and strong reasoning at low cost.
AAAWWW_gemini2.5_init baseline
Gemini-2.5-Flash FC — rank #15
MUSE + Gemini 2.5 Flash — rank #2, 41.1% resolved
SGV + Gemini 2.5 Flash SoM — rank #1 (Jul 2025)
Gemini 2.5 Flash — rank #7, calibration error 80.0%
Gemini 2.5 Flash Max Thinking (Sep 2025) — rank #43
Gemini-2.5-Flash-05-20 — rank #8, easy=99.2 med=80.0 hard=46.0