Updates & case studies

Deep dives into our projects, benchmarks, and experiments in frontier agent evaluation.