A higher offline eval score is a hypothesis, not proof. How to run controlled online experiments on prompts, models, and RAG: architecture, metrics, and the statistics.