0:00
/
0:00
Transcript

Aman Khan: Arize, Evaluating AI, Designing for Non-Determinism | Learning from Machine Learning #11

Embracing Uncertainty: Why AI Evaluation Is Its Own Machine Learning Problem

On this episode of Learning from Machine Learning, I had the privilege of speaking with Aman Khan, Head of Product at Arize AI. Aman shared how evaluating AI systems isn't just a step in the process—it's a machine learning challenge in of itself. Drawing powerful analogies between mechanical engineering and AI, he explained, "Instead of tolerances in manufacturing, you're designing for non-determinism," reminding us that complexity often breeds opportunity.

Aman's journey from self-driving cars to ML evaluation tools highlights the critical importance of robust systems that can handle failure. He encourages teams to clearly define outcomes, break down complex systems, and build evaluations into every step of the development pipeline. Most importantly, Aman's insights remind us that machine learning—much like life—is less deterministic and more probabilistic, encouraging us to question how we deal with the uncertainty in our own lives.

Thank you for listening. Be sure to subscribe and share with a friend or colleague . Until next time... keep on learning.

TLDR:

Evaluating AI product success requires:

  • balancing technical metrics with practical business outcomes

  • moving beyond subjective checks

  • simulating user interactions to proxy real-world performance

  • breaking down complex systems into measurable components

  • using evaluation data to continuously improve the end-user experience which drives business success

Takeaways

  • Generative AI vs. Traditional Software

    • Generative AI systems are non-deterministic by nature

      • They have an infinitely larger space of potential failure modes.

    • Traditional software is often designed to be:

      • Deterministic,

      • Code execution follows predictable paths

  • Break down complex AI systems for evaluation

    • Decompose AI products (like agents) into smaller, measurable components (e.g., router, tool calls, downstream steps) to identify specific areas for improvement

  • Build "judges" for components

    • Develop evaluations, potentially using LLMs, to assess the performance of individual system components against expected criteria and examples

  • Continuous improvement is key

    • Use the data and insights from technical evaluations to feed back into the development cycle

    • Continuously refine the system based on performance against defined criteria and desired outcomes

  • Current technology allows for better proxying of business outcomes by simulating interactions with the product using different user personas and then evaluating these simulations

  • While evaluation provides data, knowing how to improve can still be challenging, often requiring identification of the limiting factors—whether the system is constrained by data, model capabilities, or available tools

Resources

AI Development Tools

Resources to learn more about Learning from Machine Learning

Discussion about this video