Aman Khan: Arize, Evaluating AI, Designing for Non-Determinism

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Aman Khan: Arize, Evaluating AI, Designing for Non-Determinism | Learning from Machine Learning #11

Embracing Uncertainty: Why AI Evaluation Is Its Own Machine Learning Problem

Seth Levine

Apr 29, 2025

Transcript

On this episode of Learning from Machine Learning, I had the privilege of speaking with Aman Khan, Head of Product at Arize AI. Aman shared how evaluating AI systems isn't just a step in the process—it's a machine learning challenge in of itself. Drawing powerful analogies between mechanical engineering and AI, he explained, "Instead of tolerances in manufacturing, you're designing for non-determinism," reminding us that complexity often breeds opportunity.

Aman's journey from self-driving cars to ML evaluation tools highlights the critical importance of robust systems that can handle failure. He encourages teams to clearly define outcomes, break down complex systems, and build evaluations into every step of the development pipeline. Most importantly, Aman's insights remind us that machine learning—much like life—is less deterministic and more probabilistic, encouraging us to question how we deal with the uncertainty in our own lives.

Thank you for listening. Be sure to subscribe and share with a friend or colleague . Until next time... keep on learning.

TLDR:

Evaluating AI product success requires:

balancing technical metrics with practical business outcomes
moving beyond subjective checks
simulating user interactions to proxy real-world performance
breaking down complex systems into measurable components
using evaluation data to continuously improve the end-user experience which drives business success

Takeaways

Generative AI vs. Traditional Software
- Generative AI systems are non-deterministic by nature
  - They have an infinitely larger space of potential failure modes.
- Traditional software is often designed to be:
  - Deterministic,
  - Code execution follows predictable paths
Break down complex AI systems for evaluation
- Decompose AI products (like agents) into smaller, measurable components (e.g., router, tool calls, downstream steps) to identify specific areas for improvement
Build "judges" for components
- Develop evaluations, potentially using LLMs, to assess the performance of individual system components against expected criteria and examples
Continuous improvement is key
- Use the data and insights from technical evaluations to feed back into the development cycle
- Continuously refine the system based on performance against defined criteria and desired outcomes
Current technology allows for better proxying of business outcomes by simulating interactions with the product using different user personas and then evaluating these simulations
While evaluation provides data, knowing how to improve can still be challenging, often requiring identification of the limiting factors—whether the system is constrained by data, model capabilities, or available tools

Mindful and Malleable Machines

Aman Khan: Arize, Evaluating AI, Designing for Non-Determinism | Learning from Machine Learning #11

Takeaways

Resources

AI Development Tools

Resources to learn more about Learning from Machine Learning

Discussion about this video