On this episode of Learning from Machine Learning, I had the privilege of speaking with Aman Khan, Head of Product at Arize AI. Aman shared how evaluating AI systems isn't just a step in the process—it's a machine learning challenge in of itself. Drawing powerful analogies between mechanical engineering and AI, he explained, "Instead of tolerances in manufacturing, you're designing for non-determinism," reminding us that complexity often breeds opportunity.
Aman's journey from self-driving cars to ML evaluation tools highlights the critical importance of robust systems that can handle failure. He encourages teams to clearly define outcomes, break down complex systems, and build evaluations into every step of the development pipeline. Most importantly, Aman's insights remind us that machine learning—much like life—is less deterministic and more probabilistic, encouraging us to question how we deal with the uncertainty in our own lives.
Thank you for listening. Be sure to subscribe and share with a friend or colleague . Until next time... keep on learning.
TLDR:
Evaluating AI product success requires:
balancing technical metrics with practical business outcomes
moving beyond subjective checks
simulating user interactions to proxy real-world performance
breaking down complex systems into measurable components
using evaluation data to continuously improve the end-user experience which drives business success
Takeaways
Generative AI vs. Traditional Software
Generative AI systems are non-deterministic by nature
They have an infinitely larger space of potential failure modes.
Traditional software is often designed to be:
Deterministic,
Code execution follows predictable paths
Break down complex AI systems for evaluation
Decompose AI products (like agents) into smaller, measurable components (e.g., router, tool calls, downstream steps) to identify specific areas for improvement
Build "judges" for components
Develop evaluations, potentially using LLMs, to assess the performance of individual system components against expected criteria and examples
Continuous improvement is key
Use the data and insights from technical evaluations to feed back into the development cycle
Continuously refine the system based on performance against defined criteria and desired outcomes
Current technology allows for better proxying of business outcomes by simulating interactions with the product using different user personas and then evaluating these simulations
While evaluation provides data, knowing how to improve can still be challenging, often requiring identification of the limiting factors—whether the system is constrained by data, model capabilities, or available tools
Share this post