Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Maxime Labonne: Thinking beyond Transformers | Learning from Machine Learning #12

Liquid AI, Model Architecture, Data Quality and being on the cutting edge of edge deployments

Seth Levine

and

Maxime Labonne

May 29, 2025

Transcript

Takeaways

🚀 Just released an incredible conversation with Maxime Labonne, Head of Post-Training at Liquid AI, for Learning from Machine Learning!

From cybersecurity to building copilots at JP Morgan Chase, Maxime's journey through ML is fascinating.

🔥 The efficiency revolution is here Liquid AI is tackling the real challenge: deploying models on edge devices with limited resources. Think distillation and model merging for hardware where there is no prepared cookbook or rulebook.

📊 Evaluation isn't simple Single leaderboards? Not enough. The future belongs to multiple signals and use-case specific benchmarks that actually matter for your application.

⚡ Architecture innovation While everyone is obsessed with Transformers, sometimes you need to step back to leap forward. We discuss State Space Models (SSMs), Mixture of Experts (MoE), and Hyena Edge.

🎯 For ML newcomers:

Build breadth before diving deep
Get your hands dirty with code
Ship end-to-end projects (like his LLM Twin)

💡 The big unsolved puzzle? Data quality. We still don't have clear answers on what makes a truly great dataset in terms of accuracy, diversity, and complexity.

🔧 Production reality check Models don't live in isolation. Real learning happens in production with actual user feedback. And here's a kicker: your UI choice (not everything needs to be a chatbot!) fundamentally shapes how people interact with your model.

Career Journey & Background

Started in cybersecurity: Labonne began his ML career during his PhD applying machine learning to detect cyber attacks, where he discovered that ML models could define what constitutes an attack better than humans could explicitly define it
Progressive focus on NLP: Moved from Airbus AI lab to JP Morgan Chase, working on internal code copilots and gaining expertise in transformers before GPT's release
Current role: Now leads post-training efforts at Liquid AI, focusing on making models more efficient

Liquid AI's Mission & Edge Computing Challenges

Two major challenges for edge deployment:

Training efficiency: Creating models that are 1000x smaller than frontier models while maintaining useful capabilities through techniques like distillation and model merging
Deployment complexity: Moving beyond standardized GPU infrastructure to diverse edge devices (phones, drones, satellites, IoT devices)

Key advantages of edge deployment:

Privacy (data stays on device)
Cost efficiency (no per-token API costs)
Specialized performance (focused models can compete with much larger general-purpose ones)

Architecture Innovation Beyond Transformers

The transformer challenge: Modern transformers aren't just the 2017 "Attention is All You Need" paper - they're 8 years of optimization with techniques like flash attention
Alternative approaches: State Space Models (SSMs), mixture of experts, and hybrid architectures and model merging
Trade-offs required: Often need to sacrifice some quality initially to gain speed/efficiency, then recover performance through better training

Key Insights on AI Development

Data Quality Framework

Labonne identifies three critical properties for good training data:

Accuracy: Does the sample correctly answer the question?
Diversity: Does it cover all relevant use cases?
Complexity: Is it challenging enough to actually train the model?

System-Level Thinking

Models are just one component of AI systems
User interface design is crucial - it guides how people interact with models
Multiple levers exist beyond just improving the model (UI/UX, generation parameters, preprocessing rules, etc.)

Evaluation & Benchmarks

Skeptical of single metrics: Chatbot Arena and other benchmarks provide unreliable signals that must be combined
Community-driven evaluation: Encourages multiple specialized benchmarks for specific use cases
Benchmark decay: Benchmarks lose value over time as they can become saturated in training data

Broader Perspectives

On AI Hype

Even AI practitioners contribute to hype cycles (citing the Reflection 70B example)
Hype drowns out genuinely important developments
Importance of actually testing models rather than just sharing posts

Career Advice

Breadth first: Understand the entire ML ecosystem before specializing
Hands-on approach: Code implementations yourself to truly understand concepts
End-to-end projects: Work on complete pipelines from data collection to deployment

Learning Philosophy

Applies ML concepts to personal learning - emphasizing exposure to diverse, high-quality "tokens" across different modalities and contexts, similar to how language models benefit from varied training data.

References

Liquid AI
Hyena Edge
Armin Thomas
Michael Poli
Stefano Massaroli
HuggingFace
Clementine Fourrier
State Space Models
Mixture of Experts
LLM Course on Github
Mergekit
OpenLLM Leaderboard

Maxime Labonne’s Links

💻 mlabonne - Github

𝕏 Follow me on X

🤗 Hugging Face

💻 Blog

📙 LLM Engineer's Handbook

Resources to learn more about Learning from Machine Learning

Glossary

Chain of Thought: A technique or capability where an LLM processes and generates a sequence of reasoning steps before providing a final answer, potentially leading to improved results, especially for complex problems.

Chatbot Arena: An online platform where users can compare the outputs of different Large Language Models side-by-side and vote on which response is better, used as a benchmark for LLM performance

Distillation: A training technique where a smaller, "student" model learns from the outputs (often probability distributions) of a larger, "teacher" model, typically resulting in a more efficient model.

Edge Devices: Computing devices located at or near the source of data generation or consumption, such as mobile phones, drones, satellites, and IoT devices, which often have limited computing resources.

Evaluation: The process of assessing the performance and capabilities of machine learning models using various metrics and benchmarks.

Flash Attention: A technique used to optimize the attention mechanism in Transformer models, making them more memory-efficient and faster, especially for long sequences.

Frontier Models: The most advanced and powerful Large Language Models available at a given time, characterized by their large size (trillions of parameters) and general-purpose capabilities.

Genetic Algorithm: A search and optimization algorithm inspired by the process of natural selection and evolution, used in the context of automatically designing neural network architectures (like in the STAR algorithm for Hyena Edge).

GPT-2: A Transformer-based language model developed by OpenAI, an earlier version of the GPT series.

Hallucination: In the context of LLMs, this refers to the generation of plausible-sounding but factually incorrect or nonsensical information.

Hugging Face: A platform and community known for providing open-source tools and models for machine learning, particularly for NLP and Transformer-based models.

Hyena Edge: A research paper and associated architecture designed to be highly efficient on edge devices by exploring alternatives and combinations of attention, recurrence, and convolutions.

Inference: The process of using a trained machine learning model to make predictions or generate outputs on new, unseen data.

IoT (Internet of Things): The network of physical objects—"things"—that are embedded with sensors, software, and other technologies for the purpose of connecting and exchanging data with other devices and systems over the internet.

KV Cache: stores intermediate activations (keys and values) during the inference phase of transformer models. This allows the model to reuse these activations when generating subsequent tokens, significantly speeding up the process.

MergeKit: A library used for merging the weights of different machine learning model checkpoints to create a new model with potentially improved performance or combined capabilities.

Mixture of Experts (MoE): A neural network architecture where different "expert" sub-networks specialize in processing different types of input data, and a "gate" network determines which experts to activate for a given input, allowing for conditional computation and potentially greater efficiency.

Model Architecture: The specific design and structure of a neural network, defining how its layers and components are arranged and connected.

Model Capacity: Refers to the number of parameters in a neural network, which influences its ability to learn and store information.

Model Merging: A technique where the weights of multiple trained machine learning models are combined to create a new model.

Modality: In machine learning, refers to different types of data, such as text, images, audio, or code.

Neural Beagle: https://huggingface.co/mlabonne

NLP (Natural Language Processing): A field of computer science and artificial intelligence concerned with enabling computers to understand, interpret, and generate human language.

Octave: A free and open-source software similar to MATLAB, used for numerical computations.

OpenLLM Leaderboard: A leaderboard previously hosted by Hugging Face that ranked Large Language Models based on their performance on various benchmarks.

Parameters: The trainable weights and biases within a neural network that are adjusted during the training process.

Post-training: The phase after a base model has been pre-trained on a large dataset, where the model is further fine-tuned or adapted for specific tasks or domains (e.g., supervised fine-tuning, instruction tuning).

Pre-processing: The steps taken to clean, transform, and format raw data into a suitable input for a machine learning model.

PyTorch: An open-source machine learning framework developed by Facebook's AI Research lab, widely used for building and training neural networks.

Sequential Data: Data that comes in a specific order, where the position and relationships between elements are important, such as text, time series, or DNA sequences.

Signal-to-Noise Ratio: The ratio of useful information (signal) to irrelevant or misleading information (noise) in a dataset or communication.

Silver Bullet: A simple and seemingly magical solution to a difficult problem. The discussion suggests that a single architecture is not a silver bullet for LLMs.

SSM (State Space Model): A class of sequential models that use a latent state to process input and generate output, inspired by dynamic systems. Mentioned as an alternative to Transformer architecture.

Supervised Fine-tuning (SFT): A post-training technique where a pre-trained model is trained on a dataset of labeled examples (input-output pairs) to adapt it to a specific task.

STAR: Synthesis of Tailored Architectures

Throughput: In computing, the rate at which data can be processed or transferred. In the context of LLMs, it often refers to the speed of generating output tokens (e.g., tokens per second). Quantifies how many requests the LLM can process or how much output it can produce in a specific timeframe. Higher throughput means the LLM can handle more requests, leading to faster response times

Tokens: The basic units of text (words, subwords, or characters) that machine learning models process.

Transformer: A neural network architecture that relies on the attention mechanism to weigh the importance of different parts of the input data. It has been the dominant architecture for LLMs.

UI (User Interface): The visual layout and design of an application that allows users to interact with it.

UX (User Experience): The overall experience a user has when interacting with a product or system, including their feelings and perceptions.

Weights: The parameters within a neural network that are learned during training, determining the strength of connections between neurons.