0:00
/
0:00
Transcript

Leland McInnes: UMAP, HDBSCAN & the Geometry of Data | Learning from Machine Learning #10

Decomposing Black Boxes, Decoding Data's Hidden Geometry

In this episode of Learning from Machine Learning, we explore the intersection of pure mathematics and modern data science with Leland McInnes, the mind behind an ecosystem of tools for unsupervised learning including UMAP, HDBSCAN, PyNN Descent and DataMapPlot. As a researcher at the Tutte Institute for Mathematics and Computing, McInnes has fundamentally shaped how we approach and understand complex data.

Leland views data through a unique geometric lens, drawing from his background in algebraic topology to uncover hidden patterns and relationships within complex datasets. This perspective led to the creation of UMAP, a breakthrough in dimensionality reduction that preserves both local and global data structure to allow for incredible visualizations and clustering. Similarly, his clustering algorithm HDBSCAN tackles the messy reality of real-world data, handling varying densities and noise with remarkable effectiveness.

But perhaps what's most striking about Leland isn't just his technical achievements – it's his philosophy toward algorithm development. He champions the concept of "decomposing black box algorithms," advocating for transparency and understanding over blind implementation. By breaking down complex algorithms into their fundamental components, Leland argues, we gain the power to adapt and innovate rather than simply consume.

For those entering the field, Leland offers poignant advice: resist the urge to chase the hype. Instead, find your unique angle, even if it seems unconventional. His own journey – applying concepts from algebraic topology and fuzzy simplicial sets to data science – demonstrates how breakthrough innovations often emerge from unexpected connections.

Throughout our conversation, Leland’s passion for knowledge and commitment to understanding shine through. His approach reminds us that the most powerful advances in data science often come not from following the crowd, but from diving deep into fundamentals and drawing connections across disciplines.

There's immense value in understanding the tools you use, questioning established approaches, and bringing your unique perspective to the field. As Leland shows us, sometimes the most significant breakthroughs come from seeing familiar problems through a new lens.


Learning from Machine Learning is part of an ongoing series exploring the minds shaping modern machine learning. If you enjoyed this conversation, consider sharing it with a colleague or friend who might find it valuable.


Takeaways

  • Decompose black boxes: Understanding the internal components of algorithms (not just their parameters) is essential for adapting them to new problems and creating innovation

  • Search answers known questions; exploration reveals unknown patterns: The biggest challenge with unstructured data isn’t finding what you’re looking for, but discovering what you didn’t know was there

  • Resist the hype: Work on what interests you in less crowded fields. Breakthrough innovations often emerge from unexpected connections across disciplines

  • Build composable tools: Design algorithms as assemblies of interchangeable components (like Lego bricks) rather than monolithic solutions

  • Unsupervised learning has no single “right answer”: Unlike supervised learning with clear metrics, clustering and dimensionality reduction depend entirely on your use case. If it’s doing what you need, that’s good enough


Summary

The geometry of data matters more than you think. Leland’s background in algebraic topology and pure mathematics gives him a unique perspective: he sees data as geometric structures waiting to be understood. This isn’t just an academic exercise. It’s the foundation for practical tools that have transformed how data scientists explore unstructured data, from text and images to videos and system logs.

Build composable Lego bricks, not monolithic solutions. Leland views algorithms as assemblies of interchangeable components. He designed libraries like PyNNDescent to be reusable building blocks he could leverage across projects. He explains his libraries as being built of components that can modified. For example, HDBSCAN’s modular approach to density estimation, connectivity, tree building, and cluster extraction. The key insight: if you can decompose an algorithm into its fundamental pieces, you can adapt it to solve problems its original creators never imagined.

Visualization isn’t just pretty pictures. It’s enlightenment. When Leland works with domain experts on their datasets, their first question upon seeing a UMAP visualization is almost always: “What’s that and why is it there?” That moment of discovering something unexpected in your own data is what drives exploration forward. This insight led him to build Datamapplot, which packages all his learned tweaks and optimizations into a tool that produces publication-ready visualizations out of the box, democratizing access to the kind of polished outputs that previously required days of matplotlib fiddling.

Why it matters: Whether you’re doing topic modeling, analyzing research data, or creating computational art, Leland’s tools and philosophy remind us that the most powerful insights come not from having all the answers, but from having the right tools to ask better questions. His approach demonstrates how breakthrough innovations often emerge from unexpected connections- from seeing familiar problems through a new lens.


Career Journey and Background

Leland McInnes grew up with mathematics—his father was a math professor at University of Canterbury in New Zealand. When heading to university, he secured direct entry to second year for math and physics, but quickly discovered he preferred the math. After completing pure mathematics studies focused on topology and algebra, he took an unconventional detour:

“I actually took a break from academia, went and worked in industry and with government for a couple years before heading back to do a PhD so I have a little bit of experience with actually working with real data as opposed to pure theoretical math.”

This combination of theoretical rigor and practical experience would prove essential. At the Tutt Institute for Mathematics and Computing, a Canadian research institute, Leland gravitated toward data science and machine learning questions despite his pure math background.

The Geometry of Everything

His mathematical lens shapes everything he builds:

“My background was in topology and algebra. So I tend to think of things in terms of algebraic structures and the geometry of things. So I’m always thinking in terms of the geometry of the data.”

This geometric perspective is more than aesthetic, it’s fundamental to unsupervised learning. When you’re trying to figure out the underlying structure of data without predefined answers, thinking about the shape and relationships in high-dimensional space becomes crucial.

The UMAP and HDBSCAN Ecosystem

HDBSCAN: Density-Based Clustering That Works

Leland began with clustering algorithms over a decade ago, working with lower-dimensional data. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) extends the classic DBSCAN algorithm but handles variable densities. This is a critical improvement for real-world data that doesn’t cluster uniformly.

The breakthrough wasn’t inventing a new algorithm from scratch, but making it practical:

“I really read their paper and wrote a very basic implementation that was quite slow and was like, this does a better job of clustering than anything else. And then I just read some other papers, other random papers and was like, well, if I take this paper and this paper and just smush them together, then it’ll go faster.”

The key was optimizing nearest neighbor search, moving from O(n²) to O(n log n) complexity using space-partitioning trees. This “magpie approach” (grabbing the best pieces from different papers and gluing them together) would become a pattern in Leland’s work.

UMAP: Beyond t-SNE

As Leland’s work expanded to higher-dimensional data, he encountered a limitation: most clustering algorithms quietly assume data has relatively few dimensions (two to maybe fifty). Modern vectorization techniques produce hundreds or thousands of dimensions. He needed dimensionality reduction that would work.

T-SNE, introduced in 2009 by Van der Maaten and Geoffrey Hinton, had demonstrated the power of nonlinear dimensionality reduction for visualization. But Leland saw room for improvement:

“I was very impressed with how effective that approach was and so I just wanted to build a method that would work with the math and theory that I was familiar with.”

UMAP (Uniform Manifold Approximation and Projection) emerged from combining:

  • Fuzzy simplicial sets (a paper by David Spivak)

  • Topological data analysis concepts

  • Graph-based representations of high-dimensional data

  • Efficient optimization techniques

The results: faster than t-SNE, better preservation of global structure (how clusters relate to each other), and a theoretical foundation Leland could extend in multiple directions.

The Composability Philosophy

What makes UMAP particularly powerful is its modular design:

“There’s constructing a representation of the high dimensional data in some sort of graph based way... there’s how you’re going to optimize a low dimensional representation. Again, this is just components and pieces.”

You can swap components: embed onto a torus instead of a plane for periodic data, use different distance metrics, embed as Gaussians with uncertainty rather than points. The extensive hyperparameter set isn’t feature creep. It’s exposing the compositional structure so users can adapt the algorithm to their specific needs.

From Theory to Practice: The Complete Pipeline

Vectorizers: Getting to Vectors

Before you can explore unstructured data, you need to represent it mathematically:

“The first step is just getting the data into a useful representation... vectorizing the data. Now there’s a lot of neural embedding techniques and they work super well, so I don’t have too many solutions on that front. But if you have other kinds of data, we have a library called Vectorizers that tries to deal with some of the ways of getting other kinds of unstructured data into vector formats.”

Once everything is vectors, distance becomes meaningful and you have geometry to explore.

PyNNDescent: The Reusable Building Block

Nearest neighbor search appears everywhere in Leland’s work, in HDBScan, in UMAP, in his new clustering library EVōC. Rather than reimplementing it repeatedly:

“I needed a thing that would do that in the ways that I needed it done... I wanted to be able to do anything, and I needed to be able to work with sparse data structures as well.”

PyNNDescent packages approximate K-nearest neighbor graph construction efficiently, handling diverse metrics and sparse data. It’s designed for reuse, embodying the principle that good tools should be composable.

Datamapplot: Making Visualization Accessible

After seeing workshop participants struggle to create polished visualizations of their UMAP outputs, Leland had an epiphany:

“It’s not fair that I have the time to sit down and tweak all the parameters on the plot, but these other people, they don’t have that experience of having done a lot of these kinds of plots and they don’t have the time to fiddle with matplotlib for a couple days to make the plot look just, just so.”

Datamapplot packages his accumulated plotting wisdom into a library that produces beautiful static or interactive visualizations out of the box. It handles the tedious details like spacing, labeling, color choices and can make the difference between a rough draft and a publication-ready figure.

Key Insights and Highlights

There Is No “Ground Truth” in Unsupervised Learning

Unlike supervised learning with clear accuracy metrics, unsupervised learning resists simple evaluation:

“People expect the clustering algorithm to produce like the true clusters, but there aren’t any singular true clusters. It depends on what kind of things you want to get out. And so expecting there to be some magical true answer and these other algorithms are approximating or how good do they compare to the true clusters? There isn’t a single right answer.”

This is frustrating but liberating. “If it’s doing what you need, that’s good enough.” The art lies in understanding what you need.

Decompose Everything

Leland sees algorithms as assemblies, not monoliths:

“HDBSCAN [is] a pile of Lego bricks. There are a bunch of different things. So there’s a density estimation step, there’s a connectivity related step, there’s building a tree of clusters step, and then there’s a cluster extraction step.”

Even LDA (Latent Dirichlet Allocation) for topic modeling can be decomposed:

“You can also just view it as a matrix factorization algorithm and decompose it into the parts of how matrix factorization works... if you had count data that was much more Poisson distributed, well then you need a prior for the Poisson, there’s gamma distributions for that. You could build an LDA-like algorithm pretty easily that would work for an entirely different data type.”

This mindset (understanding components rather than treating algorithms as black boxes) is what enables innovation.

The Power of Seeing Your Data

Visualization creates moments of discovery:

“I’ve worked with domain experts on datasets that they care about, that I realize I know nothing about, but we can work through, get it to a plot, stick it in front of them. And their first question is almost always, what’s that and why is it there? Because there’s something new in the data that they didn’t realize was there.”

Interactive plotting amplifies this: once you see something interesting, you need tools to dig into it immediately.

The Search Problem

Modern tools excel at search, but search assumes you know what you’re looking for:

“One of the biggest difficulties with unstructured data right now is that we have great tools for search and picking through, but that assumes you already know what you’re looking for. And a lot of the time you don’t know what you’re looking for, especially if it’s a brand new dataset and you have no idea what’s in it.”

Unsupervised learning tools complement search by revealing the big picture, helping you ask better questions rather than just answering known questions.

Unexpected Applications and Impact

COVID-19 Research

In 2020, UMAP appeared repeatedly in pandemic research:

“As a mathematician, that was not something that I felt I would ever be able to contribute to. And yet at the same time, like that was at the start of the pandemic when everything was panic stations. It was really interesting to see that I had managed to do something that helped some people somehow on solving this problem.”

Computational Art

Artists like Refik Anadol use UMAP among other ML tools to create art:

“That’s not a use case that I ever had in mind... the outputs that you can get are quite beautiful.”

Transforming Topic Modeling

Modern topic modeling workflows (exemplified by BERTopic) use UMAP for dimensionality reduction and HDBScan for clustering by default. Leland’s ecosystem has become the infrastructure layer for exploring textual and other unstructured data.

Broader Perspectives

Don’t Follow the Hype

For aspiring researchers and data scientists:

“Don’t follow the hype in the same direction that everyone else is going because you’re not going to make a dent in the field that everyone is already working on. Go do whatever is interesting to you that isn’t necessarily the hype thing and be good at something that you’re good at and that’s probably going to be good enough. The time will come when whatever you’re working on will come around.”

Interdisciplinary Spaces Create Value

“If you can find some time to stretch into subject areas that people aren’t otherwise necessarily working on, that can make a big difference... just being able to have enough of a stretch to grab things from different fields that other people aren’t putting together. That’s often what you need to create something new.”

Simplicity Over Complexity

On sentence embeddings and transformer models:

“I’m a fan of what’s the simplest possible thing you can do that still does a good enough job. And I’d be really interested to know if you can produce sentence embeddings like 98% as good as the transformer model with something that’s just way simpler and easier to understand because the internal workings of what that pre-trained model has learned, that’s harder to pick apart.”

RAG Isn’t a Silver Bullet

On the hype around retrieval-augmented generation:

“The idea, I’ll just vectorize everything and I’ll just find the most relevant documents. I don’t, then you haven’t really used cosine similarity... Information retrieval is really hard. It’s hard. And people have been working on it for a long time.”

Real solutions require incorporating semantic patterns, lexical patterns, metadata, and multiple filtering strategies. The hard work that preceded LLMs still matters.

The State of Unsupervised Learning

“I think there’s a whole lot of scope for work to be done still in unsupervised learning... I look at the state of things and I’m like, this could all be so much better. It’s not like I have the answers for how to make it better, but I can definitely see lots of room and directions for improvement.”

Future Directions

Temporal Topic Modeling

Extending topic modeling to handle data changing over time remains challenging. Leland has been exploring Mapper (from topological data analysis) and recent extensions that could address this problem.

Evolving UMAP Embeddings

Current UMAP retraining gives qualitatively similar but rotated/flipped results. The transform method fits new data into the existing embedding but can’t adapt when truly new clusters appear:

“If a new cluster of data actually shows up, it’s just going to squeeze it in amongst all the other data that’s there.”

Making UMAP evolve dynamically as new data arrives (adapting the embedding rather than forcing new points into the old structure) is an active area of development.

Simpler Embeddings

Do we really need massive transformer models for every task?

“Do you need the full power of a giant transformer based neural network to make that work?... I’d be really interested to know if you can produce sentence embeddings like 98% as good as the transformer model with something that’s just way simpler and easier to understand.”

Lessons from Research and Life

Embrace Ignorance

Working with domain experts on their datasets taught Leland humility:

“There’s a whole lot that I know almost nothing about. In fact, almost everything I know almost nothing about. And that’s always worth keeping in mind, is how many other different things there are out there and how little you know.”

You Don’t Need All the Answers

“Research also taught me I definitely don’t have all the answers and that’s okay. You’ve got to learn to live with that. You don’t have all the answers but maybe you can get there.”

Be Patient with Problems

“You just need to be patient with problems. Sometimes these things just need to sit in the back of your brain for a long time and then eventually I don’t know what subconscious process works back there but eventually answers do come out so just be patient.”

Concluding Thoughts

Leland’s work exemplifies a crucial principle in modern data science: the tools we build should empower exploration, not just execution. His libraries don’t just solve technical problems. They change how we think about our data. When you can visualize thousands of documents as a geometric landscape, when you can cluster without knowing the “right” number of clusters, when you can reduce dimensions while preserving meaning, you’re not just analyzing data differently. You’re asking different questions entirely.

The beauty of Leland’s approach is that it scales down as well as up. Whether you’re a researcher exploring a novel dataset, a practitioner building production systems, or a student learning the fundamentals, the philosophy remains the same: understand the pieces, question the assumptions, and don’t be afraid to combine ideas in ways no one else has tried.

As data continues to grow in volume and complexity, the need for tools that help us explore rather than just search becomes increasingly critical. Leland’s contributions remind us that sometimes the most practical solutions come from the most theoretical foundations, and that the future of data science belongs to those willing to look beyond the hype and build something genuinely useful.


Resources for Leland McInnes' Work and Libraries

Leland’s Github

UMAP
HDBSCAN
PyNN Descent
DataMapPlot

References from this Episode


Resources to learn more about Learning from Machine Learning


Chapters
  • 00:00 Understanding Unstructured Data Challenges

  • 03:02 The Journey into Mathematics and Machine Learning

  • 05:57 Exploring Unsupervised Learning

  • 09:10 The Role of Libraries in Data Processing

  • 11:53 Advancements in Clustering Algorithms

  • 14:46 The Breakthrough of UMAP

  • 18:06 Evaluating Dimensionality Reduction Techniques

  • 20:51 Unexpected Applications of UMAP

  • 24:04 The Importance of Visualization in Data Science

  • 28:16 Iterative Processes in Data Analysis

  • 35:12 Decomposing Algorithms for Better Understanding

  • 38:20 The Hype vs. Reality of AI and Machine Learning

  • 43:13 Unanswered Questions in Machine Learning

  • 45:35 Advice for Aspiring Data Scientists

  • 50:13 Lessons from a Career in Research


Glossary of Key Terms

Algebraic Topology: A branch of mathematics that uses tools from abstract algebra to study topological spaces, focusing on properties preserved under continuous deformations.

Approximate K-Nearest Neighbors (KNN): Algorithms that find the k closest points to a query point in high-dimensional space quickly by accepting approximate rather than exact results.

Black Box: An algorithm or system whose internal workings are not understood or accessible, treated only through its inputs and outputs.

Clustering: The task of grouping data points so that points in the same group are more similar to each other than to those in other groups.

Composability: The ability to combine simple components or functions to build more complex systems, like Lego bricks.

Cosine Similarity: A measure of similarity between two vectors based on the cosine of the angle between them, commonly used in information retrieval.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise): A clustering algorithm that groups together points that are closely packed together, marking points in low-density regions as outliers.

Density-Based Clustering: Clustering methods that identify clusters as regions of high point density separated by regions of low density.

Dimensionality Reduction: Techniques for reducing the number of features/dimensions in data while preserving important structure and relationships.

Domain Expert: Someone with deep knowledge and expertise in a specific field or subject matter area.

Embedding: A representation of data (like words, sentences, or images) as vectors (numbers) in a continuous space where semantic similarity is preserved.

Euclidean Distance: The straight-line distance between two points in Euclidean space, calculated using the Pythagorean theorem.

Fuzzy Simplicial Sets: Mathematical structures that extend simplicial complexes (geometric objects) with fuzzy/probabilistic boundaries, used as theoretical foundation for UMAP.

Gaussian: A probability distribution with a bell curve shape, characterized by mean and variance; also called normal distribution.

Geometry of Data: Understanding data through spatial and structural relationships, treating high-dimensional datasets as geometric objects.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): An extension of DBSCAN that can handle varying densities and builds a hierarchy of clusters.

High-Dimensional Data: Data with many features or dimensions (often hundreds or thousands), where traditional methods may struggle due to the “curse of dimensionality.”

Hyperparameter: A parameter whose value is set before training begins and controls the learning process or algorithm behavior (e.g., number of neighbors in UMAP).

Information Retrieval: The science of searching for and retrieving information, traditionally focused on text documents but applicable to any data.

Interactive Plotting: Visualizations that allow users to interact with the display (zoom, pan, click on points, etc.) to explore data dynamically.

K-Nearest Neighbors Graph: A graph where each point is connected to its k closest points, used as a foundation for many dimensionality reduction and clustering algorithms.

LDA (Latent Dirichlet Allocation): A probabilistic topic modeling algorithm that represents documents as mixtures of topics and topics as mixtures of words.

Manifold Learning: A class of dimensionality reduction techniques based on the assumption that high-dimensional data lies on or near a lower-dimensional manifold.

Mapper: A topological data analysis algorithm based on Morse theory that creates simplified representations of high-dimensional data structure.

Matrix Factorization: Decomposing a matrix into a product of matrices, used in many machine learning algorithms including collaborative filtering and topic modeling.

Metric: A function that defines distance between data points (e.g., Euclidean, cosine, Manhattan).

Minimum Spanning Tree (MST): A tree connecting all points in a graph with minimum total edge weight, used internally in HDBSCAN for clustering.

Morse Theory: A branch of mathematics studying the topology of manifolds by analyzing critical points of functions, used in Mapper algorithm.

Nearest Neighbor Search: Finding the closest point(s) to a query point, fundamental to many machine learning algorithms.

O(n log n): Computational complexity indicating that algorithm runtime grows proportionally to n times the logarithm of n, where n is the input size.

O(n²): Computational complexity indicating that algorithm runtime grows proportionally to the square of input size—much slower than O(n log n) for large n.

Outlier: A data point that differs significantly from other observations, potentially indicating noise, errors, or interesting anomalies.

Periodic Structure: Data that loops back on itself, like angles (0° and 360° are the same) or days of the week.

PyNNDescent: Python library for approximate nearest neighbor search using the nearest neighbor descent algorithm.

RAG (Retrieval-Augmented Generation): A technique combining information retrieval with language model generation to produce more accurate, grounded responses.

Semantic Similarity: How similar two pieces of text (or other data) are in meaning, as opposed to surface-level features.

Sentence Embedding: A fixed-length vector representation of a sentence that captures its semantic meaning.

Space-Partitioning Tree: A data structure that recursively divides space into regions to enable efficient nearest neighbor search.

Sparse Data: Data where most values are zero or absent, often represented in compressed formats.

Supervised Learning: Machine learning where the algorithm learns from labeled training data with known correct outputs.

t-SNE (t-Distributed Stochastic Neighbor Embedding): A dimensionality reduction technique particularly good for visualization, which inspired UMAP.

Tabular Data: Data organized in rows and columns like a spreadsheet or database table.

Temporal Topic Modeling: Topic modeling that accounts for how topics evolve and change over time.

Topology: The mathematical study of properties preserved under continuous deformations (stretching, bending, but not tearing or gluing).

Topological Data Analysis (TDA): Applying topological and geometric techniques to analyze the shape of data.

Torus: A doughnut-shaped surface, useful for embedding data with two periodic dimensions.

Transformer Model: A neural network architecture (introduced 2017) using self-attention mechanisms, forming the basis of most modern language models.

UMAP (Uniform Manifold Approximation and Projection): A dimensionality reduction algorithm for visualization and preprocessing, preserving both local and global structure.

Unsupervised Learning: Machine learning where the algorithm finds patterns in unlabeled data without predefined correct answers.

Vectorization: Converting data (text, images, audio) into numerical vector representations that algorithms can process.

Visualization: Creating graphical representations of data to reveal patterns, outliers, and structure that may not be apparent in raw numbers.

Discussion about this video

User's avatar