Batch clustering algorithms that don't require the number of clusters to be pre-specified
I am training an embedding model on a classification dataset with ~20k classes. The goal is to use the embeddings to cluster a much larger set of data in a way that would extend the original classification dataset. I am using a hierarchical clustering method and it is working well on my subsets.
The problem is this: I have ~3.3 million 756-long embeddings that I need to cluster. My hierarchical clustering method uses way too much memory so I would like to do the clustering in batches. However, hierarchical clustering does not work when batched so I'm looking for a clustering method that:
- Does not require pre-specification of the number of clusters
- Can be run in a batched fashion such that the outcome of batching is the same when run on the whole dataset
Can anyone suggest a path forward? Thanks!