Title: Pruning Nearest Neighbor Cluster Trees Abstract: Clustering methods are widely used in data analysis, yet the problem of clustering is often ill-defined in practice. More precisely, a typical dataset contains empirical structures which we might want to view as 'clusters', and some we might want to view as 'noise'. Even when we assume the existence of a 'ground truth' clustering -under a particular formalism of our choice-, the ground truth is unknown in practice and it remains unclear which particular empirical structures obey the ground truth and which are spurious artifacts of the noisy data. A natural formalism proposed by Hartigan (1981) is one where clusters are regions of high density of the data. Here the data is assumed to be drawn i.i.d from some distribution F with density f, and clusters are simply the connected components (CCs) of the level sets of f. The collection of CCs over all level sets forms a nested hierarchy which is called the 'cluster tree of f'. The cluster tree is the unknown ground truth, and we will show that there is now a clean way to define and subsequently identify spurious structures. We show two main results. First we show that, given finite data, the cluster tree of an unknown distribution is well approximated by the empirical tree formed by certain subgraphs of nearest neighbor (kNN) graphs. Our finite sample bounds automatically imply consistency, i.e. any level of the underlying cluster tree is eventually recovered given enough data. The second and perhaps most important result is a finite sample guarantee about the pruning of spurious structures. We carefully work out the tradeoff between aggressive and conservative pruning and are able to guarantee the removal of all spurious structures in the empirical tree while at the same time guaranteeing the recovery of salient clusters. Consistency is maintained despite pruning. These are the first such finite sample pruning guarantees in the context of clustering. This talk is based on joint work with Ulrike von Luxburg.