Skip to content

Commit cc81015

Browse files
jbednarjnothman
authored andcommitted
[MRG] Clarified that DBSCAN is deterministic apart from data ordering (scikit-learn#7852)
1 parent f7620be commit cc81015

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

doc/modules/clustering.rst

+10-9
Original file line numberDiff line numberDiff line change
@@ -746,17 +746,18 @@ by black points below.
746746

747747
.. topic:: Implementation
748748

749-
The algorithm is non-deterministic, but the core samples will
750-
always belong to the same clusters (although the labels may be
751-
different). The non-determinism comes from deciding to which cluster a
752-
non-core sample belongs. A non-core sample can have a distance lower
753-
than ``eps`` to two core samples in different clusters. By the
749+
The DBSCAN algorithm is deterministic, always generating the same clusters
750+
when given the same data in the same order. However, the results can differ when
751+
data is provided in a different order. First, even though the core samples
752+
will always be assigned to the same clusters, the labels of those clusters
753+
will depend on the order in which those samples are encountered in the data.
754+
Second and more importantly, the clusters to which non-core samples are assigned
755+
can differ depending on the data order. This would happen when a non-core sample
756+
has a distance lower than ``eps`` to two core samples in different clusters. By the
754757
triangular inequality, those two core samples must be more distant than
755758
``eps`` from each other, or they would be in the same cluster. The non-core
756-
sample is assigned to whichever cluster is generated first, where
757-
the order is determined randomly. Other than the ordering of
758-
the dataset, the algorithm is deterministic, making the results relatively
759-
stable between runs on the same data.
759+
sample is assigned to whichever cluster is generated first in a pass
760+
through the data, and so the results will depend on the data ordering.
760761

761762
The current implementation uses ball trees and kd-trees
762763
to determine the neighborhood of points,

0 commit comments

Comments
 (0)