@@ -746,17 +746,18 @@ by black points below.
746
746
747
747
.. topic :: Implementation
748
748
749
- The algorithm is non-deterministic, but the core samples will
750
- always belong to the same clusters (although the labels may be
751
- different). The non-determinism comes from deciding to which cluster a
752
- non-core sample belongs. A non-core sample can have a distance lower
753
- than ``eps `` to two core samples in different clusters. By the
749
+ The DBSCAN algorithm is deterministic, always generating the same clusters
750
+ when given the same data in the same order. However, the results can differ when
751
+ data is provided in a different order. First, even though the core samples
752
+ will always be assigned to the same clusters, the labels of those clusters
753
+ will depend on the order in which those samples are encountered in the data.
754
+ Second and more importantly, the clusters to which non-core samples are assigned
755
+ can differ depending on the data order. This would happen when a non-core sample
756
+ has a distance lower than ``eps `` to two core samples in different clusters. By the
754
757
triangular inequality, those two core samples must be more distant than
755
758
``eps `` from each other, or they would be in the same cluster. The non-core
756
- sample is assigned to whichever cluster is generated first, where
757
- the order is determined randomly. Other than the ordering of
758
- the dataset, the algorithm is deterministic, making the results relatively
759
- stable between runs on the same data.
759
+ sample is assigned to whichever cluster is generated first in a pass
760
+ through the data, and so the results will depend on the data ordering.
760
761
761
762
The current implementation uses ball trees and kd-trees
762
763
to determine the neighborhood of points,
0 commit comments