RANDOM CENTROID INITIALIZATION FOR IMPROVING CENTROID-BASED CLUSTERING

: A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.


Initialization in centroid-based clustering
The centroid-based clustering problem is to partition N data points (observations, objects) into k clusters (groups) by minimizing the sum of within-cluster squared Euclidean distances (Gonzalez, 1985;Hartigan & Wong, 1979;Ikotun et al., 2023).Centroid-based clustering, although being a specific field in cluster analysis, has many practical implementations (Ostrovsky et al., 2006;Phillips, 2002;Mahajan, Nimbhorkar, & Varadarajan, 2009).Clustering flat objects (which have two features) is often perceived as a metric facility location problem (Li, 2011;Megiddo & Tamir, 1982).The task of this problem is to find the best warehouse locations to optimally service a given set of consumers whose locations are taken as the data to be clustered, and warehouses are seen as cluster centers (centroids) (MacQueen, 1967;Romanuke, 2018b;Mahajan, Nimbhorkar, & Varadarajan, 2009;Jafar et al., 2021).For instance, centroid-based clustering is used to rationally assign mobiles (consumers) to base stations (that, in a more rigorous manner, are referred to as centroids) of a wireless communication network (Romanuke, 2019).Mounting locations of base stations also can be determined by centroid-based clustering.It is also invoked to build complex models of decision-making (Jafar & Saeed, 2022).
In practice, the fastest method for centroid-based clustering is the k-means algorithm that is an efficient heuristic (Lloyd, 1982;Bottou & Bengio, 1994;Hamerly, 2010;Kanungo et al., 2002).The algorithm quickly converges to a local optimum (an approximate minimum), so it is usually run multiple times and the best approximate minimum is selected (Fränti & Sieranoja, 2019;Celebi et al., 2013;Kanungo et al., 2004).The k-means problem is solved using either the Lloyd's or Elkan's algorithm, where the latter is more efficient by using the triangle inequality for dense data (Ostrovsky et al., 2006;Vattani, 2011;Press et al., 2007).
While k-means chooses k initial cluster centroids at random, the k-means++ algorithm specifically initializes the centroids.It uses an heuristic to find centroid seeds for k-means clustering.According to Arthur & Vassilvitskii (2007), k-means++ improves both the running time of the Lloyd's algorithm and the approximate minimum of the sum of within-cluster squared Euclidean distances.Nevertheless, it is easy to convince that the k-means++ advantage does not come true for every centroid-based clustering problem.The advantage works on average.For example, at some random state (determining pseudorandom number generation), a dataset of 2500 points scattered uniformly within a unit square with adding a half of standard normal noise is partitioned into 24 clusters more accurately by initializing 24 centroids at random, where 50 multiples run are used.Indeed, an approximate minimum of 119.0397 is obtained by the random initialization within 677.9 milliseconds on a dual-core processor Intel Core i5-7200U@2.50GHz,whereas the k-means++ algorithm takes about 777.8 milliseconds to reach just an approximate minimum of 120.947 (see Figure 1, where the centroids are marked with circles).Thus, in this particular counterexample, k-means++ is 1.577% less accurate and 12.84 % slower.This is a quite huge loss in both accuracy and computational speed.
Therefore, the random centroid initialization (which essentially is the k-means algorithm) in centroid-based clustering can significantly outperform k-means++.But what is the random initialization performance on average?How does it relate to the k-means++ performance, i. e. what is the relationship between these two approaches?These questions are still open and need to be answered in order to ascertain a realistic disadvantage of "careful seeding".

Motivation and goal
An average advantage in performance has a general problem (that might be punned as a disadvantage) -it does not guarantee that the advantage happens in every single instance (Chakrabarty & Swagatam, 2022;Arthur & Vassilvitskii, 2006).The k-means++ algorithm has been believed to have the outperformance with respect to other approaches to centroid-based clustering including the k-means with the random centroid initialization.The k-means++ underperformance in the particular counterexample in Figure 1 may be just an occasional outlier instance, but it nonetheless pushes to a suggestion that such outliers do exist.Does the pseudorandom number generator state influence it?Maybe the counterexample is just an occurrence appeared at 50 algorithm runs, and it disappears at other numbers of runs?As a matter of fact, it does not.Figure 2 presents a plot of how the sum of within-cluster squared Euclidean distances changes versus the number of algorithms runs for the particular pseudorandom number generator state.It is clearly seen that as the algorithm is run more times, k-means++ continues underperforming, finally reaching an approximate minimum of 119.2911 after 78 runs.Meanwhile, the randomly initialized centroids here make the sum of within-cluster squared Euclidean distances equal to 120.2943 in the very beginning.It is even better than 120.947 -the k-means++ performance after 50 runs (Figure 1).An approximate minimum of 118.984 is reached after 56 runs followed the leap-down from the preceding value of 119.0397 related to Figure 1.By starting with some different pseudorandom number generator state, the same dataset is clustered differently.Unlike the previous state performance visualized in Figure 2, this time the random centroid initialization starts worse than k-means++ (Figure 3), but eventually reaches an approximate minimum of 118.8904 after 146 runs.Despite a better start lasting until 69 runs, k-means++ still underperforms reaching only an approximate minimum of 119.5583 after 38 runs (it is worse than that in Figure 2).Therefore, this confirms that the starting relationship between the performances can be broken as the algorithm is run more times.
Issuing from the possible underperformance of k-means++ in centroid-based clustering, the goal is to suggest a method of how the underperformance could be reduced by using the random centroid initialization.For meeting the goal, the following five tasks are to be accomplished: 1. To formalize a computational set-up to gather statistics of the performance of the k-means++ algorithm versus k-means with random centroid initialization.
2. To study the statistics and suggest a method of how the random centroid initialization could improve centroid-based clustering to reduce the possible underperformance of k-means++.
3. To show the suggested method performance versus k-means++.4. To discuss practical applicability and scientific significance of the suggested method for centroid-based clustering.

Computational set-up
To gather statistics of the performance of the k-means++ algorithm versus k-means with random centroid initialization, a computational simulation is set up as follows.The number of algorithms runs denoted by runs A is 200.The maximal number of iterations (for every run) is set at 300.The number of the dataset points to be clustered denoted by N is selected from a range (1) The number of clusters denoted by k is selected from a range The number of object features denoted by m is selected from a range For every   , Nm the dataset is generated as points where the value of feature l is

Statistics of the performance
As every instance to cluster has been generated for a triple   ,, N m k by having repeated it for five times, there are 10    6 4 5 5 arrays of the following data by  * 200, 80 .These arrays are averaged over the last dimension.Then the resulting  6 4 5 averages can be averaged over the second dimension getting rid of the number of features.So, the relative difference between the k-means and k-means++ performances by (8) becomes a  65 matrix (Table 1).So do the relative difference between the k-means and k-means++ running times by (9) presented in Table 2, the relative difference between the mix algorithm and k-means++ performances by (11) presented in Table 3, and the relative difference between the mix algorithm and k-means++ running times by ( 12) presented in Table 4. Positive percentage values highlighted bold in Tables 1 -4 imply that the k-means++ algorithm is worse.As the number of points increases, k-means may perform even better, and the difference becomes more solid (Table 1).However, there is no solid advantage in its running time (Table 2).The only exception is the dataset of 1000 points, for which k-means performs far faster as the number of clusters is increased.On average, k-means is 0.1248 % less accurate and 5.5815 % faster than k-means++.As the number of features increases from 2 up to 5, the accuracy loss drops from 0.2897 down to 0.051 %, whereas the speedup grows from 1.9388 up to 6.536 %.On average, the mix does not outperform k-means++ (Table 3) by losing about 0.0265 % in accuracy.However, the loss drops down to 0.0037 % as the dataset becomes larger (for instance, it is clearly seen by the column for 64 clusters).
Moreover, the mix is 22.2059 % faster (see Table 4, in which only cells are highlighted bold which correspond to those in Table 3).So, the main inference from the statistics is that, as both the number of points and the number of features increase, the impact of the random centroid initialization solidifies.As the number of clusters is increased, no such or other distinct pattern is observed.However, there is another distinct property of the statistics that is as solid as the mentioned inference is.This is about the rate of instances for which k-means performs better than k-means++ (based on the data for Table 1).In fact, about a half of 600 instances (this is the total number of instances) are clustered more accurately by k-means (Table 5).The dependence on the number of features, on the dataset size, and on the number of clusters is hardly perceivable.Nearly the same quality of the performance is seen for the mix algorithm (Table 6), where the tie on the same performance is broken by the shorter running time of the mix algorithm.
On average, the k-means performance is better, the same, and worse in 49.8333 %, 0.5 %, and 49.6667 % of all instances generated.The respective averaged rates of the mix performance are 36.6667%, 13 %, and 50.3333 %.Obviously, the mix algorithm configured for = * runs 80 A by 200 runs of k-means++ is always faster (Table 4).Meanwhile, k-means does not always outrun k-means++ (Table 7), which is also inferred from Table 2.An exception occurs when the dataset is of a medium to small size, although there are two cells in Table 7 for 10000-point and 25000-point datasets, where k-means++ is likelier to be faster.As the dataset size increases, k-means++ becomes slower, and the slowness builds up.On average, k-means is faster at 68.3333 % rate.The obtained data remain nearly the same if the computational simulation is repeated (starting from different pseudorandom number generator states).In general, the abovementioned main inference from the statistics remains the same.Moreover, the advantage of the mix algorithm solidifies further if the number of k-means and k-means++ runs (inside this algorithm) is increased from 80 up to 115.

The improved performance
Whichever number runs A is, number * runs A cannot be selected such that k-means would outperform k-means++.Nevertheless, by ignoring the seeding method of random centroid initialization, k-means++ loses about 0.05 % accuracy in every second instance to cluster.All the more that much higher accuracy losses are likely.For instance, a 1000-point dataset with 4 features has been partitioned into 16 clusters by k-means++ with a score of 417.222, whereas k-means has performed on the dataset with a score of 414.5155 (that is 0.6487 % more accurate).At the same time, k-means may perform too poorly on smaller-sized problems.Thus, another 1000-point dataset with 2 features has been partitioned into 64 clusters by k-means++ with a score of 13.6557, whereas the k-means score is 14.8535 (that is 8.7712 % less accurate).
To prevent potential accuracy losses, the random centroid initialization is incorporated into k-means++.Inasmuch as k-means cannot be used standalone, the mix algorithm is a simple solution.However, minimum ( 10 If not, then inequality (13) does not always hold.Meanwhile, it is not about the competition between the two approaches.It is rather about the diversity of scores among which the minimum should be selected.It is similar to the hybridization and fuzzification used for decision-making in multi-attributes and multi-objective problems (Jafar et al., 2022;Jafar et al., 2020a;Romanuke, 2018a).The best solution is to run both k-means and k-means++ on parallel cores, whichever number * runs A is, whereupon to conclude on minimization operation (10).Besides, inasmuch as k-means++ is more robust, it can be run for more times than k-means.For example, in overall 160 runs, k-means is run for 40 times and k-means++ is run for 120 times, where four processor cores may be used to run the algorithms in parallel.This improves the performance in every second instance to cluster.

Discussion and conclusion
In nearly a half of instances to cluster, using k-means is seemingly redundant, but the other half really needs the random centroid initialization for improving centroid-based clustering.It is worth remembering that the clustering result score is a value of a random variable, so the improved performance can be guaranteed only as an expectance.Such an improvement makes sense for applications in management and engineering featuring repeatability of decision-making events (Romanuke, 2018a;Fränti & Sieranoja, 2019;Romanuke, 2021).Non-repeatable decision-making problems will nonetheless benefit from the improved centroid-based clustering while, e. g., subproblems of measuring similarity are solved (Jafar et al., 2020b;Jafar et al., 2021;Romanuke, 2018b;Romanuke, 2019).
After all, why does k-means++ underperform?The answer is quite simple: the k-means++ sometimes underperforms because the first centroid is chosen at random, so it is likely that a poor choice of the first centroid leads to poorer accuracy.Obviously, equal running times of k-means and k-means++ are very unlikely, but the occurrence when k-means++ converges faster than k-means (by the same number of the algorithm runs and the same pseudorandom number generator state) is not excluded.Due to the k-means++ centroid initialization is a sequential process, the k-means++ slowdown deepens as the problem size increases.
Based on a series of computational simulations, it must be concluded that, by incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.In its best non-sequential way to maintain the same running time, the incorporation is practically done by running the k-means algorithm on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected.The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase.The approach can be developed further by substituting the sequential process of centroid initialization in the k-means++ algorithm with the random centroid initialization as an alternative choice of the first centroid.

Figure 1 .
Figure1.The particular counterexample, in which centroid-based clustering is done by 1.577% more accurately and 12.84 % faster by using the random initialization (the top subplot) than by using k-means++ (the bottom subplot)

Figure 2 .
Figure 2. The sum of within-cluster squared Euclidean distances versus the number of algorithm runs for k-means++ (black dots) and random (red dotted squares) centroid initialization

Figure 3 .
Figure3.The performance by a pseudorandom number generator state different from that in Figure2, where k-means++ still underperforms despite a better start 5.To make an appropriate conclusion and an outlook of how the research might be developed further.
 li of a random variable  li distributed uniformly on the open interval ( ) 0; 1 , and a value  li of a random variable  li distributed normally with the zero mean and unit variance.The dataset is partitioned into k clusters,  Range kK .The performance includes both accuracy and computational (running) time.The accuracy is meant by the sum of within-cluster squared Euclidean distances: for centroids 7), which can be referred to as a score, and the respective computational time for k-means and k-means++ respectively by rand relative tolerance regarding sum (7) is set at 0.0001 to declare convergence.The relative difference between the k-means and k-means++ performances for number runs between the k-means and k-means++ running times for number runs the algorithms can be run (in any order, but not in parallel) the respective solution (herein, both the algorithms are used in a mix)see how much minimum (10) improves the solution with respect to the k-means++ solution.In this set-up, mix algorithm is of 160 runs.This is done intentionally not to increase its running time with respect to k-means++.The relative difference between the mix algorithm and k-means++ running times is calculated as a percentage and k-means++.
) still can occasionally exceed ++ D if number * runs A is a fraction of runs A (i. e., when  * runs runs AA ).If both k-means and k-means++ are initiated by the same pseudorandom number generator state

Table 5 .
Percentage of instances on which the k-means performance is better or worse

Table 6 .
Percentage of instances on which the mix performance is better or worse

Table 7 .
Percentage of instances on which k-means performs faster or slower