Algorithms with greedy heuristic procedures for mixture probability distribution separation

Authors

  • Lev Kazakovtsev Reshetnev University, prosp. Krasnoyarskii Rabochii, Department of Systems Analysis and Operations Research, Krasnoyarsk, Russian Federation
  • Dmitry Stashkov Reshetnev University, prosp. Krasnoyarskii Rabochii, Department of Systems Analysis and Operations Research, Krasnoyarsk, Russian Federation
  • Mikhail Gudyma Reshetnev University, prosp. Krasnoyarskii Rabochii, Department of Systems Analysis and Operations Research, Krasnoyarsk, Russian Federation
  • Vladimir Kazakovtsev ITMO University, Department of Computer Educational Technologies, St. Petersburg, Russian Federation

DOI:

https://doi.org/10.2298/YJOR171107030K

Keywords:

Clustering, Variable Neighbourhood Search, Genetic Algorithm, Greedy Heuristic, Agglomerative Heuristic, Expectation Maximization.

Abstract

For clustering problems based on the model of mixture probability distribution separation, we propose new Variable Neighbourhood Search algorithms (VNS) and evolutionary genetic algorithms (GA) with greedy agglomerative heuristic procedures and compare them with known algorithms. New genetic algorithms implement a global search strategy with the use of a special crossover operator based on greedy agglomerative heuristic procedures in combination with the EM algorithm (Expectation Maximization). In our new VNS algorithms, this combination is used for forming randomized neighbourhoods to search for better solutions. The results of computational experiments made on classical data sets and the testings of production batches of semiconductor devices shipped for the space industry demonstrate that new algorithms allow us to obtain better results, higher values of the log likelihood objective function, in comparison with the EM algorithm and its modifications.

References

Orlov, V. I., Stashkov, D. V., Kazakovtsev L. A., Stupina A. A., Fuzzy clustering of EEE components for space industry, IOP Conference Series: Materials Science and Engineering, 155 (2016) Article ID 012026.

Korolev, V. Yu., EM-algorithm, its modifications and their application to the problem of separation of mixtures of probability distributions, Theoretical review, IPIRAN, Moscow, 2007.

Bishop, C., Pattern Recognition and Machine Learning. Springer, 2006.

Newcomb, S., A generalized theory of the combination of observations so as to obtain the best result, American Journal of Mathematics, 8 (4) (1886) 343-366.

Pearson, K., Contributions to the Mathematical Theory of Evolution, Philosophical Transactions of the Royal Society of London, 185 (1894) 71-110.

Everitt, B. and Hand, D. J., Finite Mixture Distributions, Monographs on Applied Probability and Statistics, Chapman and Hall, UK, 1981.

McLachlan, G. J., and Basford, K. E., Mixture models. Inference and applications to clustering, Marcel Dekker, New York, 1988.

Ayvazyan, S. A., Buhshtaber, V. M., Enyukov, I. S., and Meshalkin, L. D., Applied Statistics: Classification and Dimension Reduction [Prikladnaja statistika: klassifikatsiya i snizhenie razmernosti], Finansy i statistika, Moscow, 1989.

McKendrick, A. G., Applications of mathematics to medical problems, Proceedings of the Edinburgh Mathematical Society, 44 (1926) 98-130.

Healy, M. J. R., and Westmacott, M. H., Missing Values in Experiments Analyzed on Automatic Computers, Applied Statistics, 5 (1956) 203-206.

Shlezinger, M. I., The Interaction of Learning and Self-Organization in Pattern Recognition, Kibernetika, 4 (2) (1968) 81-88.

Dempster, A., Laird, N., and Rubin, D., Maximum Likelihood Estimation from Incomplete Data, Journal of the Royal Statistical Society, Series B, 39 (1977) 1-38.

Kazakovtsev, L. A., and Antamoshkin, A. N., Greedy Heuristic Method for Location Problems, Vestnik SibGAU, 16 (2) (2015) 317-325.

Kazakovtsev, L. A., and Antamoshkin, A. N., Genetic Algorithm with Fast Greedy Heuristic for Clustering and Location Problems, Informatica, 3 (38) (2014) 229-240.

Celeux, G., and Govaert, A., Classification EM Algorithm for Clustering and Two Stochastic Versions, Rapport de Recherche de l'INRIA 1364, Centre de Rocquencourt, 1991.

Celeux, G., and Diebolt, J., The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem, Computational Statistics Quarterly, 2 (1) (1985) 73-82.

Picchini, U., and Samson, A., Coupling Stochastic EM and Approximate Bayesian Computation for Parameter Inference in State-space Models, Cornell University Library, 2017, arXiv:1512.04831v6, DOI:10.1007/s00180-017-0770-y.

Matarazzo, T.J., and Pakzad, S.N., STRIDE for Structural Identification using Expectation Maximization: Iterative Output-Only Method for Modal Identification, Journal of Engineering Mechanics, 142 (2016), DOI:10.1061/(ASCE)EM.1943-7889.0000951.

Zaheer, M., Wick, M., Tristan, J.-B., Smola A., and Steele, G., Exponential Stochastic Cellular Automata for Massively Parallel Inference, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, 2016, 966-975.

Zinovyev, A., Data Visualization in Political and Social Sciences, International Encyclopedia of Political Science, Sage Publications, CA, 2011.

Yin, J., Zhang Y., and Gao L., Accelerating Expectation-Maximization Algorithms with Frequent Updates, Proceedings of the IEEE International Conference on Cluster Computing, Beijing, 2012, 275-283.

Kazakovtsev, L. A., Orlov, V. I., Stupina, A. A., and Kazakovtsev, V. L., New Genetic Algorithm with Greedy Heuristic for Clustering Problems with Unknown Number of Clusters, Facta Universitatis, series Mathematics and Informatics, 30 (1) (2015) 89-106.

Kaufman, L., and Rousseeuw, P. J., Clustering by Means of Medoids, in: Y. Dodge (ed.), Statistical Data Analysis Based on the L1 Norm and Related Methods, North-Holland, 1987, 405-416.

Mladenovic, N., and Hansen, P. Variable Neighborhood Search, Comput. Oper. Res., 24 (1997) 1097-1100.

Hansen, P., Brimberg, J., Urosevic, D., and Mladenovic N., Solving Large p-Median Clustering Problems by Primal Dual Variable Neighborhood Search, Data Mining and Knowledge Discovery, 19 (3) (2009) 351-375.

Hansen, P., Variable Neighborhood Search, Search Methodology, in: E.K. Burke, and G. Kendall (eds.), Springer, US, (2005) 211-238.

Brimberg, J., Hansen, P., and Mladenovic, N., Attraction Probabilities in Variable Neighborhood Search, 4OR: A Quarterly Journal of Operations Research, 8 (2010) 181-194.

Hansen, P., Mladenovic, N., and Moreno Perez, J. A., Variable Neighborhood Search: Methods and Applications, 4OR: A Quarterly Journal of Operations Research, 6 (2008) 319-360.

Martins, P., Goal Clustering: VNS Based Heuristics, Cornell University Library, 2017, arXiv:1705.07666.

Stashkov, D. V., Variable Neighborhood Search Algorithms for Problem of Mixture Distributions Separation, Sistemy upravleniya i informacionnye tehnologii, 1 (67) (2017) 18-24.

Dua, D., and Karra Taniskidou, E., UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], Irvine, CA: University of California, School of Information and Computer Science, 2017.

Franti, P. et al., Clustering Datasets [http://cs.uef.fi/sipu/datasets/], 2015.

Nielsen, S. F., The Stochastic EM Algorithm: Estimation and Asymptotic Results, Bernoulli, 6 (2000) 457-489.

Smucker, M. D., Allan, J. and Carterette, B., A Comparison of Statistical Significance Tests for Information Retrieval, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (CIKM 07), ACM, New York, 2007, 623-632.

Park, H. M., Comparing Group Means: The t-Test and One-way ANOVA Using STATA, SAS, and SPSS, Indiana University, 2009.

Mann, Henry B., Whitney, Donald R., On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other, Annals of Mathematical Statistics, 18 (1) (1947) 50-60.

Fay, Michael P., Proschan, Michael A., Wilcoxon-Mann-Whitney or t-Test? On Assumptions for Hypothesis Tests and Multiple Interpretations of Decision Rules, Statistics Surveys, 4 (2010) 139.

Downloads

Published

2019-02-01

Issue

Section

Research Articles