International Journal of Electronic Engineering and Computer Science
Articles Information
International Journal of Electronic Engineering and Computer Science, Vol.5, No.1, Mar. 2020, Pub. Date: Feb. 14, 2020
An Automated Data Pre-processing Technique for Machine Learning in Critical Systems
Pages: 1-9 Views: 1330 Downloads: 462
Authors
[01] Monica Madyembwa, Computer Science Department, National University of Science and Technology, Bulawayo, Zimbabwe.
[02] Kernan Mzelikahle, Computer Science Department, National University of Science and Technology, Bulawayo, Zimbabwe.
[03] Sibonile Moyo, Computer Science Department, National University of Science and Technology, Bulawayo, Zimbabwe.
Abstract
In many critical systems, the quality of data analysis is an important factor to consider particularly if the results of the data analysis contribute towards decision making. Data cleaning techniques are used during data preparation stage, before the application of data analysis techniques on a dataset. There is a strong causal relationship between quality of data preparation and quality of results in data analysis. For this reason, data cleaning techniques have a direct bearing on the quality of results from the data analysis stage. In this paper, we propose the use of intelligent data cleaning techniques as opposed to traditional deterministic methods. It is shown in this paper that the use of machine learning techniques to clean data, particularly as used for filling-in missing data, improves the quality of subsequent data analysis. Seven (7) flight-level datasets from the US Department of Transportation (Bureau of Transportation Statistics) were used to assess whether the quality of subsequent data analysis is significantly affected by the choice of a data pre-processing technique. A set of experiments were designed with an objective of conducting a comparative analysis of the performance of data analysis techniques on data prepared using different data cleaning techniques. Three (3) data analysis techniques, namely the LSTM, FFANN and RNN, were used in the comparative analysis study to determine how each of the techniques perform depending on the data cleaning technique used. The results obtained in the comparative study indicate that the use of machine learning techniques, such as BOSOM and K-means clustering, in data preparation, increases the quality of subsequent data analysis. The quality of data analysis was measured using performance metrics such as the Cross-Entropy loss and the Mean Square Error. Both assessment metrics show improved performance for each data analysis technique if data is cleaned using machine leaning methods.
Keywords
Long Short-Term Memory (LSTM), Bat Optimised Self Organised Map (BOSOM), Artificial Neural Networks, Data Pre-processing Techniques, K-Means Clustering
References
[01] Demuth, H. B., Beale M. H, De Jess, O. and Hagan, M. T. (2014). Neural Network Design. 2nd ed., Martin Hagan, Oklahoma, USA: Oklahoma State University.
[02] Graupe, D. (2007). Principles of Artificial Neural Networks. 2nd Ed. New Jersey, USA: World Scientific.
[03] Kröse, B. J. and van der Smagt P. (1996). An Introduction to Neural Networks. 8th ed. Department of Computer Systems, University of Amsterdam, Netherlands.
[04] Batista, G. E., Prati, R. C. and Monard, M. C., 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6 (1), pp. 20-29.
[05] Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D. B., Amde, M., Owen, S. and Xin, D., 2016. Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17 (1), pp. 1235-1241.
[06] Winkler, W. E., 2003, August. Data cleaning methods. In Proc ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation.
[07] Ng, H. W. and Winkler, S., 2014, October. A data-driven approach to cleaning large face datasets. In: 2014 IEEE International Conference on Image Processing (ICIP) (pp. 343-347). IEEE.
[08] Maloof, M. A. ed., 2006. Machine learning and data mining for computer security: methods and applications. Springer Science & Business Media.
[09] Krishnan, S., Franklin, M. J., Goldberg, K., Wang, J. and Wu, E., 2016, June. Activeclean: An interactive data cleaning framework for modern machine learning. In: Proceedings of the 2016 International Conference on Management of Data (pp. 2117-2120). ACM.
[10] Cazorla, L., Alcaraz, C. and Lopez, J., 2013, September. Towards automatic critical infrastructure protection through machine learning. In: International Workshop on Critical Information Infrastructures Security (pp. 197-203). Springer, Cham.
[11] Johnson, A. E., Ghassemi, M. M., Nemati, S., Niehaus, K. E., Clifton, D. A. and Clifford, G. D., 2016. Machine learning and decision support in critical care. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers, 104 (2), p. 444.
[12] Varshney, K. R., 2016, January. Engineering safety in machine learning. In: 2016 Information Theory and Applications Workshop (ITA) (pp. 1-5). IEEE.
[13] Naeini, M. P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In: Twenty-Ninth AAAI Conference on Artificial Intelligence.
[14] Rahimi, A. and Recht, B., 2008. Random features for large-scale kernel machines. In: Advances in neural information processing systems (pp. 1177-1184).
[15] Murray, J. F., Hughes, G. F. and Kreutz-Delgado, K., 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning Research, 6 (May), pp. 783-816.
[16] Xu, B. and Chen, D. Z., 2007, May. Density-based data clustering algorithms for lower dimensions using space-filling curves. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 997-1005). Springer, Berlin, Heidelberg.
[17] Xue, G. R., Lin, C., Yang, Q., Xi, W., Zeng, H. J., Yu, Y. and Chen, Z., 2005, August. Scalable collaborative filtering using cluster-based smoothing. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 114-121). ACM.
[18] Agyemang, M., Barker, K. and Alhajj, R., 2006. A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis, 10 (6), pp. 521-538.
[19] Vazhkudai, S. and Schopf, J. M., 2003. Using regression techniques to predict large data transfers. The International Journal of High-Performance Computing Applications, 17 (3), pp. 249-268.
[20] Kotsiantis, S., Kostoulas, A., Lykoudis, S., Argiriou, A. and Menagias, K., 2006, July. Filling missing temperature values in weather data banks. In: 2006 2nd IET International Conference on Intelligent Environments-IE 06 (Vol. 1, pp. 327-334). IET.
[21] Chen, B. W., Rho, S., Yang, L. T. and Gu, Y., 2018. Privacy-preserved big data analysis based on asymmetric imputation kernels and multiside similarities. Future Generation Computer Systems, 78, pp. 859-866.
[22] Mzelikahle, K., Mapuma, D. J., Hlatywayo, D. J. and Trimble, J., 2017. Optimisation of Self Organising Maps Using the Bat Algorithm. American Journal of Information Science and Computer Engineering, 3 (6), pp. 77-83.
[23] Mzelikahle, K., Trimble, J. and Hlatywayo, D. J., 2018. A Hybrid Technique Between BOSOM and LSTM for Data Analysis. International Journal of Mathematics and Computational Science, 4 (4), pp. 128-138.
[24] Mzelikahle, K., Hlatywayo, D. J. and Trimble, J., Application of the BOSOM-LSTM Technique in Seismic Vulnerability Assessment. American Journal of Geophysics, Geochemistry and Geosystems, 5 (1), pp. 29-39.
[25] Gers, F. A., Schraudolph, N. N. and Schmidhuber, J. (2002). Learning Precise Timing with LSTM Recurrent Networks. Journal of Machine Learning Research, 3, pp. 115–143.
600 ATLANTIC AVE, BOSTON,
MA 02210, USA
+001-6179630233
AIS is an academia-oriented and non-commercial institute aiming at providing users with a way to quickly and easily get the academic and scientific information.
Copyright © 2014 - American Institute of Science except certain content provided by third parties.