Synthetic data generator for testing of classification rule algorithms

Romana Seidlová, Jaroslav Poživil, Jaromír Seidl, Lukáš Malecl
2017 Neural Network World  
We developed a data generating system that is able to create systematically testing datasets that accomplish user's requirements such as number of rows, number and type of attributes, number of missing values, class noise and imbalance ratio. These datasets can be used for testing of the algorithms designed for solving classification rule problem. We used them for optimizing of the parameters of the classification algorithm based on the behavior of ant colonies. But they can be advantageously
more » ... ed for other applications too. Program generates output files in ARFF format. Two standards and one user-define probability distributions are used in data generation: uniform distribution, normal distribution and irregular distribution for nominal attributes. To our knowledge, our system is probably the first synthetic data generation system that systematically generates datasets for examination and judgment of the classification rule algorithms.
doi:10.14311/nnw.2017.27.010 fatcat:unm7domiijhdfkugpbiqgnxhnu