Classifier Learning from Difficult Data (CLDD) Session 1

Time and Date: 10:35 - 12:15 on 12th June 2019

Room: 0.6

Chair: Michal Wozniak

284 Keynote: ARFF data source library for distributed single/multiple instance, single/multiple output learning on Apache Spark [abstract]
Abstract: Apache Spark has become a popular framework for distributed machine learning and data mining. However, it lacks support for operating with Attribute-Relation File Format (ARFF) files in a native, convenient, transparent, efficient, and distributed way. Moreover, Spark does not support advanced learning paradigms represented in the ARFF definition including learning from data comprising single/multiple instances and/or single/multiple outputs. This paper presents an ARFF data source library to provide native support for ARFF files, single/multiple instance, and/or single/multiple output learning on Apache Spark. This data source extends seamlessly the Apache Spark machine learning library allowing to load all the ARFF file varieties, attribute types, and learning paradigms. The ARFF data source allows researchers to incorporate a large number of diverse datasets, and develop scalable solutions for learning problems with increased complexity. The data source is implemented on Scala, just like the Apache Spark source code, however, it can be used from Java, Scala, and Python. The ARFF data source is free and open source, available on GitHub under the Apache License 2.0.
Jorge Gonzalez Lopez, Sebastián Ventura and Alberto Cano
540 On the role of cost-sensitive learning in imbalanced data oversampling [abstract]
Abstract: Learning from imbalanced data is still considered as one of the most challenging areas of machine learning. Among plethora of methods dedicated to alleviating the challenge of skewed distributions, two most distinct ones are data-level sampling and cost-sensitive learning. The former modifies the training set by either removing majority instances or generating additional minority ones. The latter associates a penalty cost with the minority class, in order to mitigate the classifiers' bias towards the better represented class. While these two approaches have been extensively studied on their own, no works so far have tried to combine their properties. Such a direction seems as highly promising, as in many real-life imbalanced problems we may obtain the actual misclassification cost and thus it should be embedded in the classification framework, regardless of the selected algorithm. This work aims to open a new direction for learning from imbalanced data, by investigating an interplay between the oversampling and cost-sensitive approaches. We show that there is a direct relationship between the misclassification cost imposed on the minority class and the oversampling ratios that aim to balance both classes. This becomes vivid when popular skew-insensitive metrics are modified to incorporate the cost-sensitive element. Our experimental study clearly shows a strong relationship between sampling and cost, indicating that this new direction should be pursued in the future in order to develop new and effective algorithms for imbalanced data.
Bartosz Krawczyk and Michal Wozniak
219 Characterization of Handwritten Signature Images in Dissimilarity Representation Space [abstract]
Abstract: The offline Handwritten Signature Verification (HSV) problem can be considered as having difficult data since it presents imbalanced class distributions, high number of classes, high-dimensional feature space and small number of learning samples. One of the ways to deal with this problem is the writer-independent (WI) approach, which is based on the dichotomy transformation (DT). In this work, an analysis of the difficulty of the data in the space triggered by this transformation is performed based on the instance hardness (IH) measure. Also, the paper reports on how this better understanding can lead to better use of the data through a prototype selection technique.
Victor L. F. Souza, Adriano L. I. Oliveira, Rafael M. O. Cruz and Robert Sabourin

Classifier Learning from Difficult Data (CLDD) Session 2

Time and Date: 14:40 - 16:20 on 12th June 2019

Room: 0.6

Chair: Michal Wozniak

229 Missing Features Reconstruction and Its Impact on Classification Accuracy [abstract]
Abstract: In real-world applications, we can encounter situations when a well-trained model has to be used to predict from a damaged dataset. The damage caused by missing or corrupted values can be either on the level of individual instances or on the level of entire features. Both situations have a negative impact on the usability of the model on such a dataset. This paper focuses on the scenario where entire features are missing which can be understood as a specific case of transfer learning. Our aim is to experimentally research the influence of various imputation methods on the performance of several classification models. The imputation impact is researched on a combination of traditional methods such as k-NN, linear regression, and MICE compared to modern imputation methods such as multi-layer perceptron (MLP) and gradient boosted trees (XGBT). For linear regression, MLP, and XGBT we also propose two approaches to using them for multiple features imputation. The experiments were performed on both real world and artificial datasets with continuous features where different numbers of features, varying from one feature to 50%, were missing. The results show that MICE and linear regression are generally good imputers regardless of the conditions. On the other hand, the performance of MLP and XGBT is strongly dataset dependent. Their performance is the best in some cases, but more often they perform worse than MICE or linear regression.
Magda Friedjungová, Daniel Vašata and Marcel Jiřina
78 A Deep Malware Detection Method Based on General-Purpose Register Features [abstract]
Abstract: Based on low-level features at micro-architecture level, the existing detection methods usually need a long sample length to detect malicious behaviours and can hardly identify non-signature malware, which will inevitably affect the detection efficiency and effectiveness. To solve the above problems, we propose to use the General-Purpose Registers (GPRs) as our features and design a novel deep learning model for malware detection. Specifically, each register has specific functions and changes of its content contain the action information which can be used to detect illegal behaviours. Also, we design a deep detection model, which can jointly fuse spatial and temporal correlations of GPRs for malware detection only requiring a short sample length. The proposed deep detection model can well learn discriminative characteristics from GPRs between normal and abnormal processes, and thus can also identify non-signature malware. Comprehensive experimental results show that our proposed method performs better than the state-of-art methods for malicious behaviours detection relying on low-level features.
Fang Li, Chao Yan, Ziyuan Zhu and Dan Meng
415 A Novel Distribution Analysis for SMOTE oversampling method in Handling Class Imbalance [abstract]
Abstract: Class Imbalance problems are often encountered in many applications. Such problems occur whenever a class is under-represented, has a few data points, compared to other classes. However, this minority class is usually a significant one. One approach for handling imbalance is to generate new minority class instances to balance the data distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) is one of the dominant oversampling methods in the literature. SMOTE generates data using linear interpolation between minority class data point and one its $K$-nearest neighbors. In this paper, we present a theoretical and an experimental analysis of the SMOTE method. We explore the accuracy of how faithful SMOTE method emulates the underlying density. To our knowledge, this is the first mathematical analysis of the SMOTE method. Moreover, we study the impacts of the different factors on generation accuracy, such as the dimension of data, the number of examples, and the considered number of neighbors $K$ on both artificial, and real datasets.
Dina Elreedy and Amir Atiya
494 Forecasting purchase categories by transactional data: a comparative study of classification methods [abstract]
Abstract: Forecasting purchase behavior of bank clients allows for development of new recommendation and personalization strategies and results in better Quality-of-Service and customer experience. In this study, we consider the problem of predicting purchase categories of a client for the next time period by the historical transactional data. We study the predictability of expenses for different Merchant Category Codes (MCCs) and compare the efficiency of different classes of ma-chine learning models including boosting algorithms, long-short term memory networks and convolutional networks. The experimental study is performed on a massive dataset with debit card transactions for 5 years and about 1.2 M clients provided by our bank-partner. The results show that: (i) there is a set of MCC categories which are highly predictable (an exact number of categories varies with thresholds for minimal precision and recall), (ii) for most of the considered cases, convolutional neural networks perform better, and thus, may be recommended as basic choice for tackling similar problems.
Klavdiya Bochenina and Egor Shikov
439 Recognizing Faults in Software Related Difficult Data [abstract]
Abstract: In this paper we have investigated the use of numerous machine learning algorithms, with emphasis on multilayer artificial neural networks in the domain of software source code fault prediction. The main contribution lies in enhancing the data pre-processing step as the partial solution for handling software related difficult data. Before we put the data into an Artificial Neural Network, we are implementing PCA (Principal Component Analysis) and k-means clustering. The data clustering step improves the quality of the whole dataset. Using the presented approach we were able to obtain 10% increase of accuracy of the fault detection. In order to ensure the most reliable results, we implement 10-fold cross-validation methodology during experiments.
Michal Choras, Marek Pawlicki and Rafal Kozik