Classifier Learning from Difficult Data (CLDD) Session 1

Time and Date: 10:35 - 12:15 on 12th June 2019

Room: 0.6

Chair: Michal Wozniak

284	Keynote: ARFF data source library for distributed single/multiple instance, single/multiple output learning on Apache Spark [abstract] Abstract: Apache Spark has become a popular framework for distributed machine learning and data mining. However, it lacks support for operating with Attribute-Relation File Format (ARFF) files in a native, convenient, transparent, efficient, and distributed way. Moreover, Spark does not support advanced learning paradigms represented in the ARFF definition including learning from data comprising single/multiple instances and/or single/multiple outputs. This paper presents an ARFF data source library to provide native support for ARFF files, single/multiple instance, and/or single/multiple output learning on Apache Spark. This data source extends seamlessly the Apache Spark machine learning library allowing to load all the ARFF file varieties, attribute types, and learning paradigms. The ARFF data source allows researchers to incorporate a large number of diverse datasets, and develop scalable solutions for learning problems with increased complexity. The data source is implemented on Scala, just like the Apache Spark source code, however, it can be used from Java, Scala, and Python. The ARFF data source is free and open source, available on GitHub under the Apache License 2.0.	Jorge Gonzalez Lopez, Sebastián Ventura and Alberto Cano
540	On the role of cost-sensitive learning in imbalanced data oversampling [abstract] Abstract: Learning from imbalanced data is still considered as one of the most challenging areas of machine learning. Among plethora of methods dedicated to alleviating the challenge of skewed distributions, two most distinct ones are data-level sampling and cost-sensitive learning. The former modifies the training set by either removing majority instances or generating additional minority ones. The latter associates a penalty cost with the minority class, in order to mitigate the classifiers' bias towards the better represented class. While these two approaches have been extensively studied on their own, no works so far have tried to combine their properties. Such a direction seems as highly promising, as in many real-life imbalanced problems we may obtain the actual misclassification cost and thus it should be embedded in the classification framework, regardless of the selected algorithm. This work aims to open a new direction for learning from imbalanced data, by investigating an interplay between the oversampling and cost-sensitive approaches. We show that there is a direct relationship between the misclassification cost imposed on the minority class and the oversampling ratios that aim to balance both classes. This becomes vivid when popular skew-insensitive metrics are modified to incorporate the cost-sensitive element. Our experimental study clearly shows a strong relationship between sampling and cost, indicating that this new direction should be pursued in the future in order to develop new and effective algorithms for imbalanced data.	Bartosz Krawczyk and Michal Wozniak
219	Characterization of Handwritten Signature Images in Dissimilarity Representation Space [abstract] Abstract: The offline Handwritten Signature Verification (HSV) problem can be considered as having difficult data since it presents imbalanced class distributions, high number of classes, high-dimensional feature space and small number of learning samples. One of the ways to deal with this problem is the writer-independent (WI) approach, which is based on the dichotomy transformation (DT). In this work, an analysis of the difficulty of the data in the space triggered by this transformation is performed based on the instance hardness (IH) measure. Also, the paper reports on how this better understanding can lead to better use of the data through a prototype selection technique.	Victor L. F. Souza, Adriano L. I. Oliveira, Rafael M. O. Cruz and Robert Sabourin