Data-Driven Computational Sciences (DDCS) Session 4

Time and Date: 13:25 - 15:05 on 14th June 2017

Room: HG D 7.2

Chair: Craig Douglas

381 Improving Performance of Multiclass Classification by Inducing Class Hierarchies [abstract]
Abstract: In the last decades, one issue that has received a lot of attention in classification problems is how to obtain better classifications. This problem becomes even more complicated when the number of classes is high. In this multiclass scenario, it is assumed that the class labels are independent of each other, and thus, most techniques and methods proposed to improve the performance of the classifiers rely on it. An alternative way to address the multiclass problem is to hierarchically distribute the classes in a collection of multiclass subproblems by reducing the number of classes involved in each local subproblem. In this paper, we propose a new method for inducing a class hierarchy from the confusion matrix of a multiclass classifier. Then, we use the class hierarchy to learn a tree-like hierarchy of classifiers for solving the original multiclass problem in a similar way as the top-down hierarchical classification approach does for working with hierarchical domains. We experimentally evaluate the proposal on a collection of multiclass datasets showing that, in general, the generated hierarchies not only outperforms the original (flat) classification but also hierarchical approaches based on other alternative ways of constructing the class hierarchy.
Daniel Andrés Silva Palacios, Cèsar Ferri and Maria José Ramírez Quintana
574 The Impact of Large-Data Transfers in Shared WANs: An Empirical Study [abstract]
Abstract: Computational science, especially in the era of Big Data, sometimes requires large data files to be transferred over high bandwidth-delay-product (BDP) wide-area networks (WANs). Experimental data (e.g., LHC, SKA), analytics logs, and filesystem backups are regularly transferred between research centres and between private-public clouds. Fortunately, a variety of tools (e.g., GridFTP, UDT, PDS) have been developed to transfer bulk data across WANs with high performance. However, using large-data transfer tools are known to adversely affect other network applications on shared networks. Many of the tools explicitly ignore TCP fairness to achieve high performance. Users have experienced high latencies and low bandwidth situations when a large-data transfer is underway. But there have been few empirical studies that quantify the impact of the tools. As an extention of our previous work using synthetic background traffic, we perform an empirical analysis of how the bulk-data transfer tools perform when competing with a non-synthetic, application-based workload (e.g., Network File System). Conversely, we characterize and quantify the impact of bulk-data transfers on the application-based traffic. For example, we show that the RTT latency for other applications can increase from about 130 ms to about 230 ms for the non-bulk-data users of a shared network.
Hamidreza Anvari and Paul Lu