Data Driven Computational Sciences 2019 (DDCS) Session 1

Time and Date: 14:40 - 16:20 on 12th June 2019

Room: 0.4

Chair: Craig Douglas

160 Nonparametric Signal Ensemble Analysis for the Search for Extraterrestrial Intelligence (SETI) [abstract]
Abstract: It might be easier for intelligent extraterrestrial civilizations to be found when they mark their position with a bright laser beacon. Given the possible distances involved, however, it is likely that weak signal detection techniques would still be required to identify even the brightest SETI beacon. The Bootstrap Error-adjusted Single-sample Technique (BEST) is such a detection technique. The BEST has been shown to outperform the more traditional Mahalanobis distance metric in analysis of SETI data from a Project Argus near-infrared telescope. The BEST algorithm is used to identify unusual signals, and returns a distance in asymmetric nonparametric multidimensional central 68% confidence intervals (equivalent to standard deviations for 1-D data that are normally distributed, or Mahalanobis distance units for normally distributed data of d dimensions). Calculation of the Mahalanobis metric requires matrix factorization and is O(d3). In contrast, calculation of the BEST metric does not require matrix factorization and is O(d). Furthermore, the accuracy and precision of the BEST metric are greater than the Mahalanobis metric in realistic data collection scenarios (many more wavelengths available than observations at those wavelengths).
Robert Lodder
93 Parallel Strongly Connected Components Detection with Multi-partition on GPUs [abstract]
Abstract: The graph computing is often used to analyze complex relationships in the interconnected world, and the strongly connected components (SCC) detection in digraphs is a basic problem in graph computing. As graph size increases, many parallel algorithms based on GPUs have been proposed to detect SCC. The state-of-the-art parallel algorithms of SCC detection can accelerate on various graphs, but there is still space for improvement in: (1) Multiple traversals are time-consuming when processing real-world graphs; (2) Pivot selection is less accurate or time-consuming. We proposed an SCC detection method with multi-partition that optimizes the algorithm process and achieves high performance. Unlike existing parallel algorithms, we select a pivot and traverse it forward, and then select a vice pivot and traverse the pivot and the vice pivot backwards simultaneously. After updating the state of each vertex, we can get multiple partitions to parallelly detect SCC. At different phases of our approach, we use a vertex with the largest degree product or a random vertex as the pivot to balance selection accuracy and efficiency. We also implement WCC detection and 2-SCC to optimize our algorithm. And the vertices marked by the WCC partition are selected as the pivot to reduce unnecessary operations. We conducted experiments on the NVIDIA K80 with real-world and synthetic graphs. The results show that the proposed algorithm achieves an average detection acceleration of 8.8 x and 21 x when compared with well-known algorithms, such as Tarjan's algorithm and Barnat's algorithm.
Junteng Hou, Shupeng Wang, Guangjun Wu, Ge Fu and Siyu Jia
122 Efficient Parallel Associative Classification based on Rules Memoization [abstract]
Abstract: Associative classification refers to a class of algorithms that is very efficient in classification problems. In such domain, data are typically multidimensional with each instance represents a point in fixed-length attribute space, usually exploring from two very large sets: training and test datasets. Models, known as classifiers, are generated by class association rules mined in the training data and are handled on eager or lazy strategies to label classes for unlabeled instances of a test dataset. In such strategies is typical that unlabeled data are evaluated independently by a series of sophisticated and high costly computations, which may lead to an expressive overlap among classifiers that evaluate similar points in the attribute space. To overcome such drawbacks, we propose a parallel and high-performance associative classification based on a lazy strategy, which partial computations of similar classifiers are cached and shared efficiently. In this sense, a PageRank-driven similarity metric is introduced to measure computations affinity among unlabeled data instances, memoizing the generated association rules. The experiments results show that our similarity-based metric maximizes the reuse of rules cached and, consequently, improve outperform for application, with gains up to 60% in execution time and 40% higher cache hit rate, mainly in limited cache space conditions.
Michel Pires, Leonardo Rocha, Renato Ferreira and Wagner Meira Jr.
407 Extreme Value Theory based Robust Anomaly Detection [abstract]
Abstract: Most current clustering based anomaly detection methods use a scoring schema and thresholds to classify anomalies. These methods are often tailored to target specific data sets with "known" number of clusters. The paper provides a streaming extension to a generalized model that has limited data dependency and performs probabilistic anomaly detection and clustering simultaneously. This ensures that the cluster formation is not impacted by the presence of anomalous data, thereby leading to more reliable definition of "normal vs abnormal" behaviour\footnote{When anomaly detection is performed post clustering, the presence of anomalies gives a slightly skewed definition traditional/normal behavior. To avoid this, simultaneous clustering and anomaly detection is performed. The motivations behind developing the integrated CRP-EV model and the path that leads to the streaming model is discussed.
Sreelekha Guggilam, Abani Patra and Varun Chandola