Data Driven Computational Sciences 2019 (DDCS) Session 1

Time and Date: 14:40 - 16:20 on 12th June 2019

Room: 0.4

Chair: Craig Douglas

160	Nonparametric Signal Ensemble Analysis for the Search for Extraterrestrial Intelligence (SETI) [abstract] Abstract: It might be easier for intelligent extraterrestrial civilizations to be found when they mark their position with a bright laser beacon. Given the possible distances involved, however, it is likely that weak signal detection techniques would still be required to identify even the brightest SETI beacon. The Bootstrap Error-adjusted Single-sample Technique (BEST) is such a detection technique. The BEST has been shown to outperform the more traditional Mahalanobis distance metric in analysis of SETI data from a Project Argus near-infrared telescope. The BEST algorithm is used to identify unusual signals, and returns a distance in asymmetric nonparametric multidimensional central 68% confidence intervals (equivalent to standard deviations for 1-D data that are normally distributed, or Mahalanobis distance units for normally distributed data of d dimensions). Calculation of the Mahalanobis metric requires matrix factorization and is O(d3). In contrast, calculation of the BEST metric does not require matrix factorization and is O(d). Furthermore, the accuracy and precision of the BEST metric are greater than the Mahalanobis metric in realistic data collection scenarios (many more wavelengths available than observations at those wavelengths).	Robert Lodder
93	Parallel Strongly Connected Components Detection with Multi-partition on GPUs [abstract] Abstract: The graph computing is often used to analyze complex relationships in the interconnected world, and the strongly connected components (SCC) detection in digraphs is a basic problem in graph computing. As graph size increases, many parallel algorithms based on GPUs have been proposed to detect SCC. The state-of-the-art parallel algorithms of SCC detection can accelerate on various graphs, but there is still space for improvement in: (1) Multiple traversals are time-consuming when processing real-world graphs; (2) Pivot selection is less accurate or time-consuming. We proposed an SCC detection method with multi-partition that optimizes the algorithm process and achieves high performance. Unlike existing parallel algorithms, we select a pivot and traverse it forward, and then select a vice pivot and traverse the pivot and the vice pivot backwards simultaneously. After updating the state of each vertex, we can get multiple partitions to parallelly detect SCC. At different phases of our approach, we use a vertex with the largest degree product or a random vertex as the pivot to balance selection accuracy and efficiency. We also implement WCC detection and 2-SCC to optimize our algorithm. And the vertices marked by the WCC partition are selected as the pivot to reduce unnecessary operations. We conducted experiments on the NVIDIA K80 with real-world and synthetic graphs. The results show that the proposed algorithm achieves an average detection acceleration of 8.8 x and 21 x when compared with well-known algorithms, such as Tarjan's algorithm and Barnat's algorithm.	Junteng Hou, Shupeng Wang, Guangjun Wu, Ge Fu and Siyu Jia
122	Efficient Parallel Associative Classification based on Rules Memoization [abstract] Abstract: Associative classification refers to a class of algorithms that is very efficient in classification problems. In such domain, data are typically multidimensional with each instance represents a point in fixed-length attribute space, usually exploring from two very large sets: training and test datasets. Models, known as classifiers, are generated by class association rules mined in the training data and are handled on eager or lazy strategies to label classes for unlabeled instances of a test dataset. In such strategies is typical that unlabeled data are evaluated independently by a series of sophisticated and high costly computations, which may lead to an expressive overlap among classifiers that evaluate similar points in the attribute space. To overcome such drawbacks, we propose a parallel and high-performance associative classification based on a lazy strategy, which partial computations of similar classifiers are cached and shared efficiently. In this sense, a PageRank-driven similarity metric is introduced to measure computations affinity among unlabeled data instances, memoizing the generated association rules. The experiments results show that our similarity-based metric maximizes the reuse of rules cached and, consequently, improve outperform for application, with gains up to 60% in execution time and 40% higher cache hit rate, mainly in limited cache space conditions.	Michel Pires, Leonardo Rocha, Renato Ferreira and Wagner Meira Jr.
407	Extreme Value Theory based Robust Anomaly Detection [abstract] Abstract: Most current clustering based anomaly detection methods use a scoring schema and thresholds to classify anomalies. These methods are often tailored to target specific data sets with "known" number of clusters. The paper provides a streaming extension to a generalized model that has limited data dependency and performs probabilistic anomaly detection and clustering simultaneously. This ensures that the cluster formation is not impacted by the presence of anomalous data, thereby leading to more reliable definition of "normal vs abnormal" behaviour\footnote{When anomaly detection is performed post clustering, the presence of anomalies gives a slightly skewed definition traditional/normal behavior. To avoid this, simultaneous clustering and anomaly detection is performed. The motivations behind developing the integrated CRP-EV model and the path that leads to the streaming model is discussed.	Sreelekha Guggilam, Abani Patra and Varun Chandola

Data Driven Computational Sciences 2019 (DDCS) Session 2

Time and Date: 16:50 - 18:30 on 12th June 2019

Room: 0.4

Chair: Craig Douglas

141	An Implementation of Coupled Dual-Porosity-Stokes Model with FEniCS [abstract] Abstract: Porous media and conduit coupled systems are heavily used in a variety of areas. A coupled dual-porosity-Stokes model has been proposed to simulate the fluid flow in a dual-porosity media and conduits coupled system. In this paper, we propose an implementation of this multi-physics model. We solve the system with the automated high performance differential equation solving environment FEniCS. Tests of the convergence rate of our implementation in both 2D and 3D are conducted in this paper. We also give tests on performance and scalability of our implementation.	Xiukun Hu and Craig C. Douglas
443	Anomaly Detection in Social Media using Recurrent Neural Network [abstract] Abstract: In today’s information environment there is an increasing reliance on online and social media in the acquisition, dissemination and consumption of news. Specifically, the utilization of social media platforms such as Facebook and Twitter has increased as a cutting edge medium for breaking news. On the other hand, the low cost, easy access and rapid propagation of news through so-cial media makes the platform more sensitive to fake and anomalous reporting. The propagation of fake and anomalous news is not some benign exercise. The extensive spread of fake news has the potential to do serious and real damage to individuals and society. As a result, the detection of fake news in social media has become a vibrant and important field of research. In this paper, a novel ap-plication of machine learning approaches to the detection and classification of fake and anomalous data are considered. An initial clustering step with the K-Nearest Neighbor (KNN) algorithm is proposed before training the result with a Recurrent Neural Network (RNN). The results of a preliminary application of the KNN phase before the RNN phase produces a quantitative and measureable im-provement in the detection of outliers, and as such is more effective in detecting anomalies or outliers against the test dataset of 2016 US Presidential Election predictions.	Madhu Goyal
539	Conditional BERT Contextual Augmentation [abstract] Abstract: We propose a novel data augmentation method for labeled sentences called con- ditional BERT contextual augmentation. Data augmentation methods are often ap- plied to prevent overfitting and improve generalization of deep neural network models. Recently proposed contextual augmentation augments labeled sentences by randomly replacing words with more varied substitutions predicted by language model. BERT demonstrates that a deep bidirectional language model is more pow- erful than either an unidirectional lan- guage model or the shallow concatena- tion of a forward and backward model. We retrofit BERT to conditional BERT by introducing a new conditional masked language model 1 task. The well trained conditional BERT can be applied to en- hance contextual augmentation. Experi- ments on six various different text classi- fication tasks show that our method can be easily applied to both convolutional or re- current neural networks classifier to obtain obvious improvement.	Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han and Songlin Hu
552	An innovative and reliable water leak detection service supported by data-intensive remote sensing processing [abstract] Abstract: In the scope of the H2020 WADI project, an airborne water leak detection surveillance service, based on manned and unmanned aerial vehicles, is being developed to provide water utilities with adequate information on leaks in large water distribution infrastructures outside urban areas. Given the high cost associated with water infrastructure networks repairs, a reliability layer is necessary to improve the trustworthiness of the WADI leak identification, based on complementary technologies for leak detection. Herein, a methodology based on the combined use of Sentinel remote sensing data and a water leak pathways model is presented, based on data-intensive computing. The resulting water leak detection reliability service, provided to the users through a web interface, targets prompt and cost-effective infrastructure repairs with the required degree of confidence on the detected leaks. The web platform allows for both data analysis and visualization of Sentinel images and relevant leak indicators at the sites selected by the user. The user can provide aerial imagery inputs, to be processed together with Sentinel remote sensing data at the satellite acquisition dates identified by the user. The platform provides information about the detected leaks location and time evolution, and will be linked in the future with the outputs from water pathway models.	Ricardo Martins, Anabela Oliveira, André Fortunato, Alberto Azevedo, Elsa Alves and Alexandra Carvalho