Data-Driven Computational Sciences (DDCS) Session 1

Time and Date: 14:10 - 15:50 on 13th June 2017

Room: HG D 7.2

Chair: Craig Douglas

214 Data resolution effects on a coupled data driven system for forest fire propagation prediction [abstract]
Abstract: Every year, millions of forest worldwide hectares are burned causing important consequences on the atmosphere, biodiversity and economy. A correct prediction of the fire evolution allows to manage the fire fighting equipment properly. Therefore, it is crucial to use reliable and speed simulations in order to predict the evolution of the fire. WRF-SFIRE is a wildland fire simulator, which couples a meteorological model called Weather Research and Forecasting Model (WRF) and a forest fire simulator, SFIRE. The aforementioned coupling strategy reproduces the interaction between the propagation of the fire and the atmosphere surrounding it. The mesh resolution used to solve the atmosphere evolution has a deep impact in the prediction of small scale meteorological effects. At the same time, the ability of introducing these small scale meteorological events into the forest fire simulation implies enhancements in the quality of the data that drives the simulation, therefore, better fire propagation predictions. However, this improvement can be affected by the instability of the problem to solve. So, this paper states the convergence problem due to the mesh resolution when using WRF-SFIRE and a proposal to overcome it is described. The proposed scheme has been tested using a real case that took place in Catalonia (northeast of Spain) in 2005.
Àngel Farguell, Ana Cortés, Tomàs Margalef, Josep Ramón Miró and Jordi Mercader
442 Data Assimilation of Wildfires with Fuel Adjustment Factors in FARSITE using Ensemble Kalman Filtering [abstract]
Abstract: This paper show the extension of the wildfire simulation tool FARSITE to allow for data assimilation capabilities on both fire perimeters and fuel adjustment factors to improve the accuracy of wildfire spread predictions. While fire perimeters characterize the overall burn scar of a wildfire, fuel adjustment factors are fuel model specific calibration numbers that adjust the rate of spread for each fuel type independently. Data assimilation updates of both fire perimeters and fuel adjustment factors are calculated from an Ensemble Kalman Filter (EnKF) that exploits the uncertainty information on the simulated fire perimeter, fuel adjustment factors and a measured fire perimeter. The effectiveness of the proposed data assimilation is illustrated on a wildfire simulation representing the 2014 Cocos fire, tracking time varying fuel adjustment factors based on noisy and limited spatial resolution observations of the fire perimeter.
Thayjes Srivas, Raymond de Callafon, Daniel Crawl and Ilkay Altintas
187 Optimization strategy exploration in a wildfire propagation data driven system [abstract]
Abstract: The increasing capacity to gather data of an on-going wildfire operation has triggered the methods and strategies to incorporate these data to a flexible model to improve forecasting accuracy and validity. In the present paper we discuss the optimization strategy included in an inverse model algorithm based on semi-empirical fire spread model fed with infra-red airborne acquired images. The algorithm calibrates 7 parameters and incorporates a topographic diagnosis wind model. The optimization problem is shown to be a non-smooth problem and thus, its best resolving strategy is critical regarding efficiency and times constraints. Three optimization strategies are evaluated in a synthetic real-scale scenario to select the more efficient one. Preliminary results are discussed and compared.
Oriol Rios, M. Miguel Valero, Elsa Pastor and Eulalia Planas
127 Feature Based Grid Event Classication from Synchrophasor Data [abstract]
Abstract: This paper presents a method for automatic classification of power disturbance events in an electric grid by means of distributed parameter estimation and clustering techniques of synchro- phasor data produced by phasor measurement units (PMUs). Disturbance events detected in the PMU data are subjected to a parameter estimation routine to extract features that include oscillation frequency, participation factor, damping factor and post and pre-event frequency offset. The parameters are used to classify events and classification rules are deduced on the basis of a training set of known events using nonlinear programming. Once the classification rules are set, the approach can be used to automatically classify events not seen in the training set. The proposed event classification is illustrated on a microPMU system data developed by Power Standards Lab for which disturbance events were measured over several months.
Sai Akhil Reddy Konakalla and Raymond de Callafon
586 A Framework for Provenance Analysis and Visualization [abstract]
Abstract: Data provenance is a fundamental concept in scientific experimentation. However, for their understanding and use, efficient and user-friendly mechanisms are needed. Research in software visualization, ontologies and complex networks can help in this process. This paper presents a framework to assist the understanding and use of data provenance through visualization techniques, ontologies and complex networks. The framework generates new information using ontologies and provenance graph analysis and highlights results through new visualization techniques. The framework was used in the E-SECO scientific ecosystem platform.
Weiner Oliveira, Lenita M. Ambrósio, Regina Braga, Victor Ströele, José Maria N. David and Fernanda Campos

Data-Driven Computational Sciences (DDCS) Session 2

Time and Date: 16:20 - 18:00 on 13th June 2017

Room: HG D 7.2

Chair: Craig Douglas

242 Human Identification and Localization by Robots in Collaborative Environments [abstract]
Abstract: Environments in which mobile robots and humans must coexist tend to be quite dangerous to the humans. Many employers have resorted to separating the two groups since the robots move quickly and do not maneuver around humans easily resulting in human injuries. In this paper we provide a roadmap towards being able to integrate the two worker groups (human and robots) to increase both efficiency and safety. Improved human to robot communication and collaboration has implications in multiple applications. For example: (1) Robots that manage all aspects of dispensing items (e.g., drugs in pharmacies or supplies and tools in a remote workplace), reducing human errors. (2) Dangerous location capable robots that triage injured subjects using remote sensing of vital signs. (3) 'Smart' crash carts that move themselves to a required location in a hospital or in the field, help dispense drugs and tools, save time and money, and prevent accidents.
Craig C. Douglas and Robert A. Lodder
257 Data-driven design of an Ebola therapeutic [abstract]
Abstract: Data-driven computational science has found many applications in drug design. Molecular data are commonly used to design new drug molecules. Engineering process simulations guide the development of the Chemistry, Manufacturing, and Controls (CMC) section of Investigational New Drug (IND) applications filed at FDA. Computer simulations can also guide the design of human clinical trials. Formulation is very important in drug delivery. The wrong formulation can render a drug product useless. The amount of preclinical (animal and in vitro) work that must be done before a new drug candidate can be tested in humans can be a problem. The cost of these cGxP studies is typically $3-$5 million. If the wrong drug product formulation is tested, new iterations of the formulation must be tested with additional costs of 3 to $5 million each. Data-driven computational science can help reduce this cost. In the absence of existing human exposure, a battery of tests involving acute and chronic toxicology, cardiovascular, central nervous system, and respiratory safety pharmacology must be performed in at least two species before FDA will permit testing in humans. However, for many drugs (such as those beginning with natural products) there is a history of human exposure. In these cases, computer modeling of a population to determine human exposure may be adequate to permit phase 1 studies with a candidate formulation in humans. The CDC’s National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. The NHANES database can be mined to determine the average and 90th percentile exposures to a food additive, and early human formulation testing conducted at levels beneath those to which the US population is ordinarily exposed through food. These data can be combined with data mined from international chemical shipments to validate an exposure model. This paper describes the data driven formulation testing process using a new candidate Ebola treatment that, unlike vaccines, can be used after a person has contracted the disease. This drug candidate’s mechanism of action permits it to be potentially used against all strains of the virus, a characteristic that vaccines might not share.
Robert Lodder
383 Transforming a Local Medical Image Analysis for Running on a Hadoop Cluster [abstract]
Abstract: There is a progressive digitization in many medical fields, such as digital microscopy, which leads to an increase in data volume and processing demands for the underlying computing infrastructure. This paper explores scaling behaviours of a Ki-67 analysis application, which processes medical image tiles, originating from a WSI (Whole Slide Image) file format. Furthermore, it describes how the software is transformed from a Windows PC to a distributed Linux cluster environment. A test for platform independence revealed a non-deterministic behaviour of the application, which has been fixed successfully. The speedup of the application is determined. The slope of the increase is quite close to 1, i.e. there is almost no loss due to a parallelization overhead. Beyond the cluster's hardware limit (72 cores, 144 threads, 216 GB RAM) the speedup saturates to a value around 64. This is a strong improvement of the original software, whose speedup is limited to two.
Marco Strutz, Hermann Heßling and Achim Streit
208 Decentralized Dynamic Data-Driven Monitoring of Dispersion Processes on Partitioned Domains [abstract]
Abstract: The application of mobile sensor-carrying vehicles for online estimating dynamic dispersion processes is extremely beneficial. Based on current estimates that rely on past measurements and forecasts obtained from a discretized PDE-model, the movement of the vehicles can be adapted resulting in measurements at more informative locations. In this work, a novel decentralized monitoring approach based on a partitioning of the spatial domain into several subdomains is proposed. Each sensor is assigned to the subdomain it is located in and is only required to maintain a process and multi-vehicle model related to its subdomain. In this way, vast communication requirements of related centralized approaches and costly full model simulations are avoided making the presented approach more scalable with respect to a larger number of sensor-carrying vehicles and a larger problem domain. The approach consists of a new prediction and update method based on a domain decomposition method and a partitioned variant of the Ensemble Square Root Filter getting along with a minimum exchange of data between sensors on neighboring subdomains. Furthermore, a cooperative vehicle controller is applied in such a way that a dynamic adaption of the sensor distribution becomes possible.
Tobias Ritter, Stefan Ulbrich and Oskar von Stryk
265 A Framework for Direct and Transparent Data Exchange of Filter-stream Applications in Multi-GPUs Architectures [abstract]
Abstract: The massive data generation has been pushing for significant advances in computing architectures, reflecting in heterogeneous architectures composed by different types of processing units. The filter-stream paradigm is typically used to exploit the parallel processing power of these new architectures. The efficiency of applications in this paradigm is achieved by exploring a set of interconnected computers (cluster) using filters and communication between them in a coordinated way. In this work we propose, implement and test a generic abstraction for direct and transparent data exchange of filter-stream applications in heterogeneous cluster with multi-GPU (Graphics Processing Units) architectures. This abstraction allows hiding from the programmers all the low-level implementation details related to GPU communication and the control related to the location of filters. Further, we consolidate such abstraction into a framework. Empirical assessments using a real application show that the proposed abstraction layer ease the implementation of filter-stream applications without compromising the overall application performance.
Leonardo Rocha, Gabriel Ramons, Guilherme Andrade, Rafael Sachetto, Daniel Madeira, Renan Carvalho, Renato Ferreira and Fernando Mourão

Data-Driven Computational Sciences (DDCS) Session 3

Time and Date: 9:00 - 10:40 on 14th June 2017

Room: HG D 7.2

Chair: Craig Douglas

382 Multiscale and Multiresolution methods for Sparse representation of Large datasets -- Application to Ice Sheet Data [abstract]
Abstract: In this paper, we have presented a strategy for studying a large observational dataset at different resolutions to obtain a sparse representation in a computationally efficient manner. Such representations are crucial for many applications from modeling and inference to visualization. Resolution here stems from the variation of the correlation strength among the different observation instances. The motivation behind the approach is to make a large dataset as small as possible by removing all the redundant information so that, the original data can be reconstructed with minimal losses of information.Our past work borrowed ideas from multilevel simulations to extract a sparse representaiton. Here, we introduce the use of multi-resolution kernels. We have tested our approach on a carefully designed suite of analytical functions along with gravity and altimetry time series datasets from a section of the Greenland Icesheet. In addition to providing a good strategy for data compression, the proposed approach also finds application in efficient sampling procedures and error filtering in the datasets. The results, presented in the article clearly establish the promising nature of the approach along with prospects of its application in different fields of data analytics in the scientific computing and related domains.
Abani Patra, Prashant Shekhar and Beata Csatho
451 Fast Construction an Emulators via Localization [abstract]
Abstract: To make a Bayesian prediction of the chances of a volcanic hazard impacting a particular region requires an estimate of the mass flow consequent to an eruption, for tens of thousands of input parameters. These inputs include physical parameters, computational factors, and spatial locations. Mass flow estimates can be determined by computer simulations, which are often too slow to be used for all the necessary input evaluations. Statistical emulators provide a very fast procedure for estimating the mass flow, along with a measure of the error in that estimate. But construction of many classical emulators, such as the GAussian Stochastic Process emulator requires inversion of a covariance matrix whose dimension is equal to the number of inputs – again, too slow to be useful. To speed up the emulator construction, some down sample the input space, which ignores expensive and potentially important simulation results. Others propose truncating the covariance to a small-width diagonal band, which is easy to invert. Here we propose an alternative method. We construct a localized emulator around every point at which the mass flow is to be estimated, and tie these localized processes together in a hierarchical fashion. We show how this approach fits into a theory of Gauss-Markov Random Fields, to demonstrate the efficacy of the approach.
E Bruce Pitman, Abani K Patra and Keith Dalbey
287 From Extraction to Generation of Design Information - Paradigm Shift in Data Mining via Evolutionary Learning Classifier System [abstract]
Abstract: This paper aims at generating as well as extracting design strategies for a real world problem using an evolutionary learning classifier system. Data mining for a design optimization result as a virtual database specifies design information and discovers latent design knowledge; it is essential for decision making in real world problems. Although we employed several methods from classic statistics to artificial intelligence to obtain design information from optimization results, we may not cognize anything beyond a prepared database. In this study, we have applied an evolutionary learning classifier system as a data mining technique to a real world engineering problem. Consequently, not only it extracted known design information but also it successfully generated design strategies not to extract from the database. The generated design rules do not physically become innovative knowledge because the prepared dataset include Pareto solutions owing to complete exploration to the edge of the feasible region in the optimization. However, this problem is independent of the method; our evolutionary learning classifier system is a useful method for incomplete datasets.
Kazuhisa Chiba and Masaya Nakata
294 Case study on: Scalability of preprocessing procedure of remote sensing in Hadoop [abstract]
Abstract: In the research field of remote sensing, the recent growth of image sizes draws a remarkable attention for processing these files in a distributed architecture. Divide-and conquer rule is the main attraction in the analysis of scalable algorithm. On the other hand, fault tolerance in data parallelism, is the new aspect of requirement. In this regard, Apache Hadoop architecture becomes a promising and an efficient MapReduce model. In the satellite image processing, large scale images put the limitation on the single computer analysis. Whereas, Hadoop Distributed File System (HDFS) gives a remarkable solution to handle these files through its inherent data parallelism technique. This architecture is well suited for structured data, as the structured data can be equally distributed easily and be accessed selectively in terms of relevancy of data. Images are considered as unstructured matrix data in Hadoop and the whole part of the data is relevant for any processing. Naturally, it becomes a challenge to maintain data locality with equal data distribution. In this paper, we introduce a novel technique, which decrypts the standard format of raw satellite data and localizes the distributed preprocessing step on the equal split of datasets in Hadoop. For this purpose, a suitable modification on the Hadoop interface is proposed. For the case study on scalability of preprocessing steps, Synthetic Aperture Radar (SAR) and Multispectral (MS), are used in distributed environment.
Sukanta Roy, Sanchit Gupta and S N Omkar
308 Collaborative SVM for Malware Detection [abstract]
Abstract: Malware has been the primary threat to computer and network for years.Traditionally, supervised learning methods are applied to detect malware. But supervised learning models need a great number of labeled samples to train models beforehand, and it is impractical to label enough malicious code manually. Insufficient training samples yields imperfect detection models and satisfactory detection result could not be obtained as a result. In this paper, we bring out a new algorithm call collaborative SVM based on semi-supervised learning and independent component analysis. With collaborative SVM, only a few labeled samples is needed while the detection result keeps in a high level. Besides, we propose a general framework with independent components analysis, with which to reduce the restricted condition of collaborative train. Experiments prove the efficiency of our model finally.
Zhang Kai, Li Chao, Wang Yong, Xiaobin Zhu and Haiping Wang

Data-Driven Computational Sciences (DDCS) Session 4

Time and Date: 13:25 - 15:05 on 14th June 2017

Room: HG D 7.2

Chair: Craig Douglas

381 Improving Performance of Multiclass Classification by Inducing Class Hierarchies [abstract]
Abstract: In the last decades, one issue that has received a lot of attention in classification problems is how to obtain better classifications. This problem becomes even more complicated when the number of classes is high. In this multiclass scenario, it is assumed that the class labels are independent of each other, and thus, most techniques and methods proposed to improve the performance of the classifiers rely on it. An alternative way to address the multiclass problem is to hierarchically distribute the classes in a collection of multiclass subproblems by reducing the number of classes involved in each local subproblem. In this paper, we propose a new method for inducing a class hierarchy from the confusion matrix of a multiclass classifier. Then, we use the class hierarchy to learn a tree-like hierarchy of classifiers for solving the original multiclass problem in a similar way as the top-down hierarchical classification approach does for working with hierarchical domains. We experimentally evaluate the proposal on a collection of multiclass datasets showing that, in general, the generated hierarchies not only outperforms the original (flat) classification but also hierarchical approaches based on other alternative ways of constructing the class hierarchy.
Daniel Andrés Silva Palacios, Cèsar Ferri and Maria José Ramírez Quintana
574 The Impact of Large-Data Transfers in Shared WANs: An Empirical Study [abstract]
Abstract: Computational science, especially in the era of Big Data, sometimes requires large data files to be transferred over high bandwidth-delay-product (BDP) wide-area networks (WANs). Experimental data (e.g., LHC, SKA), analytics logs, and filesystem backups are regularly transferred between research centres and between private-public clouds. Fortunately, a variety of tools (e.g., GridFTP, UDT, PDS) have been developed to transfer bulk data across WANs with high performance. However, using large-data transfer tools are known to adversely affect other network applications on shared networks. Many of the tools explicitly ignore TCP fairness to achieve high performance. Users have experienced high latencies and low bandwidth situations when a large-data transfer is underway. But there have been few empirical studies that quantify the impact of the tools. As an extention of our previous work using synthetic background traffic, we perform an empirical analysis of how the bulk-data transfer tools perform when competing with a non-synthetic, application-based workload (e.g., Network File System). Conversely, we characterize and quantify the impact of bulk-data transfers on the application-based traffic. For example, we show that the RTT latency for other applications can increase from about 130 ms to about 230 ms for the non-bulk-data users of a shared network.
Hamidreza Anvari and Paul Lu