Data-Driven Computational Sciences (DDCS) Session 3

Time and Date: 9:00 - 10:40 on 14th June 2017

Room: HG D 7.2

Chair: Craig Douglas

382 Multiscale and Multiresolution methods for Sparse representation of Large datasets -- Application to Ice Sheet Data [abstract]
Abstract: In this paper, we have presented a strategy for studying a large observational dataset at different resolutions to obtain a sparse representation in a computationally efficient manner. Such representations are crucial for many applications from modeling and inference to visualization. Resolution here stems from the variation of the correlation strength among the different observation instances. The motivation behind the approach is to make a large dataset as small as possible by removing all the redundant information so that, the original data can be reconstructed with minimal losses of information.Our past work borrowed ideas from multilevel simulations to extract a sparse representaiton. Here, we introduce the use of multi-resolution kernels. We have tested our approach on a carefully designed suite of analytical functions along with gravity and altimetry time series datasets from a section of the Greenland Icesheet. In addition to providing a good strategy for data compression, the proposed approach also finds application in efficient sampling procedures and error filtering in the datasets. The results, presented in the article clearly establish the promising nature of the approach along with prospects of its application in different fields of data analytics in the scientific computing and related domains.
Abani Patra, Prashant Shekhar and Beata Csatho
451 Fast Construction an Emulators via Localization [abstract]
Abstract: To make a Bayesian prediction of the chances of a volcanic hazard impacting a particular region requires an estimate of the mass flow consequent to an eruption, for tens of thousands of input parameters. These inputs include physical parameters, computational factors, and spatial locations. Mass flow estimates can be determined by computer simulations, which are often too slow to be used for all the necessary input evaluations. Statistical emulators provide a very fast procedure for estimating the mass flow, along with a measure of the error in that estimate. But construction of many classical emulators, such as the GAussian Stochastic Process emulator requires inversion of a covariance matrix whose dimension is equal to the number of inputs – again, too slow to be useful. To speed up the emulator construction, some down sample the input space, which ignores expensive and potentially important simulation results. Others propose truncating the covariance to a small-width diagonal band, which is easy to invert. Here we propose an alternative method. We construct a localized emulator around every point at which the mass flow is to be estimated, and tie these localized processes together in a hierarchical fashion. We show how this approach fits into a theory of Gauss-Markov Random Fields, to demonstrate the efficacy of the approach.
E Bruce Pitman, Abani K Patra and Keith Dalbey
287 From Extraction to Generation of Design Information - Paradigm Shift in Data Mining via Evolutionary Learning Classifier System [abstract]
Abstract: This paper aims at generating as well as extracting design strategies for a real world problem using an evolutionary learning classifier system. Data mining for a design optimization result as a virtual database specifies design information and discovers latent design knowledge; it is essential for decision making in real world problems. Although we employed several methods from classic statistics to artificial intelligence to obtain design information from optimization results, we may not cognize anything beyond a prepared database. In this study, we have applied an evolutionary learning classifier system as a data mining technique to a real world engineering problem. Consequently, not only it extracted known design information but also it successfully generated design strategies not to extract from the database. The generated design rules do not physically become innovative knowledge because the prepared dataset include Pareto solutions owing to complete exploration to the edge of the feasible region in the optimization. However, this problem is independent of the method; our evolutionary learning classifier system is a useful method for incomplete datasets.
Kazuhisa Chiba and Masaya Nakata
294 Case study on: Scalability of preprocessing procedure of remote sensing in Hadoop [abstract]
Abstract: In the research field of remote sensing, the recent growth of image sizes draws a remarkable attention for processing these files in a distributed architecture. Divide-and conquer rule is the main attraction in the analysis of scalable algorithm. On the other hand, fault tolerance in data parallelism, is the new aspect of requirement. In this regard, Apache Hadoop architecture becomes a promising and an efficient MapReduce model. In the satellite image processing, large scale images put the limitation on the single computer analysis. Whereas, Hadoop Distributed File System (HDFS) gives a remarkable solution to handle these files through its inherent data parallelism technique. This architecture is well suited for structured data, as the structured data can be equally distributed easily and be accessed selectively in terms of relevancy of data. Images are considered as unstructured matrix data in Hadoop and the whole part of the data is relevant for any processing. Naturally, it becomes a challenge to maintain data locality with equal data distribution. In this paper, we introduce a novel technique, which decrypts the standard format of raw satellite data and localizes the distributed preprocessing step on the equal split of datasets in Hadoop. For this purpose, a suitable modification on the Hadoop interface is proposed. For the case study on scalability of preprocessing steps, Synthetic Aperture Radar (SAR) and Multispectral (MS), are used in distributed environment.
Sukanta Roy, Sanchit Gupta and S N Omkar
308 Collaborative SVM for Malware Detection [abstract]
Abstract: Malware has been the primary threat to computer and network for years.Traditionally, supervised learning methods are applied to detect malware. But supervised learning models need a great number of labeled samples to train models beforehand, and it is impractical to label enough malicious code manually. Insufficient training samples yields imperfect detection models and satisfactory detection result could not be obtained as a result. In this paper, we bring out a new algorithm call collaborative SVM based on semi-supervised learning and independent component analysis. With collaborative SVM, only a few labeled samples is needed while the detection result keeps in a high level. Besides, we propose a general framework with independent components analysis, with which to reduce the restricted condition of collaborative train. Experiments prove the efficiency of our model finally.
Zhang Kai, Li Chao, Wang Yong, Xiaobin Zhu and Haiping Wang