Sixth Workshop on Data Mining in Earth System Science (DMESS) Session 1

Time and Date: 16:20 - 18:00 on 2nd June 2015

Room: M209

Chair: Jay Larson

739	Data Mining in Earth System Science (DMESS 2015) [abstract] Abstract: Spanning many orders of magnitude in time and space scales, Earth science data are increasingly large and complex and often represent very long time series, making such data difficult to analyze, visualize, interpret, and understand. Moreover, advanced electronic data storage technologies have enabled the creation of large repositories of observational data, while modern high performance computing capacity has enabled the creation of detailed empirical and process-based models that produce copious output across all these time and space scales. The resulting “explosion” of heterogeneous, multi-disciplinary Earth science data have rendered traditional means of integration and analysis ineffective, necessitating the application of new analysis methods and the development of highly scalable software tools for synthesis, assimilation, comparison, and visualization. This workshop explores various data mining approaches to understanding Earth science processes, emphasizing the unique technological challenges associated with utilizing very large and long time series geospatial data sets. Especially encouraged are original research papers describing applications of statistical and data mining methods—including cluster analysis, empirical orthogonal functions (EOFs), genetic algorithms, neural networks, automated data assimilation, and other machine learning techniques—that support analysis and discovery in climate, water resources, geology, ecology, and environmental sciences research.	Forrest M. Hoffman, Jitendra Kumar and Jay Larson
312	Pattern-Based Regionalization of Large Geospatial Datasets Using COBIA [abstract] Abstract: Pattern-based regionalization -- spatial classification of an image into sub-regions characterized by relatively stationary patterns of pixel values -- is of significant interest for conservation, planing, as well as for academic research. A technique called the complex object-based image analysis (COBIA) is particularly well-suited for pattern-based regionalization of very large spatial datasets. In COBIA image is subdivided into a regular grid of local blocks of pixels (complex objects) at minimal computational cost. Further analysis is performed on those blocks which represent local patterns of pixel-based variable. A variant of COBIA presented here works on pixel-classified images, uses a histogram of co-occurrence pattern features as block attribute, and utilizes the Jensen-Shannon divergence to measure a distance between any two local patterns. In this paper the COBIA concept is utilized for unsupervised regionalization of land cover dataset (pixel-classified Landsat images) into landscape types -- characteristic patterns of different land covers. This exploratory technique identifies and delineates landscape types using a combination of segmentation of a grid of local patterns with clustering of the segments. A test site with 3.5 x 10^8 pixels is regionalized in just few minutes using a standard desktop computer. Computational efficiency of presented approach allows for carrying out regionalizations of various high resolution spatial datasets on continental or global scales.	Tomasz Stepinski, Jacek Niesterowicz, Jaroslaw Jasiewicz
720	Fidelity of Precipitation Extremes in High Resolution Global Climate Simulations [abstract] Abstract: Precipitation extremes have tangible societal impacts. Here, we assess if current state of the art global climate model simulations at high spatial resolutions capture the observed behavior of precipitation extremes in the past few decades over the continental US. We design a correlation-based regionalization framework to quantify precipitation extremes, where samples of extreme events for a grid box may also be drawn from neighboring grid boxes with statistically equal means and statistically significant temporal correlations. We model precipitation extremes with the Generalized Extreme Value (GEV) distribution fits to time series of annual maximum precipitation. Non-stationarity of extremes is captured by including a time-dependent parameter in the GEV distribution. Our analysis reveals that the high-resolution model substantially improves the simulation of stationary precipitation extreme statistics particularly over the Northwest Pacific coastal region and the Southeast US. Observational data exhibits significant non-stationary behavior of extremes only over some parts of the Western US, with declining trends in the extremes. While the high resolution simulations improve upon the low resolution model in simulating this non-stationary behavior, the trends are statistically significant only over some of those regions.	Salil Mahajan, Katherine Evans, Marcia Branstetter, Valentine Anantharaj, Juliann Leifeld
729	On Parallel and Scalable Classification and Clustering Techniques for Earth Science Datasets [abstract] Abstract: One observation of earth data science is their massive increase in volume (e.g. higher quality measurements) or the emerging high number of dimensions (e.g. hyperspectral bands in satellite observations). Traditional data mining tools (R, Matlab, etc.) are partly becoming infeasible to be used with those datasets. Parallel and scalable techniques bear the potential to overcome these limits while our analysis revealed that a wide variety of new implementations are not all suited for data mining tasks in earth science. This contribution gives reasons by focusing on two distinct parallel and scalable data mining techniques used in High Performance Computing (HPC) environments in earth science case studies: (a) Parallel Density-based Spatial Clustering of Applications with Noise (DBSCAN) for automated outlier detection in time series data and (b) parallel classification using multi-class Support Vector Machines (SVMs) for land cover identification in multi-spectral satellite datasets. In the paper we also compare recent ‘big data stacks’ vs. traditional HPC techniques.	Markus Götz, Matthias Richerzhagen, Gabriele Cavallaro, Christian Bodenstein, Philipp Glock, Morris Riedel, Jon Atli Benediktsson
322	Completion of a sparse GLIDER database using multi-iterative Self-Organizing Maps (ITCOMP SOM) [abstract] Abstract: We present a novel approach named ITCOMP SOM that uses iterative self-organizing maps (SOM) to progressively reconstruct missing data in a highly correlated multidimensional dataset. This method was applied for the completion of a complex oceanographic data-set containing glider data from the EYE of the Levantine experiment of the EGO project. ITCOMP SOM provided reconstructed temperature and salinity profiles that are consistent with the physics of the phenomenon they sampled. A cross-validation test was performed and validated the approach, providing a root mean square error of providing a root mean square error of 0,042°C for the reconstruction of the temperature profiles and 0,008 PSU for the simultaneous reconstruction of the salinity profiles.	Anastase - Alexander Charantonis, Pierre Testor, Laurent Mortier, Fabrizio D'Ortenzio, Sylvie Thiria
698	A Feature-first Approach to Clustering for Highlighting Regions of Interest in Scientific Data [abstract] Abstract: We present a simple clustering algorithm that classifies the points of a dataset by a combination of scalar variables' values as well as spatial locations. How heavily the spatial locations impact the algorithm is a tunable parameter. With no impact the algorithm bins the data by calculating a histogram and classifies each point by a bin ID. With full impact, points are bunched together with their neighbors regardless of value. This approach is unsurprisingly very sensitive to this weighting; a sampling of possible values yields a wide variety of classifications. However, we have found that when tuned just right it is indeed possible to extract meaningful features from the resulting clustering. Furthermore, the principles behind our development of this technique are also applicable in both tuning the algorithm as well as in selecting data regions. In this paper we will provide the details of design and implementation and demonstrate using the auto-tuned approach to extract interesting regions of real scientific data. Our target application is data derived from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) sensors.	Robert Sisneros