ICCS 2019 Main Track (MT) Session 7

Time and Date: 10:15 - 11:55 on 14th June 2018

Room: 1.5

Chair: To be announced

247 Representation Learning of Taxonomies for Taxonomy Matching [abstract]
Abstract: Taxonomy matching aims to discover categories alignments between two taxonomies, which is an important operation of knowledge sharing task to benefit many applications. The existing methods for taxonomy matching mostly depend on string lexical features and domain-specific information. In this paper, we consider the method of representation learning of taxonomies, which projects categories and relationships into low-dimensional vector spaces. We propose a method to takes advantages of category hierarchies and siblings, which exploits a low-dimensional semantic space to modeling categories relations by translating operations in the semantic space. We take advantage of maximum weight matching problem on bipartite graphs to model taxonomy matching problem, which runs in polynomial time to generate optimal categories alignments for two taxonomies in a global manner. Experimental results on OAEI benchmark datasets show that our method significantly outperforms the baseline methods in taxonomy matching.
Hailun Lin
311 Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort [abstract]
Abstract: Scientific Named Entity Referent Extraction is often more complicated than traditional Named Entity Recognition (NER). For ex- ample, in polymer science, chemical structure may be encoded in a variety of nonstandard naming conventions, and authors may refer to polymers with conventional names, commonly used names, labels (in lieu of longer names), synonyms, and acronyms. As a result, accurate scientific NER methods are often based on task-specific rules, which are difficult to develop and maintain, and are not easily generalized to other tasks and fields. Machine learning models require substantial expert-annotated data for training. Here we propose polyNER: a semi-automated system for efficient identification of scientific entities in text. PolyNER applies word embedding models to generate entity-rich corpora for productive expert labeling, and then uses the resulting labeled data to bootstrap a context-based word vector classifier. Evaluation on materials science publications shows that polyNER’s combination of automated analysis with minimal expert input enables noticeably improved precision or re- call relative to a state-of-the-art chemical entity extraction system. This remarkable result highlights the potential for human-computer partner- ship for constructing domain-specific scientific NER systems.
Roselyne Tchoua, Aswathy Ajith, Zhi Hong, Logan Ward, Kyle Chard, Alexander Belikov, Debra Audus, Shrayesh Patel, Juan de Pablo and Ian Foster
366 Evaluating the benefits of Key-Value databases for scientific applications [abstract]
Abstract: The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed key-value database on a cell segmentation application. The original code uses HDF5 files on GPFS through a complex interface and imposes synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Also, thanks to the key-value data model the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.
Pol Santamaria, Lena Oden, Yolanda Becerra, Eloy Gil, Raül Sirvent, Philipp Glock and Jordi Torres
427 Scaling the Training of Recurrent Neural Networks on Sunway TaihuLight Supercomputer [abstract]
Abstract: The recurrent neural network (RNN) models require longer training time with larger datasets and bigger number of parameters. Distributed training with large mini-batch size is a potential solution to accelerate the whole training process. This paper proposes a framework for large-scale training RNN/LSTM on the Sunway TaihuLight (SW) supercomputer. We take series of architecture-oriented optimizations for the memory-intensive kernels in RNN models to improve the computing performance. The lazy communication scheme with improved communication implementation and the distributed training and testing scheme are proposed to achieve high scalability for distributed training. Furthermore, we explore the training algorithm with large mini-batch size, in order to improve convergence speed without losing accuracy. The framework supports training RNN models with large size of parameters with at most 800 training nodes. The evaluation results show that, compared to training with single computing node, training based on proposed framework can achieve a 100-fold convergence rate with 8,000 mini-batch size.
Ouyi Li, Wenlai Zhao, Xuancheng Huang, Yushu Chen, Lin Gan, Hongkun Yu, Jiacheng Zhang, Yang Liu, Haohuan Fu and Guangwen Yang