Workshop on Biomedical and Bioinformatics Challenges for Computer Science (BBC) Session 2

Time and Date: 14:30 - 16:10 on 6th June 2016

Room: Rousseau East

Chair: Alfredo Tirado-Ramos

232 Forward Error Correction for DNA Data Storage [abstract]
Abstract: We are reporting on a strong capacity boost in storing digital data in synthetic DNA. In principle, synthetic DNA is an ideal media to archive digital data for very long times because the achievable data density and longevity outperforms today’s digital data storage media by far. On the other hand, neither the synthesis, nor the amplification and the sequencing of DNA strands can be performed error-free today and in the foreseeable future. In order to make synthetic DNA available as digital data storage media, forward-error-correction schemes have to be applied. In order to realize DNA data storage, we have developed an efficient and robust forward-error-correcting scheme adapted to the DNA channel. We based the design of the needed DNA channel model on data from a proof-of-concept conducted 2012 by a team from the Harvard Medical School*. Our forward error correction scheme is able to cope with all error types of today DNA synthesis, amplification and sequencing processes, e.g. insertion, deletion, and swap errors. In a successful experiment, we were able to store and retrieve error-free 22MByte of digital data in synthetic DNA recently. The found residual error probability is already in the same order as it is in hard disk drives and can be easily improved. This proves the feasibility to use synthetic DNA as a long-term digital data storage media. In an already planned next development step we will increase the amount of stored data into the GByte range. The presented forward error correction scheme is already designed for such and even much higher volumes of data. *) Church, G. M.; Gao, Y.; Kosuri, S. (2012). "Next-Generation Digital Information Storage in DNA". Science 337 (6102): 1628
Meinolf Blawat, Klaus Gaedke, Ingo Huetter, Xiao-Ming Chen, Brian Turczyk, Samuel Inverso, Benjamin Pruitt, George Church
435 Computationally characterizing genomic pipelines using high-confident call sets [abstract]
Abstract: In this paper, we describe some available high-confident call sets that have been developed to test the accuracy of called single nucleotide polymorphisms (SNPs) from next-generation sequencing. We use these calls to test and parameterize the GATK best practice pipeline on the high-performance computing cluster at the University of Kentucky. Automated script to run the pipeline can be found at https://github.com/sallyrose0425/GATKBP. This study demonstrates the usefulness of high-confident call sets in validating and optimizing bioinformatics pipelines, estimates computational needs for genomic analysis, and provides scripts for an automated GATK best practices pipeline.
Xiaofei Zhang, Sally Ellingson
390 Denormalize and Delimit: How not to Make Data Extraction for Analysis More Complex than Necessary [abstract]
Abstract: There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best under-served by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources. The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.
Alex Bokov, Laura Manuel, Catherine Cheng, Angela Bos, Alfredo Tirado-Ramos