Recent Articles & Updates in Bioinformatics

  1. introduction-to-bioinformatics

  2. METAINTER: meta-analysis of multiple regression models

  3. Course: Pattern Recognition (4th edition)

  4. ADME SARfari: Comparative Genomics of Drug Metabolising Systems

  5. Balti and Bioinformatics On Air: 21st January 2015

  6. Introduction to Bioinformatics using NGS data

  7. T-Bioinfo Bioinformatics platform - Grosmannia clavigera

  8. Global Bioinformatics Market to Push Past US$9 Billion by 2020

  9. Cross Border Collaborations to Nurture Bioinformatics Research

  10. Bioinformatics: 25 Years of Integrating the Biological Sciences

  11. Introductory Bioinformatics (second course)

  12. A bioinformatics case study with insulin

  13. IM-TORNADO: A Tool for Comparison of 16S Reads

  14. Virophage Genomes Discovered from Yellowstone Lake Metagenomes

  15. Java 8 For Bioinformatics



All the videos and slides from the breakout sessions which took place on the first day of DockerCon Europe. From original Docker use cases in bioinformatics and radio Astronomy to more classic use cases on Continuous Delivery, these videos include a ton of Docker insights, tips and tricks.
  • Evaluating and ranking genome assemblers by Michael Barton
  • The Tale of a Docker-based Continuous Delivery Pipeline by Rafe Colton
  • Continuous Delivery leveraging on Docker CaaS by Adrien Blind
  • Docker in a big company? by Damien Duportal
  • Migrating a large code-base to Docker containers by Doug Johnson and Jonathan Lonzinski
  • Enable Fig to deploy to multiple Docker servers by Willy Kuo
  • Opinionated containers and the future of game servers by Brendan Fosberry
  • Python, Docker and Radio Astronomy by Gijs Melenaar


Discover how bioinformatics is becoming increasingly important to contemporary healthcare research and delivery. Learn about the principles and practices of bioinformatics, the challenges it faces and the problems it can help to solve.

METAINTER: meta-analysis of multiple regression models

Meta-analysis of summary statistics is an essential approach to guarantee the success of genome-wide association studies (GWAS). Application of the fixed or random effects model to single-marker association tests is a standard practice. More complex methods of meta-analysis involving multiple parameters have not been used frequently, a gap that could be explained by the lack of a respective meta-analysis pipeline. Meta-analysis based on combining p-values can be applied to any association test.
  • However, to be powerful, meta-analysis methods for high-dimensional models should incorporate additional information such as study-specific properties of parameter estimates, their effect directions, standard errors and covariance structure.

Course: Pattern Recognition (4th edition)


Many problems in bioinformatics require classification: prediction of the class to which a certain object (i.e. a gene, protein, cell, patient, ?) belongs. This calls for algorithms that can assign the most likely label (discrete output) to an object, given one or more measurements on that object. For most interesting problems, the underlying physics are too complex to explicitly formulate such an algorithm. In such cases, a machine learning approach is taken: an algorithm is constructed, with parameters that are tuned based on an available dataset of training examples. The algorithm should predict the labels for these examples as well as possible, yet still generalize, i.e. perform well on objects not seen before. Some examples of classification problems in bioinformatics are gene finding (sequence in, gene presence out), diagnostics (microarray data in, diagnosis out), data integration (measurements in, probability of interaction out), etc.


Next occasion: March 23-27, 2015. VU University, Amsterdam, the Netherlands
Last occasion: 21 Jan – 25 Jan 2013, Amsterdam


  • After having followed this course, a student should have an overview of basic pattern recognition techniques and be able to recognize what method is most applicable to classification problems (s)he encounters in bioinformatics applications.

ADME SARfari: Comparative Genomics of Drug Metabolising Systems

ADME SARfari is a freely available web resource that enables comparative analyses of drug-disposition genes. It does so by integrating a number of publicly available data sources, and then providing specific analysis and predictive tools for drug metabolism researchers. The data includes the interactions of small molecules with ADME (Absorption, Distribution, Metabolism and Excretion) proteins responsible for the metabolism and transport of molecules;available pharmacokinetic (PK) data; protein sequences of ADME related molecular targets for pre-clinical model species and human;alignments of the orthologues including information on known SNPs(Single Nucleotide Polymorphism) and information on the tissue distribution of these proteins. In addition in-silico models have been developed which enable users to predict which ADME relevant protein targets a novel compound is likely to interact with.

Balti and Bioinformatics On Air: 21st January 2015

The plan this year for the triumphat Balti and Bioinformaticsseries is to alternate between virtual, "on-air" meetings (where sadly you will need to provide your own balti curry) and real life ones which will be mainly held in Birmingham, but may be in other places in England or Wales.

Balti and Bioinformatics On-Air

Introduction to Bioinformatics using NGS data

Course content

In collaboration with BILS, SciLIfeLab will organize the course Introduction to Bioinformatics using NGS data. The course will provide an introduction to a wide range of analytical techniques for massively parallel sequencing, including basic linux commands. We will pair lectures on the theory of analysis algorithms with practical computational excercises demonstrating the use of common tools for analyzing data from each of several common sequencing study designs.

Important dates

  • Application open: December 16
  • Application deadline: January 18
  • Confirmation to accepted students: January 21
  • Responsible teachers: Manfred Grabherr, Bengt Persson
  • If you don’t receive information according to the dates above, contact [email protected]

T-Bioinfo Bioinformatics platform - Grosmannia clavigera

NGS big data analysis on the revolutionary big data analysis platform developed at the Tauber Bioinformatics Institute in Haifa, Israel.

Global Bioinformatics Market to Push Past US$9 Billion by 2020

Transparency Market Research has released a new report of their analysis of the global bioinformatics market, titled ‘Global Bioinformatics Market (By Platforms, Tools and Services and By Applications: Preventive Medicine, Molecular Medicine, Gene Therapy Drug Development and Others) - Industry Analysis, Size, Share, Growth, Trends and Forecast, 2014 - 2020’. The report estimates that the bioinformatics market, valued at US$2.3 billion in 2012, will reach a value of more than US$9 billion by the end of the report’s forecast period, growing at a healthy CAGR throughout. The fragmented global bioinformatics market is segmented by the type of platform, content management tools, services, and geographical distribution.

Browse Report:

  • According to platform, the bioinformatics market is divided into four categories: sequence manipulation, sequence analysis, sequence alignment, and structural analysis. There are two types of content management tools: general knowledge management tools and specific content management tools. According to the type of service provided, the global bioinformatics market is divided into four categories: data analysis services, database and management services, sequencing services, and others. By geography, the global bioinformatics market is divided into four regional markets: North America, Europe, Asia Pacific, and Rest of World.

Cross Border Collaborations to Nurture Bioinformatics Research

Rising overseas expansions and cross border collaborations in the bioinformatics field has given new dimensions to the industry. A number of international alliances are bridging bioinformatics research gaps between different nations. Exponential growth in bioinformatics trade and research result sharing has given a massive thrust to the market.
  • In their latest research study, “Global Bioinformatics Market Outlook 2019”, RNCOS’ spread over 140 pages, analysts identified the global bioinformatics market reached the mark of around US$ 3.7 Billion in 2013 with the anticipation of its growth at a CAGR of around 19% during 2015-2019. The report is an outcome of in-depth research and comprehensive analysis of the bioinformatics market, trends and future opportunities covering a wide spread examination of bioinformatics space.

Bioinformatics: 25 Years of Integrating the Biological Sciences

The 26th Presidential Faculty Lecture given by Jason Moore, BS, MA, MS, PhD, Third Century Professor, Professor of Genetics and Community and Family Medicine at the Geisel School of Medicine at Dartmouth.

Introductory Bioinformatics (second course)

The course sets out to introduce an extensive range of computing facilities vital for molecular biological research. This will be achieved primarily through "hands on" exercises based around an investigation of a well documented human disease. How information can be obtained both by analysis of raw sequence data and by interrogation of information resources will be demonstrated.
  • The last day of the this course will be dedicated to a soft introduction to Next Generation Sequencing (NGS) data analysis.


  • The course is a user course. How to use the various tools is thus the prime objective. However, where it is useful, the operation of the programs will be discussed as far as is required. Participants will know how to set up the programs in an informed fashion, and to fully understand the output generated. On completion of this 4 day long training, they will also know how to implement this methodology elsewhere, using public domain software and data resources.

A bioinformatics case study with insulin

Blink is a database of protein blast search results. Using Blink can save you lots of time because it organizes blast results from all the organisms in the non-redundant protein sequence database, but getting to Blink can be tricky because it’s a little hard to find.

Why is this sequence in the NCBI database if it’s misidentified?

  • The presence of the cow insulin sequence from jack beans illustrates an important point about the NCBI database.  It’s an archive.  Sequences get entered that aren’t always right and they can persist.

IM-TORNADO: A Tool for Comparison of 16S Reads

16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads.

Availability and Implementation

Virophage Genomes Discovered from Yellowstone Lake Metagenomes

Virophages are a unique group of circular double-stranded DNA viruses that are considered parasites of giant DNA viruses, which in turn are known to infect eukaryotic hosts. In this study, the genomes of three novel Yellowstone Lake virophages (YSLVs)—YSLV5, YSLV6, and YSLV7—were identified from Yellowstone Lake through metagenomic analyses. The relative abundance of these three novel virophages and previously identified Yellowstone Lake virophages YSLV1 to -4 were determined in different locations of the lake, revealing that most of the sampled locations in the lake, including both mesophilic and thermophilic habitats, had multiple virophage genotypes.


  • This study discovered novel virophages present within the Yellowstone Lake ecosystem using a conserved major capsid protein as a phylogenetic anchor for assembly of sequence reads from Yellowstone Lake metagenomic samples. The three novel virophage genomes (YSLV5 to -7) were completed by identifying specific environmental samples containing these respective virophages, and closing gaps by targeted PCR and sequencing.

Java 8 For Bioinformatics

Benefits of using Java for Bioinformatics

  1. Performance:Very early releases of Java earned it (quite rightly) a horrible reputation for performance. However, the modern Java Virtual Machine is extremely fast. In particular, the Hotspot JVM comes with a Just In Time (JIT) compiler, which compiles byte code to native code on the fly when it detects there may be a performance benefit to doing so. Because a lot of the processing we do in bioinformatic analysis is highly repetitive, our work benefits hugely from this.
  2. Multithreading:Java has a high-level abstraction for multithreading that transparently supports multiple processors. Since Java 5, there have been libraries to support blocked queues and executors, and since Java 7 there are libraries supporting fork-join functionality. These add to the performance benefits noted above and make it relatively easy to exploit parallelization in Java. Java 8 offers some new APIs that make this easier still.
  3. Robustness:The experience in our lab is that, while much of our code is written for “one-off” execution, there are data structures we commonly want to reuse. Java is primarily designed as a language to support reusable, robust components. We have, in common with many other labs, I suspect, developed an in-house set of libraries to manage these, and have developed simple class structures to represent genes, exons, genomes, etc., as well as some high-performing memory maps of fasta files.  We also use Picard, which provides the functionality of samtools in a Java API. GATK also provides similar reusable libraries.
  4. Familiarity:Java has been around for over 18 years now and has been highly influential in the development of more recent languages. It’s virtually impossible to hire a programmer who hasn’t had some exposure to Java, and so it’s relatively easy to bring new lab members up to speed on existing in-house code.
  5. Platform-Independence:Unlike commercial organizations, academic organizations usually allow a large amount of freedom for employees to choose a computational platform on which to work. Since Java runs on all major systems, it makes a good choice for in-house code, as that code is not dependent on the choice of platform for an investigator. Many UI-based informatics tools (including FastQC, IGV, IPA, and many others) take advantage of this. Bioinformatics is a field where most practitioners are, by nature, computer-savy, so the problems associated with installing and maintaining JVMs are not an issue in this environment.


Cloud computing has been seen most influential in high-throughput sequence data analysis. With the volume of data multiplying every year, it is a daunting task for small and large laboratories to maintain and process data for these sequential analyses. Hadoop has been successfully used in bioinformatics as it meets the essential need of biological data analysis. Hadoop consists of two parts – MapReduce and Hadoop Distributed File System (HDFS).Employing these two parts, Hadoop can successfully solve large data problems by using technology infrastructure in a more efficient manner. Cloud-based analysis compares favorably in both performance and cost when compared to local computational clusters, showing that cloud computing technologies might be a viable options to facilitate large-scale translational research in genomic medicine.
  • The traditional method for bioinformatics was to download databases and software and then proceed to analyze the data at hand using the downloaded data with the software installed locally. Bioinformatics cloud utilization can vary depending on the need of the task.