How I solved/hacked/cheated the first 4 Genome Biology bioinformatics challenge 2013



published by  Damian Kao
About Damian Kao
I am a third year PhD student in Aziz Aboobaker's lab at the University of Nottingham. Aziz recently moved to Oxford University, so I've also moved there. Our broad research topic is regeneration. Specifically, we use the freshwater flat worm (Schmidtea mediterranea) as our model organism to study the mechanism underlying regeneration. We tackle the complexities of our research with a combination of molecular experiments and next generation sequencing Read more



Genome biology recently hosted a 5 part bioinformatics challenge event, with the last challenge concluding on the 60th anniversary of the discovery of DNA (Thursday, April 25th. 2013).
I was able to solve the first 4 parts pretty easily. Unfortunately, I gave up after half an hour on the last challenge because I was hungry and wanted to go home.
In this post, I'll show what I did to solve/hack/cheat the first 4 challenges.


Step 1

The first challenge requires the contestant to find the longest 7-mer in a given fasta file. I wrote a quick and dirty python script to solve it:
import sys
from Bio import SeqIO
inFile = open(sys.argv[1],'r')
mer = {}
d in SeqIO.parse(inFile,'fasta'):
for reco rseq = str(record.seq)
len(seq) - 7): m = se
for i in range (q[i:i+7] if not mer.has_key(m):
mer[m] += 1 largest = 0 motiff =
mer[m] = 0 '' for m, count in mer.items(): if count > largest:
nt motiff
largest = count motiff = m pr
i
The answer for this challenge is: TAGCGAC


Step 2

The second challenge required the contestant to find all open read frames in a given genomic sequence, translate all ORFs, sort ORFs by size, then get the 25th amino acids from the top 15 ORFs. A bit more involved, but ultimately pretty trivial if you already have an ORF finding tool.
I used the EMBOSS getorf tool and generated all possible ORFs. Then I used a python script to sort and extract the answer:
import sys
from Bio import SeqIO
inFile = open(sys.argv[1],'r')
seqs = []
in SeqIO.parse(inFile,'fasta'): s
for recor deqs.append(str(record.seq))
n(x), reverse = True) print ''.join(x[24] for x in
seqs.sort(key = lambda x : l
e seqs[:15])
The answer is: THESECRETQFLIFE


Step 3

This step required the contestants to find the most differentially expressed gene given a set of fastq files and gene annotations on a genome. I cheated on this step. I didn't bother with mapping the reads or finding differentially expressed genes. 
In these challenges, a contestant advance to the next round by appending the answer of the previous round to the url: http://genomebiology.com/about/update/DNA60_
There were 90 genes in the annotation file, meaning the answer has to be one of the 90 genes. So I extracted the possible gene names from the file and wrote a script to try all possible urls until I found the correct url:
import sys,os
inFile = open('geneNames.data','r')
for line in inFile:
strip().upper() url = 'http
name = line .://genomebiology.com/about/update/DNA60_' + name
s.system(cmd) outFile = open('te
cmd = 'curl ' + url + " > temp" omp','r') data = outFile.read() outFile.close()
print url print 'FOUND'
if data.find('Sorry! We have looked everywhere') == -1:
sys.exit()
The answer is: CARB
*I would not have done it this way if the possible answers were thousands of genes. I do not codone DDoS-ing the genome biology website.


Step 4

In this step, given a set of 16s RNA metagenomic sequences file, the contestant needs to find the genus of the bacteria that is most abundant. 
Given that this is a genome biology contest, I assumed the data given will be ideal and most likely not real-world, noisy and messy sequences. I simply CD-HIT-EST all the sequences, found the CDHIT cluster with the most members, blasted the cluster representative on NCBI blastn and got my answer.
The answer is: HELICOBACTER