Creating BLAST app for Cyverse v3 [dataset]

Ken Youens-Clark
2017 protocols.io  
How I created a BLAST app for Cyverse. 1 If you haven't already, install the Cyverse SDK (https://github.com/cyverse/cyverse-sdk) so you have access to "jobs-submit" and such. 2 I created https://github.com/hurwitzlab/muscope-blast to hold the code for the Stampede/Cyverse app. Stampede apps are recommended (http://developer.agaveapi.co/) to have a 'stampede' directory for the code --perhaps a different dir for each execution system? I also created https://github.com/hurwitzlab/ohana for the
more » ... e to build the BLAST dbs and such. 3 I pulled down the HOT from Cyverse Data Store at /iplant/home/scope/data/delong/HOT224-238. To ensure I had everything, I ran https://github.com/hurwitzlab/ohana/blob/master/scripts/check-md5.pl6 to check against MD5 sums. I wrote https://github.com/hurwitzlab/ohana/blob/master/scripts/mk-blast.sh to concatenate all contigs/genes/proteins files into one per type, then index with BLAST. 4 The original Eggnog annotations to the predicted genes were delivered in a format that spread one annotation over two lines, so I wrote https://github.com/hurwitzlab/ohana/blob/master/scripts/merge-genes.pl6 to merge them. As there were 15M annotations, I struggled over how to store and retrieve them. I wanted a database like MySQL or Pg, but it's unlikely I could bring up a daemon-based server on stampede, so I chose SQLite. Problem there is I was quite certain it would be too slow to put 15M in one table, so I decided to make a db for each sample (103 of them). The script https://github.com/hurwitzlab/ohana/blob/master/scripts/pyloader.py will load the dbs. 5 This can be any language or executable, but I tend to write these in bash. I often call mine 'run.sh' (https://github.com/hurwitzlab/muscope-blast/blob/master/stampede/run.sh) and base it off a template (https://github.com/kyclark/metagenomics-book/blob/master/bash/basic.sh) that accepts named arguments. This script will query the input file(s) to the BLAST dbs and then use any resulting hits to predicted genes to query SQLite for annotations which will be placed into an additional file.
doi:10.17504/protocols.io.g27byhn fatcat:mp7uaeji7vg7fkxnuyd744fz6a