`-.`'.-' `-. .-'. `-. -./\.- .-' -. /_|\ .- `-. `/____\' .-'. `-. -./.-""-.\.- ' `-. /< (()) >\ .-' - .`/__`-..-'__\' .- ,...`-./___|____|___\.-'.,. ,-' ,` . . ', `-, ,-' `-,
Illuminati is a set of tools geared towards automating portions of Illumina’s secondary analysis pipeline. This application is custom tailored for a specific research institution and not architectured for general use. This may change in the future, and others dealing with Illumina software could still find it interesting.
Note: we will refer to this pipeline as the ‘primary’ analysis pipeline, even though Illumina documentation refers to it as the ‘secondary’ analysis pipeline.
Let’s say you have Illuminati installed here (like we do for some reason):
/n/ngs/tools/pilluminati
and you have a cluster with a head node ngs-cluster
You get a notification that Flowcell XYZ is done and ready for primary analysis.
First, we generate a run script that will
ssh ngs-cluster cd /n/ngs/tools/pilluminati ./bin/startup_run.rb XYZ
Where XYZ
is the flowcell ID for the flowcell you are running.
This will create a XYZ.sh
run script in your run folder.
Our run folder is in:
/n/ngs/runs
The startup_run.rb
has not done anything except create this run script. The run script will drive the primary analysis pipeline.
We can look over the run script before executing it
less /n/ngs/runs/XYZ.sh
We see a lot of comments – mostly data output from LIMS. We also see a series of commands that will be executed. Look it over.
Most important is the comments that come after the /n/ngs/tools/pilluminati/bin/config_maker.rb
. These comments represent what will go into our config.txt
and SampleSheet.csv
files. If something looks wrong, we can call this command manually beforehand, tweak the results, and comment it out of our run script.
Once you are satisfied, all you have to do is run the run script…
/n/ngs/runs/XYZ.sh
Make sure to do that on the cluster head node (ngs-cluster
). It will kick off the primary analysis pipeline which includes demultiplexing, alignment, and post processing. Emails will be sent to the pipeline administrator during the run process. When it is done, files should be in their project directories. LIMS should be updated with the QA/QC data. Everything should be beautiful.
General overview of the execution of the analysis pipeline this tool performs:
- Create configuration files
- SampleSheet.csv and config.txt are created using data from custom LIMS system
- Convert raw unaligned reads to Fastq format
- Demultiplex indexed lanes
- Perform alignment to reference genome
- Aggregate and rename unaligned reads
- Remove reads that do not pass filter
- Split custom barcoded lanes
- Analyze unaligned reads using fastqc
- Aggregate and rename export files
- Create Sample_Report for name / lane / sample mapping
- Distribute data to project directories
- Distribute stats and quality control analysis to qcdata directory
- CASAVA 1.8.x – this tool is primarily a wrapper around the CASAVA pipeline software help automate it. This tool is designed to deal specifically with CASAVA 1.8, and would fail completely with an older version. CASAVA binaries should be located at
CASAVA_PATH
(seelib/illuminati/constants.rb
).- Currently using CASAVA 1.8.1
- fastqc – we need fastqc available from the command line. Tested with fastqc version 0.9.2.
- This is run by the
fastqc.pl
script – located in thescripts
directory.
- This is run by the
- image magick – the fastqc.pl script also uses the
convert
tool of Image Magick to process the images created by fastqc. Expected to be on the path. - fastx_barcode_splitter.pl – needs to be availible from command line
- lims_data.pl – the most troublesome external dependency. lims_data.pl should sit in the
SCRIPT_PATH
directory. It is a perl script that connects to our LIMS system and provides flowcell and distribution information back to Illuminati. Future work will be to eliminate or reduce this external dependency.- NOTE – There is a crude work-around for not needing this LIMS connection. See External Data below.
- Custom Directory Structure – By default, Illuminati expects a certain directory structure to be in place to be able to run correctly.
lib/illuminati/config.rb
describes the default path structure, withlib/illuminati/constants.rb
providing some description of how these paths are used. Configuration of the paths and other options used by Illuminati can be done by creating a config file and setting theENV
variableILLUMINATI_CONFIG
to point to that path. Seeassests/example.config.yaml
for a starting point for the config file.
Frequently, we deal with flowcells processed from outsourcing centers and the resulting data returned to us. Each outsourcing center has their own way of doing things which results in their own tweaks and gotchas in the returned data.
Illuminati can handle outsourced data, but does not attempt to deal with all the nuances of the particular facility. It expects that the outsourced data flowcell directory has the Unaligned
and Aligned
sub-directories setup as if they came out of a flowcell run.
If this is satisfied, then the bin/post_run
command can be used to process outsourced orders using the -o
flag.
To use Illuminati, you need a few things in place:
- Flowcell data is in LIMS system and data provided by
lims_data.pl
is correct.
Once things are in order, running Illuminati should be a three step process:
Run the bin/startup_run.rb
script passing in the flowcell id:
$ cd /solexa/bin/illuminati $ ./bin/startup_run.rb 639AXXPY
This should generate an Admin script in /solexa/runs/
. This admin script is named <flowcell_id>.sh
and is meant to be executed to start off the main Illuminati process. So for example, this admin script would be named 639AXXPY.sh
.
Have a look at the admin script to make sure things look good
$ cd /solexa/runs less 639AXXPY.sh
First, you should see output from the LIMS system about the data of each of the flowcell’s lanes. This is the data that will be used to create the SampleSheet.csv
and config.txt
, so its important that its right.
Then comes the SampleSheet.csv
output.
Next, you will see the output that will go into the config.txt
file. Check this over and ensure the output doesn’t include errors about missing genomes or otehr information.
Finally, the admin script contains the shell commands that will kick off CASAVA with the files to be generated. You can see that there is a POST_RUN_COMMAND
in the make command that will automate the next step, alignment. However, if everything goes as planned (like it always does), then you won’t really need to deal with anything but this file.
If the admin script looks good, and you are ready to roll, then go!
$ ./639AXXPY.sh
This will actually create the config.txt
and SampleSheet.csv
files with the output as described in the admin script, then startup CASAVA’s BCL Converter.
Ok, so the config files CASAVA needs are automatically generated and the BCL converter starts up, but what else does Illuminati do?
Well, the BCL converter in CASAVA 1.8 now also automatically does the demultiplexing step as well. So, if your lanes are multiplexed, you get the demultiplexing for free!
After the BCL converter finishes, The POST_RUN_COMMAND
we saw in the make command kicks off the next step in the process: alignment. Technically, we didn’t need to create the config.txt
file until this step, but getting it done initially makes running the alignment process pretty simple.
The alignment portion of the CASAVA pipeline is started by the bin/align_runner.rb
script. Its a little wordy in there, but ultimately we are just calling 3 commands. First, CASAVA’s configureAlignment.pl
script is run without the --make
flag, so it just does a test run. The output from this test run is analyzed by the align_runner
. If there is a problem (specifically, if “ERROR” is found in the output), then Illuminati shuts down and emails you to get things going again. Usually, this means that there is a bug in the config.txt
file. If there isn’t an error, then we run configureAlignment.pl
for reals, and create the Aligned
directory. Then we run make inside that directory to start CASAVA’s ELAND aligner.
The alignment’s make command contains another POST_RUN_COMMAND
which starts up the final part of the primary analysis pipeline: the post run.
The post run functionality is contained in bin/post_run
. Its an elaborate maze of fun. But hopefully the method names at least hint at what is going on. Briefly, this script does the following:
- use
cat
to combine the multiple fastq.gz files that CASAVA makes into one for each lane / sample. - use
zcat
and thebin/fastq_filter.rb
script to filter out reads that don’t pass filter. - split lanes with custom barcodes using
fastx_barcode_splitter.pl
. - distribute fastq.gz files to their project directories.
- run fastqc on the filtered fastq.gz files and distribute the results.
- create Sample_Report.csv to summarize the sample data
- distribute the relevant stats files to the project directories.
- combine export files using
cat
and distribute these files. - distribute relevant qa / qc and stats files to the appropriate qcdata directory.
The post_run
tool has a lot of options if you want to run it manually, for outsourced orders, or to repeat a portion of the run.
Use bin/post_run -h
to get a list of options to use.
You also get emails along the way. Check out the EMAIL_LIST
in lib/illuminati/constants.rb
to see who gets emailed.
A log file is also created for each flowcell in /solexa/runs/log
. This will be used for future dashboard-style awesome-ness.
align_runner.rb
and post_run
also generate log files during their execution. If post_run
completes, it will produce a shell script with all the commands it did, so that theoretically, you could run it again if you wanted.
CASAVA’s make output for the bcl conversion and align steps are captured in their own make.out files.
If you want to avoid using the lims_data.pl
script or other external LIMS systems, you can do so by creating an external_data.yml
file and placing it in the flowcell’s root directory.
Essentially, the external_data.yml
contains all the required information about the samples on a run. As described in ExternalDataBase.rb, these required fields are:
:lane String name of lane (1 - 8), :name Sample name. :genome Code for genome used for lane. Should correlate to folder name in genomes dir, :protocol Should be either "eland_extended" or "eland_pair", :barcode_type Should be :illumina, :custom, or :none :barcode If :barcode_type is not :none, this provides the 6 sequence barcode
These are all under the :samples:
section of the YAML file. It also contains information about how the files are distributed, in the :distributions:
section:
:lane: Lane number :path: Path to distribute the data for this lane
These sections will probably be combined in the future to allow for distributions on a per-sample level.
Example external_data.yml file:
--- :samples: - :lane: "1" :name: 9.5dpc_HoxB1_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "2" :name: HoxB1_3FMS_myc_M2Flag_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "3" :name: HoxB1_3FMS_myc_myc_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "4" :name: HoxB1_3FMS_myc_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "5" :name: HoxB1_HF_M2Flag_IP :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "6" :name: HoxB1_HF_M2Flag_input :genome: mm9 :protocol: eland_extended :barcode_type: :none :barcode: "" - :lane: "7" :name: Pbx_input :genome: mm9 :protocol: eland_extended :bases: Y* :barcode_type: :none :barcode: "" - :lane: "8" :genome: phiX :name: Phi X :protocol: eland_extended :barcode_type: :none :barcode: "" :distributions: - :lane: 1 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 2 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 3 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 4 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 5 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 6 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX - :lane: 7 :path: /n/analysis/Krumlauf/bdk/Krumlauf-2011-06-23/639KBAAXX
See the specs for ExternalDataYml
for details.
CASAVA 1.8 has a pretty different structure for where it puts data now. Plus, the post_run
script creates some custom directories to do its job. Here’s an overview of the directories relevant to the primary analysis pipeline:
_ FLOWCELL_BASE_DIR | |___ Data | | | |___ Intensities | |___ Unaligned | | | |___ all | | | |___ filter | |___ Aligned | |___ all
The big change is that you don’t go digging in Data/Intensities/GERALD_*
directories anymore for your sequence / alignment files.
Instead, there are two separate directories right off the flowcell’s base directory called Unaligned
and Aligned
that are now used to store that data (technically they are user configurable, but these are the defaults, and thats what we’ll use).
Unaligned
contains your fastq.gz files. Illuminati adds two additional sub-directories – all
and filter
. all
contains the concatenated, unfiltered fastq.gz files with the correct naming structure. filter
contains the same files, but with the reads that didn’t pass filter removed.
Aligned
as you might guess contains the export.txt.gz files. Illuminati again adds the all
directory to store files using the naming convention we want to use.
Some effort has been put in to writing tests to ensure components of Illuminati are functional and don’t break when new features are added. These tests are currently incomplete, but it is a decent start. If you are on the particular machine this software was meant to be run on, you can simply run autotest to check all tests:
$ cd /solexa/bin/illuminati $ autotest
Everything should pass, or something is broken. If you are on another machine, some of the tests can be run using rspec. Some tests will fail due to requirements of file structure and database access. Future work includes making it configurable enough to test on other machines.
If you’d like to hack on Illuminati, start by forking the repo on GitHub:
https://github.com/vlandham/illuminati
To get all of the dependencies, run bundle install first. The best way to get
your changes merged back into core is as follows:
- Clone down your fork
- Create a thoughtfully named topic branch to contain your change
- Hack away
- Add tests and make sure at least the new features pass by running rspec
- If you are adding new functionality, document it in the README
- Do not change the version number, I will do that on my end
- If necessary, rebase your commits into logical chunks, without errors
- Push the branch up to GitHub
- Send a pull request to the vlandham/illuminati project.