.. _custom_pipeline_run:

Customize Grape for your own datasets
=====================================

Choosing a project name
-----------------------

Choose your project name wisely. Some examples:

    * CLL
    * ENCODE
    * HBM

Let's say you have a project about Drosophila and you are interested in Selenoproteins, 
you choose the project name 'Dsel'. We are going to use the project name 'MyProject'
here.

Layout
------

There are a number of top level folders containing configuration files, like accessions,
and  profiles. We'll come back to these later, but right now we will create a custom 
folder for our project inside the pipelines folder::

    $ cd grape
    $ cd pipelines
    $ mkdir MyProject
    $ cd MyProject

Configure the buildout.cfg
--------------------------

Write a buildout.cfg file for your project::

    [buildout]
    extends = ../dependencies.cfg
              ../../accessions/MyProject/db.cfg
              ../../profiles/MyProject/db.cfg

Adding a configuration file for the accessions
----------------------------------------------

Add a project folder to the top level accessions folder, and add a db.cfg file:

    cd grape
    cd accessions
    mkdir MyProject
    cd MyProject
    touch db.cfg

We'll cover how to configure this file after taking care of the profile.

Adding a configuration file for the profile
-------------------------------------------

Add a project folder to the top level profiles folder, and add a db.cfg file::

    cd grape
    cd profiles
    mkdir MyProject
    cd MyProject
    touch db.cfg

Configuring the profile
-----------------------

Let's copy over the configuration for the profile of the Test project::

    cd grape
    cd profiles
    cd MyProject
    cp ../Test/db.cfg .

We adapt this file to our case::

    [runs]
    parts = Male
            Female
    
    [pipeline]
    TEMPLATE   = ${buildout:directory}/src/pipeline/template3.0.txt
    PROJECTID  = MyProject
    DB         = MyProject_RNAseqPipeline
    COMMONDB   = MyProject_RNAseqPipelineCommon
    HOST       = pou
    THREADS    = 2
    MAPPER     = GEM
    MISMATCHES = 2
    CLUSTER    = mem_6
    ANNOTATION = /users/yourusername/Drosophilas/Dwill/dwil_all_r1.3.gff
    GENOMESEQ  = /users/yourusername/Genomes/Drosophila_willistoni/genome.fa
    
    [Male]
    recipe = grape.recipe.pipeline
    accession = Male
    
    [Female]
    recipe = grape.recipe.pipeline
    accession = Female

What we have done in this configuration is:

    1. Decide how to call the pipeline runs: Male and Female
    2. Configured the Databases in which to store the results: MyProject_RNAseqPipeline and MyProject_RNAseqPipelineCommon
    3. Given the location of the annotation and the genome
    4. Configured the pipelines to be run on the cluster with 2 threads

Configuring the accessions
--------------------------

Let's copy over the configuration for the profile of the MyProject project::

    cd grape
    cd accessions
    cd MyProject

Edit the db.cfg file we created earlier::

    [Female]
    file_location = /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_female_read1_qseq.fastq
                    /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_female_read2_qseq.fastq
    mate_id = Female.1
              Female.2
    pair_id = Female
              Female
    label = Female
            Female
    gender = female
    dataType=RNASeq
    cell=CELL
    rnaExtract=UNKNOWN
    localization=CELL
    replicate=1
    lab=CRG
    type=fastq
    readType=2x96
    qualities=phred
    species=Drosophila willistoni
    
    [Male]
    file_location = /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_male_read1_qseq.fastq
                    /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_male_read2_qseq.fastq
    mate_id = Male.1
              Male.2
    pair_id = Male
              Male
    label = Male
            Male
    gender = male
    dataType=RNASeq
    cell=CELL
    rnaExtract=UNKNOWN
    localization=CELL
    replicate=1
    lab=CRG
    type=fastq
    readType=2x96
    qualities=phred
    species=Drosophila willistoni

Now you have the two accessions defined and the profiles specify how to run the 
two pipelines. Now we need a database for storing the results of the pipeline runs.

Create databases for your project
---------------------------------

You need two databases for the MyProject project:

    1. MyProject_RNAseqPipeline
    2. MyProject_RNAseqPipelineCommon

The permissions you need to ask for are:

    1. rnaseqweb: read
    2. yourusername: read and write

The rnaseqweb user needs read access in order to show the statistical results.

You needs to have read write access.

Then you need to modify your MySQL configuration file: ~/.my.cnf::

    [client]
    host=mysqlserver
    port=3306
    user=yourusername
    password=123

Run the buildout
----------------

Run virtualenv::

    cd grape
    cd pipelines
    cd MyProject
    virtualenv --no-site-packages .

If you get an error, you may have to remove your .pydistutils.cfg file.

    .pydistutils.cfg

Run the bootstrap.py file with the python binary that has been made available by virtualenv in the bin folder::

    cd grape
    cd pipelines
    cd MyProject
    ./bin/python ../../bootstrap.py

Run the buildout::

    cd grape
    cd pipelines
    cd MyProject
    ./bin/buildout

The parts folder now contains everything you need to run the two pipelines::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    tree
    .
    |-- Female
    |   |-- GEMIndices -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/GEMIndices
    |   |-- bin -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/pipeline/bin
    |   |-- clean.sh
    |   |-- execute.sh
    |   |-- lib -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/pipeline/lib
    |   |-- read.list.txt
    |   |-- readData
    |   |   |-- lane8_W_female_read1_qseq.fastq -> /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_female_read1_qseq.fastq
    |   |   `-- lane8_W_female_read2_qseq.fastq -> /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_female_read2_qseq.fastq
    |   |-- results -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/Female
    |   `-- start.sh
    |-- Male
    |   |-- GEMIndices -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/GEMIndices
    |   |-- bin -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/pipeline/bin
    |   |-- clean.sh
    |   |-- execute.sh
    |   |-- lib -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/pipeline/lib
    |   |-- read.list.txt
    |   |-- readData
    |   |   |-- lane8_W_male_read1_qseq.fastq -> /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_male_read1_qseq.fastq
    |   |   `-- lane8_W_male_read2_qseq.fastq -> /users/myusername/sequencing_drosophilas_saltans/RNAseq/fastq/lane8_W_male_read2_qseq.fastq
    |   |-- results -> /users/yourusername/Drosophilas/Dwill/Pipeline/pipelines/MyProject/var/Male
    |   `-- start.sh
    `-- buildout

Run the first pipeline
----------------------

Now it is time to run the first pipeline so that the index files for the genome and
annotation can be generated. Once these files are present we can run all the other 
pipelines in parallel.

Go to the parts folder and run the start script::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Female
    ./start.sh

If you get errors, you can store them into an error.log file like this::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Female
    ./start.sh 2> error.log

In case everything worked ok, you can run the execute script::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Female
    ./execute.sh

Run the other pipeline
----------------------

The second pipeline is run exactly like the first one:

Go to the parts folder and run the start script::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Male
    ./start.sh

If you get errors, you can store them into an error.log file like this::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Male
    ./start.sh 2> error.log

In case everything worked ok, you can run the execute script::

    cd grape
    cd pipelines
    cd MyProject
    cd parts/
    cd parts/Male
    ./execute.sh