Last commit for README: cfdc4e8afab5e7d515fe39fe294ce535cc844f1e

updating citation to BMC RACS paper

Marcelo Ponce [2019-11-04 16:12:50]

updating citation to BMC RACS paper

RACS: Rapid Analysis of ChIP-Seq data for contig based genomes
====  --------------------------------------------------------

These tools are a series of scripts developed to facilitate the analysis
of ChIP-Seq data and has been applied to the organism Tetrahymena thermophila.


=== Content ===

* Overview

* Requirements

* Installation
  - Updating RACS
  - Integrity checks

* How to use the "RACS" pipeline
  - Counting Reads in ORF
    - Normalization of ORF by PF Cluster Scores
  - Determination of InterGenic Regions

* Additional Tools
  - Monitoring RACS runs
  - Downloading datasets
  - Comparing RACS results to MACS

* Performance & Parallelism

* Notes about the use of RAMdisk and storage space as "working space"

* Examples

* Citation & References

================

* Overview

Our computational pipeline Rapidly Aanalyze ChIP-Seq data (RACS) can be used
for any genome that has files containing coordinate sequences of interest.
Our pipeline provides a unified tool to perform comprehensive ChIP-Seq data
analysis.
For instance, with RACS users obtain the coordinates of ChIP peaks as well as
information regarding their relative enrichment across the genome, i.e. number
of significant peaks found with genic versus non-genic regions.
RACS is a versatile computational pipeline suitable to analyze ChIP-Seq data
generated using any model organism.

RACS will generate results for genic and intergenic regions, generating as
final products two CSV (Comma Separated Value) files.
These are standard text ASCII files, which can be read with any typical
spreadsheet software or R.


*******************************************************************************

This pipeline is open source under the MIT license, and researchers are welcome
to use it.
We will appreciate if users can let us know if they found bugs or could provide
feedback about its use and experience using RACS.
Please cite our paper (https://arxiv.org/abs/1905.02771) when ever you are using RACS.

*******************************************************************************


* Requirements

The scripts should work in any Linux-type OS (ie. Mac OS and Linux OS).
It requires to have installed the following programs:
	- SAMtools
	- BWA
	- R

The first two are needed for the determination of reads in open region frames -ORF- (genic regions).
While SAMtools and R are needed for the determination of intergenic regions -IGR-.


The main scripts are located in the "core" directory.
Additionally we provide other subdirectories containing the following:
	- "hpc": submission scripts for typical job managers and schedulers in HPC environments
	- "tools": comparison tools developed to compare against other software, eg. MACS;
			also we included one monitoring tool to observe the memory/RAMdisk usage while RACS is running
	- "datasets": examples of the data used to test and run our pipeline

- Other dependencies:
Our core scripts do not require any additional R packages; however, the
comparison tools, depending on what format the data to compare with is given,
might use some spreadsheet reader package.
For instance, we have included one named "xlsx" which allows to read
proprietary spreadsheet formats.
To aid the process of identifying and installing the auxiliary R packages
needed by these tools, we included an R script called "setup.R", that can be
run from the command line to check and install these dependencies (more details
are presented in the Examples section).




* INSTALLATION

The scripts composing this pipeline, are available as open source tools,
and can be obtained from any of the following repositories:

	https://gitrepos.scinet.utoronto.ca/public/?a=summary&p=RACS

or

	https://bitbucket.org/mjponce/racs

Both repositories are synchronized, so that the latest version of RACS will be available in both
repositories simultaneously.

To obtain and install a copy of RACS in your computer, open a terminal (you will need git and
a internet connection!) and type:

	git clone https://gitrepos.scinet.utoronto.ca/public/RACS.git

that should clone (download and copy) the latest version of our pipeline in your computer,
creating a directory called "RACS", containing the following files and sub-directories:

``
RACS
 âââ AUTHORS
 âââ CITATION
 âââ LICENSE
 âââ README
 âââ WARRANTY
 âââ core
 |Â Â  âââ countReads.sh
 |Â Â  âââ table.sh
 |Â Â  âââ comb_tables.sh
 |   âââ auxs
 |   |    âââ auxFns.sh
 |   |    âââ testFns.sh
 |   âââ intergenic
 |   |    âââ det-interGenes.sh
 |   |    âââ interGeneRegions.R
 |   |    âââ interGenes.sh
 |   âââ defns
 |   |   âââ TT_gene.id
 |   |   âââ TT_mRNA.id
 |   |   âââ OXY_gene.id
 |   âââ test
 |        âââ lst
 âââ datasets
 |Â Â  âââ TET_PostProcessing_MAC_Genome_Genic.xlsx
 |Â Â  âââ TET_PostProcessing_MAC_Genome_Intergenic.xlsx
 |Â Â  âââ TET_Ibd1_MAC_Genome_Genic.xlsx
 |Â Â  âââ TET_Ibd1_MAC_Genome_Intergenic.xlsx
 |   âââ TET_Ibd1_rDNA_Genic.xlsx
 |   âââ TET_Ibd1_rDNA_Intergenic.xlsx
 |   âââ OXY_Rpb1_MAC_Genome_Genic.xlsx
 |Â Â  âââ OXY_Rpb1_MAC_Genome_Intergenic.xlsx
 |   âââ get_GFF3-files.sh
 |   âââ get_OXYchIPseq-files.sh
 âââ hpc
 âÂ Â  âââ submission.pbs
 |Â Â  âââ submission.slurm
 |   âââ IGR-jobs_parallel.gnu
 |   âââ modules
 âââ tools
  Â Â  âââ racs_monitor.sh
  Â Â  âââ setup.R
  Â Â  âââ compare.R
''


Updating RACS:
Because the RACS pipeline is available under version control (git), you can easily get updates
and the latest additions to the RACS pipeline, by simply typing the following command in the
RACS directory:

	git pull

This will bring into your local installation the latest updates to the pipeline.
Similarly, any of the git functionalities should work as expected, eg. git log, git status, etc.



* Integrity checks
For integrity purposes we have included a series of tools to check whether internal scripts'
dependencies, as well as, external programs are available in the system so that RACS functionality
is guaranteed.
Some of RACS routines will run this integrity checks when ever appropriate at the moment of execution.
The user can also test this, by executing the following command in the shell:

  PATHtoRACSrepo/core/auxs/testFns.sh

at this point a series of checks will be executed and the results shown in the screen, eg.

*** CHECKING RACS pipeline INTERNAL INTEGRITY...
         countReads.sh ... ok!
         table.sh ... ok!
         comb_tables.sh ... ok!
         intergenic/interGenes.sh ... ok!
         intergenic/interGeneRegions.R ... ok!
         intergenic/det-interGenes.sh ... ok!
         auxs/testFns.sh ... ok!
         auxs/auxFns.sh ... ok!
*** CHECKING EXTERNAL DEPENDENCIES with auxiliary software: BWA, SAMTOOLS & RSCRIPT...
         bwa ... found!
         samtools ... found!
         Rscript ... found!

If either one of the dependencies is not meet an error message describing the issue will be displayed,
followed by the basic instructions in how to use RACS.


-------------------------------------------------------------------------------


* HOW TO USE THE 'racs' PIPELINE:

RACS offers two main functionalities:
- count reads in ORF
and
- identify reads in intergenic regions (using information about the biology of the model)


* Counting Reads in ORF:

- "countReads": countReads.sh is the main script in the pipeline, it is a shell
  script in charge of implementing and distributing the actual pipeline to count
the reads. It combines several instances of shell commands and offloads
to the packages SAMtools and BWA.
This script can utilize multi-core parallelization when possible via threads
which can be explicitly input by the user or automatically detected by the script.

At the end of the execution, the script will create a directory named
'ORF_RACS_results-YYYYMMDD-HHMMSS' where 'YYYYMMDD-HHMMSS' indicates the
year-month-day and time when the script was executed.
The resulting files generated when executing the ORF countReads script,
including some intermediate files and the final table, will be located in this
"ORF_RACS_results-..." directory.
Some of these intermediate files located in the "ORF_RACS_results-..." directory,
e.g. bam and bai files, could be useful to keep for additional analysis, such as,
visually inspecting some of them with IGV.

In particular within this results' directory, there will be a file named
"FINAL.table....", which is a plain text file presenting the results for the
reads in the ORF. Specific details about this file are presented below.
This "FINAL.table...." file, which summarizes the results for the reads in the
ORF, is one of the mandatory 'inputs' for the IGR scripts of the RACS pipeline.

Summary of "Inputs" and "Outputs" of the ORF 'countReads.sh" script:
Inputs:
	- INPUT file (fastq.gz)
	- IP file (fastq.gz)
	- fasta assembly file (.fasta)
	- reference genome file (.gff3)

Outputs:
	- several sam, bam and bai files for each input (INPUT and IP) files
	- intermediate tables with reads for INPUT and IP files, named 'tableReads{INPUT/IP}.{input/ip-filename}'
	- final table sorted by scaffold and localization within the scaffold,
	named 'FINAL.table.{INPUTfilename}-{IPfilename}


When executing the scripts, please indicate the full path to the location of
the script.  In this way the script will determine where the repository is
located and find the different tools needed during the execution.
Also, please do not "source" the scripts, they are set to be run as executable scripts,
ie. execute the scripts in this way,

	PATHtoRACSrepo/core/.../scriptNAME.sh  args	# correct way to run RACS scripts

do *NOT* execute them in this way,

	. PATHtoRACSrepo/core/.../scriptNAME.sh  args	# WRONG: do *NOT* run RACS scripts in this way

NOR in this way neither,

	source PATHtoRACSrepo/core/.../scriptNAME.sh  args	# WRONG: do *NOT* run RACS scripts in this way

Our scripts will also try to detect this situation and prevent the user from doing this.
Due to the modular implementation we designed for RACS, and in order to allow RACS
locating their main components, we need to have the scripts executed as described above
and not sourced.


Integrity & Sanity Checks
In each of the main RACS scripts we have incorporated several integrity checks which
are run at the very beginning of the scripts' execution. These tests include
determining whether the tools used (eg. BWA, SAMTOOLS or R) are installed in the
system and available for being used within the pipeline.
Similarly, by checking the location of the script, the pipeline verifies that the
other components of the pipeline are also in place and can be found so that the pipeline
can run without any problems.
In this way, there is no need to add the different scripts of the pipeline to the
PATH and the scripts are self-aware of where they are placed.
For achieving this, the scripts will need to be called specifying its full location in the computer.

The different scripts in the pipeline will also check that the arguments specified,
in particular when these suppose to be existent files, they actually are!
We basically tried our best to implement defensive programming all across the different scripts
that compose the RACS pipeline, to protect its proper execution and help the user to establish
the proper way of using RACS.
We also included a simple testing routine, in the 'aux' subdir, that can be used to run
some of the integrity tests that the pipeline will be checking during execution,
as described in the section above.


Arguments to the script:
  1st argument: file with INPUT reads, usually a ".fasta.gz" file, obtained from the sequencer
  2nd argument: file with IP reads, usually a ".fasta.gz" file, obtained from the sequencer
  3rd argument: reference genome file (fasta)
  4th argument: annotation file (gff3)
  5th argument: working space (if possible use RAMdisk --ie. /dev/shm/--, or
	/tmp in a computer with SSD)
  6th argument (optional): number of cores to use for BWA multi-threading.
	If this argument is not specified, the script will attempt to determine
	automatically the number of cores and use that number to run in multi-threading.
	If you want to run the script serially, ie. with just one core, you should
	set this argument to 1.
	If you want the script to autmatically detect and use the cores, use ""
	for this argument when you are aslo specifying the 7th argument.
  7th argument (optional): used for specifying details, "defns", when
	processing an arbitrary organism (other than Tetrahymena thermophila) and/or
	targetting specifc filters/terms other than genes.

The main output file, FINAL.table."INPUTfile"-"IPfile"
(where "INPUTfile" and "IPfile" refer to the corresponding INPUT and IP files),
is ordered by the scaffold localization.
Notice that this file <<FINAL.table."INPUTfile"-"IPfile">>, is the one that
will be used in the second part of the pipeline when detecting the intergenic regions.
This final file is a CSV text file containing the following columns:

	location	name	description	geneSize	INPUT.reads	IP.reads

where
  "location" is the scaffold location of the gene
  "name" is the name of the gene
  "description" is a description of the gene
  "geneSize" represents the size of the gene
  "INPUT.reads" is the number of reads (calls) for the INPUT file
  "IP.reads" is the number of reads (calls) for the IP file



* Normalization of ORF by PF Cluster Scores
If your data contains the Passing Filter (PF) Cluster scores, you can use an additional shell script
that we have included to normalize your INPUT/IP-reads.
This script is called "normalizedORF.sh" and is located in the core subdirectory.
The script requires three mandatory arguments:
	1st argument: "FINAL.table.*"  file generated from the RACS' ORF pipeline
	2nd argument: "PF-INPUT-value"  PF value corresponding to the INPUT file
	3rd argument: "PF-IP-value"  PF value corresponding to the IP file

Please notice that arguments 2 and 3, are the actual numerical values corresponding to the PF clusters
for the INPUT and IP respectively.


* Determination of InterGenic Regions
The second main functionality of our RACS pipeline is the ability to determine
intergenic regions based on the results from the ORF reads and the biology of the
model organism.
For achieving this second part, we combined a series of shell scripts and R scripts.
Similarly as before the pipeline when executed will perform some sanity and integrity
checks to guarantee that all the necessary pieces are in place.
All the scripts for this part of the pipeline are located in the 'intergenic' subdir.
The flow's director is a simple shell script, 'det-interGenes.sh', which will receive
four command line arguments and call an R script, 'interGenicRegions.R' and
a second shell script by the end of the execution, 'interGenes.sh'.
The R script, 'interGenicRegions.R', is the actual brain and main script in charge of
determining regions among genic boundaries.
It is designed in a modular fashion, implementing the script in a main driver script
(the 'interGenicRegions.R' itself) and an utility file where all the functions used in
the main driver are defined.
The second shell script, 'interGenes.sh', is in charge of computing the number of
reads/calls within the determined intergenic regions, for which it also uses SAMtools.
The main shell script, keeps track of the time employed in each stage of the process,
and offers a summary of it at the end of the execution.

Arguments to the script:
	arg1: final combined table generated by the ORF from the RACS pipeline
	arg2: reference genome file (gff3)
	arg3: name of the file where to save the InterGenic Regions
	arg4: text file containing the name of the output file from ORF part/tag.(ref.) name; eg.
		alnDATASET_INDICATORS_ChIP_SXX_RY_ZZZ.fastq.gz-sorted.bam
		alnDATASET_INDICATORS_Input_SXX_RY_ZZZ.fastq.gz-sorted.bam




* Additional Tools

In addition to RACS main scripts, we have developed and provide some additional
tools to help the user monitoring RACS runs, downloading publicly available
datasets that can be used to test and reproduce our results, and compare RACS
results vs other softwares.

- Monitoring RACS runs
  We provide a simple monitoring tool to observe the "working directory" space
indicated in the ORF pipeline. In principle this can be done either in RAMdisk
(ie. memory) or just a typical HDD. In particular when the user specifies to
use memory instead, it could be useful to monitor the memory utilization.
This tool is available in the 'tools/racs_monitor.sh' script, when executed
will list the specified working directory, space utilized within this and overall
memory utilization.

- Downloading datasets
  We provide two scripts that will allow the user download the reference
genomic files for Tetrahymena thermophila and Oxytricha trifallax, as well as,
the publicly available ChIP-seq data for the Oxytricha trifallax.
Both scripts are available in the 'datasets' directory:

	get_GFF3-files.sh  will download the corresponding gff3 files for
T.thermophila and O.trifallax.

	get_OXYchIPseq-files.sh  will download two runs of ChIP-next gen. seq.
data for Oxytricha trifallax. Because this data is located at NCBI repositories
the user should have installed the SRA toolkit tools in order to access and
download this data.

- Comparison tools
  We provide an R script which helps with the comparison among results obtained
using RACS and MACS2.
Please be aware that this tool is still under active development and for now
is offered as set of functions degined to be used within R, either
interactively or in batch mode by invoquing the functions defined in this
module.
The main functionalities offered are:
	- calculation of the overlap between results coming from different
		tools, eg. RACS vs MACS2, by defining an 'overlap score'
		to quantify the similarities between the results.
	- visual representations (statics and interactive) of the overlapping scores.
More concrete examples and uses are presented in the examples section below.


-------------------------------------------------------------------------------

* Performance & Parallelism

RACS can utilize multi-core (threading) parallelism for its ORF pipeline.
This is achieved through the multi-threaded features of BWA.
A performance and scaling analysis plot is included in the "doc" subdirectory
of this repository -- see  doc/RACS_scaling.pdf  or  doc/RACS_Scaling.html (this
last file is an HTML file that can be opened with any browser and will display a
3D visualization of the scaling analysis with interactive features).

We should also notice that in some cases, depending on the size of the data and
hardware specifications (ie. number of cores) where the pipeline is executed,
results could slightly differ depending on how many cores/threads are used.
Thisi, as disclaimed by BWA developers, is a characteristic of BWA (see
https://github.com/lh3/bwa/issues/121 ).
In our analyzes we discovered that this can occur, when using a number of
threads larger than the actual number of physical cores in the computer (a
typical technique known as hyper-threading).
Nevertheless, this is an issue that users should be aware of, although our analysis
showed that the variations are not statistically significant.


* Notes about the use of RAMdisk and storage space as "working space"

When using the main script for counting reads, the user has the ability of
indicating whether to use a faster 'working space' than traditional spinning
disks (ie. HDD) such as memory (ie. RAMdisk) or a solid state devive (SSD).
In general, utilizing RAMdisk or SSDs, would result in a speed-up of roughly
10 to 30%, depending on hardware specifications and the size of the dataset
to be analyzed.
The larger the dataset the more IO operations that would be needed, hence
larger datasets would benefit the most of this.
This is of course, assuming that the data and subsequent auxiliary files
created during the analysis will fit in ``memory''. If that is not the case
then depending on the system and how ti is configured may result in decremental
performance (e.g. some computer will swap --i.e. start using traditional HDD
space--) or even crash (for instance, is common in many HPC clusters to do not
allow for swapping techniques).
Differences in performance among SSD vs RAMdisk, are almost negligible, again
depending on hardware specs, this can be upmost of the order of few
percentages.
Finally, it should be noticed that by using RAMdisk (i.e. memory) as a working
space, users will reduce  the overall computational time, however this is will
ultimately depend upon the amount of memory available as this technique will
increase the utilization of RAM.
As a general estimate, at the moment of running the pipeline, users might
estimate the amount of memory needed by one order of magnitude larger (i.e. x
10) than the size of the dataset to be processed.


The following plot represents the typical behaviour in storage use in the
"working space" area during a typical run of RACS.
The vertical axis represents the size used in the 'working space' in units
of the total size of the initial data (INPUT and IP files, plus reference
files --gff3 and fasta files--).
Ie. a value of 8, means 8 times the original size of the initial data.
The horizontal axis is runtime in seconds, and the '*' represents data points
showing the trend in use of working space.


  9 +-+---------+-----------+-----------+----------+-----------+---------+-+
    +           +           +           +          +           +           +
  8 +-+                                            ****                  +-+
    |                                         ******  ************         |
  7 +-+                                     *** **               *       +-+
    |                                 *******                    *         |
  6 +-+                            ******                        *       +-+
    |                              *                             *         |
  5 +-+                            *                             *       +-+
    |                             **                             *         |
    |                             *                              *         |
  4 +-+                          **                              *       +-+
    |                 ************                               *         |
  3 +-+               *                                          *       +-+
    |                **                                          *         |
  2 +-+              *                                           *       +-+
    |              ***                                           *         |
  1 ****************                                             *       +-+
    *           +           +           +          +           + *         +
  0 *-+---------+-----------+-----------+----------+-----------+-**------+-+
    0          500         1000        1500       2000        2500        3000


-------------------------------------------------------------------------------

* EXAMPLES

I) calling reads for genic regions (ORF)
I.i) the following command will run the countReads.sh script using:
	- 'dataDIR/chipseq_INPUT_file.fasta.gz'  as the file with the INPUT reads
	- 'dataDIR/chipseq_IP_file.fasta.gz' as the file with IP reads
	- 'REFfiles/T_thermophila_June2014_assembly.fasta' as the reference genome for the T.thermophila organism
	- 'REFfiles/T_thermophila_June2014.gff3' as the annotation file for the T.thermophila
	- '/tmp/' as working space
	- notice that 'dataDIR' and 'REFfiles' are the directories where the data is located and it can be given as absolute or relative paths

	PATHtoRACSrepo/core/countReads.sh   dataDIR/chipseq_INPUT_file.fastq.gz  dataDIR/chipseq_IP_file.fastq.gz  REFfiles/T_thermophila_June2014_assembly.fasta  REFfiles/T_thermophila_June2014.gff3  /tmp/

In this case, beucase we haven't specfied the number of cores to use, the
script will try to automatically detect the number of cores in the computer and
use all of them when running parallel parts of the pipeline.

I.ii) the following command will run the countReads.sh script using the same
arguments as before but it is specifying to use "/dev/shm" (RAMdisk) instead
of "/tmp" as temporary storage for auxiliary files, and 16 threads in the
parallel regions of the pipeline.
Additionally it is using the system's 'time' command to time how long the
pipeline takes to run.

	PATHtoRACSrepo/core/countReads.sh   dataDIR/chipseq_INPUT_file.fastq.gz  dataDIR/chipseq_IP_file.fastq.gz  REFfiles/T_thermophila_June2014_assembly.fasta  REFfiles/T_thermophila_June2014.gff3  /dev/shm/  16


II) Determination of InterGenic Regions (IGR)
II.i) the following command will determine the intergenic regions, using:
	- 'FINAL.table.chipseq_INPUT_file-chipseq_IP_file' as the table which was determined in the ORF part of the pipeline; this file is located in the output subdirectory ORF_RACS_results-YYYYMMDD-HHMM
	- 'REFfiles/T_thermophila_June2014.gff3' as the micro-organism reference genome file
	- 'interGENs_chipseqExp_INPUTid-IPid.csv' is the name of the table that the IGR part of the pipeline will
		generate, ie. this will be the output of this part of the pipeline
	- 'samples.file' is a text file containing the name of the BAM output file, also generated when
		running the ORF part from RACS; they are usually named as
                alnDATASET_INDICATORS_ChIP_SXX_RY_ZZZ.fastq.gz-sorted.bam
                alnDATASET_INDICATORS_Input_SXX_RY_ZZZ.fastq.gz-sorted.bam

	PATHtoRACSrepo/core/intergenic/det-interGenes.sh  FINAL.table.chipseq_INPUT_file-chipseq_IP_file  REFfiles/T_thermophila_June2014.sorted.gff3  interGENs_chipseqExp_INPUTid-IPid.csv  samples.file


II.ii) we included a submission script in the 'hpc' directory, named "IGR-jobs_parallel.gnu",
which will basically scan the current directory digging for sub-directories named "ORF_RACS_*",
which is the way RACS will name the outcome from running the RACS' ORF pipeline.
When the script detects one of these directories, it will look inside it and will generate the
corresponding 'samples.file' containing the name of all the aln*fastq.gz-sorted.bam files within
this directory.
When the search for ORF sub-directories is done, it will launch the IGR part of the pipeline in
*parallel* for ALL the IGR results using an 'embarrassingly parallel' approach via GNU-Parallel.
Assuming you are located in a directory containing several ORF sub-directories, you will just run it
in this way,

	PATHtoRACSrepo/hpc/IGR-jobs_parallel.gnu

In principle the script does not require any command line argument, but it contains several environment
variables defined that should be adjusted for the particular system where it will run.
Ie.
	- RACSloc="location where RACS is installed in your system"
	- REFgff3="location where the genome reference file of the organism is located in your system"

Additionally one could adjust the following variables:
	- PARALLEL_TASKS_PER_NODE="number of parallel jobs to run at the same time", although
		by the default the script will try to determine the maximum number of cores available
		on the system
	- ORFdirs="ORF_RACS_results-", matching pattern for the beginning of the directories generated
		by the RACS' ORF


III) normalization and cut-off with negative control

We provide a shell script that can be used for normalizing and dealing with the
cut-offs when your data includes wild-types or negative controls.
The script can be found in the core directory, and is named "normalizeORG.sh".
It requires 3 arguments:
   - 1st argument: "FINAL.table.*"  file from RACS' ORF pipeline
   - 2nd argument: "PF-INPUT-value"  PF value corresponding to the INPUT file
   - 3rd argument: "PF-IP-value"  PF value corresponding to the IP file
   - 4th argument: 'A' or 'D' (OPTIONAL), when this 4th argument is specified, an additional table is created being ordered with respect to the IP/INPUT ratio, in "A"scending or "D"ecreasing order

       PATHtoRACSrepo/core/normalizeORF.sh  FINAL.table.XXXX  14694464  10148171
       PATHtoRACSrepo/core/normalizeORF.sh  FINAL.table.XXXX  14694464  10148171  A

Alternatively one could use a couple of spreadsheets available in the 'datasets'
directory, if the user prefers to proceed in an interactive manner.
See,
	datasets/TET_PostProcessing_MAC_Genome_Genic.xlsx
	datasets/TET_PostProcessing_MAC_Genome_Intergenic.xlsx

We have also included, specific examples of these applied to some of the cases
we have analized using RACS, ie.
- for Tetrahymena thermophila:
	TET_Ibd1_MAC_Genome_Genic.xlsx
	TET_Ibd1_MAC_Genome_Intergenic.xlsx
	TET_Ibd1_rDNA_Genic.xlsx
	TET_Ibd1_rDNA_Intergenic.xlsx

- for Oxytricha trifallax:
	OXY_Rpb1_MAC_Genome_Genic.xlsx
	OXY_Rpb1_MAC_Genome_Intergenic.xlsx


IV) Processing generic organisms/terms, other than Tetrahymena thermophila

IV.i) Reference table manipulation
	During the analysis and determination of ORF, the reference file for
the given organism is processed by selecting the appropriate terms and filters
to carve the corresponding terms.
In the subdirectory "core/defns", we present examples of what terms and
filters have to be specified; eg.
	TT_gene.id  _ for selecting "genes" from Tetrahymena thermophila
	TT_mRNA.id  _ for selecting "mRNA" from Tetrahymena thermophila
	OXY_gene.id  _ for selecting "genes" from the Oxytricha trifallax

For instance, the following command will select the micro-RNA for the Tetrahymena thermophila,

	PATHtoRACSrepo/core/table.sh  REFfiles/T_thermophila_June2014.gff3  PATHtoRACSrepo/core/defns/TT_mRNA.id

or, the "genes" for the Oxytricha trifallax,

	PATHtoRACSrepo/core/table.sh  REFfiles/Oxytricha_trifallax_022112.gff3  PATHtoRACSrepo/core/defns/OXY_gene.id


IV.ii) ORF determination
	For analyzing generic organisms and terms, we need to propagate the
information passed to the "table.sh" script as explained in the previous
examples for the analysis of ORFs,

	PATHtoRACSrepo/core/countReads.sh   dataDIR/chipseq_INPUT_file.fastq.gz  dataDIR/chipseq_IP_file.fastq.gz  REFfiles/T_thermophila_June2014_assembly.fasta  REFfiles/T_thermophila_June2014.gff3  /dev/shm/  16  PATHtoRACSrepo/core/defns/TT_mRNA.id

        PATHtoRACSrepo/core/countReads.sh  dataDIR/SRX483016_1.fastq.gz  dataDIR/SRX483017_1.fastq.gz  REFfiles/Oxytricha_trifallax_022112_assembly.fasta  REFfiles/Oxytricha_trifallax_022112.gff3  /dev/shm/  16  PATHtoRACSrepo/core/defns/OXY_gene.id

where "SRX483017_1.fastq.gz" corresponds to the data for the ChIP-Seq
run/experiment #1, and "SRX483016_1.fast1.gz" corresponds to the data for the
Input of the Oxytricha trifallax ChIP-Seq run/experiment #1.

IV.ii') Automatic core detection with arbitrary organisms/terms defns.:
	The following example, is exactly the same as the previous one, but with the
solely difference that the script will try to automatically detect and use all
the cores available in the computer; this is achieved by setting the 6th
argument to "", which is needed when indicating the seventh argument:

	PATHtoRACSrepo/core/countReads.sh  dataDIR/SRX483016_1.fastq.gz  dataDIR/SRX483017_1.fastq.gz  REFfiles/Oxytricha_trifallax_022112_assembly.fasta  REFfiles/Oxytricha_trifallax_022112.gff3  /dev/shm/  ""  PATHtoRACSrepo/core/defns/OXY_gene.id


IV.iii) IGR determination
	Determining IGR for generic organism does not require any special
considerations other than using the "FINAL.table.*" file generated by the RACS'
ORF tool. Eg.

	PATHtoRACSrepo/core/intergenic/det-interGenes.sh  FINAL.table.SRX483016_1-SRX483017_1  REFfiles/Oxytricha_trifallax_022112.gff3  interGENs_run1_OXY.csv  sample.input

where the file "sample.input" contains the following files:

	alnSRX483016_1.fastq.gz-sorted.bam
	alnSRX483017_1.fastq.gz-sorted.bam


IV.iv) Full analysis of Oxytricha trifallax
	We describe the step-by-step case of how to analyze the data for the Oxytricha trifallax

	# 1) create a directory where allocate the data
	mkdir oxy
	cd oxy

	# 2) use the auxiliary tools we provide to download the datasets for Oxytricha trifallax
	# 2.i) first the reference genome
	PATHtoRACSrepo/datasets/get_GFF3-files.sh
	# 2.ii) now the ChIP-seq data, you will the NCBI-SRA toolkit for this step!
	PATHtoRACSrepo/datasets/get_OXYchIPseq-files.sh

	# 3) determination of ORFs with RACS
	PATHtoRACSrepo/core/countReads.sh  SRX483016_1.fastq.gz  SRX483017_1.fastq.gz  Oxytricha_trifallax_022112_assembly.fasta  Oxytricha_trifallax_022112.gff3  /dev/shm/  16  PATHtoRACSrepo/core/defns/OXY_gene.id

	# 4) determination of the IGRs
	# 4.i) first, cd into the ORF_... directory generated by the previous step where the results from the ORF part of the RACS pipeline were placed
	cd ORF_...
	# 4.ii) create the "sample.input" file containing the names of the *.fastq.gz files to be processed, eg.
	ls -1 *fastq.gz-sorted.bam > sample.input
	# 4.iii) run the IGR determination part of the RACS pipeline
	PATHtoRACSrepo/core/intergenic/det-interGenes.sh FINAL.table.SRX483016_1-SRX483017_1 Oxytricha_trifallax_022112.gff3 interGENs_run1_OXY.csv  sample.input



V) comparison to MACS and Other Tools
	In order to compare RACS results vs MACS, we provide an R script that
does that generating some visuals comparisons.
	Please notice that this tool is still under development.
	V.1) First start by checking that the necessary packages are installed,
		PATHtoRACSrepo/tools/Rscript setup.R

	V.2) Using the tool from an R session:
	V.2-i) launch R

	V.2-i) source the "compare.R" file

		source("PATHtoRACSrepo/tools/compare.R")


	V.2-ii) Now, you have a list of functions and datasets loaded ready to
be used, including some tests cases:

		ls()

 [1] "comparison"     "comparison.ALL" "ibdX"           "load.data"
 [5] "loadCheckPkg"   "macsIBD1"       "macsIBD2"       "overlap"
 [9] "sample1"        "sample2"        "sampleIBD1"     "sampleIBD2"
[13] "tests"          "vizDiffs"

For instance, 'sampleIBD1/sampleIBD2' are ORF generated with RACS, while
'macsIBD1/macsIBD2' are the results obtained with MACS2.


	V.2-iii) Now you could use the ibdX() function to compare each of them,
eg.
		ibdX(sampleIBD1,macsIBD1, "comparison_ibd1.pdf")

several static PDF plots would be generated, as well as, interactive plots
generated using the plotly library which would be stored in HTML files.

	V.2-iv) Alternatively, one could inspect more in detail the results
from RACS & MACS2; for instance, let's start by computing the overlapping
scores between the results from both programs:

        overlap.RACSvsMACS <- comparison(sampleIBD1,macsIBD1, DBG=FALSE)

this will return a dataframe containing the scaffolds, coordinates for the
beginning and end determined by RACS and MACS2 respectively, and an overlap
score; ie.

        scaffold        x1      x2      y1      y2      overlap

The 'overlap score' is computed in the following way:

        +1,	when RACS detects a region and MACS2 does not
        -1,	when MACS2 detects a region and RACS does not
        ratio between region sizes,	when there is an overlap between the regions determined by RACS and MACS2

Utilizing these overlap scores, then one could visualize this comparison using
the following function:

        vizDiffs(overlap.RACSvsMACS)

for generating a series of comparative static plots; or

        vizDiffs(overlap.RACSvsMACS, iplots=TRUE)

for generating additional interactive visualizations that will be saved in HTML
documents that can be open in a web browser for interactive exploration.


-------------------------------------------------------------------------------


* CITATIONS

This pipeline is open source under the MIT license, and researchers are welcome
to use it.
We will appreciate if users can let us know if they found bugs or could provide
feedback about its use and experience using RACS.

Please cite the following paper when ever you are using RACS:

- "RACS: Rapid Analysis of ChIP-Seq data for contig based genomes",
  Ponce et al., BMC under review -- https://arxiv.org/abs/1905.02771


Publications where RACS has been used:
- "The Med31 conserved component of the divergent Mediator complex in
  Tetrahymena thermophila participates in developmental regulation",
  Garg et al., Current Biology (2019) Vol 29(13).
  https://doi.org/10.1016/j.cub.2019.06.052

- "The bromodomain-containing protein Ibd1 links multiple chromatin-related
  protein complexes to highly expressed genes in Tetrahymena thermophila",
  Saettone et al, Epigenetics & Chromatin (2018) Vol 11(10).
  https://doi.org/10.1186/s13072-018-0180-6


-------------------------------------------------------------------------------

ViewGit