Genome Track Analyzer or AnCorr is the set of tools for genome-wide study of genomic features and their correlations.
The broad class of genetic and epigenetic tasks can be reduced to the study
of features distributed over the genome (genome tracks).
Our server contains the following tools applicable to genomic track investigations:
- Correlations between point-wise and stretch-wise genomic tracks
- Correlations between profiles (including expression and DNA-protein binding profiles)
- Correlations between point-wise and stretch-wise genomic tracks and expression profiles
- Statistical Kolmogorov-Smirnov and entropy tests for assessment of distribution of genomic tracks over the chromosomes
Methods and software based upon the analytical criteria enable rapid and efficient processing of huge amount of data stored in genome-scale datasets.
We have developed criteria for assessment of genome track inhomogeneity and of correlations between two genome tracks.
The full description of the criteria and their implementations can be found in the following article.
Kravatsky Yuri V., Chechetkin Vladimir R., Tchurikov Nikolai A., Kravatskaya Galina I.
Genome-wide study of correlations between genomic features and their relationship with the regulation of gene expression.
DNA Research 2015, vol.22, issue 1, pp. 109-119, DOI: 10.1093/dnares/dsu044, PMID: 25627242
Attention! Genome Track Anazyler commandline version is available at request!
This work is supported by RFBR grants no 14-04-01638 A, 17-04-02152 A
Warning: all methods and criteria implemented at this server are statistics-based.
The number of pairs/points should be large enough. We chose the threshold of 50 pairs/points.
For lower number of pairs/points, some parts of the results are displayed in red,
denoting an insufficient number of points for statistics.
The following example shows such a correlation table row:
Chromosome | z | Correlation | p | z p | Ordering | p z | Pairs | 1st set | 2nd set |
---|---|---|---|---|---|---|---|---|---|
chrY | -1.252 | 0.21047 | -1.698 | 0.08956 | 17 | 333 | 270 |
General layout of methods and criteria
- Names nomenclature
- Chromosome names in both datasets should be in the same nomenclature. AnCorr cannot autoconvert chromosome
names and cannot correlate chromosomes/profiles with names written in different nomenclatures. - Chromosome names should not contain blank spaces (example of correct names: chr10, NC_000013.10).
- The order of chromosomes in the input files is unimportant and can differ for first and second datasets.
Note that AnCorr can find both positive (correlation) and negative (anticorrelation) correlations.
- The meanings of these terms for genomic tracks are:
- Correlation means that points/stretches of both tracks tend to locate close to each other
- Anticorrelation means that points/stretches of both tracks tend to fall apart from each other
AnCorr calculates the ordering information (zp) about datasets too.
Positive statistically significant value means that the points/streches in dataset 1 precede points/stretches in dataset 2.
In the resulting table it is displayed as "1⇒2". Negative statistically significant value means the reverse situation and is displayed as "2⇒1".
Output for all methods and criteria uses a common color layout.
z values | Meaning |
---|---|
|z|<1.8 | Insignificant, no correlation |
1.8≤|z|<1.96 | Fuzzy correlation (fuzzy in results) |
1.96≤|z|<2.58 | Significant correlation, p-value<0.05 |
|z|≥2.58 | Higly significant correlation (strong in results), p-value<0.01 |
Correlations between point-wise and stretch-wise genomic tracks
This tool compares genome tracks by algorithm robust to inhomogeneity of genome tracks. Comparison of tracks is perfomed with their attribution relative to DNA strands.
- How to compare/calculate correlations between genome tracks using this tool:
- Set up files for Dataset 1 and Dataset 2. AnCorr can upload files only from your local computer, no URL treatment.
- Set up file formats for both datasets properly. AnCorr cannot autodetect file formats.
- Set the strand for correlation calculation. By default strand information is ignored
(both strands are taken into consideration) - For BED/BED6/BED12/GFF formats you can set correlation point too.
By default it is the midpoint of stretch/segment, you can change it to
the start of each stretch/segment or to the end of each stretch/segment. - You can select type of processing too. By default AnCorr process tracks chromosome-wide.
You can change type to genome-wide calculation. When you select this type of processing then all
data each of two tracks is joined without gaps with respect to minimum/maximum data coordinates
for each chromosome at both tracks. - Results are displayed as color-coded table.
For each chromosome you will obtain the result in the one of the following terms:
empty cell (no significant correlation), in the other cases the color of cell will indicate
type (positive/negative) of correlation and its strength.
To copy results to the MS Excel/Libre Calc, press "Select Table" button then use Copy/Paste feature
of your operating system. If you want to copy/paste resulting table with color output you can use
either of browsers: Google Chrome, Internet Explorer, Safari, Opera. Firefox or Palemoon cannot
copy/paste coloured table, even with plugins. It is known very old bug. You can paste coloured table
only to Microsoft Office products. Libre Office Calc doesn't support coloured tables pasting.
Input formats are:
SGR/TXT can be used in one of the following notations:
Text format without strand information:
chr23 49302345
Text format with strand information:
chr23 49302345 +
BED/bedGraph can be used in one of the following notations:
BED without strand information (BED3):
chr23 49302345 49305671
BED with strand information:
chr23 49302345 49305671 -
BED6/BED12 full-featured BED format:
chr7 127474697 127475864 Pos4 0 +
GFF/GFF2/GFF3 full-featured GFF formats:
chr2R CisGenome protein_binding_site 8972000 8975706 196.39 + ID=region_299
By default:
- Strand information is ignored even if it is presented in the input files by default. In this case all points/stretches are used for
correlation calculation. You can limit correlation calculation to one of the strands (Forward, 5´→3´, or Reverse, 3´→5´), if it
has interest for your data. E.g. SINEs or CpG islands cannot be attributed by strands due to their double-strand essense. - Midpoint is the default correlation point for BED/BED6/BED12/GFF files. Correlation point changing has sense only for stretches,
and especially for significantly long ones (>50kbp).
AnCorr doesn't support file format autodetection, so please set correct file types!
Correlations between point-wise and stretch-wise genomic tracks and expression profiles
This tools is designed to compare profiles and/or genome tracks by robust method with their attribution relative to DNA strands.
Using this tool, you can compare/correlate genomic tracks and DNA expression/binding profiles in any combination.
- How to compare/calculate correlations between genome tracks using this tool:
- Set up type of data for each file. It can be "Profile" or "Track". The server displays the set of options available for the current data type only.
- You should choose the correct peak caller depending on your data. If you use pre-aligned sequenced profile data you should choose
MACS peak caller. MACS can process your data in CHiP-Seq specific or in general (--nomodel) mode. If you use preprocessed profile
data in formats like SGR/TXT or non-overlapped BED/WIG you should choose bult-in AnCorr peak caller. The further instructions
will significanly depend on the chosen peak caller. - Upload required files for Dataset 1 and/or Dataset 2 from your computer. AnCorr doesn't treat URL.
You can upload up to 3 replicates and 1 control file per each profile dataset that will be processed by MACS. You can upload
simultaneously all data for both datasets. Data that you are uploading to the server should be pre-aligned. Recommended
aligner: the latest bowtie 1 build with following options --best -m 1 and output to SAM format (-S option).
Warning: don't try to upload to our server NGS data directly from the sequencing machine, MACS cannot process this data
and we don't have hardware that is sufficient for this kind of tasks. - Supported file formats: AUTO (MACS will decide the format of your data itself), BAM, BAMPE, BED, SAM, ELAND, ELANDMULTI,
ELANDEXPORT. - The default mode of MACS is "CHiP-Seq". It means that MACS will try its best to process your data as CHiP-Seq pre-aligned data.
If you use the other kind of NGS pre-aligned data and still want to process your data by MACS then you should set "General"
Caller mode. It means that MACS will be executed with --nomodel option to bypass the building of the shifting model. - You should set the correct genome size to support the MACS in your data processing. We have set pre-defined genome sizes for
the most popular research targets: Homo sapiens, Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae,
Caenorhabditis elegans. Mail us, we can add species of your interest too. - MACS can detect both peaks as segments and peaks summits (including subpeaks detection). You can choose the point for
correlation calculations. For segments it is midpoint by default and could be changed to the start and the end. - Upload files for Dataset 1 and/or Dataset 2 from your computer. AnCorr doesn't treat URL.
You can upload simultaneously all data for both datasets. Please, use GZIPped profile files to save internet bandwith! - Set up file formats for profiles properly. AnCorr cannot autodetect file formats!
"Profiles" can be: TXT, SGR, WIG, bigWig in GZIPped or plain form. - For this caller you can change default value of Clusterization and Threshold parameters
- Set up file format for tracks properly. AnCorr cannot autodetect file formats!
"Tracks" can be: TXT, SGR, BED, BED3, BED6/BED12 in GZIPped or plain form. - For "Track" datatype you can set the strand for correlation calculation.
By default strand information is ignored (both strands are taken into consideration).
Strand information is ignored for expression and/or DNA binding profiles as meaningless. - For "Track" datatype, especially for BED/BED6/BED12/GFF formats you can set correlation point too.
By default it is the midpoint of stretch/segment, you can change it to
the start of each stretch/segment or to the end of each stretch/segment. - You can select type of calculations too. By default AnCorr process tracks chromosome wide.
You can change type to genome wide calculation. When you select this type of calculation all
data each of two tracks is joined without gaps with respect to minimum/maximum data coordinates
for each chromosome at both tracks. - Results are displayed as color-coded table.
For each chromosome you will obtain the result in the one of the following terms:
empty cell (no significant correlation), in the other cases the color of cell will indicate
type (positive/negative) of correlation and its strength.
To copy results to the MS Excel/Libre Calc, press "Select Table" button then use Copy/Paste feature
of your operating system. If you want to copy/paste resulting table with color output you can use
either of browsers: Google Chrome, Internet Explorer, Safari, Opera. Firefox or Palemoon cannot
copy/paste coloured table, even with plugins. It is known very old bug. You can paste coloured table
only to Microsoft Office products. Libre Office Calc doesn't support coloured tables pasting.
- How to correlate profiles processed by MACS peak caller:
- How to correlate profiles processed by built-in AnCorr peak caller:
- How to correlate "Track" data:
Input formats for genomic tracks/stretches are:
SGR/TXT can be used in one of the following notations:
Text format without strand information:
chr23 49302345
Text format with strand information:
chr23 49302345 +
BED/bedGraph can be used in one of the following notations:
BED without strand information (BED3):
chr23 49302345 49305671
BED with strand information:
chr23 49302345 49305671 -
BED6/BED12 full-featured BED format:
chr7 127474697 127475864 Pos4 0 +
GFF/GFF2/GFF3 full-featured GFF formats:
chr2R CisGenome protein_binding_site 8972000 8975706 196.39 + ID=region_299
Input formats for expression and/or DNA-protein binding profiles are:
SGR/TXT The same as for points above, for profiles the strand field is ignored even if it is presented.
BED/bedGraph The same as for stretches above, for profiles it should be bedGraph with no strand field.
WIG full-featured Wiggle track format
bigWig binary condensed version of WIG, description here: bigWig track format
By default for genome tracks:
- Strand information is ignored even if it is presented in the input files by default. In this case all points/stretches are used for
correlation calculation. E.g. expression and DNA binding profiles, SINEs or CpG islands cannot be attributed by strands due to their
double-strand essense. You can limit correlation calculation to one of the strands (Forward, 5´→3´, or Reverse, 3´→5´), if it
has interest for your data. - Midpoint is the default correlation point for BED/BED6/BED12/GFF files. Correlation point changing has sense only for stretches,
and especially for significantly long ones (>50kbp).
By default for expression and/or DNA profiles:
- Strand information is always ignored.
- The peaks are clustered by the center-of-mass rule. The clustered peaks are used subsequently for the assessment of correlations.
The procedure is as follows: The preprocessing will filter out the insignificant values lower than a given threshold (in terms of the mean
plus 2 standard deviations, σ). As the features under study may be related to cooperative protein binding, the significant values
exceeding a given threshold are clustered by the following rule: the consecutive points nearer than a given distance (500bp) belong to
the same cluster. The values of the threshold and the clustering distance are defined by the user. The site corresponding to a cluster is
determined by the center-of-mass rule.
This tool transparently supports gzipped files. You can upload .sgr.gz, .bed.gz. and .wig.gz files to AnCorr.
AnCorr doesn't support file format autodetection, so please set correct file types!
Assessment of genome track inhomogeneity
When studying correlations between genome tracks, it is useful to begin with the preliminary assessment of input data.
In our package the (in)homogeneity of length distribution related to the genome tracks is assessed by two methods.
- The distribution of normalized fragment lengths is compared with one-fragment DeFinetti distribution by
the Kolmogorov-Smirnov criterion. - The homogeneity of length distribution can be assessed with an entropy-like function too.
The higher the value of the entropy, the stronger the variations in fragment lengths or
the higher the inhomogeneity of input data.