The mammalian epigenome contains thousands of heterochromatin nanodomains (HNDs) marked by di- and trimethylation of histone H3 at lysine 9, which have a typical size of 3-10 nucleosomes. However, the (epi)genetic determinants of their location and boundaries are only partly understood. Here, we compare four HND types in mouse embryonic stem cells, that are defined by histone methylases SUV39H1/2 or GLP, transcription factor ADNP or chromatin remodeller ATRX. Based on a novel chromatin hierarchical lattice framework termed ChromHL, we are able to predict HND maps with singe-nucleotide resolution. We find that HND nucleation can be rationalized by DNA sequence specific protein binding to PAX3/9, ADNP and LINE1 repeats. Depending on type of microdomains, boundaries are determined either by CTCF binding sites or by nucleosome-nucleosome and nucleosome-HP1 interactions. Our new framework allows predicting how patterns of H3K9me2/3 and other chromatin nanodomains are established and changed in processes such as cell differentiation.
This code is written in combination of R and MATLAB. It allows calculations of the epigenetic chromatin landscape for a individual genomic region, as well as bulk calculations for a set of genomic regions defined in the FASTA file.
- MATLAB - https://uk.mathworks.com/ (tested on >R2018a); this should also work on Octave (https://www.gnu.org/software/octave) instead of MATLAB, but it was not tested.
- R - https://cran.r-project.org/ (tested on >v2.12)
- bedtools - https://bedtools.readthedocs.io/ (tested on >v2.24)
- Clone the github repository.
- From a MATLAB command line, run
multiple_run_all.m
.- This will generate one text file per input FASTA file (four, in the case of the demo) containing the predicted HP1 binding, and the probabilities of being in each of the chromatin states (three chromatin states in this example).
- Make a directory
individual_regions
containing the individual FASTA files asgene_region_XXXX.fa
. - Alter the header
-t
injob_submit.sh
for the number of tasks (ceiling(number of regions/1000)). - Change the other submission parameters in the header of the
job_submit.sh
script. - Submit the job script
job_submit.sh
.- The text output files will be generated in a directory
Text_output
ready for processing.
- The text output files will be generated in a directory
- From an R command line, run
plot_chr_states.R
inside the directory containing the text outputs. The basic visualisation will show the chromatin states per lattice unit, with the centre of the region marked with a vertical dotted line.
-
Initialise heterochromatin locations (either by calculating binding affinity or by location of relevant genomic feature, such as repeat)
a. Binding affinity of initiation factors is calculated using a sliding window along the sequence, which is then averaged using a geometric mean across a 501-bp centred window. If the mean affinity in a lattice unit is greather than some fixed threshold, then the lattice unit is initialised as being in heterochromatin state, otherwise it remains as euchromatin.
b. For repeats or other genomic features, the lattice unit where the centre of the feature is located is initialised as being in heterochromatin state
-
Set up CTCF binding sites by calculating binding affinity
a. As for the initiation factors, the binding affinity for CTCF is calculated using a sliding window, without any further averaging; if the affinity within a lattice unit crosses a fixed threshold, the unit is initialised as the CTCF-bound state.
-
Compute states based on initiation sites and CTCF binding sites, based on biophysical parameters such as the far-field statistical weights for each chromatin state, the statistical weight of forming a boundary between two chromatin states, the cooperativity of the associated protein HP1 and the binding constant of HP1 to each chromatin state
More detailed information on the calculation loop is here: https://github.com/TeifLab/ChromHL/blob/master/Program_Flow.md
- The sequences of the region(s) to simulate.
- The weight matrices for all proteins binding to the sequence that are taken into account.
- The location of sequence repeats (as part of a BED-like file), if repeats are being used as initiation sites.
- Concentration (in units of initial HP1 concentration) of HP1 binding.
- Probability of each chromatin state per lattice unit.
- Lattice units are often taken to be approximately equal to the NRL for the region of study, so ~179-189bp. This is set in
CalculatePAX39Affinity.m
andCalculateCTCFAffinity.m
. - The initiation of heterochromatin based on TF-binding is currently in a sliding geometric-average window of 501bp; this can be changed by the user in
CalculatePAX39Affinity.m
. - The code is based on the Fortran version from @epigenereg, converted into MATLAB by @geejaytee, with some scripts for analysis in R by @geejaytee.
Thorn G.J., Clarkson C.T., Rademacher A., Mamayusupova H., Schotta G., Rippe K. and Teif V.B. (2022) DNA sequence-dependent formation of heterochromatin nanodomains. Nature Communications 13, 1861. https://doi.org/10.1038/s41467-022-29360-y
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.