ATRX modulates the escape from a telomere crisis

Telomerase activity is the principal telomere maintenance mechanism in human cancers, however 15% of cancers utilise a recombination-based mechanism referred to as alternative lengthening of telomeres (ALT) that leads to long and heterogenous telomere length distributions. Loss-of-function mutations in the Alpha Thalassemia/Mental Retardation Syndrome X-Linked (ATRX) gene are frequently found in ALT cancers. Here, we demonstrate that the loss of ATRX, coupled with telomere dysfunction during crisis, is sufficient to initiate activation of the ALT pathway and that it confers replicative immortality in human fibroblasts. Additionally, loss of ATRX combined with a telomere-driven crisis in HCT116 epithelial cancer cells led to the initiation of an ALT-like pathway. In these cells, a rapid and precise telomeric elongation and the induction of C-circles was observed; however, this process was transient and the telomeres ultimately continued to erode such that the cells either died or the escape from crisis was associated with telomerase activation. In both of these instances, telomere sequencing revealed that all alleles, irrespective of whether they were elongated, were enriched in variant repeat types, that appeared to be cell-line specific. Thus, our data show that the loss of ATRX combined with telomere dysfunction during crisis induces the ALT pathway in fibroblasts and enables a transient activation of ALT in epithelial cells.

two rotations of TTAGGG correspond to GTTAGG and GGTTAG, in total generating two sets of 6 sequence rotations for forward and reverse telomere motifs. We refer to each of these disjoint sets of telomere sequence rotations as telmers which represent the forward and reverse canonical telomere repeat motif.
Next, each kmer was compared against telmer sets using Edlib to align sequences, and using an edit distance of < 2 as a threshold to determine a match. If a kmer was matched with a forward telmer then a label of 1 was given to the kmer, a label of 2 denoted a match with a reverse telmer, whilst 0 denoted no match. Transition probabilities were manually set as follows: The model was then normalised by calling the "bake" method. Sequences were then segmented and classified. Telomere array was identified by a segment label of 1 or 2. Sub-telomere sequences were identified as segments of background sequence extending from the PCR primer to the start of the telomere array. Interstitial insertions were identified as blocks of background sequence found within the telomere array, and end insertions were identified as blocks of background sequence positioned adjacent to the telorette sequence at the end of the telomere array.

Cleaning of sequence data
Further filtering steps were performed that aimed to remove potential sequencing or PCR artifacts and other anomalous sequences from the raw sequencing data. Together these aimed to 1) remove unexpected non-sub-telomeric sequences that were occasionally amplified due to low homology with sequencing primers; 2) remove STELA-like products that showed evidence of primer swapping, occurring when the sequenced primer did not match the expected sub-telomere sequence; 3) remove STELA-like products that had no discernible sub-telomere sequence; 4) remove apparent concatemers of STELA products deemed to be an artifact of PCR or PacBio sequencing.
Firstly, the expected sub-telomere sequences associated with each of the PCR primers were extracted from the GRCh38 human reference genome, corresponding to the reference sequence from the primer site to the start of the telomere repeat array. For each input sequence, the sub-telomere segment was then aligned to the expected reference-derived sub-telomere sequence. Edlib was used for alignment with arguments mode = 'HW'. To meet the filtering goals 1, 2 and 3 listed above, sequences were discarded if any of the following conditions were met: no alignment with any reference-sub-telomeres; the edit distance of the alignment corresponded to > 0.1 x input sequence length; the total sub-telomere segment length was < 30 bp; the PCR primer did not match the expected sub-telomere class. To identify concatemers, interstitial insertions were mapped to the GRCh38 reference genome using bwa mem with options '-x PacBio -a' to generate all mappings. An optimal set of alignments was then chosen using dodi align 4 (found online at: https://github.com/kcleal/dodi), supplying the list of target sub-telomere loci in ".bed" format using the --include option. Dodi align outputs a spanning set of alignments consisting of primary and supplementary alignments, but filters out secondary alignments and nested alignments. Supplying a list of target regions with the --include option has the result of favouring alignments that fall within those target regions, and can thus be used to identify a spanning set of alignments that preferentially includes target regions of the genome. If an alignment was identified that overlapped one of the target sub-telomere regions then the sequence was regarded as a concatemer and discarded.

Telomere variant repeat abundance
Each of the target telomere variant repeats was converted into a corresponding telmer set, as described. Of these, 15 were 6 bp in length with a single 7-mer TTAAGGG, giving a set of 97 rotation sequences that mapped to 16 telmers. To quantify the abundance of telomere variant repeats, the whole telomere repeat array including any insertions was analysed, from the start of the telomere array to the beginning of the telorette sequence. For the 6 bp telmers, the telomere array sequence was decomposed into 6 bp kmers. If any kmer exactly matched a rotation sequence the corresponding telmer count was incremented. The same procedure was repeated for the 7-mer telomere variant repeat, noting that the counts for the 7-mer are therefore not independent of the TAAGGG telmer, as every TTAAGGG will also be counted as TAAGGG, but not vice versa.

Allele separation by telomere variant repeat content
Different telomere alleles were first identified by manual inspection of reads. To generate a prototype signature for each of the alleles, a random selection of reads was drawn from the bulk data, corresponding to each of the target alleles, separating a minimum of 20 example reads for each allele.
For each sequence, only the first 100 bp of the telomere repeat array was analysed further. Counts for the 16 telmer classes were then determined, as described. Additionally, the counts of any 6 bp kmers that did not match a telmer were also recorded. Thus, the first 100 bp of the telomere repeat array was converted into a count matrix with 17 columns. For each collection of input sequences, the mean across count matrices was taken, generating a single count matrix or signature for each allele.
Next, each read from the bulk data was processed in the same way, deconstructing the first 100 bp of the telomere array into a count matrix. To classify reads into different alleles, the cosine similarity between the derived count matrix and each allele signature was calculated, and a threshold of 0.10 was used to identify a match.

Programming and statistics
Scripts were written using Python3 and statistical testing was carried out using the Scipy package 5 .