The author has declared that no competing interests exist.
FAM230C, a long intergenic non-coding RNA (lincRNA) gene in human chromosome 13 (chr13) is a member of lincRNA genes termed family with sequence similarity 230. An analysis using bioinformatics search tools and alignment programs was undertaken to determine properties of FAM230C and its related genes. Results reveal that the DNA translocation element, the Translocation Breakpoint Type A (TBTA) sequence, which consists of satellite DNA, Alu elements, and AT-rich sequences is embedded in the FAM230C gene. Eight lincRNA genes related to FAM230C also carry the TBTA sequences. These genes were formed from a large segment of the 3’ half of the FAM230C sequence duplicated in chr22, and are specifically in regions of low copy repeats (LCR22)s, in or close to the 22q.11.2 region. 22q11.2 is a chromosomal segment that undergoes a high rate of DNA translocation and is prone to genetic deletions. FAM230C-related genes present in other chromosomes do not carry the TBTA motif and were formed from the 5’ half region of the FAM230C sequence. These findings identify a high specificity in lincRNA gene formation by gene sequence duplication in different chromosomes.
Long non-coding RNA (lncRNA) genes make up a major portion of the human genome [
In this study, we analyzed a family of lncRNA genes related to the long intergenic non-coding RNA (lincRNA) gene FAM230C and address gene composition and origins. Although functions of RNA transcripts from this family are not known, we discovered that at the DNA level, many of the genes carry a prominent DNA translocation breakpoint motif and that these genes are formed and concentrated in a fragile chromosomal region in human chromosome 22 (chr22), 22q11.2.
22q11.2 is a region that displays a high frequency of chromosomal translocation [
Here we show that the lincRNA gene FAM230C, which is on chromosome 13 (chr13), carries the TBTA/AT-rich motif in its sequence. More significantly, eight-related lincRNA genes that stem from copies of the FAM230C sequence also carry the motif. These genes are formed and present only in chr22 and they are exclusively in low copy repeats (LCR22)s situated within or close to the 22q11.2 region [
A total of seventeen lncRNA genes are found in various chromosomes that originate from different segments of the FAM230C sequence, of which nine do not carry the TBTA. Here we propose that the FAM230C sequence serves as a pool for formation of diverse lncRNA genes by sequence duplication and subsequent modification.
With respect to RNA transcription, data provided by NCBI on RNA-seq transcript expression from the eight FAM230C-related genes on chr22 show that RNA transcripts are expressed almost exclusively in the testes (
Translocation Breakpoint Type A sequence [
Alu sequences are from the Dfam website
Human Satellite I sequence is from NCBI GenBank: X00470.1 [
Human chr22 sequence is from NCBI, Homo sapiens chromosome 22, GRCh38.p7 Primary Assembly, Sequence ID:
lincRNA gene sequences, exon and intron specifications, and chromosomal coordinates are from the Ensemble Genome Browser,
LINC01663
Ensemble: LINC01663 ENSG00000276095 NCBI: LINC01663 NCBI ID: 100996432 Vega: AC008103.3 OTTHUMG00000188102
LINC01660 (AC011718.2)
Ensemble: LINC01660 ENSG00000274044 NCBI: LINC01660 NCBI ID: 729461 Vega: AC011718.2 OTTHUMG00000188347
LINC01662 (AC008132.15)
Ensemble: LINC01662 ENSG00000182824 NCBI: LINC01662 NCBI ID: 642643 Vega: AC008103.3 OTTHUMG00000187471
FAM230B
Ensemble: FAM230B ENSG00000215498 NCBI: FAM230B NCBI ID: 642633 Vega: FAM230B OTTHUMG00000150782
AP000552.1 (KB-1183D5.13)
Ensemble: AP000552.1 ENSG00000206142 NCBI: LOC100996335 NCBI ID: 100996335, Vega: KB-1183D5.13 OTTHUMG00000150795
AC007731.1
Ensemble: AC007731.1 ENSG00000188280 NCBI: LOC101927859 NCBI ID: 101927859, Vega: AC007731.1 OTTHUMG00000150686
AC008079.1
Ensemble: AC008079.1 ENSG00000187979 NCBI: LOC100996415 NCBI ID: 100996415 Vega: Not characterized
LINC01658 (AP000345.1)
Ensemble: LINC01658 ENSG00000178248 NCBI: LINC01658 NCBI ID: 388882, Vega: AP000345.1 OTTHUMG00000150669
AP000552.3 ENSG00000237407; KB-1183D5.14 OTTHUMG00000150793 (chr22)
FAM230A NCBI ID
AP003900.1 ENSG00000277693 (chr21)
EIF3FP1 NCBI ID: 54053 (chr21)
EIF3FP2, NCBI ID:838880 (chr13)
EIF3FP3 NCBI ID: 339799, (chr2)
DUXAP9 ENSG00000225210 (chr14)
DUXAP10 ENSG00000244306 (chr14)
CECR7 ENSG00000237438 (chr22)
Genes # 3–9 carry sequences from the 5’ half of FAM230C, genes # 1 and 2 from the 3’ half of FAM230C. As more lncRNA genes are annotated and characterized, this number may increase.
For alignment of two or more nucleotide sequences, the EMBL-EBI Clustal Omega Multiple Sequence Alignment program, website:
RepeatMasker analysis [
The Dfam RepeatMasker website is:
To find sequences similar to Translocation breakpoint Type A and FAM230C, the blast search engine was used with the following website:
NCBI Blast, website:
The databases targeted were Human genomic + transcript, reference genomic sequences. The parameters were default settings as well as parameter Optimize for Highly similar sequences.
To find lncRNA genes that have sequences similar to a lncRNA gene sequence, the Ensemble BLAT/BLAST Search Engine was used: (
The data in Supporting Information,
Central to variability and expansion of AT-rich sequences in chromosome 22 is the repeat element Translocation Breakpoint Type A, NCBI GenBank accession: AB261997.1 [
The figure shows the HSATI, Alu, AT-rich regions and the PATRR. Positions 1–306 represent a direct repeat of 814–1119. Positions of Satellite/Alu/AT-rich regions in the translocation breakpoint type A sequence were determined by RepeatMasker analysis).
Start | End | Satellite/Alu /AT-rich | Class/family* |
---|---|---|---|
1 | 117 | HSAT1 | Satellite |
118 | 272 | AluYm1 | SINE/Alu |
310 | 364 | (AT)n | Simple repeat |
370 | 930 | HSAT1 | Satellite |
931 | 1085 | AluYm1 | SINE/Alu |
1124 | 1469 | (TA)n | Simple repeat |
1686 | 1880 | (ATTATAT)n | Simple repeat |
*Data from RepeatMasker analysis of sequence from NCBI GenBank: AB261997.1
In terms of origin, the TBTA sequence has an 85% identity with Satellite 1 subspecies, Human Satellite I (NCBI GenBank: X00470.1), which is described as a sequence that “includes a male specific 2.47 kb tandemly repeated unit containing one Alu family member per repeat” [
Start | End | Satellite/Alu /repeat | Class/family* |
---|---|---|---|
1 | 505 | HSAT1 | Satellite |
506 | 792 | AluSc8 | SINE/Alu |
796 | 903 | (TATATGT)n | Simple repeat |
*Data from RepeatMasker analysis of sequence from NCBI GenBank: X00470.1.
These signatures are useful for the identification of TBTA-related sequences, as there can be an ambiguity in alignment of nucleotide sequences due to the internal repeats and the low complexity of AT-rich regions displayed by the TBTA/AT-rich element. Thus the satellite/Alu/AT-rich signature is used with the lincRNA gene analyses discussed here.
FAM230C is a gene termed Homo sapiens family with sequence similarity 230 member C, long intergenic non-coding RNA (NCBI Gene ID is 26080). The nomenclature by Vega is RP11-341D18.3 OTTHUMG00000189381 and by Ensemble as FAM230C ENSG00000279516. This gene consists of 37,928 bp and is present in chromosome 13 with coordinates chr:13:18194697–18232624. It has 8 exons and its transcript is considered a processed transcript but with unknown function.
FAM230C contains 3 copies of the HSAT1/Alu/AT-rich sequence of the TBTA. An analysis by RepeatMasker of the region of the FAM230C gene that contains the TBTA shows the similar signature pattern, the HSAT1/Alu/AT-rich sequence with the HSAT1/Alu/AT-rich motif repeated three times in FAM230C albeit there are minor differences involving two Alu subspecies (
Start | End | Satellite/Alu /AT-rich # | Class/family | Repeat |
---|---|---|---|---|
17010 | 17576 | HSATI | Satellite | 1 |
17577 | 17854 | AluSc8 | SINE/Alu | |
17864 | 18258 | (TA)n | Simple repeat | |
18259 | 18824 | HSATI | Satellite | 2 |
18825 | 19117 | AluSc8 | SINE/Alu | |
19118 | 20117 | (TATATTA)n | Simple repeat | |
20119 | 20607 | HSATI | Satellite | 3 |
20610 | 20761 | AluYm1 | SINE/Alu | |
20800 | 20952 | (TA)n | Simple repeat | |
21862 | 22032 | HSATI | Satellite |
*Data from RepeatMasker analysis of sequence of FAM230C
An alignment of the TBTA sequence with the sequence of repeat #3 in FAM230C is in
The nucleotide positions in both sequences are shown. The NCBI Align Two Sequences Nucleotide BLAST program was used.
Blast/Blat searches using the Ensembl genome browser (
22q11.2 coordinates are from Guna et al [
lincRNA gene, chromosomal location, | lincRNA nt positions | FAM230C nt positions |
---|---|---|
length; LCR22 position | ||
1. LINC01663 (AC008103.3) | 314 -21974 | 17003- 37928 |
chr22: 18,872,943-18,895,007 | ||
22065 bp; LCR22A | ||
2. LINC01660 (AC011718.2) | 4028-21304 | 16903-37928 |
chr22: 18,361,223-18,391,705 | ||
30,483 bp; LCR22A | ||
3. LINC01662 (AC008132.15) | 1-16346 | 17872-37928 |
Chr22: 18,733,314-18,758,506 | ||
25,913 bp LCR22A | ||
4. FAM230B | 1-16529 | 17410-37928 |
chr:22: 21,167,158-21,192,756 | ||
25,599 bp; LCR22D | ||
5. AP000552.1 (KB-1183D5.13) | 1-16588 | 17408- 37928 |
chr22: 21,300,390-21,325,642 | ||
25,253 bp; LCR22D | ||
6. AC007731.1 | 1- 16551 | 17401-37928 |
chr22:20,338,205-20,354,972 | ||
16,768 bp; LCR22B | ||
7. AC008079.1 | 380- 16994 | 16994-37928 |
chr: 22: 18,177,438-18,206,515 | ||
29,078 bp; LCR22A | ||
8. |
1- 16761 | 16668-37928 |
chr |
||
26,095 bp; LCR22F |
As an example, presence of the HSAT1/Alu/AT-rich repeat signature in lincRNA gene LINC01660 (AC011718.2) is shown in
sequences. The figure shows the similarity of the LINC01660 (AC011718.2) sequence with that of the TBTA and extends from approximately the middle of the PATRR (TBTA position 1686) to the 3’ end of the TBTA (position 2540). Excluding the AT-variable sequences and a PATRR sequence rearrangement, the entire TBTA sequence is found in LINC01660. Clustal Omega Multiple Sequence Alignment program (EMBL-EBI) was used for alignment.
Start | End | Satellite/Alu /repeat # | Class/family | Repeat |
---|---|---|---|---|
2413 | 4128 | (ATAATAT)n | Simple repeat | |
4129 | 4694 | HSATI | Satellite | 1 |
4695 | 4849 | AluYm1 | SINE/Alu | |
4888 | 4964 | (TA)n | Simple repeat | |
4965 | 5530 | HSATI | Satellite | 2 |
5531 | 5685 | AluYm1 | SINE/Alu | |
5724 | 6118 | (TA)n | Simple repeat | |
7047 | 7159 | AluSz | SINE/Alu |
*Data from RepeatMasker analysis of sequence of LINC01660
There are a total of forty repeats of segments of the TBTA in chromosome 22; twenty-four of these repeats are part of the eight lincRNA genes shown in
Seven of the eight genes of
The sequences in sections B. and C. (highlighted in yellow) are outside of but contiguous with the AC007731.1 gene. Bottom: schematic of regions A, B, and C (highlighted in yellow) that represent the close identity of FAM1230C sequences with those of chr22. The sequence in chr 22 that has a high identity with FAM230C and is contiguous with the 5' side of AC007731.1, region B. is ~389 bp long and the FAM230C sequence contiguous on the 3' side of AC007731.1, region C. is ~150 bp. The upstream region of AC007731.1 termed A., consists of 2711 bp segment of chr22 that has a high identity with 5’ half sequences of FAM230C.
In terms of gene composition, there are other sequences that form part of the eight lincRNA genes, in addition to the FAM230C 3’ half sequence. For example, the 3’ end of the AC007731.1 lncRNA gene sequence partially overlaps and is antisense to protein gene USP41 on chr22 and thus shares some USP41 sequences; the remaining and major part of the AC007731.1 sequence consists of FAM230C sequences. In other examples, FAM230B and five other lincRNA genes share a common sequence close to their 3’ ends that extends beyond the region of identity with the 3’ end of FAM230C (Supporting Information,
From the NCBI website that provides gene expression values in human tissues (
A ncRNA gene termed AP000552.3 (ENSG00000237407) has a sequence similar to a short 3’ half segment of FAM230C. This is a small gene of 3185 bp encoded within the sequence of the large lincRNA gene AP000552.1. It is transcribed in the reverse direction from AP000552.1 and on the opposite strand. It has an identity with the antisense sequence strand of FAM230C at nt positions 31657–34832 and its entire sequence consists only of a 3’ segment of FAM230C and does not include the TBTA sequence. Thus this gene is not encoded in a separate locus but within a section of the AP000552.1 gene, and additionally, differs from the eight genes in terms of composition and size. It appears to be in a separate ncRNA gene category.
Another ncRNA gene, FAM230A (NCBI ID
Importantly, there are lincRNA genes that have an identity only with the 5’ half segment of FAM230C, do not contain the TBTA/AT-rich motif and most reside in chromosomes other than chr22. The most prominent is AP003900.1 ENSG00000277693 on chr21. This lincRNA gene has a high sequence identity to FAM230C (98%) and its entire sequence consists of most of the 5’ half sequence of FAM230C (Supporting Information,
In other examples, the 5’ half sequence of the FAM230C gene on human chromosome 13 carries a small ncRNA gene within its sequence that produces a reverse strand transcript. This is termed the eukaryotic translation initiation factor 3 subunit F pseudogene2 (EIF3FP2, NCBI ID:838880; Ensemble AL356585.1 ENSG00000279081). The EIF3FP2 gene is 2097 bp long. The gene is situated within the 5’ half of the FAM230C sequence at positions nt 11274–13870 and carries no TBTA sequences. Two closely related genes, EIF3FP1 (NCBI ID: 54053 in chr21 that is encoded within AP003900.1, and EIF3FP3 (NCBI ID: 339799) in chr2 also carry only a segment of the 5’ end sequence of FAM230C, do not have the TBTA sequence and reside in chromosomes outside of chromosome 22.
Additionally, several other pseudogenes also have homology with the 5’ end segment of FAM230C, lack the TBTA/AT-rich motif, and reside in chromosome other than chr22, the homeobox pseudogenes DUXAP 9 (ENSG00000225210) and DUXAP10 (ENSG00000244306) on chr14. Of interest, several disease-related aspects of both DUXAP 9 and 10 have been reported [
An additional gene that carries a segment of the 5’ end sequence of FAM230C is CECR7. This gene is an exception to the lncRNA genes that have 5’ half sequence of FAM230C in that it is located in chr22. It contains only a small section of the 5’ end of FAM230C, 1937 bp and is situated at chr 22:17036570–17058792, which is far removed from the 22q11.2 region. It is possible that the FAM230C 5’ sequence present in CECR7 originated via transposition of this small sequence and that this gene is not a product of duplication of the FAM230C sequence as the chromosomal region of CECR7 is devoid of other FAM230C sequences. Functions related to CECR7 have recent been shown and they may point to important cancer-related processes [
To summarize, with the exception of CECR7, there are six ncRNA genes that carry only 5’ segments of FAM230C, do not have the TBTA motif and are situated in chromosomes other than chromosome 22. In contrast, the eight genes described in
Palindromic PATRR AT-rich stem loop sequences are found at DNA breakpoints located within the LCR22B segmental duplication in chromosomal region 22q11.2 and several constitutional translocations may involve this region [
One concept of why AT-rich and other Pu/Py sequences are stored in lincRNA and/or protein genes is that genes may provide a stability for these motifs, however, this increases the probability of DNA breakage and translocation within these genes, which can alter or inactivate the gene [
TBTA sequences were previously found in lincRNA gene exons [
Other than chr22, the TBTA motif is not in lncRNA genes present in other chromosomes even though these genes also may have formed from FAM230C duplications in these other chromosomes. For example, chr21 contains repeats of the TBTA/AT-rich motif that are part of a copy of the FAM230C sequence present in chr21, but the TBTA sequences are not incorporated into the lincRNA gene AP003900.1 that was formed from the 5’ end region of the FAM230C copy in chr21. FAM230C sequences without the TBTA segments are also found in lncRNA pseudogenes in chromosomes 2, 9 and 14. There are a relatively small number of genes here, seventeen total FAM230C-related genes, yet they show a pattern. Perhaps cellular regulatory mechanism may secure the formation of TBTA-containing FAM230C-related lincRNA genes only in or near the 22q11.2 region of chr22, but formation of FAM230C-related lncRNA genes without the TBTA in other chromosomes.
In terms of transcription from FAM230C-related lincRNA genes present in or close to the 22q11.2 region, RNA transcripts are found exclusively in human testes with one exception, LINC01658 (AP000345.1) where there is minor expression in other tissues [
RNA expression during embryonic development from these and other lncRNA genes in the 22q11.2 deletion region is of interest to assess possible involvement in developmental abnormalities due to a lack of the genes. Ensemble and Expression Atlas have reported RNA expression values for a number of lncRNAs in developing tissues (
The FAM230C gene sequence serves as a source for formation of other lincRNA genes and as a source for spreading of TBTA/AT-rich sequences in chr22. Seventeen lncRNA genes carrying FAM230C sequences have been detected, eight of which contain the TBTA/AT-rich motif. Significantly, the eight genes are all in chr22, localized in or near the critical 22q11.2 deletion region, and all are within low copy repeats, the LCR22 segmental duplications. This work helps define properties of a lincRNA gene family in the chromosomal region 22q11.2 and suggests the mode of lncRNA gene formation of this family.
Data from NCBI Genes & Expression website:
(PDF)
The positional start site for each gene is close to the end of the identity with FAM230C. Chromosomal coordinates for the common sequence shared by the six genes are 18495612–18500180, Homo sapiens chromosome 22, GRCh38.p7 Primary Assembly.
(PDF)
(PDF)
Folding of DNA sequence for secondary structure was with the mFold Web Server:
(PDF)
Data compiled from NCBI Genes & Expression website:
(PDF)
I thank Dr. Deyou Zheng, Albert Einstein College of Medicine for kindly providing parameters for the LCR22F segmental duplication. This paper is dedicated to granddaughter Michelle, who has so bravely coped with DiGeorge Syndrome.
Translocation Breakpoint Type A sequence
Human satellite I sequences
long non-coding RNA
long intergenic non-coding RNAs
palindromic AT-rich repeat
chromosome