It belongs to one of the 23 pairs of chromosomes belonging to the human karyotype.
Chromosome 18 appears to have the lowest gene density of any human chromosome and is one of only three chromosomes for which trisomic people survive to term.
There are also several genetic disorders derived from chromosome 18 trisomy and aneuploidy.
Here we report the ending sequence and gene annotation of human chromosome 18, which will allow a better understanding of the standard and disease biology of this chromosome.
Despite the low density of protein-coding genes on chromosome 18, we found that the proportion of non-protein-coding sequences conserved among mammals is close to that of the entire genome.
Extending this analysis to the entire human genome, we found that the density of conserved sequences that do not encode proteins is uncorrelated mainly with genetic density.
This has important implications for the nature and roles of non-protein-coding sequence elements.
The International Human Genome Sequencing Consortium (IHGSC) recently completed a human genome sequence and published a report on the finishing of the human genome.
Now, papers containing detailed reports on each human chromosome are bringing to light aspects of this work’s biomedical and evolutionary implications.
Here we describe the completion of a physical map, a high-quality final sequence, and a gene catalog for the human chromosome 18, representing approximately 2.7% of the human genome.
The extremely low density of protein-encoding genes on chromosome 18 offers an opportunity to study the conservation of non-protein-encoding sequences.
It was recently observed that, in addition to protein-coding sequences, ~ 3% of the human genome shows a degree of evolutionary conservation among mammals significantly higher than the background.
It is not clear whether this sequence consists primarily of gene-related regulatory elements or whether it represents other elements not closely coupled to genes.
These alternatives can be explored by comparing gene-rich and gene-poor chromosomes.
To see if the proportion of conserved non-protein-coding sequence tends to scale with gene density or is not related to gene density.
The terminated sequence of chromosome 18 contains 76,117,153 bases and is interrupted by three euchromatic spaces, a space in telomere 18q, and centromeric heterochromatin.
These loopholes are refractory to current cloning and mapping technology. The sizes of the euchromatic spaces were estimated by alignment with the conserved syntenic regions in the mouse genome 4.
The size of the telomere gap was estimated using the size of the telomeric YAC medium (yeast artificial chromosome).
The total size of these gaps is estimated at 118 kb. This corresponds to <0.2% of the euchromatic chromosome length, substantially shorter than the average in the human genome (cited in reference 3, also refs 5-7).
Of the final sequence, 79% was generated by the Broad Institute of MIT and Harvard, 20% by the RIKEN Genomic Sciences Center, and the remaining 1% by three other research groups.
Details of the cloning map and sequencing construction are described in the Supplementary Information.
Several analyzes verify that almost the entire euchromatic region of chromosome 18 is present and accurately represented in the final sequence.
Of the 332 gene sequences in the RefSeq data set mapped to chromosome 18, all are present and complete in the final sequence.
Furthermore, the finished sequence aligns with the hybrid genetic and radiation maps.
We assessed the local precision of the cloning pathway by aligning the paired pairs sequences from a human Fosmid Library (designated WIBR2, representing ten × physical coverage) to the final line.
Cloning path errors can be detected by identifying discrepancies in the distances between Fosmid ends in the final sequence and those expected based on insertion size constraints.
Analysis revealed a single aberrant region, which turned out to be the result of an artificial bacterial chromosome (BAC) clone containing a 21 kb deletion that was present in the source genome.
Finally, an independent quality assessment exercise commissioned by NHGRI estimated the precision of the finished sequence to be less than one error per 100,000 bases 11 (J. Schmutz, personal communication).
A hand-curated gene catalog was produced, scoring 337 genetic loci and 171 pseudogene loci on chromosome 18. These include all previously known genes on chromosome 18.
All “new transcription” genes had evidence of expressed sequence tag (EST). For ‘putative genes,’ only a subset of the exons relied on one or more processed ESTs.
Only a tiny fraction of all loci, those in the ‘novel’ and ‘putative’ categories, were scored as genes based on spliced EST tests only.
Some “gene fragment” loci may be pseudogenes.
Using aligned EST tests, extend many previously known gene patterns at their 5 ‘or 3’ ends as possible.
Approximately 57% of the transcripts of the RefSeq and Mammalian Gene Collection (MGC) could be extended.
The 5 ‘end extensions averaged 321 bp, and the 3’ end extensions averaged 1,131 bp.
Furthermore, a new 5 ‘exon was found for 14% of RefSeq or MGC transcripts, and a new 3’ exon was found for 2.2%.
The ability to extend gene models probably reflects the expanded databases of transcripts and ESTs.
A sample of the extended gene models was validated in the laboratory.
An average of 10.7 exons per known full-length transcript were found, comparable to recently published reports of human chromosomes.
Internal exon lengths average 155 bp, and the average transcript length is 3.1 kb for complete transcripts of known genes.
There is evidence of extensive alternative splicing, with genetic loci averaging 3.1 distinct transcripts and 71% having at least two transcripts.
This alternative splicing rate is comparable to recent reports. The longest gene on chromosome 18 is DCC (deleted in colorectal carcinoma), spanning 1,190,632 bp.
The DCC also contains the longest intron at 411,177 bp. The most extended mature transcript is laminin α3 (LAMA3) at 10,585 bp. The most extended single exon is found in TCF4, being a 3 ‘exon of 5,700 bp.
The gene with the most identified splice forms is TGIF (TGFβ-induced factor), which appears to have ten splice forms, of which RefSeq transcripts represent two.
Of the 171 pseudogenes on chromosome 18, approximately two-thirds are processed pseudogenes (without introns) that arise from retroposition, and the remaining third have not been processed.
In addition, four transfer RNA genes were identified on the chromosome. An analysis of gene families revealed that several families have multiple members present on chromosome 18.
These include members of the laminin and cadherin families of cell adhesion molecules and a group of ten serpin protease inhibitors.
Careful analysis of genetic models found 59 overlapping gene pairs on chromosome 18, suggesting that gene overlap maybe 2-4 times more common than previously thought.
With an average of 4.4 genes per megabase (Mb), chromosome 18 has the lowest gene density of published human chromosomes.
This density of genes cannot be explained by random fluctuations around the mean of the whole genome.
The low genetic density is reflected in the low percentage of transcribed sequence (28.5%) and the tiny fraction of the chromosome included in the exons (1.14% in all exons, 1.06% in the coding exons).
The G + C content (39.8%) is low, consistent with the known positive correlation between G + C content and gene number.
Chromosome 18 contains 24 genetic deserts, which comprise 28 Mb or ~ 38% of the total length of the chromosome.
The rarest region of the chromosome harbors only three genes larger than 4.5 Mb.
Additionally, chromosome 18 has the most extended median intron length among all chromosomes, reflecting a genome-wide inverse correlation between intron size and gene density.
Chromosome 18 is not enriched in repeated sequences despite being poor in genes. Fossils of transposable elements cover 43.5% of the chromosome, typical for the entire genome.
Chromosome 18 also has a relatively low rate of segmental duplication (segmental duplications are defined as having greater than 90% identity and being longer than 1 kb).
Segmental duplications constitute ~ 2.5% (1.92 Mb) of the chromosome, with a higher representation of interchromosomal repetitions (2.13%) than intrachromosomal duplications (0.55%).
Some sequences are represented in both types of duplication.
Parental origin error and cell division
Parental origin in Bugge’s sample was determined in all 100 cases.
In four cases, the origin of the additional chromosome 18 was paternal, and all four were consistent with a post-zygotic mitotic error (PZM) or uncrossed MII.
It is not possible to distinguish between the two classes. However, it must be doubtful that these paternal cases arose due to a meiotic event that generated only non-crosses.
Therefore, the parsimonious assumption is that this was all due to post-zygotic errors.
In the remaining 96 cases, the extra chromosome was of maternal origin. In 34 points, the error was this MI, of which 15 were without evidence of crossover.
There were 49 cases of mat MII with evident crossover and seven uncrossed points that were maternal PZM or MII.
Again, the definitive classification is not possible; however, since there are four clear paternal cases, this group most likely includes some post-zygotic errors.
Therefore, there is no justification for treating all seven as uncrossed MII.
However, given such a large sample of MII mat crosses, we assume that there are a small number of uncrossed MII crosses, so there is no justification for omitting all seven from the analysis.
Therefore, there was an equal number of paternal and maternal PZM (four cases), leaving three that we classified as uncrossed MII mat.
Among the seven cases, there were slight differences in the informativeness of the markers.
The differences were minimal, and a single subset of three cases was randomly selected for analysis.
In Fisher’s sample, there were two paternal cases, again assumed to be PZM.
Three maternal cases were MII PZM or not crossed.
The lack of genes on chromosome 18 likely explains why it is one of the three autosomes (chromosomes 13 and 21) for which trisomic individuals routinely survive.
Although chromosomes 18 and 21 have roughly the same number of genes, trisomy on chromosome 18 (Edwards syndrome) has much more severe effects than trisomy on chromosome 21 (Down syndrome).
Edwards syndrome occurs in 1 in 5,000 live births, and ~ 90% of affected people die before their first birthday.
In contrast, Down syndrome is more common (1 in 800 live births), and affected people can often cope with the many health consequences and survive into adulthood.
The availability of gene catalogs for these two chromosomes will facilitate work to elucidate how the contributions of specific genes lead to such different clinical outcomes.
Extensive abnormalities cause four other syndromes on chromosome 18, including three partial monosomies caused by removing part of the pop arms (18p-, 18q-, and ring18) and the p-arm tetrasomy.
At least 45 loci on chromosome 18 have been implicated in genetic disorders. The list includes at least four disorders for which the responsible gene and the molecular mechanism of the disease have been characterized.
For two of these diseases (methemoglobinemia and erythropoietic protoporphyria), there is evidence of new alternative forms of splicing that would lead to alterations in the coding sequence.
Comparative genetic analysis revealed a locus that may represent a recently evolved gene in the primate lineage, although its function is unknown.
Among mammals’ annotated multi-exon genes in conserved synteny blocks, only one lacks exon conservation with rodents and dogs: C18orf2, a predicted RefSeq gene.
Within this conserved syntenic block, there is a ~ 100 kb inversion specific to primates in the region (present in both humans and chimpanzees).
One of the endpoints of this inversion is in the middle of the coding region of the gene, with the result that the area is not contiguous in the genomes of dogs or rodents.
Partial sequencing of this gene in apes suggests that it is conserved at least from the orangutan. Chromosome 18 was compared with its chimpanzee counterpart chromosome 18 (ref. 16).
The average sequence divergence is 1.25%, close to the genome average.
On a larger scale, the human chromosome 18 karyotype differs from its great ape counterparts by a human-specific pericentric inversion with a human-specific inverted duplication.
Consequently, human 18p corresponds to the proximal region of chimpanzee 18q.
Since large-scale chromosomal rearrangements can facilitate speciation, 19, 20, this inversion may have played a role in hominin evolution.
Finally, we explore the still mysterious nature of conserved protein-encoding sequences.
The recent comparison of the human and mouse genomes 4 led to the surprising discovery that ~ 5% of the human genome shows evolutionary conservation greater than the background frequency.
Similar results have been observed in comparisons between the human and rat genomes.
As only 1-2% of the human genome encodes protein-encoding exons, the majority of the human sequence in the purification screen does not encode proteins.
In principle, these non-protein-coding sequences could be associated with protein-coding genes, such as those that directly or indirectly regulate the expression of protein-coding genes.
Or independently of protein-encoding genes, such as those that play a structural role in chromosomal architecture or those that encode RNA genes.
The overall ratio of bases on each chromosome under purifying selection was calculated, and this ratio was assigned as protein-coding or protein-non-coding.
Closely followed computational analysis used in recent mammalian comparisons.
The proportion of the whole sequence in the selection and the protein-encoding line in the section was compared with the balance of the coding sequence for each human chromosome.
Chromosome 18 contains a low overall proportion of sequence in the selection, but this is almost entirely explained by its low coding density.
Approximately 4.2% of the bases on chromosome 18 appear to be under purifying selection, consisting of 0.6% in the exons of protein-encoding genes and 3.6% of non-protein-encoding elements.
The proportion of non-protein-coding sequences in selection is typical for human chromosomes.
Note that chromosomes 19 and 22 are atypical in this analysis; the numerous expansions of the local gene family make orthology assignment difficult.
Since chromosomes vary widely in size, we repeated the 5 Mb window analysis across the human genome. Although there is more dispersion in the data, the overall conclusion is very similar.
Notably, the average proportion of selected sequence that does not encode proteins in a window is ~ 3.8%.
Y is slightly negatively correlated (R 2 = 0.08) with the proportion of coding sequence in the window.
The analysis shows that the density of conserved sequences that do not code for proteins depends on the thickness of genes that code for proteins.
It is interesting to note that the examination of aligned noncoding sequences between humans and chickens negatively correlated with the coding content.
And a study of highly conserved noncoding sequences in intergenic regions of human chromosome 21 did not identify a tight coupling with the beginnings and ends of genes 24, 25.
What is the nature of non-protein-coding elements?
First, the elements could encode transcripts not translated into proteins, such as small RNA genes or large regulatory RNAs.
Second, they could play a structural role, with a constant density of these elements necessary to maintain the chromosomal structure independent of the thickness of the gene.
Such structural elements could be evolutionarily essential for the maintenance of a region but could be dispensable if the entire area were eliminated.
This could explain the recent observation in mice that a 1 Mb deletion in a desert gene containing highly conserved elements has no discernible phenotypic effect.
Third, the elements may be primarily related to the regulation of protein-encoding genes, but their distribution may be inversely associated with gene density.
Genes in gene-poor regions may have more elaborate regulatory controls, partly explaining the relative scarcity of genes in such areas.