It belongs to one of the 23 pairs of chromosomes belonging to the human karyotype.
Chromosome 18 appears to have the lowest gene density of any human chromosome and is one of only three chromosomes for which trisomic people survive to term.
There are also a number of genetic disorders derived from chromosome 18 trisomy and aneuploidy.
Here we report the ending sequence and gene annotation of human chromosome 18, which will allow a better understanding of the normal and disease biology of this chromosome.
Despite the low density of protein-coding genes on chromosome 18, we found that the proportion of non-protein coding sequences conserved among mammals is close to that of the entire genome .
Extending this analysis to the entire human genome, we found that the density of conserved sequences that do not encode proteins is largely uncorrelated with genetic density.
This has important implications for the nature and roles of non-protein coding sequence elements.
The International Human Genome Sequencing Consortium (IHGSC) recently completed a sequence of the human genome and published a report on the finishing of the human genome.
Now, papers containing detailed reports on each human chromosome are bringing to light aspects of the biomedical and evolutionary implications of this work.
Here we describe the completion of a physical map, a high-quality final sequence, and a gene catalog for the human chromosome 18, which represents approximately 2.7% of the human genome.
The extremely low density of protein-encoding genes on chromosome 18 offers an opportunity to study the conservation of non-protein-encoding sequences.
It was recently observed that, in addition to protein coding sequences, ~ 3% of the human genome shows a degree of evolutionary conservation among mammals that is significantly higher than background.
It is not clear whether this sequence consists primarily of gene-related regulatory elements or whether it represents other elements not closely coupled to genes.
These alternatives can be explored by comparing gene-rich and gene-poor chromosomes.
To see if the proportion of conserved non-protein coding sequence tends to scale with gene density or is not related to gene density.
The terminated sequence of chromosome 18 contains 76,117,153 bases and is interrupted by three euchromatic spaces, a space in telomere 18q and a space containing centromeric heterochromatin.
These loopholes are refractory to current cloning and mapping technology. The sizes of the euchromatic spaces were estimated by alignment with the conserved sythenic regions in the mouse genome 4.
The size of the telomere gap was estimated using the size of the telomeric YAC medium (yeast artificial chromosome).
The total size of these gaps is estimated at 118 kb. This corresponds to <0.2% of the echromatic chromosome length, substantially shorter than the average in the human genome (cited in reference 3, also refs 5-7).
Of the final sequence, 79% was generated by the Broad Institute of MIT and Harvard, 20% by the RIKEN Genomic Sciences Center, and the remaining 1% by three other research groups.
Details of the construction of the cloning map and sequencing are described in the Supplementary Information.
Several analyzes verify that almost the entire euchromatic region of chromosome 18 is present and accurately represented in the final sequence.
Of the 332 gene sequences in the RefSeq data set that have been mapped to chromosome 18, all are present and complete in the final sequence.
Furthermore, the finished sequence shows excellent alignment with the hybrid genetic and radiation maps.
We assessed the local precision of the cloning pathway by aligning the paired pairs sequences from a human Fosmid library (designated WIBR2, representing 10 × physical coverage) to the final sequence.
By identifying discrepancies in the distances between Fosmid ends in the final sequence and those expected based on insertion size constraints, cloning path errors can be detected.
Analysis revealed a single aberrant region, which turned out to be the result of an artificial bacterial chromosome (BAC) clone containing a 21 kb deletion that was present in the source genome.
Finally, an independent quality assessment exercise commissioned by NHGRI estimated the precision of the finished sequence to less than one error per 100,000 bases 11 (J. Schmutz, personal communication).
A hand curated gene catalog was produced, scoring 337 genetic loci and 171 pseudogene loci on chromosome 18. These include all previously known genes on chromosome 18.
All “new transcription” genes had evidence of expressed sequence tag (EST). For ‘putative genes’, only a subset of the exons relied on one or more processed ESTs.
Only a small fraction of all loci, those in the ‘novel’ and ‘putative’ categories, were scored as genes based on spliced EST tests only.
Some “gene fragment” loci may be pseudogenes.
Using aligned EST tests, it was possible to extend many of the previously known gene patterns at their 5 ‘or 3’ ends.
Approximately 57% of the transcripts of the RefSeq and Mammalian Gene Collection (MGC) could be extended.
The 5 ‘end extensions averaged 321 bp, and the 3’ end extensions averaged 1,131 bp.
Furthermore, a new 5 ‘exon was found for 14% of RefSeq or MGC transcripts, and a new 3’ exon was found for 2.2%.
The ability to extend gene models probably reflects the expanded databases of transcripts and ESTs.
A sample of the extended gene models was validated in the laboratory.
An average of 10.7 exons per known full-length transcript were found, comparable to recently published reports of human chromosomes.
Internal exon lengths average 155 bp, and the average transcript length is 3.1 kb for complete transcripts of known genes.
There is evidence of extensive alternative splicing, with genetic loci averaging 3.1 distinct transcripts and 71% having at least two transcripts.
This alternative splicing rate is comparable to recent reports. The longest gene on chromosome 18 is DCC (deleted in colorectal carcinoma), which spans 1,190,632 bp.
The DCC also contains the longest intron at 411,177 bp. The longest mature transcript is laminin α3 (LAMA3) at 10,585 bp. The longest single exon is found in TCF4, being a 3 ‘exon of 5,700 bp.
The gene with the most identified splice forms is TGIF (TGFβ-induced factor), which appears to have ten splice forms, of which two are represented by RefSeq transcripts.
Of the 171 pseudogenes on chromosome 18, approximately two-thirds are processed pseudogenes (without introns) that arise from retroposition, and the remaining third have not been processed.
In addition, four transfer RNA genes were identified on the chromosome. An analysis of gene families revealed that several families have multiple members present on chromosome 18.
These include members of the laminin and cadherin families of cell adhesion molecules, and a group of ten serpin protease inhibitors.
Careful analysis of genetic models found 59 overlapping gene pairs on chromosome 18, suggesting that gene overlap may be 2-4 times more common than previously thought.
With an average of 4.4 genes per megabase (Mb), chromosome 18 has the lowest gene density of published human chromosomes.
This density of genes cannot be explained by random fluctuations around a mean of the whole genome.
The low genetic density is reflected both in the low percentage of transcribed sequence (28.5%) and in the small fraction of the chromosome included in the exons (1.14% in all exons, 1.06% in the coding exons).
The G + C content (39.8%) is also low, consistent with the known positive correlation between G + C content and gene number.
Chromosome 18 contains 24 genetic deserts, which together comprise 28 Mb or ~ 38% of the total length of the chromosome.
The rarest region of the chromosome harbors only three genes larger than 4.5 Mb.
Additionally, chromosome 18 also has the longest median intron length among all chromosomes, reflecting a genome-wide inverse correlation between intron size and gene density.
Despite being poor in genes, chromosome 18 is not enriched in repeated sequences. Fossils of transposable elements cover 43.5% of the chromosome, which is typical for the entire genome.
Chromosome 18 also has a relatively low rate of segmental duplication (segmental duplications are defined as having greater than 90% identity and being longer than 1 kb).
Segmental duplications constitute ~ 2.5% (1.92 Mb) of the chromosome, with a higher representation of interchromosomal duplications (2.13%) than intrachromosomal duplications (0.55%).
Some sequences are represented in both types of duplication.
Parental origin error and cell division
Parental origin in Bugge’s sample was determined in all 100 cases.
In four cases, the origin of the additional chromosome 18 was paternal and all four were consistent with a post-zygotic mitotic error (PZM) or uncrossed MII.
It is not possible to distinguish between the two classes. However, it must be highly unlikely that these paternal cases arose as a result of a meiotic event that generated only non-crosses.
Therefore, the parsimonious assumption is that this was all due to post-zygotic errors.
In the remaining 96 cases, the extra chromosome was of maternal origin. In 34 cases the error was this MI, of which 15 were without evidence of crossover.
There were 49 cases of mat MII with evident crossover and seven uncrossed cases that were maternal PZM or MII.
Again, the definitive classification is not possible, however, since there are four apparent paternal cases, it is most likely that this group includes some post-zygotic errors.
Therefore, there is no justification for treating all seven as uncrossed MII.
However, given such a large sample of MII mat crosses, we assume that there are a small number of uncrossed MII crosses, so there is no justification for omitting all seven from the analysis.
Therefore, there was an equal number of paternal and maternal PZM (four cases), leaving three that we classified as uncrossed MII mat.
Among the seven cases, there were small differences in the informativeness of the markers.
The differences were extremely small and a single subset of three cases was randomly selected for analysis.
In Fisher’s sample there were two paternal cases, again assumed to be PZM.
There were three maternal cases that were MII PZM or not crossed.
The paucity of genes on chromosome 18 likely explains why it is one of the three autosomes (the others being chromosomes 13 and 21) for which trisomic individuals routinely survive.
Although chromosomes 18 and 21 have roughly the same number of genes, trisomy on chromosome 18 (Edwards syndrome) has much more severe effects than trisomy on chromosome 21 (Down syndrome).
Edwards syndrome occurs in 1 in 5,000 live births, and ~ 90% of affected people die before their first birthday.
In contrast, Down syndrome is more common (1 in 800 live births), and affected people are often able to cope with the many health consequences and survive into adulthood.
The availability of gene catalogs for these two chromosomes will facilitate work to elucidate how the contributions of specific genes lead to such different clinical outcomes.
Four other syndromes are caused by large abnormalities on chromosome 18, including three partial monosomies caused by removal of part of the poq arms (18p-, 18q-, and ring18) and the p-arm tetrasomy.
At least 45 loci on chromosome 18 have been implicated in genetic disorders. The list includes at least four disorders for which the responsible gene and the molecular mechanism of the disease have been characterized.
For two of these diseases (methemoglobinemia and erythropoietic protoporphyria), there is evidence of new alternative forms of splicing that would lead to alterations in the coding sequence.
Comparative genetic analysis revealed a locus that may represent a recently evolved gene in the primate lineage, although its function is unknown.
Among the annotated multi-exon genes contained in conserved synteny blocks among mammals, only one lacks exon conservation with rodents and dog: C18orf2, a predicted RefSeq gene.
Within this block of conserved synthenia there is a ~ 100 kb inversion specific to primates in the region (present in both humans and chimpanzees).
One of the end points of this inversion is in the middle of the coding region of the gene, with the result that the region is not contiguous in the genomes of dogs or rodents.
Partial sequencing of this gene in apes suggests that it is conserved at least from the orangutan. Chromosome 18 was compared with its chimpanzee counterpart chromosome 18 (ref. 16).
The average sequence divergence is 1.25%, which is close to the genome average.
On a larger scale, the human chromosome 18 karyotype differs from its great ape counterparts by a human-specific pericentric inversion with a human-specific inverted duplication.
As a consequence, human 18p corresponds to the proximal region of chimpanzee 18q.
Since large-scale chromosomal rearrangements can facilitate speciation, 19, 20 this inversion may have played a role in hominin evolution.
Finally, we attempt to explore the still mysterious nature of conserved protein-encoding sequences.
The recent comparison of the human and mouse genomes 4 led to the surprising discovery that ~ 5% of the human genome shows evolutionary conservation greater than the background frequency.
Similar results have been observed in comparisons between the human and rat genomes.
As only 1-2% of the human genome encodes protein-encoding exons, this indicates that the majority of the human sequence in the purification screen does not encode proteins.
In principle, these non-protein coding sequences could be associated with protein-coding genes, such as those that directly or indirectly regulate the expression of protein-coding genes.
Or independently of protein-encoding genes, such as those that play a structural role in chromosomal architecture or those that encode RNA genes.
The overall ratio of bases on each chromosome that are under purifying selection was calculated, and this ratio was assigned as protein-coding or protein-non-coding.
Closely followed computational analysis used in recent mammalian comparisons.
The proportion of the total sequence in the selection and the protein-encoding sequence in the selection was compared with the proportion of the coding sequence for each human chromosome.
Chromosome 18 contains a low overall proportion of sequence in selection, but this is almost entirely explained by its low coding density.
Approximately 4.2% of the bases on chromosome 18 appear to be under purifying selection, consisting of 0.6% in the exons of protein-encoding genes and 3.6% of non-protein-encoding elements.
The proportion of non-protein coding sequence in selection is typical for human chromosomes.
Note that chromosomes 19 and 22 are atypical in this analysis, the numerous expansions of the local gene family make orthology assignment difficult.
Since chromosomes vary widely in size, we repeated the 5 Mb window analysis across the human genome. Although there is more dispersion in the data, the overall conclusion is very similar.
Notably, the average proportion of selected sequence that does not encode proteins in a window is ~ 3.8%.
Y is slightly negatively correlated (R 2 = 0.08) with the proportion of coding sequence in the window.
The analysis shows that the density of conserved sequences that do not code for proteins is largely independent of the density of genes that code for proteins.
It is interesting to note that examination of aligned noncoding sequences between humans and chickens showed a negative correlation with the coding content.
And a study of highly conserved noncoding sequences in intergenic regions of human chromosome 21 did not identify a tight coupling with the beginnings and ends of genes 24, 25.
What is the nature of non-protein coding elements?
First, the elements could encode transcripts that are not translated into proteins, such as small RNA genes or large regulatory RNAs.
Second, they could play a structural role, with a constant density of these elements necessary to maintain the chromosomal structure independent of the density of the gene.
Such structural elements could be evolutionarily essential for the maintenance of a region, but could be dispensable if the entire region were eliminated.
This could explain the recent observation in mice that a 1 Mb deletion in a desert gene containing highly conserved elements has no discernible phenotypic effect.
Third, the elements may be largely related to the regulation of protein-encoding genes, but their distribution may be inversely related to gene density.
It is possible that genes in gene-poor regions tend to have more elaborate regulatory controls, and this may partly explain the relative scarcity of genes in such regions.
In any case, it is clear that the final sequence of the human genome will reveal many features of biological function and provide a firm foundation for future systematic analyzes.