This web page was produced as an assignment for an undergraduate course at Davidson College.

Article Link

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations


The 1000 genomes project was carried out in an effort to detail human genetic diversity by sequencing at least 1000 human genomes (ended up sequencing 2.504 individual genomes). While the number of individuals included in the study provide a depth of sequencing, only 26 distinct human populations were included. Mallick et al. aimed in this research to improve the depth of understanding of human genetic diversity through genomic sequencing the genomes of an additional 300 individuals from 142 distinct human populations. This information has numerous implications. From it, researchers estimated the genetic drift across populations and the time of divergence of populations from an African common ancestor. They also aimed to answer questions about the divergence of Australian New Guinean, and Andamanese populations from an African common ancestor and the relative mutation rates between African and non-African subgroups. Understanding the depth of human genetic diversity is a critical step in studying human genetic disease. Genome sequencing to identify disease-related mutations is an increasingly useful tool in medicine. However, in order to identify a "mutation", an individual genome is compared to a reference and all base pairs that do not match with the reference are considered mutations. As genetic diversity contributes to a significant amount of genetic variation that is not disease-causing, knowledge of this sequence variation is critical for identifying mutations that are actually contributing to disease. For this reason, continuing to expand upon the 1000 genomes project and on improving the reference genome to represent individuals of all populations is a critical step in improving the coverage of genomic medicine across populations.

 Explanation of figures:

Figure 1:

Figure 1a: This figure is a neighbor-joining tree in which subpopulations of the groups African, American, East Asian, Oceanian, South Asian, and West Eurasian are place into a phylogenetic tree based on genomic sequence similarity. The tree was constructed by starting with Khoe-San African group which is believed to be the first distinct African population to diverge from a common ancestor (Schlebusch et al., 2012). This group was believed to diverge into a distinct genetic population before the migration out of Africa and are therefore commonly used as a common ancestor between African and non-African populations. This figure shows the divergence of populations from the Khoe-San population by individual nucleotide change (pairwise divergence). The figure shows a divergence from African to West Eurasian to a simultaneous South Asian and East Asian and finally to Oceanian and American.

Figure 1b: This figure aims to detail the ratio of X to autosome diversity in differing populations. The X to autosome diversity ratio can be used to estimate the selection pressure on x-linked vs autosomal genes (Hammer et al., 2010). We note that in this figure, the X to autosomal diversity ratio is lower in non-Africans than in Africans and is lower in Pygmy than in other African populations. One offered explanation for this is male-driven admixture. Male-driven admixture occurs when a male from another genetic population reproduces with a female from a given population. In this case, because males contribute one X chromosome to their daughters and only a Y chromosome to their sons, across a population, the genetic diversity in the X chromosome resulting from reproduction across populations should be lower than in autosomal chromosomes where all offspring get one copy of each chromosome from the father with a different genetic background. Therefore, populations that experience increased incidence of cross-population reproduction have lower X to autosomal diversity ratios. A potential explanation for Pygmy populations having a lower X to autosome diversity ratio is that this group was largely hunter-gatherers and the population was therefore very mobile and may have mixed frequently with genetically distinct populations.

Figure 1c: A heatmap of Neanderthal ancestry across sample populations. Populations ranged from 0-3% Neanderthal. We note that the highest percent of shared sequence with Neanderthal is from populations in East Asia. This coincides with our knowledge of Neanderthals as they were known to exist in Eastern Europe and Northwest Asia and populations migrating out of Africa likely intermixed with Neanderthals before continuing to migrate into East and Southeast Asia.

 Figure 1d: A heatmap of Denisovan ancestry across sample populations. Here we see that higher sequence similarity to Denisovans is found in Southeast Asia and Oceania. Denisovans were known to inhabit a region overlapping and east of Neanderthals, from North to Southeast Asia. Thus, the higher percentage of Denisovan sequence similarity as shown in this figure is a good verification that the individuals in the population samples were inhabitants with long-rooted ancestry in the region as Denisovan sequence similarity has been found linked to these regions in other studies (Sankararaman et al., 2016). 

Figure 2:

Figure 2a-c: These figures examine the cross-coalescence rate of varying populations over time. Cross-coalescence rate is essentially the rate of genetic drift over time, measured by comparing sequence similarity of specific genes, SNPs, STRs, etc between populations and calculating a time of divergence from a most-recent common ancestor (MRCA). In this way, researchers can identify trends in divergence from a MRCA among differing populations over time based on current genomic sequence. A higher cross-coalescence rate suggests that the population is moving towards a common ancestor quickly (when moving from left to right on the x axis). Therefore, populations with higher cross-coalescence rates at lower kya (thousands of years ago) will have diverged less from the MRCA. By figure 1a, we note that the populations converge around 200 thousand years ago. It was from this data that researchers proposed that the most recent common ancestor of present day human populations was living around 200,000 years ago as this figure compares present day African populations to a number of other genetically distinct populations. Figure 2b shows the cross-coalescence rate of present day African hunter-gatherer populations, suggesting that populations diverged within Africa between 50-100 thousand years ago with a MRCA living around 100 thousand years ago.  Figure 2c shows the cross-coalescence rate of non-Africans over time, demonstrating that the genetic divergence among these groups occurred largely within the last 50 thousand years. It is predicted that much of this genetic diversity occurred during the time that these populations were migrating out of Africa (estimated around 50,000 years ago).

Figure 2d-f: Figures 2d, 2e, and 2f display the effective population size of the populations shown in figures 2a, b, and c respectively. Effective population size is an estimate of the number of individuals in a population that are able to contribute to the next generation by producing offspring (Kliman, 2008). The effective population size is estimated using the pairwise sequentially Markovian coalescent (PSMC) model which estimates based on the number of individuals reproducing required to enable a particular genetic drift in a population over time. What we notice in each of these figures (d-f) is that the time in which the population sizes converge corresponds to the predicted time of population divergence from figures 2a-c. For example, in figure 2d, the population sizes appear to be the same around 200,000 years ago, which is the same time that figure 2a shows the populations beginning to diverge, suggesting that at this time the populations were the same size because they have converged to a single population.

Figure 3: