In the human genome, there is generally a polymorphism every 1000 bases, though there are regions of the genome where this rate can quadruple. These Single Nucleotide Polymorphisms (SNPs) are one manifestation of genetic variation. When SNPs occur, they segregate according to recombination rates, advantages or disadvantages of the mutation, and the population structure that exists and continues during the lifespan of the SNP. Following a genetic mixing event, for example, one initially sees entire chromosomes, or close to entire chromosomes, coming from each constituent. As generations pass, recombination splits the SNP haplotype blocks into smaller pices. The rate of change of the length of these blocks, then, is dependent on the rate of recombination and the stability of the recombination product. Therefore, the length of conserved haplotypes can be used to infer the age of a mutation or its selection. An important consideration, however, is that the rate of recombination is not uniform across the genome; rather, there are recombination hot spots that can skew the measure of haplotype age or selectivity. This makes the haplotype blocks longer than expected under a uniform model.
Every place in the genome can be thought of as a tree when compared across individuals. Depending on where are you look within the genome, one tree will be different than another tree you may get from a specific set of SNPs. The trick is to use the data that we have available on SNPs to infer the underlying trees, and then the overarching phylogenetic relationships. For example, the Y chromosome undergoes little to no recombination and thus can produce a highly accurate tree as it passed down from father to son. Likewise, we can look at mitochondrial DNA passed down from mother to child. While these trees can have high accuracy, other autosomal trees are confounded with recombination, and thus show lower accuracy to predict phylogenetic relationships. Gene trees are best made by looking at areas of low recombination, as recombination mixes trees. In general, there are about 1 to 2 recombinations per generation.
Humans show about 10,000 base-pairs of linkage, as we go back about 10,000 generations. Fruit fly linkage equilibrium blocks, on the other hand, are only a few hundred bases. Fixation of an allele will occur over time, proportional to the size of the population. For a population of about 10,000, it will take about 10,000 years to reach that point. When a population grows, the effect of gene drift is reduced. Curiously enough, the variation in humans looks like what would have been formed in a population size of 10,000.
If long haplotypes are mapped to genetic trees, approximately half of the depth is on the first branch; most morphology changes are deep in the tree because there was more time to mutate. One simple model of mutation without natural selection is the Wright-Fisher neutral model which utilizes binomial sampling. In this model, a SNP will either reach fixation (frequency 1) or die out (frequency 0).
In the human genome, there are 10-20 million common SNPs. This is less diversity than chimpanzees, implying that humans are genetically closer to one another.
With this genetic similarity in mind, comparing human sub-populations can give information about common ancestors and suggest historical events. The similarity between two sub-populations can be measured by comparing allele frequencies in a scatter plot. If we plot the frequencies of SNPs across different populations on a scatterplot, we see more spread between more distant populations. The plot below, for example, shows the relative dissimilarity of European American and American Indian populations along with the greater similarity of European American and Chinese populations. The plots indicate that there was a divergence in the past between Chinese and Native Americans, evidence for the North American migration bottleneck that has been hypothesized by archaeologists. The spread among different populations within Africa is quite large. We can measure spread by the fixation index (Fst) which describes the variance.
Several current studies have shown that unsupervised clustering of genetic data can recover self-selected labels of ethnic identity. Rosenberg's experiment used a Bayesian clustering algorithm. They took a sample size of 1000 people (50 populations, 20 people per population), and clustered those people by their SNP genetic data, but they did not tag any of the people with their population, so they could see how the algorithm would cluster without knowledge of ethnicity. They tried many different numbers of clusters to find the optimal number. With 2 clusters, East-Asians and non-East-Asians were separated. With 3 clusters, Africans were separated from everyone else. With 4, East-Asians and Native Americans were separated. With 5, the smaller sub-populations began to emerge.
When waves of humans left Africa, genetic diversity decreased; the small numbers of people in the groups that left Africa allowed for serial founder events to occur. These serial founder events lead to the formation of sub-populations with less genetic diversity. This founder effect is demonstrated by the fact that genetic diversity decreases moving out of Africa and that West Africans have the highest diversity of any human sub-population.