Back in 2009, Reich et al. theorized that the current South Asian gene pool was basically made up of two founding genetic components; Ancestral North Indian (ANI), and Ancestral South Indian (ASI). The distilled ANI, they noted, was more similar to the genomes of modern Northwest Europeans than those of the Adygei from the Caucasus. This is obviously out of whack with geography, but it does make sense based on what I've seen in my experiments on the Pakistani samples from the HGDP. Many of them, especially the Pathans, carry numerous segments, or haploblocks, that basically look North European. This gave me an idea to try and reconstruct the ANI genome based on such fragments. The first chromosome of my composite sample, which I call the "ANI composite" is available for download here. It's a PLINK Ped file in illumina AB format with 19,261 SNPs.
Below are several PCA plots featuring the "ANI composite", obviously not including the HGDP samples used to make it (see below). Overall, it seems to resemble most closely my reference samples from Eastern Europe. I have to admit that I was very pleased to see it behaving like a set of genotypes from a real human subject across many dimensions of genetic variation. PCA are very sensitive to anomalies, such as unusually long runs of homozygosity, so the fact that my composite can pass for a normal sample on these plots is fantastic.
So how did I do this? Well, it wasn't very difficult, but a bit tedious, so I need a break before continuing. I used information from my earlier experiments with ADMIXMAP, HAPMIX and RHH Counter to locate and delineate North European-like segments in phased Pakistani HGDP samples. I phased the data myself with BEAGLE, in a pool of South Asian and Middle Eastern samples, so as not to bias the results of phasing and imputation towards Northern Europe. In order to keep the alleles in phase when loaded into PLINK, I duplicated the haplotypes, producing completely homozygous individuals out of each one. Then I created an ANI composite dummy with 100% no calls, and loaded the haplotypes into this sample with a Python script. The first to load were the Pathan haplotypes, followed by the Burusho. I chose individuals from these two groups to make up the backbone of the putative ANI genome because they always seem to come out most "North European" in my ADMIXTURE and PCA/MDS runs compared to other South Asians. The empty spaces were filled with haplotypes from the Brahui and Balochi. Below is a list of all the samples used:
The phased data and the "ANI" haplotypes used in this experiment are available on request from eurogenesblog [at] hotmail [dot] com. I welcome feedback and suggestions on how to improve my methodology. Admittedly, this was a test run, so it's unlikely to be perfect.