"Wow! I've never seen anything like that before," my colleague Chao-Qiang Lai exclaimed when examining output from his analysis of genome-wide association (GWAS) data. He was looking for genetic markers influencing the level of triglyceride in serum as part of the GOLDN study. GOLDN is looking at the genetics of the response to lipid-lowering medication. The result of Chao's preliminary analysis indicated that SNP rs2880301 associated with TG levels with a p-value of 10-218. He showed me some data from the scan of the Affymetrix 6.0 genotyping chip and we postulated that we could be looking at some type of CNV (copy number variant) or deletion, but the lack of minor allele homozygotes troubled us.
What intrigued us right from the start was our colleagues at other institutions who are also analysis the GOLDN GWAS data did not report this SNP in their initial findings. The dbSNP entry for rs2880301 indicates a C to T variant with an allele frequency of 0.24 in the four primary HapMap populations from USA, Nigeria, China and Japan. No differences in allele frequency means no chance of positive (or negative) selection on this variant. No, none indeed as we were to learn later.
So, Chao dug deeper into his data and he and I shot ideas back and forth. After my suggestion to look at sex, he saw that when the SNP and sex are together in the same model, the analysis did not complete. Then, looking at the individual genotypes, he saw that all the men had genotype CT and all the women CC. This is from a total of just over 800 subjects.
OK, time for me to step in and see where this "SNP" maps in the genome. My first query was the flanking sequence supplied by Affymetrix. This 33-bp segment maps nearly perfectly to both chromosome 13, within intron 1 of the TPTE2 gene and agreeing with both dbSNP and Affyemtrix's annotation of the SNP, and curiously to a spot on the Y chromosome. (TPTE2 is a membrane-associated phosphatase which acts on the 3-position phosphate of inositol phospholipids and could be argues as relevant to TG biology.) The only residue not matching is the "polymorphic" base of the "SNP." A C is found on chr13 and a T is found on Y. Thus, the SNP becomes a marker of sex and Chao was right - it is a type of deletion (females carry no Y chromosome) - but a deletion he had not envisioned.
Is rs2880301 then a marker for gender? Not really. I compared the genomic regions where the homologous SNP sequences were found on both chromosomes, extending over 6 kbp in each direction. I saw a large region of sequence identity between 13 and Y - over 96% - for a ~5 kbp segment. Running RepeatMasker indicated that rs2880301 falls within an L1 LINE, a common repeat element. Thus, while it is intriguing that an array of repeats (70% of the 13-kbp segment of chr13 is masked by RepeatMasker) are conserved between chromosomes 13 and Y, and in order, SNP rs2880301 is not really a SNP. All subjects are C on chr13 and all Y chromosomes are T.
What we then had in our data were five genotypes: CC on chr13 for all women, CC on chr13 for all men and T on Y. Thus, the "allele frequencies" of C between 0.75 and 0.80 and T between 0.20 and 0.25 seen by us and others, including the HapMap data, roughly correspond to populations that are half to slightly more than half women.