Friday, February 18, 2011

10 years with the human genome

This week marks the 10-year anniversary of the publications of a (nearly) completed human genome sequence. Much has been made already of this passage of time, as well as what we can look forward to in the next ten years.

What I thought I would do in this space is share a little personal story on my connection with this achievement. In late spring of 2001, I happened to search the Internet and PubMed for my name because I wanted to check to see if any presentations at conference or publications from previous laboratories in which I had worked had been released. To my surprise, I found a website in Japan with the title of something like "list of authors" which contained a collection of names of former colleagues from my days in the Genome Sequencing Center at Cold Spring Harbor Laboratory. That seemed strange and so investigating a bit I learned that we were included on the Nature paper describing the human genome - along with some 5000+ other authors (hence the special listing on this website, and no hits in PubMed). Well, needless to say but that was quite a thrill. I quickly updated my CV to include this landmark publication.

Back in 1997 to 1999, as the publicly funded project to sequence the human genome was ramping up and dollars were dangled in front of genome centers around the USA and the globe, we at CSHL were trying to deposit as much finished sequence into GenBank as possible. Monthly and quarterly totals of base pairs deposited were key to securing grant money. An introduction to all this came within my first two weeks as the Computational Fellow (post-doc) with Dick McCombie when I was told I would be leading the analysis segment of his Genome Sequencing course. I learned the ins and outs of a new computer system and new software tools (I came from a cell biology lab) just in time to teach the students. We worked hard during that 2-week course to sequence a 143-kbp BAC clone containing some critical HIV/AIDS-relevant genes: CCR2, CCR5 and CCR6. You can view the sequence entry I deposited to GenBank here, accession U95626.

From this initial BAC, we worked on many more to try to show that we could put high-quality sequence data together and to get as much sequence finished as possible. Of course, our main funding was to contribute to the Arabidopsis thaliana genome and so the human projects (BACs and cosmid/fosmid clones) took second priority. But we did contribute enough sequence to warrant inclusion on the paper and Dick was kind enough to remember everyone who had passed through his lab during those years.

Wednesday, February 16, 2011

PCSK, cholesterol homeostasis and osteoporosis

Today, I saw a news release on a series of articles concerning the PCSK gene family published by Dr. Nabil Seidah's group at the Institut de recherches cliniques de Montréal. The combined body of work suggests that the PCSK enzymes could influence health from cholesterol homeostasis to osteoporosis.

PCSK stands for proprotein convertase subtilisin/kexin. This means that it enzymatically converts a larger proprotein into a smaller functional entity. PCSK9 is certainly the most well publicized member of this family with much known about genetic variants associating with myocardial infarction, heart disease and plasma lipid levels, particularly LDL-cholesterol. PCSK9 interacts with the LDL-cholesterol receptor.

PCSK9 also shows decreased expression in a circadian rhythmic fashion in mouse liver depleted for Mir122. This comes from a report by Gatfield, Schibler, et al. 2009 Genes Dev. 23:1313-26.

Here are some other interesting bits about members of the PCSK gene family.

PCSK2 - homolog of nematode gene C51E3.7 which is involved in determination of adult lifespan. SNPs in PCSK2 may increase susceptibility to myocardial infarction and type 2 diabetes, which are both age-related afflictions. A QTL for HDL has been mapped to the vicinity of Pcsk2 in mouse: Hdlq19.

Interestingly, some of my own work on literature mining with Biomax BioLT tool indicated that both PCSK7 and PCSK1N have relationships with HDL-cholesterol. A QTL for HDL at the PCSK7 locus has been described.

Heterozygous knock-out mice for Pcsk1 show increased adipose mass. Transgenic expression in mice of Pcsk1n driven by an actin promoter yielded adult-onset obesity. This gene, in human, was recently proposed as a candidate obesity/type 2 diabetes (T2DM) genes by Chang Hsu (2011 Diabetes, in press) but did not pass their test for Fst measures of positive selection.

An interesting paper by Tiffin, Hide, et al. (2006) suggested that PCSK2 and PCSK7 are candidate obesity and T2DM genes.

Certainly interesting phenotypes here. Keep your eyes on these genes.

Thursday, February 10, 2011

Transcription factor databases

The following is a guest-post by my colleague Jacqueline Lane (with some editing by me). She has been interested in identifying novel transcription factors (TF) involved in obesity and genetic variants in their binding sites as well as in the TF genes themselves.

Jackie has put together a list of TF-gene interaction databases she is willing to share here. There are three types of data:

1) TF-gene interaction
This is a compilation of databases with TF-gene interaction data. This might be of the most interest because it lists many databases. See There is also the oRegAnno database, which is easy to view if you click on the tfview link on the right-hand side; see Lastly, TF-gene binding data can also be found at

2) TF-TF interactions
This is a database of TF-TF co-activators and co-repressors (TFs that direct transcription of a gene in concert). This helps with determining tissue/temporal specific combinatorial regulation. See

3) TF co-activators
The TF co-factor database lists proteins that bind to TFs, but not directly to DNA. These protein interactions can give a better picture of the full interaction. Find the data at

Friday, February 4, 2011

A water flea's phenotypic plasticity and HDL-cholesterol in humans

This week marked the announcement of the completion of the genome sequence of the water flea Daphnia pulex. I remember peering through a microscope in my first biology classes amazed at the activity and diversity of structures of these creatures. Now, the 200-megabase genome has been deduced. One of the startling discoveries is the small D. pulex genome is packed full with more than 30000 genes, far exceeding the number in the human genome. Some 13000 genes were identified in the paper by Colbourne, et al. as paralogs - arising from gene duplication.

Here is part A of figure 1 from the paper illustrating major differences in gene numbers between D. pulex and other animal genomes.

So, why all these paralogous genes? Well, the upshot here is one of likely gene duplication as a means to build an inventory of possibilities for a wide range of phenotypes. This scenario is spelled out rather nicely by Dieter Ebert in an accompanying overview. The water flea is remarkably able to sense its predators in a very precise manner and in turn activate any of a number of genes that direct expression of defense mechanisms. Some of these are structural features such as protective helmets, tail spines and neck teeth. Herein is the water flea's phenotypic plasticity - different environments induce expression of different subsets of the vast genome for the purpose of evading the predator. A gene for each bad guy swimming nearby.

Now, let's consider humans and their environment. In particular, I'd like to offer the example of diet, for most this is high in fat and sugar, and the important blood lipid of HDL-cholesterol, so-called "good cholesterol." Regular readers of this blog know that our research expends a good deal of effort in describing gene-environment interactions (GxEs). This is a situation where one allele of a genetic variant like a SNP associates with disease risk only when a given environmental factor passes a certain threshold. We have compiled a series of these GxEs for phenotypes pertinent to metabolic syndrome - phenotypes such as body weight, BMI, blood lipids, blood pressure, glucose and insulin levels, as well as heart disease and type 2 diabetes risk. Those data are available here. If you mine those data, you'll notice that by far there are more GxEs reported in the literature for HDL-cholesterol than any other commonly measured phenotype.

Thus, it seems to me that the water flea has a lot of very similar genes, mostly in paralogous pairs to cope with slight changes in its environment. Humans do not. Eating a sub-optimal diet will likely drive HDL levels down (unhealthy). There are also age-related, natural declines in HDL. At the same time, there are a number of variants in our genomes that show an environmental sensitivity with respect to HDL - there are many ways to activate a program of increased risk (by lowering HDL levels). And similar cases can be presented for LDL, triglycerides, total cholesterol, blood pressure, waist circumference, body weight, etc. So, while it may take years of indulging in a sub-optimal diet before an adverse event such as diabetes of atherosclerosis is diagnosed, perhaps our (relatively) small number of genes, each with a collection of variants, that sets us up for sensitivity to what we put in our mouths. If we can't eat right, then perhaps more genes would be the answer to a better defense against a poor diet.

Wednesday, February 2, 2011

Synonymous SNPs are not so synonymous

Early this week, an excellent paper by Brest, Darfeuille-Michaud, Hofman, et al. in Nature Genetics provides a prime example of going beyond genome-wide association studies (GWAS) to dissect the functional consequences of a genetic variant associated with disease risk. In so doing, the authors provide another case of synonynous SNPs not being so synonymous.

Here are what I find to be the key points of the research presented in this report:

1. The exonic SNP c.313C>T (rs10065172) is in perfect linkage disequilibrium (r2=1.0) with a deletion polymorphism of 20 kbp mapping upstream of the IRGM gene. This deletion has been strongly associated with Crohn's disease in several European populations or those with European ancestry. What is important here is a SNP can act as a tag or proxy for the deletion.

2. The c.313C>T variant alters codon 105 of the IRGM protein from CTG>TTG. Both codons call for leucine upon translation and so this SNP is classified as synonymous. The authors speculate that there could be allele-specific consequences to protein expression. Based on two other reports from other groups, the authors decided to investigate whether allele-specific interactions between the IRGM transcript and a microRNA could be at play here. They observed a predicted binding between microRNA-196 (or miR-196, both miR-196A encoded by A1 and A2 genes and by miR-196B) that was affected by the variation at SNP c.313C>T. Importantly, they show that not only is the miR-196-IRGM interaction real but that expression of miR-196 is elevated in inflammatory epithelia from Crohn's sufferers. These results underscore the point that synonymous SNPs are not so synonymous. The different alleles can exhibit different functions that have health consequences.

3. From GWAS to function. Although this paper does not report original results from GWAS, it builds on those results in an important way. There are four key papers reporting GWAS results for IRGM and Crohn's disease. These papers are by Parkes et al (2007), the Wellcome Trust Case Control Consortium (2007), Barrett et al (2008) and Franke et al (2010). So, in just over three years from the initial discovery of association of this once rather unremarkable gene (only 5 papers were published on IRGM prior to the initial GWAS report of 2007, most reporting a role in autophagy), we now have a much deeper understanding how a synonymous variant leads to the disease condition.

This is very nice work indeed and can be held up as an example of the success of GWAS in laying a foundation for getting at the mechanism of a disease.