Friday, November 5, 2010

ASHG 2010 conference notes - 5 Nov 2010

Notes from ASHG 2010 (American Society of Human Genetics)
Washington, D.C. 5 November 2010

E. Kang – Reliable eQTL mapping with F1 generations of inbred mice by measuring allele-specific differential expression

Inbred A:
nnnAnnnnnCnnnnnAnnnnnnGnnn (variant positions showing alleles)

Inbred B:

Inbred C:

Then, the inbred F1s:

AB F1:
nnnAnnnnnCnnnnnAnnnnnnGnnn – high expressor of a given gene
nnnTnnnnnGnnnnnAnnnnnnCnnn – low expressor

BC F1:
nnnTnnnnnGnnnnnAnnnnnnCnnn – low expressor
nnnAnnnnnGnnnnnTnnnnnnGnnn – high expressor

CA F1:
nnnAnnnnnGnnnnnTnnnnnnGnnn – high expressor
nnnAnnnnnCnnnnnAnnnnnnGnnn – high expressor

Thus, the possible causal alleles are the A at SNP 1 and the G at SNP 4.

They worked with 71 million SNPs from six F1 strains built from four parental lines.


S. Montgomery – eQTL discovery with RNAseq

Regulatory haplotypes found with HapMap3 data were essentially concordant with 1000G data. So, getting closer to the causal variant? Yes, he states, because p-values are getting stronger.

More rare variants were observed in outliers of expression of a given gene.

For RNAseq, look for many individuals with heterozygous haplotypes. The putative regulatory SNPs they discover are just upstream of the gene to a point within the gene. The magnitude: 60,000 with p-value < 0.05 and 10 or more RNAseq reads (at a total of 3500 genes).


P. ‘t Hoen – Expression association with fasting glucose levels

See their recent paper in Nucl Acid Res 38:e165, entitled "Tissue-specific transcript annotation and expression profiling with complementary next-generation sequencing technologies."

~62% of transcript reads from blood samples encode hemoglobin. Still, 9562 genes are expressed at > 0.3 transcripts per cell.

SNP rs11605924 maps within intron 1 of CRY2 and associates with higher expression when glucose plasma is low – but this is a circadian rhythm gene and makes things quite interesting.


V. Strumba – cis eQTLs across ten brain regions

170 humans – psychiatric disorders + controls

The region is 500 kbp upstream and downstream of the gene, including the gene, too. 45,000 SNP-gene expression pairs passed FDR of 0.05 in at least one brain region. 58% of SNP-gene expression pairs are specific to one of the ten brain regions tested.


A. Dimas – Sex-specific eQTLs

After identification, they did follow-up in twins for replication.

An interesting example is SPO11, a gene with a sex-specific eQTL each for males and females. The two eQTL SNPs are ~760 kbp apart: the female SNP maps to PCK1 and the male eQTL maps to RAB22A. Importantly, the eQTL is not observed when the sexes are mixed, analyzed together.


T. Zeller – Cardiovascular disease-associated eQTLs

Of 950 CAD-associated SNPs, 34 SNPs associated with expression at p LIPA increases expression of LIPA, associates with lower HDL-C, associates with lower systolic blood pressure. But there is no difference in expression in CAD subjects vs controls. But it did in 21,428 CAD cases vs 38361 controls in a meta-analysis.

LYZ encodes lysozyme. Lower expression of LYZ associates with CAD. They identified an intergenic SNP that associates with LYZ mRNA levels – rs11166777.


J. Curran - Selenoprotein S and cardiovascular disease risk

A SNP at position -105, changing G to A, associates with differential expression of the SELS gene when cells are treated with tunicamycin, an endoplasmic reticulum stressor, but show no differences in mRNA levels under basal conditions. The G allele shows the higher expression.


E. Gamazon (abstract 195) – High proportion of transcripts associated with insulin sensitivity in fat and muscle are associated with eQTLs

SCAN is a SNP and CNV annotation database that they built and used in the following analyses.

Top GWAS hits are significantly enriched for eQTL SNPs (see Nicolae, Gamazon et al. 2010 PLoS Genet).

From 184 subjects, they looked at fat and muscle biopsies plus their insulin sensitivity data (in order to classify individuals as insulin sensitive or insulin resistant). Of those, 167 were selected for genotyping (Affymetrix 6.0) and gene expression (Agilent array). In adipose, there is a significant enrichment for eQTL SNPs, Some T2DM SNPs were shown to have eQTL characteristics. For example, rs864745 associates with expression of JAZF1, a T2DM locus, in muscle.

In muscle, ten genes are differentially expressed between the insulin sensitive and the insulin resistant individuals. One of these is PPARGC1A. In adipose, the story is one of more genes – 172 genes are differentially expressed between the insulin sensitive and the insulin resistant subjects at greater than or equal to 1.5-fold. However, few eQTL SNPs were identified from these 182 events. They conclude that transcript regulation is mostly trans. Many, nearly all of the cis eQTL candidates did not hold up to further analysis.


J. Zhao – TCF7L2 variants and functional consequences

They used ChIP-seq but observed nothing from extracts from pancreatic islet cells. They noted (from the literature?) a connection between TCF7L2 and cancer. For example, TCF7L2 binds in the region far upstream of the MYC oncogene.

[LP: Are any of the 1095 TCF7L2 binding sites they observe (within 50 kbp of 866 genes) disrupted by SNPs?]


J. Florez – Meta-analysis of proinsulin levels

The phenotype is fasting proinsulin adjusted for fasting insulin in a manner that seemed to require a fair amount of thought on their part. Then, they did the GWAS – where TCF7L2 and SLC30A8 served as positive controls. They noted six loci:

/ C2CD4A / C2CD4B

A seventh locus is SNP rs306549 in DDX31 where the association is found only in women.


N. Palmer – Loci for type 2 diabetes in African-Americans

14.7% of African-American adults have T2DM and one in four elderly women suffer from the disease or end-stage kidney disease.

They used principal component analysis to model the admixture.

The original cohort was 965 cases and 1029 controls. The replication population was 709 cases and 690 controls. For the meta-analysis, they had ~3100 cases and ~3100 controls.

754 SNPs were selected for replication. 122 SNPs were nominally and directionally consistent to proceed with validation. They found loci in:


During the Q&A, the issue was raised that some controls will go on to develop T2DM in the future. [LP: Rather unfair question as this can be the case for so many studies that were presented at ASHG. In fact, you can control for this, somewhat, with age-matched controls.]


W. Wei (Institute for Genetics and Molecular Medicine) – Epistasis and genetic control of BMI

Pairwise genome scan identified seven gene-gene pairs reaching statistical significance. A significant number of genes in the 35 gene-gene pairs (the seven above plus another 28 based on candidate approaches) have a role in smoking and alcohol addiction. He showed some gene-gene interaction networks – nice and very similar to what we are doing.

See, for example, his paper in Heredity entitled, "Controlling false positives in the mapping of epistatic QTL."


N. Timpson – Effect of BMI on risk of heart disease

They segmented the population by ~4 units of BMI because this is the standard deviation for this population between heart disease and not showing heart disease. After showing a lot of analysis methods and approaches, there was the point that an increase in BMI of about four units leads to an OR of ~1.52 in risk for ischemic heart disease. Thus, BMI is causally related to ischemic heart disease (OR ~1.5). He used an allele score to represent lifescore changes in BMI.


E. Speiliotes – GWAS for fatty liver disease

Five loci identified:


Thursday, November 4, 2010

ASHG 2010 conference notes - 4 Nov 2010

Notes from ASHG 2010 (American Society of Human Genetics)
Washington, D.C. 4 November 2010

A Goldstein – Challenges to identification of high-risk alleles

High-risk alleles are rare to very rare and typically have a penetrance greater than 5.

Challenges to finding high-risk alleles
There really is no major high-risk gene
Lack of power or informativeness
Underlying complexity of genetics
Clinical and epidemiological heterogeneity and/or misclassification
Follow-up of linkage results

Illustrations of challenges
BRCA1 – 10% of risk of breast cancer
BRCA2 – 12% of risk of breast cancer
Existence of a "BRCA3" with high-risk is rather unlikely

CDKN2A/ARF – ~20% risk for melanoma
CDK4 – ~1% risk for melanoma

So, increase power of the study. Better use or incorporate:
Molecular genetic data
Functional genomics data
Epidemiological and clinical data

New technology may help – such as NextGen sequencing


J. Bailey-Wilson – Complex traits really are complex

Major environmental risk factors may be common
Major genetic risk alleles for serious diseases tend to be rare in population
- Due to selection
- A major locus may have many “risk” alleles

She offers breast cancer as a model. Traditional approaches identified BRCA1 and BRCA2, but then came GWAS.

Linkage is very powerful to detect high penetrance risk alleles in families. Association is very powerful to detect common risk alleles but – if each family has a different, rare or private allele/variant, association will not succeed.

Why has “the gene” not been found?
- False positive linkage
- Have the right gene but don’t understand it yet
- Haven’t yet sequenced fully the region defined by the linkage study
- It is not a gene but a regulatory region
- Could be a long, non-coding RNA
- MicroRNAs and intronic variants, too

Synonymous variants are interesting – change the kinetics of translation!

She is hopeful that more sequencing will be done under broad linkage peaks. But need to phenotype well to fully test for GxE influence.


E. Wijsman – Cardiovascular QTLs and large pedigrees

They are looking at familial combined hyperlipidemia (FCHL) in 4 families with 253 subjects. They looked at 600 STRs and 48K SNPs on CVD chip. The phenotype of choice is plasma APOB. For plasma APOB levels, they noted a LOD score of 3.1 on chromosome 4.

Across this large APOB linkage peak, they used each SNP as a covariate to see which one(s) abolish the peak. Then, which gene? Do exome sequencing. All this identified a SNP in LRBP but direct genotyping of the entire pedigree brought the variance from 0.4 to ~0.18 – killed it. So, need to generate many candidate variants for quick screening by genotyping the entire pedigree – because finding one SNP and testing it in a one-by-one manner is not efficient.

The exome data may identify a haplotype which extends to the non-exome.


N. Camp – Analytical strategies to identify rare risk variants using extended high-risk pedigrees

They use Utah family data: 2.2 million individuals over three to eleven gnerations, with hospital records.


J. Degner – Using genome-wide sensitivity data to infer transcription factor binding

Transcription factor binding sites (TFBS) are poorly annotated. They use ENCODE’s DNase I data. See for their tool – it uses 230 position weight matrices, 800,000 sites. They also have an article in press at Genome Research. So, use this to check GWAS hits. An example is a binding site QTL for PEBPI.


I Aneas – What are the downstream targets of Tbx20?

- differential expression in Tbx20 wildtype vs knockout mice, in heart tissue
- ChIP-seq data from embryo gives 2000 binding sites, from adult gives 4000 binding sites

Combining the above gives 2000 genes. This set is enriched for ion transport and calcium homeostasis functions.


A Letourneau – Effect of trisomy 21 on gene expression

They used a twin study – monozygotic twins where one is trisomic for Chr21 and the other not. Many genes on Chr21 and elsewhere in the genome show differential expression. Many Chr21 genes show >1.5-fold increase in expression for trisomic:normal comparison. 58 genes show Chr21-trisomy-specific alternate splicing. [LP: This has got to be a harbinger of what is possible with careful analysis of the effect of CNVs.]


T. Teslovich – Sequencing of 400 cases, 200 controls at 26 genes for type 2 diabetes

Goal: Identify rare variants in genes implicated by GWAS.

To date, the most interesting finding is GCKR variant E584X (stop codon). In study #1, the minor allele frequency (MAF) was 0.56% in cases and 0.80% in controls. In study #2, the MAF was 0.08% in cases and 0.15% in controls. (I missed values for study #3.) The point here is one of where the differences in allele frequencies are not significant. So, go to the Metabolo-chip with 14,000 cases and 17,000 controls. This is on-going…


H. Daoud – Exome sequencing in ALS families

Six candidate genes were identified that are shared in two ALS families, but none are shared in three families. This is indicative of the heterogeneity of ALS.


D. MacArthur – Loss-of-function mutations in healthy human genomes

LOF is a premature stop, splice site disruption, small indel leading to a frameshift, others.

Data from the 1000G pilot:
- 1088 stop SNPs
- 643 splice disruptors
- 956 small (< 40 bp) frameshift indels
- 147 genes disrupted by large indels

Implication is each person has many of these types of variant. ~25% (453 of ~1743) LOF variants did not pass manual validation. OK, so a few of these LOF variants actually are from RefSeq errors and gene model errors. Gene models will be corrected in the next release of Gencode so that subsequent clinical sequencing won’t have to deal with this. In other words, there will be no error.

The estimate is there are ~140 true LOF variants per individual and about 35 or these are homozygous.

Wednesday, November 3, 2010

ASHG 2010 conference notes - 3 Nov 2010

Notes from ASHG 2010 (American Society of Human Genetics)
Washington, D.C.
3 November 2010

John Rossi (City of Hope National Medical Center) – SNPs in human microRNA genes affect biogenesis and function

miRNAs regulate translation and degradation of mRNAs. Identifying targets of the miRNAs is a major challenge.


Euan Ashley (Stanford University) – What to do with all the sequence data?

Examine the genome of S. Quake with its 6 billion data points.

A rare variants algorithm – tough because a single database does not exist or is private and in varying format. Thus, they use catalogs of common variants for this Patient Zero prototype. With common variants, they need genotype frequencies much more than odds ratio or p-value of association (in the population) when applying population data to the individual.

Dealing with novel variants presents another challenge but some new tools were built by their team (e.g., using SNP-based changes in free energy of RNA folding).

They want to put the genetic risk of the individual in the context of risk for that patient – a 40-yr old White male. For example, he already has a 50% increased risk for obesity given certain non-genetic parameters. It is also necessary to consider environmental risk. Below is an example figure of how such information on risk can be presented to the patient, where the bar indicates how risk changes for this person. In this case, there is an increase in risk of obesity from about 10% to about 60%.
- Data are coming, lots and lots!
- We need to deal with large amounts of data
- Databases need to be reconfigured to facilitate genome interpretation
- Physicians need to learn how to communicate such genetic results with patients


Russ Altman (Stanford University) – Pharmacogenomics

He started with a screenshot of and used it to highlight a few SNPs relevant to warfarin dosing.

The focus of the talk was to analyze S. Quake’s genome and evaluate ~2500 SNPs and CNVs with pharmacological implications. They used common variants. Within CYP2C19, Quake has a known variant resulting in 50% reduction in metabolizing rate (he’s heterozygous). He then presented a table with column headers of: Drug, Summary, Level of evidence, PMID, Gene, rsID.

Then on to the novel SNPs found in the Quake genome and organized in the same type of table. The focus was on those SNPs that change an amino acid and are predicted to be deleterious, with predicted potential drug impact. He, as a physician, cannot say, “These SNPs have not been studied before and we will ignore the data (on predicted impact).” Instead, acknowledge those SNPs and genes and drugs and go in a different but equivalent direction with regard to advice and treatment.


Job Dekker (University of Massachusetts Medical School) - HiC and higher order folding of the human genome

Started with chromosome 21 to identify higher order organization of the genome. The 5C method was employed to identify millions of chromatin-chromatin interactions across the entire genome. Their finding is genes often become physically close to elements that are 1 to 10 MB away from that gene. This is a long-range distance but mapping to the same chromosome. They have identified some 3000 such examples.


Arend Sidow (Stanford University) – What is the functional fraction of the portion of the variable part of the human genome?

How big is the functional fraction of our total genetic variation? “Our” is a key word: It could relate to population or to a single person or haploid genome. For the amount of total genetic variation, consider derived alleles.

0.5% of haploid genome is deviant – but what fraction is functional?

He used p53 (TP53) as an example with its SNPs and repeats to suggest to him that 10% of variants are functional. They use GERP – genomic evolutionary rate profiling (Cooper 2005 Genome Res). See Davydov (PLoS Comp Biol, in press). That work shows that 225 MB, 7.3% of the genome, is functional.

What is the functional fraction of the variation in human?

0.5% of the genome, 3 million variants. Functional: 3-8%, 300,000 to 1,000,000 bp, with most (~90%) mapping to non-coding sites.


Erin Kaminsky (Emory University) – Towards evidence-based criteria for clinical interpretation of CNVs

15,749 subjects (from 7 different studies) were genotyped for CNVs as were ~10,400 controls. I think the pathology was for neurological disorders. Pathogenic CNVs were identified in ~17% of cases.

She presented a table of CNV deletions at 22q11.2 (found in 93 cases and 0 controls), 15q13.2-q13.3 (epilepsy, 46 cases, 0 controls), 15q11.2-q13.3 (Angelman, 41 cases, 0 controls), 16p11.2 (autism, 67 cases, 5 controls), and 1q21.1 (microcephaly, 55 cases, 3 controls). The group also looked at duplications.

They used p-value to classify the CNV as pathogenic or not. There was nothing like pathway analysis or gene expression data to go along with this.


N. Wasserman – MYC, GWAS for cancer and the nearby gene desert

This region near to MYC is a gene desert but it is a region of regulation (see Wasserman 2010 Genome Res).

How then to identify such long-range regulatory potential? They use BACs (bacterial artificial chromosomes) as enhancer traps!

FTO. The obesity associations fall within a 50-kbp block of LD that includes the last half of intron 1, exon 2 and most of intron 2. Fto-/- mice are smaller and leaner, and have less adipose than control. Thus, tissue-specific upregulation of FTO should lead to the obese condition. The result is enhancers in this 50-kbp region enhance expression in many tissues just like normal Fto (mouse).

They then used 13 different contigs spanning this 50 kbp region to tile across the LD block to find tissue-specific enhancer elements in zebrafish, then to mouse. They found a brain enhancer and then deleted that enhancer from the BAC enhancer trap to show that that small segment is necessary to drive expression in brain.


Jared Maguire (Broad Institute) – Using conditional mutation rate to interpret variation in the genome

They use adjacent bases as an explanation for local variability. They look at 3-mers in the coding sequence but he offered an example of GCG > GTG as a known sequence-context-driven C > T change from CpG islands. (I thought CpG islands were not typically found in coding sequence.)

They look for genes with higher SNP burden than others. No specific genes were given.


M. Eberle (Illumina, Inc) – Illumina NextGen genotype arrays
15-20% increase in the number of common variants based on latest NextGen and 1000G data. Can they build haplotypes? They use 1.4 million SNPs for imputation based on 60 CEPH samples. He thinks this will improve when more samples are added. This process gives 7.7 million total SNPs. Many show concordance. Genotype calls for rare variants are very accurate: Rare variants show similar accuracy to common variants and overall concordance is 99.96%.


Li – Global patterns of RNA editing in humans

RDDs = RNA-DNA differences

Traditional RNA editors are the ADARs (A>I) and APOBECs (C>U). RDDs are not traditional.

RNA preps from 27 CEU B cell samples were sequenced along with the genomic DNA. From the DNA side, they retained only monomorphic sites not in dbSNP, HapMap, 1000G data. From the RNA side, they required greater than 20 reads per position, greater than 20% of those reads with sequence different than the DNA.

They find 3762 (+/-1647) RDD events per subject. Overall, there were 20,753 events in 4507 genes. When requiring that the event/gene be present in more than half the subjects, there were 10,117 events and 3776 events detected in all the subjects.

30.8% of the 101,574 grand total events were A>G or T>C. 19.3% were C>T or G>A. But all others were seen. About 25% of the events are in coding sequence.

What percent of the reads show the RDD? Of all 101,574 events, median level is 97%! These affect splicing. These affect disease susceptibility. These modify disease manifestation. The question remains if these mRNAs are degraded or translated.


J. Knight – Psoriasis susceptibility loci and genetic interaction between HLA-C and ERAP1.

Their GWAS identified many immune system genes. They then looked for pair-wise interactions between SNPs that replicated and those concordant with other studies. They used a dominant model to do this.


M. Hannibel – Identification of a gene involved in Kabuki syndrome

This is a rare syndrome and so they began the search by looking for a SNP in exome data but in HapMap or dbSNP. 78% of 104 kindreds have MLL2 mutations. MLL2 methylates histone H3 on lysine 4, H3K4.

ASHG 2010 conference notes - 2 Nov 2010

Notes from ASHG 2010 (American Society of Human Genetics)
Washington, D.C.
November 2, 2010

Eric Lander (Broad Institute) – The human genome project: A decade later

The draft (~90% complete) of the human genome was announced in June, 2000 and published in February, 2001. The finished (~99.3%) sequence was announced in April, 2003 and published in October, 2004.

With the sequence available, we can now build maps of all kinds. Some types include structure maps, maps of molecular function and disease maps. We can also put together a catalog of signatures – allowing us to build platforms for gene expression and proteomics.

In 2000, the completed eukaryotic genomes numbered four (S. cerevisiae, C. elegans, D. melanogaster, A. thaliana). 38 prokaryotic genomes were known. In 2010, the genomes of 250 eukaryotes are complete, 4000 bacteria/viruses and at least 500 human genomes. This has happened for various reasons, a primary one being the drop in cost of sequencing; it fallen ~100,000-fold since 1999.

Understanding the genome. In 2000, the thought was there are 35,000 to 100,000 protein-coding genes, regulatory sequences were not so numerous, there was some non-coding sequence, and transposons and such were considered junk. In 2010, the gene count is 21,000, much more information is in the genome than we thought (~25% of evolutionarily conserved sequences are non-coding and number about 3 million elements (by sequencing and comparing the genomes of 29 mammals)), transposons are big players in the dissemination of these conserved elements, the epigenome, and the approximate 5000 large inter-genic non-coding RNAs.

Mendelian traits. In 1990, we knew the source of 70. In 2000, that number was 1300. In 2010 that stands at 2900 Mendelian disorders identified (see OMIM). There are about 1800 more to know.

The basis of disease – complex diseases and traits. In 1990, we knew only about HLA, number = 1. In 2000, that was ~25, with things like APOE and Alzheimer disease. In 2010 that has risen to ~1100 with respect to 165 common disease traits. But there is disappointment in GWAS because the effect size is small and there is this missing heritability. He thinks that rare variants are not needed because heritability increases as the number of subjects in the GWAS increases, because population genetics suggests that for many common diseases rare variants explain less than other variants, (point #3 I missed), and epistasis hugely distorts the estimate of variance (a – GWAS finds all loci, b – but the loci explain 33% of variance, c – thus we need to use GWAS to identify the biology and then look at variance).

Cancer. In 1990 we knew of 12 solid tumor cancer genes. In 2000 that number was 80. In 2010 it is 240. New pathways are being discovered as pertinent in certain concerns.

History of human populations. He rushed through this and did not really provide any information that is not widely published.


John Stamatoyannopoulos (University of Washington) – Using ENCODE to read the human genome: Function and disease

ENCODE is used to guide interpretation of disease-associated genetic variation (GWAS). Many GWAS point to non-coding GWAS SNPs – 47% in introns, 2% in promoters, 7% coding, 14% are 50-100 kbp from nearest known gene, 10% are 1-50 kbp from nearest known gene, 18% are >100 kbp from nearest known gene.

DNase I hypersensitivity site (D1HS) maps overlayed on inflammatory bowel disease GWAS near PTGER4. He uses data from relevant cell lines Th2, Th1, B lymphocytes and sees signals of histone marks in those cells.

Cancer GWAS at 8q24 (upstream of MYC). One SNP lands in a H3K27Ac site, a binding site for TCF7L2 (in colonic cells) and a D1HS.

26% of GWAS SNPs fall in D1HSs. This is ~2.5-fold enrichment. GWAS SNPs for cognition, Parkinson disease, bipolar disorder, and others, map to D1HSs found only in brain. He sees a similar result for heart with Q-T interval, atrial fibrillation, EKG traits and response to statin therapy.

ENCODE is heading to a point of nucleotide resolution in order to better define the regulatory genome.


Nathalie Cartier (INSERM) – Gene therapy for neurodegenerative diseases

Brain: 2% of body weight but 25% of all cholesterol.
LP: Hence the Alzheimer-lipid links


Michael Meaney – Environmental regulation of the neural epigenome

Environmental factors are social (parental) and economic (food, shelter, safety).

Parental care leads to epigenetic marks which lead to changes in gene expression which then leads to a phenotype. His example is licking of young rat pups (in the first one to two weeks of life) by rat mothers. This licking (care) leads to changes in phenotypic responses to stress, neural development, female reproduction and metabolism. He intends to discuss the endocrine response to stress. Expression of specific genes in specific brain region(s).

[Cool stuff – but delivered like a speed reading of a journal article...]