
- Select a language for the TTS:
- UK English Female
- UK English Male
- US English Female
- US English Male
- Australian Female
- Australian Male
- Language selected: (auto detect) - EN
Play all audios:
Structural variants (SVs) and short tandem repeats (STRs) comprise a broad group of diverse DNA variants which vastly differ in their sizes and distributions across the genome. Here, we
identify genomic features of SV classes and STRs that are associated with gene expression and complex traits, including their locations relative to eGenes, likelihood of being associated
with multiple eGenes, associated eGene types (e.g., coding, noncoding, level of evolutionary constraint), effect sizes, linkage disequilibrium with tagging single nucleotide variants used in
GWAS, and likelihood of being associated with GWAS traits. We identify a set of high-impact SVs/STRs associated with the expression of three or more eGenes via chromatin loops and show that
they are highly enriched for being associated with GWAS traits. Our study provides insights into the genomic properties of structural variant classes and short tandem repeats that are
associated with gene expression and human traits.
Structural variants (SVs) and short tandem repeats (STRs) are important categories of genetic variation that account for the majority of base pair differences between individual genomes and
are enriched for associations with gene expression1,2,3. SVs and STRs are comprised of several diverse classes of variants (e.g., deletions, insertions, multi-allelic copy number variants
(mCNVs), and mobile element insertions (MEIs)), and multiple algorithmic approaches and deep whole genome sequencing are required to accurately identify and genotype variants in these
different classes4. Due to the complexity of calling SVs and STRs, previous genetic association studies have generally not identified a comprehensive set of these variants but rather have
focused on one or a few of the class types, and therefore the genomic properties of SVs and STRs associated with gene expression and/or complex traits are not well characterized.
SV classes and STRs vary in genomic properties including size, distribution across the genome, and impact on nucleotide sequences, but previous studies have not investigated whether these
differences influence the likelihood of being an expression quantitative trait locus (eQTL), eQTL effect sizes, or the properties of eQTL genes (eGenes) such as gene type or level of
evolutionary constraint3,5,6,7. Further, it is unknown if the variant classes may affect gene expression through different mechanisms such as altering gene copy number or three-dimensional
spatial features of the genome. A comprehensive SV and STR data set generated using high-depth whole genome sequencing (WGS) from a population sample with corresponding RNA-sequencing data
could be used to assess whether genomic features of SV classes and STRs are associated with properties of eGenes and eQTLs.
SVs and STRs have also been associated with complex traits, though they have been studied considerably less often in genome-wide association studies (GWAS) than single nucleotide variants
(SNVs), and the overall contribution of SVs and STRs to complex traits is not well understood8,9,10,11,12,13,14,15. One difficulty with studying differences between SV classes and STRs in
GWAS is that it is unknown whether the SV classes are differentially tagged by SNVs on genotyping arrays. A collection of hundreds of subjects genotyped for a full range of SVs, STRs, SNVs,
and insertion/deletions (indels) could be used to assess the functional impact of SVs and STRs on complex traits using existing SNV-based GWAS and identify dark regions of the genome not
captured by array GWAS.
In this study, as part of the i2QTL Consortium, we use RNA-sequencing data from induced pluripotent stem cells (iPSCs) from the iPSCORE and HipSci collections7,16,17 along with a
comprehensive call set of SVs and STRs from deep WGS data4 to identify variants associated with iPSC gene expression and characterize the genomic properties of these SV and STR eQTLs. We
observe that SVs are more likely to act as eQTLs than SNVs when in distal regions (> 100 kb from eGenes) and that duplications and mCNVs are more likely to have distal eQTLs and multiple
eGenes compared to other SVs classes and STRs. eGenes for mCNV eQTLs are also less likely to be protein coding and more likely to have strong effect sizes relative to other SV classes and
STRs. We examine the LD of SVs and STRs with GWAS variants and find that mCNVs and duplications are poorly tagged by GWAS SNVs compared to other variant classes. 11.4% of common SVs and STRs
are in strong LD with a SNV associated with at least one of 701 unique GWAS traits; and deletion, rMEI, ALU, and STR lead eQTL variants are enriched for GWAS associations establishing that
these variant classes have underappreciated roles in common traits. Finally, we find a highly impactful set of SVs and STRs located near high complexity loop anchors that localize near
multiple genes in three dimensional space and are enriched for being associated with the expression of multiple genes and GWAS traits. This work establishes that different classes of SVs and
STRs vary in their functional properties and provides a valuable, comprehensive eQTL data set for iPSCs.
We performed a cis-eQTL analysis using RNA sequencing data from iPSCs derived from 398 donors in the iPSCORE and HipSci projects along with a comprehensive map of genetic variation (37,296
SVs, 588,189 STRs, and ~48 M SNVs and indels (Supplementary Data 1)) generated using deep WGS from these same donors4. These variants include several classes of SVs including biallelic
duplications and deletions; multi-allelic copy number variants (mCNVs); mobile element insertions (MEIs) including LINE1, ALU, and SVA; reference mobile element insertions (rMEI);
inversions; and unspecified break-ends (BNDs). We identified 16,018 robustly expressed autosomal genes and tested for cis associations between the genotypes of all common (MAF ≥ 0.05) SVs
(9,313), STRs (33,608), indels (~1.52 M), and SNVs (~5.83 M) within 1 megabase of a gene body using a linear mixed model approach (Fig. 1a and Supplementary Data 2, “Methods” section). We
detected associations between 11,197 eGenes (FDR 30% coding bases, > 65% coding mRNA bases, a duplication rate lower than 75%, Median 5′ bias below 0.4, a 3′ bias below 4, a 5′–3′ bias
between 0.2 and 2, a median coefficient of variation of coverage of the 1000 most expressed genes below 0.8, and a free-mix value below 0.05.
Subsequently, gene expression values were normalized across lines that passed quality control. For this we derived edgeR42,43 corrected transcript per million gene-level quantifications per
iPSC line from the feature count information. After this normalization we removed samples that had low expression correlation (20% of samples at an average TPM > 0.5 among samples that
expressed the gene) in 398 the HipSci and iPSCORE donors (Supplementary Data 1). We performed association tests using a linear mixed model (LMM), accounting for population structure and
sample repeat structure as random effects (using a kinship matrix estimated using PLINK45). All models were fit using LIMIX46 (https://limix.readthedocs.io/).
Before QTL testing the gene expression-levels were log transformed and standardized. Significance was tested using a likelihood ratio test. To adjust for global differences in expression
across samples, we included the first 50 PEER factors (calculated across all 1,367 lines using log transformed expression values) as covariates in the model. In order to adjust for multiple
testing, we used an approximate permutation scheme, analogous to the approach proposed in Ongen et al.47. Briefly, for each gene, we ran LIMIX on 1,000 permutations of the genotypes while
keeping covariates, kinship, and expression values fixed. We then adjusted for multiple testing using this empirical null distribution. To control for multiple testing across genes, we used
Storey’s q values48. Genes with significant eQTLs were reported at an FDR