P. K. Rogan1,2,4, E. J. Mucaki1, A. Stuart3, N. Bryans2, E. Dovigi1, B. C. Shirley2, C. Viner2, J. H. Knoll3,4, P. Ainsworth4. Departments of Biochemistry1, Computer Science2, and Pathology3 Western University, and Cytognomix Inc4,Ā London, ON N6A 2C1 Canada.
High-throughput sequencing (HTS) of both healthy and disease singletons yields many novel and low frequency variants of uncertain significance (VUS). Some non-coding sequence variants have been proven to significantly contribute to the phenotypes of high penetrance disorders. We develop an approach to predict pathogenicity of non-coding VUS based on comprehensive information analysis of changes in DNA and RNA sequences bound by regulatory factors. Using cleavable solution microarrays, we are capturing and enriching for non-coding variants in genes known to harbor mutations that increase breast cancer risk. Oligo baits covering ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2 and TP53 were synthesized for solution hybridization with a custom cleavable microarray spanning the complete coding and intergenic regions 10 kb upstream and downstream of each gene. Non-exonic sequences are densely populated with repetitive sequences that can affect short read assembly. A novel probe design method was used to capture both repeat-free and divergent repeat sequences that are effectively single copy. After SBS sequencing of 13 patient samples in our laboratory, information theory-based sequence analysis was used to prioritize non-coding variants which occurred within sequence elements recognized by proteins or protein complexes. The novel VUS identified are being investigated for effects on mRNA splicing, transcription factor-binding site (TFBS), and untranslated region (UTR) mutations. We have developed and apply information theory based models for exon recognition, which predict the relative abundance of natural, cryptic, and mutant splice isoforms resulting from predicted mutations using the combined donor and acceptor site strengths of each mRNA species. We have applied a similar approach to detect mutations in the promoters of BRCA1 and BRCA2 that alter strengths of TFBS. Information weight matrices were automatically computed by entropy minimization of ATF3, BATF, BCL3, BCLAF, c-Jun, c-Myc, CTCF, EGR1, EP300, ETS1, FOSL2, FOXA1, FOXM1, GABP, GATA3, GRP20, HSF1, IRF4, MEF2A, NFIC, NFkB, PU.1, RAD21, RXRA, TCF12, TCF7L2, and YY1 TFBS from the global set of ENCODE ChiP-seq regions embedded within DNAse I hypersensitive domains. These models were then used to evaluate novel variants discovered by sequence analysis of breast cancer patients for alteration the TFBS binding strengths. This strategy more comprehensively covers non-coding regions in breast cancer genes than repeat masking, and introduces a unified framework for systematic interpretation of VUS that may affect expression.