Methods
Sources of Data
Copyright (C) 2010 Jennifer D. Warrender, MSc Bioinformatics, Newcastle Univeristy
For this study, FBN1 & CFTR protein sequences and control data of known variants for both genetic diseases were required for this study. FBN1 and CFTR FASTA protein sequences for different species from their corresponding family tree provided by the tree families database. Links are found below. Ambiguous sequences or fragments were not included. The species used for each gene are found below:
The control data of known variations for Marfan’s Syndrome and cystic fibrosis were obtained from three sources; Dr. Ciaron McAnulty of the Northern Genetic Service - The Newcastle Upon Tyne Hospitals NHS Trust, the Universal Mutations Database for FBN1 (Beroud, et al., 2000) and the cystic fibrosis mutation database for CFTR. Links are found below.
The data was then filtered such that splicing, insertions, deletions, frame-shifts and repetitions were removed. This resulted in 271 pathogenic and 4 neutral variations for Marfan’s Syndrome and 668 pathogenic and 18 neutral variations for cystic fibrosis. The control data was then used to create substitution files. A substitution file is a text file used by all PON tools, which holds all the query substitutions. Each line represents a substitution and each substitution was in the format A1NA2, where A1 is the wild-type amino acid and N is the position within the protein and A2 is the substituting amino acid. For example the substitution R62C for Marfan’s syndrome was the substitution of R (Arginine) with C (Cysteine) at position 62 of the FBN1 gene.
• FBN1 – Homo sapiens, Bos taurus, Canis familiaris, Gasac aculeatus, Macaca mulatta, Monodelphis domestica, Mus musculus, Pan troglodytes, Rattus norvegicus, Tetraodon nigroviridis, Xenopus tropicalis
• CFTR - Homo sapiens, Bos taurus, Canis familiaris, Macaca mu-latta, Mus musculus, Oryzias latipes, Pan troglodytes, Pongo pygmaeus, Rattus norvegicus, Tetraodon nigroviridis, Xenopus tropicalis
Optimization of parameters for the assessment of Unclassified Disease Gene Sequence Variants