Methods
Methods
Sources of Data
For this study, protein sequences and control data of known variants for 7 genetic diseases were required.
For this study, 7 MSA tools, 2 MSA Benchmarks, 6 PON tools and 5 other tools were evaluated and used.
In order to test which MSA gave the optimal , it was required to create every possible MSA size and combination of the sequences input. These were created using the modified CombinationGenerator program. Each MSA consisted of H. sapiens and n other sequences where n is greater than 1. This resulted in different MSAs to be run in conjunction with the PON tools.
The PON benchamrk consists of the Matthews Correlation Coeffecient (MCC) and the PM% (the percentage of predicted mutations). The MCC looks at the true predictions as well as the false predictions. The MCC returns a value bewteen -1 and +1 where +1 coeffecient means that 100% of the predictions are correct, 0 coeffecient means that 50% are correct and -1 coeffecient means that 0% are correct.
Automation
The majority of this study was automated using a Linux system, with Bash scripts, Java classes and Perl modules for the automating of the optimization of the parameters.
Copyright (C) 2011 Jennifer D. Warrender, Newcastle Univeristy
Optimization of parameters for the assessment of Unclassified Disease Gene Sequence Variants