PROJECT TITLE : Date Estimation in Lineage-Linked Databases NAME : Vegard Brox SUPERVISOR : Brian Randell SECOND SUPERVISOR : Nick Rossiter ABSTRACT : The aim of this project is to produce a system, which scan a lineage-linked database in order to provide estimated years for undated events. Such databases are used by genealogists for holding family trees and associated information. The system to be produced should first traverse the database, inserting the year range 0000/9999 in all blank year fields. It should then repeatedly traverse the database, applying appropriate logical and heuristic constraints in order to narrow the year ranges, so as to make all the year ranges in the database consistent with each other, or report errors if it is not possible to make the year ranges consistent. AIMS : - Explore how mathemathical relaxation can be used to estimate dates in lineage linked databases. - Analyze how best to process lineage-linked databases. - Get a better understanding of the software development process. - Get more experience in doing a project. OBJECTIVES : - Determine reasonable heuristics for estimating dates in a lineage linked database. - Determine internal representation of the database. - Develop a program which performs the estimation. - Make the program easy to use. - Report errors and inconsistencies in the original database to the user. - Test how well the program is able to oprate on real-life databases. - Make the program freely available through a web-page that explains what the program does. DELIVERABLES : The main deliverable from the project is the dissertation, which describes the development process. But the project should also resiult in a well-tested, robust program with a graphical user interface which can be used for estimating dates in lineage-linked databases. A webpage that explains what the program does and offers it for download should also be made, as the idea is to make the program freely available. FOCUS : Problem analysis: 20% Design: 20% Implementation: 20% Testing: 20% Analysis of result: 20% BACKGROUND SKILLS : - Software development skills, including analysis, design and programming skills (not necessarily for a specific tool or language). - Project skills, including project planning and time managment. SKILLS TO BE ACQUIRED : - Understanding of the GEDCOM standard which is used to define the lineage- linked databases. - Knowledge of the chosen programming language. - Understanding and use of the chosen tools for design, programming and project planning. - General knowledge of mathemathics, including mathemathical relaxation. - Experience and produce a robust and non-trivial program. REFERENCES : - D. Hawgood: GEDCOM Data Transfer (3rd edition), Cardon 1999. A book explaining the GEDCOM protocol. - Edith Chipo Lwanda: Date-estimation in Genealogical Databases. MSc project done on a related topic at University of Newcastle in 1992. RESOURCES : The only hardware needed should be a normal PC. Of software, it is not possible to give a list until tools to use have been chosen. SPECIFICATION : The aim of this project is to produce a system, which scan a lineage-linked database in order to provide estimated years for undated events. (Such databases are used by genealogists for holding family trees and associated information, though these "trees" are more accurately described as acyclic directed graphs.) The database will be given as a GEDCOM file - this is a documented standard ASCII representation used for exporting and importing such databases. The program to be produced should first build and traverse an internal representation of the GEDCOM file, inserting the year range 0000/9999 in all blank year fields. It should then repeatedly traverse the internal representation, applying appropriate logical and heuristic constraints in order to narrow the year ranges, so as to make all the years and year ranges in the file consistent with each other. An example of a logical constraint is that person's death cannot his/her marriage. The heuristic constraints are perhaps less obvious, but can include that people do not live for more than 110 years or have children before they are 10 years old etc. The numbers to be used by the heuristic constraints should be options the user can change the value of, but they should have reasonable default values. The program will aim to perform such traversals until a complete traversal is made during which no year range is further narrowed. (In effect, it will have performed a simple "mathematical relaxation" process.) However, due to errors in the original data in the file, the program might find it impossible to determine a consistent set of year intervals - something which comes apparent when a year range becomes negative, so to speak. It will have to be decided what the program should do when it encounters an error - whether it can continue the estimation at all, and if so, which data can still be used. The program is to be written for a PC or a Macintosh, have a good quality user interface, and be validated using the numerous GEDCOM files that are readily available. The requirement is to produce a fully operational and well-documented system, worthy of being used in anger by a genealogist. The project can be split up in some main stages - initial stage, design, programming, experimentation, report writing, and final stage. A brief description of what each of these stages should include is given below. The development process will be based on the spiral model for software development. Initial stage: One task will be to perform research and background reading into necessary standards and technologies, e.g. the GEDCOM standard. Which logical and heuristic constraints to use will have to be decided at this stage, and reasonable default values for the heuristic constraints should be set. It should also be decided quite early which platform and programming language/environment to use, and if necessary gain more knowledge about this. Finally, a detailed project plan for the (rest of the) project should be made sometime during the initial stage. Design: A design methodology should be chosen, and a detailed design of the resulting application should be carried out. As the task for the program has not been well-explored earlier, it would be an advantage to start the implementation and testing quite early in order to discover unexpected problems. Before any implementation starts it will be a design phase focusing mostly on the basics of the program. After an initial version has been implemented, a new phase will allow for re-design caused by discovered problems (if any), and more detailed design of the remaining parts of the program. Implementation and testing: The program will have to be implemented using the selected language and environment. A help system should also be made for the program to explain the basic operation of the program and help users with usual problems. This stage also includes testing to ensure that the program works as expected. This testing can be carried out on both constructed and real-life data. Experimentation: The aim in this stage will be to determine how well the program performs on real data, i.e. how much it manages to narrow down the year intervals given a number of real databases. It should also be investigated how often errors occur and perhaps what caused the errors. Documentation: The project is supposed to result in a (roughly) 50-page dissertation, and the main task at this stage will be to determine what to include in the dissertation, and to write it. Also, things like user manual and maintenance manual will have to be written. Final stage: The idea is to make the finished program available for free use by genealogists, and one of the tasks in the final stage could be to make a webpage presenting the program and allowing people to download it. The main milestones for the project will be at the end of each stage, although it will be natural to do some of the tasks in parallel. The provisional dates for the milestones are: - Initial stage finished: 8/10 1999 - Initial design finished: 29/10 1999 - Initial implementation finished: 26/11 1999 - Final design finished: 17/12 1999 - Implementation and testing finished: 25/2 2000 - Experimentation finished: 31/3 2000 - Documenatation finished: 5/5 2000 - Final stage finished: 5/5 2000