PROJECT TITLE       : Date Estimation in Lineage-Linked Databases

NAME                : Vegard Brox

SUPERVISOR          : Brian Randell
SECOND SUPERVISOR   : Nick Rossiter

ABSTRACT :
	The aim of this project is to produce a system, which scan a lineage-linked
	database in order to provide estimated years for undated events. Such 
	databases are used by genealogists for holding family trees and associated
	information. The system to be produced should first traverse the database, 
	inserting the year range 0000/9999 in all blank year fields. It should then
	repeatedly traverse the database, applying appropriate logical and 
	heuristic constraints in order to narrow the year ranges, so as to make all
	the year ranges in the database consistent with each other, or report 
	errors if it is not possible to make the year ranges consistent.


AIMS : 
	- Explore how mathemathical relaxation can be used to estimate dates in 
	  lineage linked databases.
	- Analyze how best to process lineage-linked databases.
	- Get a better understanding of the software development process.
	- Get more experience in doing a project.
       
       
OBJECTIVES :

	- Determine reasonable heuristics for estimating dates in a lineage linked
	  database.
	- Determine internal representation of the database.
	- Develop a program which performs the estimation.
	- Make the program easy to use.
	- Report errors and inconsistencies in the original database to the user.
	- Test how well the program is able to oprate on real-life databases.
	- Make the program freely available through a web-page that explains what
	  the program does.
     

DELIVERABLES :
	
	The main deliverable from the project is the dissertation, which describes 
	the development process. But the project should also resiult in a 
	well-tested, robust program with a graphical user interface which can be 
	used for estimating dates in lineage-linked databases. A webpage that 
	explains what the program does and offers it for download should also be 
	made, as the idea is to make the program freely available. 


FOCUS : 

	Problem analysis:		20%
	Design:				20%
	Implementation:			20%
	Testing:			20%
	Analysis of result:		20%


BACKGROUND SKILLS :

	- Software development skills, including analysis, design and programming 
	  skills (not necessarily for a specific tool or language).
	- Project skills, including project planning and time managment.

      
SKILLS TO BE ACQUIRED :

	- Understanding of the GEDCOM standard which is used to define the lineage-
	  linked databases.
	- Knowledge of the chosen programming language.
	- Understanding and use of the chosen tools for design, programming and 
	  project planning.
	- General knowledge of mathemathics, including mathemathical relaxation.
	- Experience and produce a robust and non-trivial program.
    
      
REFERENCES :

	- D. Hawgood: GEDCOM Data Transfer (3rd edition), Cardon 1999.	
	  A book explaining the GEDCOM protocol.
	- Edith Chipo Lwanda: Date-estimation in Genealogical Databases.
	  MSc project done on a related topic at University of Newcastle in 1992.
      

RESOURCES :

	The only hardware needed should be a normal PC. Of software, it is not 
	possible to give a list until tools to use have been chosen. 


SPECIFICATION :

	The aim of this project is to produce a system, which scan a lineage-linked
	database in order to provide estimated years for undated events. (Such 
	databases are used by genealogists for holding family trees and associated 
	information, though these "trees" are more accurately described as acyclic 
	directed graphs.) The database will be given as a GEDCOM file - this is a 
	documented standard ASCII representation used for exporting and importing 
	such databases. 
	
	The program to be produced should first build and traverse an internal
	representation of the GEDCOM file, inserting the year range 0000/9999 in 
	all blank year fields. It should then repeatedly traverse the internal 
	representation, applying appropriate logical and heuristic constraints in 
	order to narrow the year ranges, so as to make all the years and year 
	ranges in the file consistent with each other. An example of a logical 
	constraint is that person's death cannot his/her marriage. The heuristic 
	constraints are perhaps less obvious, but can include that people do not 
	live for more than 110 years or have children before they are 10 years 
	old etc. The numbers to be used by the heuristic constraints should be 
	options the user can change the value of, but they should have reasonable 
	default values.
	
	The program will aim to perform such traversals until a complete traversal 
	is made during which no year range is further narrowed. (In effect, it will 
	have performed a simple "mathematical relaxation" process.) However, due to 
	errors in the original data in the file, the program might find it 
	impossible to determine a consistent set of year intervals - something 
	which comes apparent when a year range becomes negative, so to speak. It 
	will have to be decided what the program should do when it encounters an 
	error - whether it can continue the estimation at all, and if so, which 
	data can still be used.
	
	The program is to be written for a PC or a Macintosh, have a good quality 
	user interface, and be validated using the numerous GEDCOM files that are 
	readily available. The requirement is to produce a fully operational and 
	well-documented system, worthy of being used in anger by a genealogist. 

	The project can be split up in some main stages - initial stage, design, 
	programming, experimentation, report writing, and final stage. A brief 
	description of what each of these stages should include is given below. The 
	development process will be based on the spiral model for software 
	development.
	
	Initial stage:
	One task will be to perform research and background reading into necessary 
	standards and technologies, e.g. the GEDCOM standard. Which logical and 
	heuristic constraints to use will have to be decided at this stage, and 
	reasonable default values for the heuristic constraints should be set. It 
	should also be decided quite early which platform and programming 
	language/environment to use, and if necessary gain more knowledge about 
	this. Finally, a detailed project plan for the (rest of the) project should 
	be made sometime during the initial stage. 
	
	Design:
	A design methodology should be chosen, and a detailed design of the 
	resulting application should be carried out. As the task for the program
	has not been well-explored earlier, it would be an advantage to start the
	implementation and testing quite early in order to discover unexpected
	problems. Before any implementation starts it will be a design phase 
	focusing mostly on the basics of the program. After an initial version has
	been implemented, a new phase will allow for re-design caused by 
	discovered problems (if any), and more detailed design of the remaining 
	parts of the program.
	
	Implementation and testing:
	The program will have to be implemented using the selected language and 
	environment. A help system should also be made for the program to explain 
	the basic operation of the program and help users with usual problems. This 
	stage also includes testing to ensure that the program works as expected. 
	This testing can be carried out on both constructed and real-life data.
	
	Experimentation:
	The aim in this stage will be to determine how well the program performs on 
	real data, i.e. how much it manages to narrow down the year intervals given 
	a number of real databases. It should also be investigated how often errors 
	occur and perhaps what caused the errors.
	
	Documentation:
	The project is supposed to result in a (roughly) 50-page dissertation, and 
	the main task at this stage will be to determine what to include in the 
	dissertation, and to write it. Also, things like user manual and 
	maintenance manual will have to be written.
	
	Final stage:
	The idea is to make the finished program available for free use by 
	genealogists, and one of the tasks in the final stage could be to make a 
	webpage presenting the program and allowing people to download it. 
	
	The main milestones for the project will be at the end of each stage, 
	although it will be natural to do some of the tasks in parallel. The 
	provisional dates for the milestones are:
	- Initial stage finished:				 8/10 1999
	- Initial design finished:				29/10 1999
	- Initial implementation finished:			26/11 1999
	- Final design finished:				17/12 1999
	- Implementation and testing finished:			25/2  2000
	- Experimentation finished:				31/3  2000
	- Documenatation finished:				 5/5  2000
	- Final stage finished:					 5/5  2000