AppsFromResearch
Indexing Go icon

Indexing Go

Evidence Tier:DOCUMENTED

Published in academic literature

For:Researchers & AcademicsGeneral Public & Enthusiasts

App Summary

Indexing Go is a genealogical research tool that uses advanced handwriting recognition to automatically index and improve the accuracy of historical census records for family history researchers. The app's scientific basis is a deep learning model that combines a convolutional neural network with a Long-Short-Term-Memory (LSTM) network, trained on a dataset of 2.4 billion images from the 1940 U.S. Census. The associated research concludes this technology can correct mistakes made by original human indexers and expand the number of searchable fields, improving the quality of historical data for research.

App Screenshots

Indexing Go screenshot 1 of 5Indexing Go screenshot 2 of 5Indexing Go screenshot 3 of 5Indexing Go screenshot 4 of 5Indexing Go screenshot 5 of 5

Detailed Description

Functionality & Mechanism

Indexing Go is a mobile data contribution tool that facilitates the crowdsourced transcription of historical census records. The interface presents users with digitized images of individual cells from census documents, such as a name or occupation. The system captures user-entered transcriptions of the handwritten text. This human-generated data serves as a training and validation set for a sophisticated handwriting recognition algorithm that leverages a convolutional neural network (CNN) and a Long-Short-Term-Memory (LSTM) network to automate large-scale indexing.

Evidence & Research Context

  • The associated research details the underlying algorithm, which integrates a convolutional neural network with a Long-Short-Term-Memory (LSTM) network for handwriting recognition.
  • The system's design leverages a training dataset of 2.4 billion labeled sub-images derived from the 1940 U.S. Census.
  • A pilot application of the algorithm on a 1930 census dataset demonstrated a character error rate (CER) of 10.4% for names.
  • To enhance accuracy, the system incorporates data from the FamilySearch Family Tree to correct transcription errors and identify alternative name spellings.

Intended Use & Scope

This application is designed for volunteers, genealogists, and family history researchers as a data contribution and verification platform. Its primary utility is to improve the accuracy and completeness of large-scale digital census archives. The tool does not function as a genealogical search engine; it is intended exclusively for performing transcription and indexing micro-tasks.

Studies & Publications

1 publication

Peer-reviewed research associated with this app.

Development/Design Paper

Using Hand-Writing Recognition to Auto Index the US Census Records

Clement et al. (2019) · SSHA Annual Meeting

Describes the research-driven development of this app
Recent breakthroughs in handwriting recognition have the capability to improve the quality of the 1940 Census data and expand the set of fields that are available to use for research. Our hand-writing recognition algorithms uses new data augmentation and normalization methods applied to a convolutional neural network that feeds into a Long-Short-Term-Memory (LSTM) network. We also have a unique advantage by having access to a training set that is unprecedented size. Census records consist of a set of rows for each person and columns for each of the fields of information for that person. We've developed an algorithm to extract the sub-image in each cell of the census record and match these with the indexed data for that cell. This provides us a labeled training set with 2.4 billion images from the 1940 census (18 fields x 132 million individuals). We are using our algorithm to re-index the 1940 census and fix mistakes made by the original human indexers and also expand the number of fields that are indexed. We conducted a pilot study on the 1930 census using a small training set and have already achieved a character error rate (CER) of 10.4% for names. We also make use of the FamilySearch Family Tree, a crowdsourced genealogical database which includes a substantial number of individuals linked to the 1940 census. These sources have often been attached to the Family Tree by family members who have access to additional information about these people that improve the accuracy of the linkages to these sources. We use information from these sources to correct mistakes in the index of the 1940 census and identify alternative name spellings and nicknames for the individual.
... Read More

Indexing Go

Free