Author: Gary Spikeberg
Once we had run all of the directories through OCR we had a text file of all of the entries from each directory. However, in order to actually put this data into the Time Traveler, we needed to do a little more work. We needed to find a way to break up each person’s entry into its component parts, and isolate just their address. Now we could have done this by hand, but we found that most people across all of the directories had 3 basic parts to their entry: Their name, their profession, and where they lived. With all of these components separated by a comma, this pattern made it very easy to write a program that would recognize those different parts of each entry and “parse” them into their component parts. This program had to be tweaked as we went, because the directories were slightly different each year, and this program also cleaned up all of those extra “specks” we had cluttering up our data. When all was said and done we had parsed over 86,000 entries for people living and working in the Keweenaw between 1888 and 1939! The last step was to take all of those parsed entries and run them through a process called geocoding. Fellow Time Traveler Daniel Trepal has written an informative blog post about Geocoding. To give a brief explanation though, geocoding is the process of assigning a place on the map for everyone we could possibly find. We were able to match the addresses from the city directories to the address we recorded on our collection of historical maps, and directly link that person to that same building in the Time Traveler. This process wasn’t perfect, and of our roughly 86,000 directory entries we were able to map about 74,500 of them (or a little under 87%). However, when looking at similar projects these are actually really good results. It really speaks to the care those historical map makers had when creating the maps we use every day here at the Keweenaw Time Traveler. Thank you for reading this brief look at how we created one of our longstanding datasets for the Time Traveler, I can’t wait for you all to see what’s coming next!
2 Comments
Steve Mintz
5/18/2022 10:15:57 am
Gary. This is a fascinating explanation of the work required just to get directory information from xeroxed, non-standard print into a useful state for KeTT. Although I am far from an expert, I am shocked you could even get an 87% hit rate. Well done KeTT team!
Reply
Matt Kievit
5/31/2022 03:43:05 pm
Are all of the Polk City directories currently active in the Time Traveler (eg., prior to June 1st re-launch)? It seems as though my family is basically missing prior to 1917, even though we know they were there.
Reply
Leave a Reply. |
|