Keweenaw Time Traveler
  • Home
  • ABOUT THE PROJECT
    • About the Data >
      • About the Maps
      • About the Datasets
      • About Sharing Your Stories
    • Behind the Scenes
    • Our Partners
    • Our Funders
    • Meet the Team
    • Citizen Historian Apps
    • Time Traveling Experiences
    • Publications
  • Upcoming Events
  • Project News
  • Help

Illuminating Life in the City - The City Directory Dataset

5/17/2022

2 Comments

 
Author: Gary Spikeberg
Picture
While the upcoming relaunch events will be largely focused on the new Explorer App and our newly processed datasets such as the US Census, I wanted to take some time to talk about one of the datasets returning Time Travelers might be quite familiar with; the City Directories.  In fact, the very first job I did for the Time Traveler after I was hired in the spring of 2016 was scanning in all of the Polk City Directories available to us in the Michigan Tech Archives page by page.  The directories we were able to map covered Calumet & Laurium, Houghton & Hancock, Dollar Bay, Hubble, and Lake Linden for various years between 1888 and 1939. 

​ The plan was to run all of these directories through a process called Optical Character Recognition (OCR), which essentially uses a computer program to actually “look” at and read the text on those digitized pages.  However, as you can see from this example, there is a lot of “stuff” on these pages that isn’t the list of names, professions, and addresses that we needed.  The simplest way to isolate that information was to just crop each page so that all that was left was the names.
​
Picture
Picture
However, if you look closely, you will notice specks dotted around the page, leftovers from when the original directories were photocopied into the volumes present at the archives (though there are still some originals).  These were “read” by the OCR program and transcribed just like the entries were.  However, we dealt with those at a later step in the process.  Another problem we ran into that made the OCR process more difficult was the fact that when you’re scanning a physical book it doesn’t lay perfectly flat.  Text is skewed or sometimes warped based on where that specific page was in the book.  To the computer it’s like trying to read something upside down, without knowing it’s not the correct way up.  Fortunately, the program we used for the OCR, called ABBYY Finereader, had a tool that could both crop images and correct skewed text at the same time, streamlining the processing we needed to do.  By surrounding the text and aligning the grid to where the text should be aligned, we could ensure that the computer was getting the best possible chance to give us a good reading.  You can also see that advertisements were present in the margins of the pages just like a modern phonebook, these would also confuse the OCR when sideways like pictured here. 
Once we had run all of the directories through OCR we had a text file of all of the entries from each directory.  However, in order to actually put this data into the Time Traveler, we needed to do a little more work.  We needed to find a way to break up each person’s entry into its component parts, and isolate just their address.  Now we could have done this by hand, but we found that most people across all of the directories had 3 basic parts to their entry: Their name, their profession, and where they lived.  With all of these components separated by a comma, this pattern made it very easy to write a program that would recognize those different parts of each entry and “parse” them into their component parts.  This program had to be tweaked as we went, because the directories were slightly different each year, and this program also cleaned up all of those extra “specks” we had cluttering up our data.  When all was said and done we had parsed over 86,000 entries for people living and working in the Keweenaw between 1888 and 1939!
The last step was to take all of those parsed entries and run them through a process called geocoding.  Fellow Time Traveler Daniel Trepal has written an informative blog post about Geocoding. To give a brief explanation though, geocoding is the process of assigning a place on the map for everyone we could possibly find.  We were able to match the addresses from the city directories to the address we recorded on our collection of historical maps, and directly link that person to that same building in the Time Traveler.  This process wasn’t perfect, and of our roughly 86,000 directory entries we were able to map about 74,500 of them (or a little under 87%).  However, when looking at similar projects these are actually really good results.  It really speaks to the care those historical map makers had when creating the maps we use every day here at the Keweenaw Time Traveler.
Thank you for reading this brief look at how we created one of our longstanding datasets for the Time Traveler, I can’t wait for you all to see what’s coming next!
2 Comments
Steve Mintz
5/18/2022 10:15:57 am

Gary. This is a fascinating explanation of the work required just to get directory information from xeroxed, non-standard print into a useful state for KeTT. Although I am far from an expert, I am shocked you could even get an 87% hit rate. Well done KeTT team!

Reply
Matt Kievit
5/31/2022 03:43:05 pm

Are all of the Polk City directories currently active in the Time Traveler (eg., prior to June 1st re-launch)? It seems as though my family is basically missing prior to 1917, even though we know they were there.

Reply



Leave a Reply.

    Archives

    February 2023
    January 2023
    December 2022
    November 2022
    October 2022
    September 2022
    August 2022
    July 2022
    June 2022
    May 2022
    April 2022
    January 2022
    December 2021
    November 2021
    October 2021
    September 2021
    August 2021
    June 2021
    May 2021
    April 2021
    March 2021
    February 2021
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016
    September 2016
    August 2016
    July 2016
    May 2016
    April 2016
    March 2016

    RSS Feed

Contact Us​

Picture
Picture
HESA Lab
Picture
  • Home
  • ABOUT THE PROJECT
    • About the Data >
      • About the Maps
      • About the Datasets
      • About Sharing Your Stories
    • Behind the Scenes
    • Our Partners
    • Our Funders
    • Meet the Team
    • Citizen Historian Apps
    • Time Traveling Experiences
    • Publications
  • Upcoming Events
  • Project News
  • Help