Census Records Transcription

Spokeo transcribed 200 Million Handwritten Census Records for its People Search Engine!

Building a Universal People Search Engine

Founded in 2006, Spokeo is a leading people search platform that is on a mission to reunite friends and family and empower users to discover information about their own online footprints.

Using a proprietary technology to organize information, Spokeo allows users to search for a name, address, email or phone number and find comprehensive, easy-to-understand online profiles for millions of people around the world. As Spokeo CEO Harrison Tang explains, “People come to us to find data they can’t find anywhere else. There are a variety of use cases, from people looking for information about their long-lost family members, best friends, or old classmates to companies looking to verify their customers’ identities.”

“We’re a mission driven company,” Tang says. “We believe in helping people find and connect with others. To that end, we’ve successfully organized more than 12 billion records. It’s all part of what we call the ‘universal people search initiative,’ and the idea is to organize all of these records—whether they’re regular people, historical figures, or people who have passed away—and bring them together so our customers can find the most comprehensive information out there.”

The Challenge of Transcribing 200 Million Handwritten Records

In 2012, the Spokeo team began looking for ways to incorporate historical U.S. census data from 1790 to 1940 into its platform. Originally, the product team tried using Optical Character Recognition (OCR) technology but quickly discovered it wasn’t an adequate solution.

We tried different OCR solutions, but the handwriting is so unique from decade to decade,” says Tang. “A person’s handwriting in the 1790’s is so different from a person’s handwriting today, OCR just couldn’t do it. We had two engineers evaluate different solutions and even thought about building a solution internally, but at the end of the day, OCR wasn’t working. From that point, we determined crowdsourcing was really the way we needed to go.”

Finding The Right Workforce Virtualization Partner

Spokeo evaluated a variety of options to crowdsource its census data transcription, including Amazon Mechanical Turk.

Ultimately, however, Spokeo decided to partner with CloudFactory, betting on its on-demand workforce and scalable tech platform to deliver cost-effective, accurate and fast results. Says Tang, “We knew we needed to work with a partner that had both the technology and the human resources capabilities to tackle the scale of the project. Of course, cost was important to us, and the cost CloudFactory offers is competitive. Once we really evaluated the CloudFactory solution—the technology component that manages all of these global workers and the human component that trains and empowers people to do the tasks at hand—we realized that we really could tackle this problem at scale. And with 200 million unique records, it’s not an easy problem to solve.”

Launching an Epic Census Transcription Project

By 2013, CloudFactory had built an infrastructure and foundation that allowed it to ramp up fast depending on the nature and size of its clients’ projects. Yet the Spokeo project was larger in scope than any the CloudFactory team had encountered thus far. Prior to its partnership with Spokeo, the CloudFactory office in Nepal employed 430 workers. Given the scope of the census transcription project, the CloudFactory team needed to scale...fast. That process proved to be a turning point in the company’s growth.

John Snowden, VP of Solutions at CloudFactory then, who is based in Nepal, watched the CloudFactory workforce grow from 589 to over 3,000 in a matter of five months. “We did several recruiting events and had almost 700 people show up at an early event. At one point we rented a movie theater and filled the entire theater. It was obvious that people wanted to work, and we had an opportunity for them to gain some digital-age skills in the process.”

On March 11th, 2013, CloudFactory kicked off the census transcription project and the timeline lasted fifteen months, during which the CloudFactory team focused on delivering results expediently and achieving superior accuracy. “When we started out, we were delivering batches on a weekly basis,” Snowden says. “At the start, we delivered around 700,000 records a week, but there were later weeks when we were delivering ten or eleven million records a week.”

Human Intelligence + Technology = Superior Transcription Speed and Accuracy

To receive the transcriptions, the CloudFactory team set up a bidirectional API. Says Snowden, “The input was a photo of a page of a census sheet that had been digitized from the U.S. archives. Each sheet included header data and up to 50 individual names of people from the census. The output was a census name—first name, last name, and age. So, we would deliver one name as one unit. Spokeo sent us the inputs via the API, and we delivered the outputs that mapped to an API endpoint in Spokeo’s system.” “Speed was important, but it wasn’t the type of speed requirement we often see from other clients who want a turnaround in minutes or hours,” says Snowden.

We had a project manager sampling the accuracy—our accuracy standard was around 90%, and at the end of the day, CloudFactory was able to transcribe the data in a very, very accurate fashion.
- Harrison Tang, Spokeo CEO

“In this case, it was ‘Here are 400 million records; we need them by May.’ Accuracy was the real challenge. To determine the accuracy, we broke the records up into individual names and had our data science team pull out batches to analyze. Accuracy was around 93% at the start of the project, and by the end it was around 95%.”

As one of CloudFactory’s first large-scale projects, the Spokeo initiative was a milestone in the company’s self-assessment process. “It gave us an incredible confidence in our product,” Snowden says. “We increased our confidence so much in our ability to find workers to do the work, to onboard them expediently, and to train them on a very niche task at scale. Transcribing handwritten names from 1920’s America is not the same as transcribing handwritten names today. We knew that if we could do that, we could do a lot of different things. In terms of serving clients, we learned we could meet client needs while massively scaling our operations. With 3,000+ workers working seven to ten hours a week and never hitting any true bottlenecks, showed us we could handle this level of scale.”

Leveraging Census Data to Become the People Search Leader

The ability to incorporate historical census data into its product has been a huge asset to Spokeo. As one of the fastest-growing companies in universal people search, Spokeo now has the most comprehensive, accurate, up-to-date information online.

"In terms of impacting our business, this takes us beyond a traditional people search engine. The fact that we could overcome this huge technical challenge shows us and other people that we’re not afraid to be innovative, break down walls and do something people haven’t done before. It also speaks to what Spokeo stands for. We’re primarily a mission-driven company, and this is the type of project that speaks to our focus and shows that we are accomplishing our mission," says Tang.

Spokeo Census By The Numbers

Spokeo-by-the-numbers.png

Ready to transcribe your documents with CloudFactory?

Get Started