South China Morning Post begins digitising a century worth of content

By Korey Lee

South China Morning Post

Hong Kong

Connect      

Over the past several years, we’ve been working on an internal project code-named “Vintage” to digitise our archives. The process entails several detailed steps and oftentimes is manual and time-consuming.

Turning our “dusty” archives into digital artifacts in our data warehouse would enable us to leverage our legacy for a myriad of purposes:

  • Making the history of Hong Kong and China searchable and accessible for educational institutions and research.
  • Increase efficiency and ease of reference for our newsroom internally.
  • Syndicate content to partners, news agencies, and businesses.
  • Make selected content available to South China Morning Post (SCMP) readers.
  • License archival content to individuals, companies, or institutions for commercial purposes.

The first step is taking the microfilm from the archives and turning it into high-resolution digital scans. We scanned these in 300 DPI though 600 DPI is actually recommended, but the higher the better given time and memory considerations, particularly if the broadsheet is large format. With distortion from wear and tear of the print copy itself over time or smudges on newsprint, small fonts can be difficult to decipher.

A scan via microfilm from SCMP’s first day of publication on November 6, 1903.
A scan via microfilm from SCMP’s first day of publication on November 6, 1903.

Once the high-resolution scans are completed, we need to transform these scans into text via OCR (optical character recognition) so we can begin mapping each article into a semi-structured or structured format. We did so with XML (extensible markup language) since it’s human and machine-readable.

A sample of XML output from the OCR process.
A sample of XML output from the OCR process.

As you can see, the mapping has some inconsistencies and requires further cleaning and transformation, removal of extra spaces, special characters, and erroneous letters.

The final step in the process is to convert that text into structured data and transfer it to our data warehouse.

In the past few months, our data engineering team has taken a century of our historical archives and transformed it into structured data, which is now in our data warehouse. We took a look at the archives and found some interesting insights.

Plotting our average article output per week, we see a small dip during World War I and then a substantial drop during 1941-1945 as the Japanese occupied Hong Kong followed shortly thereafter by World War II. However, the SCMP continued to grow its volume of coverage over time through the 1970s into the late 1990s.

Once this data is available in our warehouse, it enables us to run various NLP (natural language processing) models against it including sentiment analysis, readability scoring, keyword tagging, and topic analysis. However, archival content also poses particular challenges as the news cycle is ever-changing alongside the world we live in, and training an algorithm across a century’s worth of topics brings in new challenges.

Leveraging unsupervised learning to perform keyword tagging to count recurring words (excluding “stop words” like “the,” “is,” and “and”) may be a more effective approach to extracting recurring themes over time in our content. After doing this we find, not surprisingly, that our top keywords include “China,” “Hong Kong,” “British,” “Chinese,” “government,” and “police.”

Looking at the keyword “China,” we see the usage of it on a per-article basis has fluctuated over time. As we have recently completed importing this data into our warehouse, we are just starting to scratch the surface of the depth of insights that digitising a century of historical news coverage can reveal.

By bringing historical perspectives alive through the infusion of today’s data technology, we look forward to revealing more insights, news findings, and learnings in the near future.

About Korey Lee

By continuing to browse or by clicking ‘I ACCEPT,’ you agree to the storing of cookies on your device to enhance your site experience. To learn more about how we use cookies, please see our privacy policy.
x

I ACCEPT