Over the past several years, we’ve been working on an internal project code-named “Vintage” to digitise our archives. The process entails several detailed steps and oftentimes is manual and time-consuming.
Turning our “dusty” archives into digital artifacts in our data warehouse would enable us to leverage our legacy for a myriad of purposes:
- Making the history of Hong Kong and China searchable and accessible for educational institutions and research.
- Increase efficiency and ease of reference for our newsroom internally.
- Syndicate content to partners, news agencies, and businesses.
- Make selected content available to South China Morning Post (SCMP) readers.
- License archival content to individuals, companies, or institutions for commercial purposes.
The first step is taking the microfilm from the archives and turning it into high-resolution digital scans. We scanned these in 300 DPI though 600 DPI is actually recommended, but the higher the better given time and memory considerations, particularly if the broadsheet is large format. With distortion from wear and tear of the print copy itself over time or smudges on newsprint, small fonts can be difficult to decipher.
Once the high-resolution scans are completed, we need to transform these scans into text via OCR (optical character recognition) so we can begin mapping each article into a semi-structured or structured format. We did so with XML (extensible markup language) since it’s human and machine-readable.
As you can see, the mapping has some inconsistencies and requires further cleaning and transformation, removal of extra spaces, special characters, and erroneous letters.
The final step in the process is to convert that text into structured data and transfer it to our data warehouse.
In the past few months, our data engineering team has taken a century of our historical archives and transformed it into structured data, which is now in our data warehouse. We took a look at the archives and found some interesting insights.
Plotting our average article output per week, we see a small dip during World War I and then a substantial drop during 1941-1945 as the Japanese occupied Hong Kong followed shortly thereafter by World War II. However, the SCMP continued to grow its volume of coverage over time through the 1970s into the late 1990s.
Once this data is available in our warehouse, it enables us to run various NLP (natural language processing) models against it including sentiment analysis, readability scoring, keyword tagging, and topic analysis. However, archival content also poses particular challenges as the news cycle is ever-changing alongside the world we live in, and training an algorithm across a century’s worth of topics brings in new challenges.
Leveraging unsupervised learning to perform keyword tagging to count recurring words (excluding “stop words” like “the,” “is,” and “and”) may be a more effective approach to extracting recurring themes over time in our content. After doing this we find, not surprisingly, that our top keywords include “China,” “Hong Kong,” “British,” “Chinese,” “government,” and “police.”
Looking at the keyword “China,” we see the usage of it on a per-article basis has fluctuated over time. As we have recently completed importing this data into our warehouse, we are just starting to scratch the surface of the depth of insights that digitising a century of historical news coverage can reveal.
By bringing historical perspectives alive through the infusion of today’s data technology, we look forward to revealing more insights, news findings, and learnings in the near future.