Analyzing the ‘life’ of newspapers with Machine Learning

Netherlands eScience Center
Netherlands eScience Center
7 min readSep 26, 2018

--

Marcel Broersma, professor of Media and Journalism Studies and Director of the Research Centre for Media and Journalism Studies (CMJS) at the University of Groningen

Photography: Elodie Burrillon | http://hucopix.com

Journalism offers citizens a window on the world

“What is the role and function of news and journalism in, and for, society? And how does this role change over time? These are questions that fascinate me. This includes a focus on long-term historical shifts, but also on very current changes in journalism, for example related to the rise of social media and changing patterns of news use. I study journalism as a cultural form with specific stylistic and textual conventions that have an impact on how citizens obtain and experience knowledge about society, and act upon this. For most citizens journalism offers a window on the world. The views we get from that window have a huge impact on our daily lives.”

What is the role and function of news and journalism in, and for, society? And how does this role change over time?

Marcel Broersma is Principal Investigator of the collaborative research project NEWSGAC which studies how genres in newspapers and television news can be detected automatically using Machine Learning technology. The project brings together expertise from journalism history scholars, specialists in data modelling, integration and analysis, digital collection experts and eScience Research Engineers.

The views we get from journalism have a huge impact on our daily lives

Exploring new territories

“What I particularly like in research is exploring new territories. My research has always been on the interface of different disciplines, trying to shed new light on journalism as a research object. In addition, I value and enjoy multi- and interdisciplinary collaboration — in Groningen I founded the Centre for Digital Humanities to foster such work. These interests converge in digital humanities projects in which we push the boundaries by using computational methods to analyze questions that are key in journalism studies.”

Understanding the changing nature of journalism

“In this project we are interested in two complicated issues. First, we want to analyze on a large scale how journalism has changed in the twentieth century from a practice centered around views and opinions to fact-centered reporting. We do so by analyzing the “life” of genres in newspapers. Throughout history, genres such as the interview and the reportage were “invented” as part of a shift towards active reporting in which on-site observation is important and sources are critically assessed. Other genres such as the report and the opinionated essay have either disappeared or decreased in volume and importance.”

“In my previous NWO-VIDI project (Reporting at the Boundaries of the Public Sphere. Form, Style and Strategy of European Journalism, 1880–2005) we did a manual content analysis of about 125.000 historical newspaper articles. We use this annotated dataset as a training set to train algorithms that can identify the genre of historical news articles. Ideally, this will allow us to study the shift from opinion-oriented journalism to event-centered reporting on a large scale.”

This project will open the black box of Machine Learning

“In addition, the project will open the black box of Machine Learning by comparing, assessing and visualizing the effects of applying various algorithms on heterogeneous historical data with genre features that shift over time. This will enable scholars to critically evaluate the methodological effects of various machine learning approaches, while developing an approach that can deal with the overabundance of available historical newspaper material.”

A difficult task for computers

“The biggest challenge is the fact that the genre of a newspaper article is not straightforward; labeling the genre of a historical newspaper text demands a lot of interpretation. Even for human coders it is hard to code for this category, for which they also take contextual factors into account in addition to textual features. This makes it a very difficult task for the computer. But when we could do this, we gain a lot — which is why we decided to focus on this category.”

Fascinating conversations

“Collaboration between us as domain specialists, computer scientists, eScience Research Engineers and the collection specialists at the archives ( who provide access to, and knowledge about, their digital collections) is crucial for a project like this. To solve the challenges that are ahead of us, we need the knowledge and expertise of every one of these stakeholders. This also raises issues because we do not always speak each other’s language. This results in fascinating conversations and discussions from which we all learn. I think it is fun to explore this new shared terrain, also because we all see what we can gain from it.”

It is fun to explore this new shared terrain, also because we all see what we can gain from it

“Prior to the current project on genre, I worked with the eScience Center on a project that focused on mapping the online behavior from politicians on Twitter. This project also centered around a very complex content category, which we tried to classify automatically by applying machine learning as well.”

“On both projects I have worked and work with Erik Tjong Kim Sang, who I already knew a little bit from his work on Twitter in computational linguistics. Our collaboration has been very smooth and enjoyable. Together with our computer scientist and postdoc Aysenur Bilgin (CWI), Erik offers crucial knowledge and skills that I and other domain specialists in the team don’t have, and Aysenur and Erik both come up with new ideas and solutions based on our discussions. It’s exactly the kind of collaboration you need in a project like this.”

Sharing knowledge and tools with other researchers

“Internally, we have open discussions in the team in which we share knowledge, ideas and expertise. We learn a lot from each other. Externally, we discuss our findings and approach with the scholarly community through papers and (conference) presentations. The archives profit from the knowledge and tools created in this project to make their collections better available. But most importantly, we are working on a virtual workspace in which researchers can upload their own annotated dataset, experiment with different machine learning algorithms, and tweak features to ultimately assess the output and make an informed choice about the best performing algorithm. This is important because certain machine learning algorithms can score high on overall accuracy, but might still underperform on validity. We think this is a useful tool for other researchers that could make an impact in the CLARIAH framework — which also supports this project.”

Digitization can have a major impact on how we study journalism

“I hope that, in the longer term, we will be able to answer research questions that have only be tested on smaller datasets by analyzing research material on a much larger scale. Content analyses of news texts has also been a very labor intensive method, also because collections were hard to access. The size of our manually annotated dataset of 125.000 articles in our previous project is unprecedented, but it is still only very little compared to the news texts that have been produced in the long twentieth century. The digitization of newspaper collections and the use of computational methods can thus have a major impact in my field. We can pose new questions and answer existing questions based on more data.”

“Moreover, we work with latent categories and “fuzzy” data. If we manage to develop a classifier that does the job, this might also have an impact on developing algorithms that can classify other “complex” categories. Last but not least, algorithmic transparency is becoming an increasingly bigger issue for society. How do we know what algorithms do what they should do? How can we assess their performance? Hopefully, our workspace can offer some insights in how machine algorithms work and open the black box.”

From left to right: Erik Tjong Kim Sang, Lotte Wilms, Marcel Broersma, Frank Harbers, Kim Smeenk, Aysenur Bilgin, Tom Klaver and Jacco van Ossenbruggen.

Much to gain from interdisciplinary collaborations

I hope this type of research on the interface of specific domains and computer science will then be more common. There is much to gain! The road towards it will for sure be bumpy, but I also hope that we will by then have solved many issues and have developed solid classifiers to label latent categories. These are the most interesting ones in the humanities.

About the NEWSGAC project

The NEWSGAC project was initiated by Marcel and his colleague Frank Harbers (both from University of Groningen), together with Jacco van Ossenbruggen and Laura Hollink (both from CWI), and in cooperation with the National Library (KB) and the Netherlands Institute for Sound and Vision (NISV). Aysenur Bilgin works on it as a postdoctoral researcher at CWI and Kim Smeenk as a junior researcher at CMJS. Two eScience Research Engineers from the Netherlands eScience Center are part of the NEWSGAC team: Erik Tjong Kim Sang and Tom Klaver.

--

--

Dutch national center for digital expertise. We help those in academic research develop open, sustainable, high quality software tools.