Reading news, visually
The StoryTeller application gives you a different perspective on the news
When I started working with the Computational Linguistics group of the VU University Amsterdam two years ago, I didn’t know much about Natural Language Processing. I used to think that communicating human beings were pretty understandable, especially in everyday news reporting. I’ve since learned that I was terribly wrong.
Computers do not understand us. Human language is so abstract and full of context, innuendo and cultural references, that to an observer without the same background (the computer, in this case) we are nearly impossible to understand.
Shaka, when the walls fell.
In this episode, the English-speaking captain of the star ship Enterprise is forced to communicate with an alien race with a completely different language. The aliens’ words are translated to English, but they make no sense at all because they are completely speaking in metaphors.
Since this is an episodic series, the impasse is of course resolved eventually, but the point is well made.
Human(oid) communication is a very difficult problem indeed.
Natural Language Processing
This is where science must find a solution. If we ever want computers to answer complex queries from humans (or even make us a cup of Earl Grey on voice command), we have to make computers and people understand each other.
Great strides have already been made, but many challenges persist. One of those challenges is event-based data like news stories. While literary works are usually well researched works of knowledge or stories with a well defined structure, the news is fraught with an extra layer of challenges because of its time-based nature. The facts of today might be the fake news of tomorrow, and opinions of politicians and other prominent speakers may shift over the course of time. Even concepts themselves are not as stable as you might think. The news and its complexities is the chosen topic of research for the group of professor Piek Vossen.
Enter the NewsReader project, a European-funded multi-partner research project that aims to help humanity overcome at least some of the challenges as described above. To help improve the communication between computers and people, a software pipeline was constructed. This pipeline is a collection of software packages dealing with specific parts of human language, which can be connected to form a more complete picture of stories in newspapers.
The newsreader pipeline extracts what happened to whom, when and where from billions of news stories, company databases and biographies and stores them in a structured database, enabling more precise search over this immense stack of information. It supports multiple languages (English, Spanish, Italian and Dutch) and allows its users to find complex interconnections between participants, events and perspectives on these events.
Before I joined the project, however, a new and exciting challenge appeared. The output of the pipeline was too complex to understand for humans.
A term usually applied to the inane utterings of politicians or their representatives, it quite accurately describes the issues with scientific output in computational linguistics. At least to us humans, English speaking humans, or me alone, depending on how narrow this particular selection is made. On second thought, let me explain.
Before new users start to make sense of the output of the NewsReader pipeline, they encounter an enormous mass of words, arranged in mentions (instances of events that are mentioned in the news) and their attributes. Dates, cited persons, authors, event participants, labels, groups, perspectives etc. All neatly arranged in a computer-readable data structure. Needless to say, I didn’t understand any of its significance until it was explained to me.
Climax events and Storylines
The pipeline spews forth a slurry of events and their mentions in the news, but fortunately there is now some structure to the data. Stories have been defined by defining a ‘climax’ event, and connecting it to other events through their participants, labels, groups and other metadata. These stories should follow the ancient structure of all human stories, at the very least since the first recorded epic of Gilgamesh. Incidentally, (and probably not very coincidentally) this epic tale was also a large part of the resolution of that Star Trek episode I linked earlier.
To detect climax events, the software uses multiple Natural Language Processing modules including named entity recognition and linking, semantic role labeling, time expression detection and normalization and nominal and event coreference. Processing a single news article results in the semantic interpretation of mentions of events, participants and their time anchoring in a sequence of text.
Where the pipeline of NewsReader finds climax events, it will first look for events that are part of the same storyline as the largest climax event of the selected data, and link them to this climax event in a story. It will do so in a greedy fashion, until it can find no more that it can link. It will then move to the next biggest climax event that is left in the pool and continue its operation. This creates a set of stories, linked by single events with a number of participants, citations, authors and perspectives.
Movie narrative charts
Very early on in the project, we realized that one visualization would not be enough. The complexity of the connections between events, actors and mentions was just too great to capture in a single image. Especially if that image had to be algorithmically constructed instead of painstakingly hand-crafted. While we used the legendary Randall Munroe’s Movie narrative Charts (see image) as a source of inpiration for one of our visualizations, we knew that we could never fit as much information in the image automatically.
We therefore decided on three different visualizations in different tabs (or views), which were linked through filters and selections.
Visualize and conquer!
(Hold on to your horses, a link will be provided soon… I just need to explain a little more…)
The first of our chosen visualizations is a variation on the bubble chart. It shows events on a timeline, with the size (and color, for distinctiveness) of the bubbles as a measure of their importance. Every line in the chart is a story, with a topic (on the left side) and event labels.
Too much is shown at once. We know. Luckily we have the handy tool of interaction to help us. From this view, we can filter on stories, intervals and (relative) importance of events.
More interested in who was present or who was mentioned?
(or colorful spaghetti?)
That’s what the relations tab is for. This is also where the previously mentioned xkcd comes in handy. We see participants on the left and single mentions on the right. Colorful lines are drawn through events with actors that appear in these events simultaneously.
This graph can of course also be filtered, here we see all events linked to both the United kingdom and David Cameron.
What about these perspectives you mentioned at the start?
This is the last tab of our web-app. Here a user can visually find (and filter on) events with sentiments (negative, neutral or positive), events which mention the past or present, events that the original speaker was certain or uncertain about etc. We can also filter on specific citations by people and authors of articles.
Try it !
(please do allow for a few moments for the page to load, the amount of data in here is quite … big)
Ugh, too much work, don’t you have a video?
Ok, cool, so how did you make this? Can I use it? Can I cannibalize your code? Use it to take over the world?
Whoa! Yes! All of the software we develop at the eScience Center is open source, with a very permissive attribution-only licence. So go ahead and try it out.
If you are going to use it, awesome! I’d appreciate it if you send me a message if you encounter any issues. It’s meant to be quite readable and useable, so if it’s not, I’d like to know. Don’t hesitate to make an issue on Github either. I promise I’ll be nice :)