Still coding texts by hand for social science text analysis? Use Doccano instead to speed up the process!

Sven van der Burg
Netherlands eScience Center
4 min readDec 21, 2021

--

By Sven van der Burg and Femke van Esch

Hand coding example
Example of manual text coding. This is a paragraph from a political speech in which the causal relation between different concepts is annotated.

Are you a social sciences or humanities researcher that depends on text coding similar to the picture above for your research? For example, for text classification (e.g. sentiment analysis), sequence labelling (e.g. named entity recognition) and sequence to sequence tasks (e.g. text summarization). Do you recognize yourself in any of the following statements? Read on!

  • I need more data, but it is too time-consuming to do text annotation.
  • I used NVivo or Maxqda, but they are too complex or do not fit well with what I need in my project.
  • I tried using Excel or Access, but after the data is gathered, I spend hours cleaning up errors in the data entry.
  • I print out texts that I want to code. I highlight groups of words with differently coloured markers, representing two concepts that are causally linked to each other. I end up with a big pile of colourful sheets of paper. Then I type out these codings in, for instance, an Excel or Access file. Finally, I use this file to do my analysis on.

Doing manual coding has generated good results in the past. In fact, the hybrid coding pipeline described above produced very valuable data for the ‘Automated Cognitive Mapping’ project that we are carrying out together with researchers from Utrecht University. As a result of the collaboration between us technical people and the domain experts from Utrecht University, we started using Doccano which speeds up the coding process tremendously! Yay for interdisciplinary research!

Why is Doccano so elegant and how does it work?

A screenshot of the Doccano interface

Doccano is a simple online tool for text coding. Let’s introduce Doccano and its benefits:

  • Coder-centered: It is designed to make the coder’s life as easy as possible. You view one text at a time at the center of your screen. All you have to do is this: select a piece of text with your pointer, and click or hit a shortkey to add a label (for example: `concept 1’). When the complete text is coded, just click next. It’s as simple as that.
  • No more errors in data entry: As a data analyst, you don’t need to worry about how exactly a coder enters the data as you would, for example, when data is coded in an Excel sheet. Doccano enforces that any data entry results in nicely formatted data. You can download the data without any data entry errors.
  • Open-source: The programming code for Doccano is open for everyone to see and edit (under sensible restrictions, of course). This makes it free to use and easy to setup yourself in any scenario. In addition, if you want to add a new feature, you can always add it yourself or request it in the community. Yay for open-source software!
  • Collaborative coding: Doccano allows different coders to code the same text, which is very useful if you want to get reliable labels.

How does Doccano compare to NVivo or MAXQDA?

NVivo and Maxqda also allow you to select sections of documents, but they are only useful for thematic analysis using extensive coding trees. Both are paid packages, whereas Doccano is completely free and easier to work with. Doccano is easily combined with other analysis frameworks like SPSS, Excel, R, or Python, whereas NVivo and MAXQDA want you to do your complete analysis within their program.

How does Doccano compare to Amcat?

AmCAT also allows token-level and sentence-level coding (see their documentation). But it’s certainly not as coder-centered as Doccano. In addition, it is much more complex compared to Doccano’s simple interface.

Next steps

We hope that we’ve convinced you that using Doccano can be a simple improvement in text analysis pipelines. So, what’s next?

  • Try out the Doccano demo to get a feel for how it works.
  • Because of it’s open source nature, you have to run Doccano in a cloud environment yourself. Fortunately, you can do this without any technical knowledge. Just follow the first part of this excellent guide to setup Doccano on Heroku (ignore the prerequisites, you don’t need them). You will have your own free Doccano environment within 15 minutes.
  • Follow the further instructions in that guide to add a dataset and tags within Doccano to kickstart your first coding project! NB: Double check whether you are allowed to upload your data to Heroku, maybe there are privacy or security concerns. If there are concerns you could discuss installing Doccano on your institution’s own infrastructure.
  • You can download your coded data as .jsonl file. This file can be opened with notepad or transferred to .csv online. The coded sections in the Doccano output can easily be transformed back from ‘token position’ into text via the text function ‘mid’ in Excel if you are not familiar with programming.

--

--