Immaculate Speech and Hallucinations
When (not) to use OpenAI’s Whisper to transcribe audio in a social sciences project
Audio recordings are important data in many research projects across the social sciences. For example, researchers often record interviews or focus group sessions. Online platforms, such as YouTube, provide audio data that are publicly or commercially available. Whenenver the goal is to analyze speech content, the recordings must first be transcribed. Then, the transcripts can be analyzed qualitatively, for example with thematic analysis, or with quantitative methods from Natural Language Processing.
Transcribing what has been said is usually done manually which is costly, laborsome, and difficult to reproduce (imagine the transcripts get lost!). For these reasons, automated tools for audio transcription have been developed using machine learning¹. The company OpenAI recently released a new free and open source model for audio transcription and translation called Whisper. In contrast to previous models, Whisper is claimed to be more robust to noise and accents. For English, it is supposed to produce transcriptions close to those by humans, even when using the model off-the-shelf without finetuning it to the dataset at hand. Finally, it is relatively easy to use because of its simple Python interface.
In this post, I will give examples from the social sciences for which Whisper might or might not be a good solution. The examples are based on English, but should also apply to other languages for which Whisper has been reported to perform well (e.g., Spanish or Dutch)². For technical details about the model, I recommend consulting the paper, blog post, and model card introducing Whisper.
When to Use Whisper: Matching Text to a Dictionary
Social scientists often use dictionaries of words that have certain attributes and match them to text they investigate. For example, a dictionary could contain verbs that have positive or negative sentiment (like “care” and “destroy”). If automatically generated transcripts contain grammatical errors, it becomes difficult to match them to dictionaries because transcribed words can differ from the dictionary entries even though they refer to the same word (“detroy” would not match “destroy”). Moreover, if the transcripts are without punctuation or syntactic boundaries, it is impossible to aggregate matches across certain parts of the text (e.g., on sentence or paragraph level). Speech is often unstructured which makes it hard even for humans to set such boundaries naturally.
These shortcomings could be solved by postprocessing the transcripts manually, which leads to the same initial problems of manual transcription. Whisper however has been shown to provide transcripts that are mostly grammatically and syntactically correct, and also semantically coherent. It also removes many filler words and noise that commonly occur in speech. Thus, the transcripts generated by Whisper could be matched easily to a dictionary.
When NOT to Use Whisper: Analyzing Speech Patterns
When analyzing transcripts of speech qualitatively, exactly how things were said can lead to many insights. For example, if speakers make many pauses or use a lot of filler words, this might reflect their deliberation or hesitation about what they are saying. They could also repeat words right after each other to emphasize their importance. Here, it is important that transcripts reflect what was said as closely as possible.
Using Whisper off-the-shelf in this case might bias the results because it tends to “polish” the structure and coherence of the speech in the transcription. It is also known to sometimes insert common words when nothing was said, which is called “hallucination”. Thus, Whisper could artificially create, alter, or even remove the effect to be investigated.
Other Scenarios Where Whisper Can Be Useful
In addition to the previous examples, I also want to mention two other good use cases for Whisper. The transcripts can be fed into a second machine learning model which, for example, predicts the sentiment of each sentence. The sentence structure returned by Whisper is necessary, and many recent language models that perform sentiment prediction benefit from clean and structured input text. However, one should keep in mind that errors in the transcriptions made by Whisper will likely lead to errors in the sentiment prediction too.
Another use case is the transcription of audio recordings in languages that are not known a priori, for example, when using data from YouTube. Whisper can automatically detect the language and transcribe the recordings. Moreover, it can also translate the transcripts to English. This combination can make the analysis of multilingual datasets a lot easier. However, Whisper transcribes some languages better than others, which researchers should consider to avoid bias.
Concluding Remarks
In this post, I gave some examples how Whisper can be applied in social science research. Whisper shines when structured and coherent transcripts are important. In contrast, when the transcribed text should mimic the original speech closely, the model might be less useful. With Whisper many tasks can be solved using a single tool, whereas traditional approaches require many processing steps by different methods. Finally, I want to highlight that, in any case, the transcripts should at least be partially checked for unexpected results by someone who is familiar with the recordings and the research domain. Discussing the approach and results with a machine learning expert will also not hurt. To those, who found inspiration in this post, happy whispering!
Reference: Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://cdn.openai.com/papers/whisper.pdf
[1]: In the field of machine learning, audio transcription falls under the task Automatic Speech Recognition (ASR).
[2]: This figure gives a quick overview of Whispers transcription performance: https://github.com/openai/whisper/blob/main/language-breakdown.svg