Shakespeare audio: a tiny text and speech corpus

The tinyshakespeare miniature corpus (~1MB) by Andrej Karpathy is a convenient resource for language model prototyping purposes. Recently I found myself in need of a tinyshakespear-like audio corpus, so I decided to write a script that makes one. The script and instructions are on github.

The corpus is based on ~10 hours of Shakespeare play recordings in English, and contains voices of several female and male speakers. The procedure is a bit less elegant, as I did not want to store any audio files in the repository. But it is not that complicated either - you need to clone the repository, optionally install ffmpeg and sox, run the script and wait. It automatically downloads everything from LibriVox, installs whisper.cpp to create transcripts with timestamps, converts the mp3s to wavs, downsamples them to 8bit*8kHz, and splits everything into a train and test set.

At the end, you will have three types of resources for each split:

These are the resulting files sizes:

split.txt.wav (8kHz)duration
train328kB217.6MB7:33:21
test91kB58.8MB2:02:32

By default, the script combines the act recordings of 3 Shakespeare plays (Romeo & Juliet, Hamlet, and As You Like It), but it can be easily configured to download more from LibriVox. See github for more detailed instructions!