The goals of this initiative were:
- Open source, process locally. We tried Azure Cognitive Services but the pricing was not acceptable.
- Convert a text file to a WAV file.
- Convert the WAV file to an OGG file.
- Attach the OGG file to each post.
- Show audio player on each post.
After a few hours, we had the software installed on our servers and had met these requirements. We were able to produce a python script that handled all the requirements. Another couple hours of work and we had the script executing and attaching the final OGG file to each post as a media file.
We are pretty happy with the results. We chose the model
tts_models/en/ljspeech/vits which has a good balance of tone and accuracy. The speech is not perfect or 100% human like. That was not one of the goals. We wanted something that would allow a blind person or someone who cannot read at the moment to be able to consume the content without looking.
Converting an average post of 2000 characters takes about 15–30 seconds, which isn't too bad.
Text to speech has come a long way since the early 1980s, when all you had was the robotic voice that Stephen Hawking used. In the coming decades, it is likely that text to speech will become indistinguishable from humans.
Source: TTS on Github
Technomancer is a science and tech enthusiast who enjoys writing about software and AI and other tech topics.