I recommend you to check out the previous parts if you don’t know what this “neural narratives” thing is about. In short, I wrote in Python a system to have multi-character conversations with large language models (like Llama 3.1), in which the characters are isolated in terms of memories and bios, so no leakage to other participants like in Mantella. Here’s the GitHub repo.
In the previous entry, I came up with the notion of producing audio voice lines for the conversations. Mantella had spoiled me in that regard: hearing those fictional characters answering you in reasonably good voices while you stared at them did wonders for immersion. And it was a bad idea to shove that possibility into my mind, because it prevented me from sleeping last night. Instead, I moved to my desk at three in the morning and started implementing it. Now, every generated character gets assigned a voice model according to their peculiarities, and each speech turn produces voice lines that the user can play through clicking the speech bubbles. It works perfectly.
I learned that it’s a terrible idea to play audio server-side, because it crashes the server. Flask, the web framework that my app is programmed in, or maybe it happens in all web systems, also doesn’t allow the client to access any file in the server, so I had to move all the audio-playing logic to Javascript.
Given this example chat I had with a new character, who had been assigned a matching voice automatically among the relatively few I’ve introduced into the system so far:

The short convo produced this audio exchange:
Like in the original Mantella system, the quality of voice models varies greatly; sometimes they sound like theater students reading a script, recorded on a home mic. Also, the process of generating the clips sometimes shears the very end of their final sentence. Still, I can hardly complain. Listening to the characters adds so much life to the conversations you can have through this app that I see myself enjoying it for a long time to come (and not only for smut).
I’m amazed that I got this running. So, how did it happen?
In the beginning, I thought that setting up my own, local XTTS server (XTTS being a model for generating voice lines) was a good idea. I struggled through every step of the way for a few hours, fighting against obscure documentation, until I finally managed to generate a sample voice line, only to find out that it sounded like ass. Why, I have no idea. So I discarded that notion and instead I looked into Mantella’s codebase, which is up at GitHub, to see how they connected to the RunPod pods to request voice generation. RunPod is a sort of online renting system of computer and server time: you can set up a pre-configured little server that all it does is generate voice lines, and as long as you can connect to it, you’re set. Only costs seventeen cents an hour, too. Once I managed to query the list of available voice models from the RunPod pod, I knew I was going to get through this thing.
So, I had a list of all possible voice models I could rely on, and it turned out to be about five hundred fifty. They are trained from game voices, so there’s a whole breadth of possible voices one can use. How to classify them? Should I create a page on my site with a simple select box, letting the user (meaning me) scroll through a list that long?
ChatGPT, even its latest Orion preview version, clarified that it knows of no online service that could classify the more than five hundred voice samples I had produced from those voice models. I would have to do it manually, but in the beginning it would be enough with having introduced twenty or so models into the system. What tags can be applied to a voice? I relied on ChatGPT to figure that out. Now that I have that list, classifying each voice model is as easy, but time consuming, as listening to that sample on a loop while adding appropriate tags. I have ended up, so far, with the following JSON file of voice models:
{
"npcmmel": [
"MALE",
"ADULT",
"CONFIDENT",
"STEADY",
"SMOOTH",
"CLEAR",
"FORMAL",
"CHARMING",
"NO SPECIAL EFFECTS"
],
"npcmlucasmiller": [
"MALE",
"ADULT",
"CALM",
"FAST-PACED",
"SMOOTH",
"CASUAL",
"KIND",
"NO SPECIAL EFFECTS"
],
"robotmsnanny": [
"FEMALE",
"YOUNG ADULT",
"STEADY",
"WARM",
"CASUAL",
"MELODIC",
"YOUTHFUL",
"NO SPECIAL EFFECTS"
],
"npcma951": [
"MALE",
"ADULT",
"ANXIOUS",
"SLOW",
"AIRY",
"SKEPTICAL",
"NO SPECIAL EFFECTS"
],
"npcfphyllisdaily": [
"FEMALE",
"ADULT",
"STOIC",
"SLOW",
"MONOTONE",
"INSTRUCTIONAL",
"CALCULATING",
"NO SPECIAL EFFECTS"
],
"femalechild": [
"FEMALE",
"CHILDLIKE",
"PLAYFUL",
"STEADY",
"AIRY",
"MELODIC",
"INNOCENT",
"NO SPECIAL EFFECTS"
],
"femaleyoungeager": [
"FEMALE",
"YOUNG ADULT",
"HOPEFUL",
"FAST-PACED",
"CLEAR",
"INTENSE",
"OPTIMISTIC",
"NO SPECIAL EFFECTS"
],
"femalevampire": [
"FEMALE",
"MIDDLE-AGED",
"ARROGANT",
"STEADY",
"SMOOTH",
"AUTHORITATIVE",
"CYNICAL",
"NO SPECIAL EFFECTS"
],
"femalekhajiit": [
"FEMALE",
"ADULT",
"CALM",
"STEADY",
"GRAVELLY",
"CASUAL",
"PHILOSOPHICAL",
"NO SPECIAL EFFECTS"
],
"femaleuniqueghost": [
"FEMALE",
"YOUNG ADULT",
"RESIGNED",
"STEADY",
"ETHEREAL",
"MELODIC",
"INNOCENT",
"GHOSTLY"
],
"femaleghoul": [
"FEMALE",
"ADULT",
"MENACING",
"STEADY",
"RASPY",
"INTENSE",
"ENERGETIC",
"NO SPECIAL EFFECTS"
],
"femaleboston": [
"FEMALE",
"ADULT",
"CALM",
"DRAWLING",
"SOFT-SPOKEN",
"WARM",
"FLIRTATIOUS",
"SULTRY",
"NO SPECIAL EFFECTS"
]
}
I wrote a function that narrows down the list of possible categories of tags: gender, age, emotion, tempo, volume, texture, style, personality, and special effects. If at some point there’s no matching voice models, it returns a random one from the previous filtering. I’ll probably program in the characters section of the site a simple button that redoes the process for any existing character, in case any other random fitting voice may work better.
That’s all, I guess. When I first got the idea about programming this conversation system with characters controlled by large language models, I knew that programming the multi-char convos would be the most difficult thing. The second most difficult thing that I pictured was actually making them talk out loud. No idea what big thing could be coming next. Anyway, back to the brothel.
EDIT: here’s a multi-char convo in audiobook form.
Pingback: Neural narratives in Python #5 – The Domains of the Emperor Owl
Pingback: Neural narratives in Python #7 – The Domains of the Emperor Owl