Neural narratives in Python #7

I recommend you to check out the previous parts if you don’t know what this “neural narratives” thing is about. In short, I wrote in Python a system to have multi-character conversations with large language models (like Llama 3.1), in which the characters are isolated in terms of memories and bios, so no leakage to other participants like in Mantella. Here’s the GitHub repo.

Last time, I managed to nail creating voice lines for the dialogues of my app, relying on a RunPod server dedicated to generating audio. Now I want to go on a serious test run of app to see what it lacks.

Starting from scratch, I created a new world, a new region of that world, a new area of that region, and a new location of that area. I won’t detail the specifics, because the app itself should do it like a story would do, so if in the course of testing the system I feel that something must be implemented to cover for the shortcomings, I will do so.

The process of creating a new playthrough requires you to provide some notion of the character you want to play as. I ended up with a good depiction of the guy.

I have the initial setting and the player character (including his automatically assigned voice model). A story needs other characters, so I went to the section that shows the character creation guidelines that have been generated for this combination of places.

I turned them into post-its. Anyway, my protagonist is a police detective, so he could do with a partner. I grabbed the second guideline and made her a woman.

The app doesn’t allow you to access most of the specific data of a character except in very paricular circumstances, so in general, you have to glean the specifics of a character from their looks and the conversations you have with them.

In a story, you need some sense of where you are. There’s a system in place for the LLM to generate a description from the first-person perspective of the player character, and at this point it’s trivial to generate a voice line for it:

Let’s interact with the sole other character around (for now). As I was running a conversation with the protagonist’s partner, I ran across the first issue: when the app had to generate a voice line for a bit of ambient text, the server (the RunPod pod) returned a 404. I guess that even if a pod is technically running, it could intermittently produce 404 errors for whatever reason. I guess I’ll need to program in some retry system to cover these cases.

I did do that. Let’s continue.

The “stop your a coffee” was my blunder. Old, stupid fingers.

The couple of grizzled detectives exited the police station, back to the grim city surrounding it.

Now I want my characters to move to the mentioned location, some bar. Even though I didn’t create any other locations for this run other than the police station, there is a lingering issue with the interface: when plenty of possible locations exist, if you press the button “Search for location,” it may link locations you don’t want (like a cave, a hospital, etc.). Now that I was looking specifically for a bar, I figured that I may as well fix this issue.

It took quite a while, but now the user can only search locations by a type. In fact, if no locations are available, because they have already been used or they don’t match the area’s categories (you don’t want a fantasy bar in a cosmic horror story), the select and the button will be disabled.

Well, that was all for today. I expected to do more, but reworking that interface was arduous.

Neural narratives in Python #6

I recommend you to check out the previous parts if you don’t know what this “neural narratives” thing is about. In short, I wrote in Python a system to have multi-character conversations with large language models (like Llama 3.1), in which the characters are isolated in terms of memories and bios, so no leakage to other participants like in Mantella. Here’s the GitHub repo.

In the previous entry, I came up with the notion of producing audio voice lines for the conversations. Mantella had spoiled me in that regard: hearing those fictional characters answering you in reasonably good voices while you stared at them did wonders for immersion. And it was a bad idea to shove that possibility into my mind, because it prevented me from sleeping last night. Instead, I moved to my desk at three in the morning and started implementing it. Now, every generated character gets assigned a voice model according to their peculiarities, and each speech turn produces voice lines that the user can play through clicking the speech bubbles. It works perfectly.

I learned that it’s a terrible idea to play audio server-side, because it crashes the server. Flask, the web framework that my app is programmed in, or maybe it happens in all web systems, also doesn’t allow the client to access any file in the server, so I had to move all the audio-playing logic to Javascript.

Given this example chat I had with a new character, who had been assigned a matching voice automatically among the relatively few I’ve introduced into the system so far:

The short convo produced this audio exchange:

Like in the original Mantella system, the quality of voice models varies greatly; sometimes they sound like theater students reading a script, recorded on a home mic. Also, the process of generating the clips sometimes shears the very end of their final sentence. Still, I can hardly complain. Listening to the characters adds so much life to the conversations you can have through this app that I see myself enjoying it for a long time to come (and not only for smut).

I’m amazed that I got this running. So, how did it happen?

In the beginning, I thought that setting up my own, local XTTS server (XTTS being a model for generating voice lines) was a good idea. I struggled through every step of the way for a few hours, fighting against obscure documentation, until I finally managed to generate a sample voice line, only to find out that it sounded like ass. Why, I have no idea. So I discarded that notion and instead I looked into Mantella’s codebase, which is up at GitHub, to see how they connected to the RunPod pods to request voice generation. RunPod is a sort of online renting system of computer and server time: you can set up a pre-configured little server that all it does is generate voice lines, and as long as you can connect to it, you’re set. Only costs seventeen cents an hour, too. Once I managed to query the list of available voice models from the RunPod pod, I knew I was going to get through this thing.

So, I had a list of all possible voice models I could rely on, and it turned out to be about five hundred fifty. They are trained from game voices, so there’s a whole breadth of possible voices one can use. How to classify them? Should I create a page on my site with a simple select box, letting the user (meaning me) scroll through a list that long?

ChatGPT, even its latest Orion preview version, clarified that it knows of no online service that could classify the more than five hundred voice samples I had produced from those voice models. I would have to do it manually, but in the beginning it would be enough with having introduced twenty or so models into the system. What tags can be applied to a voice? I relied on ChatGPT to figure that out. Now that I have that list, classifying each voice model is as easy, but time consuming, as listening to that sample on a loop while adding appropriate tags. I have ended up, so far, with the following JSON file of voice models:

{
  "npcmmel": [
    "MALE",
    "ADULT",
    "CONFIDENT",
    "STEADY",
    "SMOOTH",
    "CLEAR",
    "FORMAL",
    "CHARMING",
    "NO SPECIAL EFFECTS"
  ],
  "npcmlucasmiller": [
    "MALE",
    "ADULT",
    "CALM",
    "FAST-PACED",
    "SMOOTH",
    "CASUAL",
    "KIND",
    "NO SPECIAL EFFECTS"
  ],
  "robotmsnanny": [
    "FEMALE",
    "YOUNG ADULT",
    "STEADY",
    "WARM",
    "CASUAL",
    "MELODIC",
    "YOUTHFUL",
    "NO SPECIAL EFFECTS"
  ],
  "npcma951": [
    "MALE",
    "ADULT",
    "ANXIOUS",
    "SLOW",
    "AIRY",
    "SKEPTICAL",
    "NO SPECIAL EFFECTS"
  ],
  "npcfphyllisdaily": [
    "FEMALE",
    "ADULT",
    "STOIC",
    "SLOW",
    "MONOTONE",
    "INSTRUCTIONAL",
    "CALCULATING",
    "NO SPECIAL EFFECTS"
  ],
  "femalechild": [
    "FEMALE",
    "CHILDLIKE",
    "PLAYFUL",
    "STEADY",
    "AIRY",
    "MELODIC",
    "INNOCENT",
    "NO SPECIAL EFFECTS"
  ],
  "femaleyoungeager": [
    "FEMALE",
    "YOUNG ADULT",
    "HOPEFUL",
    "FAST-PACED",
    "CLEAR",
    "INTENSE",
    "OPTIMISTIC",
    "NO SPECIAL EFFECTS"
  ],
  "femalevampire": [
    "FEMALE",
    "MIDDLE-AGED",
    "ARROGANT",
    "STEADY",
    "SMOOTH",
    "AUTHORITATIVE",
    "CYNICAL",
    "NO SPECIAL EFFECTS"
  ],
  "femalekhajiit": [
    "FEMALE",
    "ADULT",
    "CALM",
    "STEADY",
    "GRAVELLY",
    "CASUAL",
    "PHILOSOPHICAL",
    "NO SPECIAL EFFECTS"
  ],
  "femaleuniqueghost": [
    "FEMALE",
    "YOUNG ADULT",
    "RESIGNED",
    "STEADY",
    "ETHEREAL",
    "MELODIC",
    "INNOCENT",
    "GHOSTLY"
  ],
  "femaleghoul": [
    "FEMALE",
    "ADULT",
    "MENACING",
    "STEADY",
    "RASPY",
    "INTENSE",
    "ENERGETIC",
    "NO SPECIAL EFFECTS"
  ],
  "femaleboston": [
    "FEMALE",
    "ADULT",
    "CALM",
    "DRAWLING",
    "SOFT-SPOKEN",
    "WARM",
    "FLIRTATIOUS",
    "SULTRY",
    "NO SPECIAL EFFECTS"
  ]
}

I wrote a function that narrows down the list of possible categories of tags: gender, age, emotion, tempo, volume, texture, style, personality, and special effects. If at some point there’s no matching voice models, it returns a random one from the previous filtering. I’ll probably program in the characters section of the site a simple button that redoes the process for any existing character, in case any other random fitting voice may work better.

That’s all, I guess. When I first got the idea about programming this conversation system with characters controlled by large language models, I knew that programming the multi-char convos would be the most difficult thing. The second most difficult thing that I pictured was actually making them talk out loud. No idea what big thing could be coming next. Anyway, back to the brothel.

EDIT: here’s a multi-char convo in audiobook form.