Right now the audio transcription text that comes back is a mess. It's hard to know who said what and I don't derive any value from the blurb but I keep it in case I need to go back to reference it.
So, I recommend developing the transcription AI so that it can recognize voices and separate out Person 1 vs Person 2 vs Person 3.
Then, I recommend having it formatted to something like:
Person 1: blah
Person 2: blah
Person 1: blah
Person 3: blah