Ian Bebbington - Combining the UWP SpeechSynthesizer and AudioGraph APIs

Synchronicity is a wonderful thing.

Just this morning I was considering using the new SpeechSynthesizer capabilities of the UWP platform to add spoken language to my ToddlerBox app for Xbox. I had already started using the AudioGraph classes to play sounds in the app so ideally wanted to continue using this API to emit speech.

Then, during my morning... ahem... ablutions, I came across this post by Mike Taulty who was looking to do the same thing but for different reasons. It seems that the RaspberryPi has a firmware issue that causes a popping noise every time speech is emitted using the MediaPlayer and AudioGraph seems to be a way of resolving it.

The problem

Mike had implemented a means of emitting speech via AudioGraph by saving the SpeechSynthesisStream to a temporary file and then using multiple AudioFileInputNode instances to render the speech to the AudioGraph.

"Well", I thought, "there's got to be a better way. How hard can this be..."

Turns out the answer is: "Not all that hard, but...".

My approach

I wanted to find a way to eliminate the need for the temporary files and render the speech stream directly to the graph.

To do this, I first saved the SpeechSynthesisStream to a file so that I could examine the content. As expected, the file turned out to be a simple 32-bit, mono, ADPCM waveform in WAV/RIFF format.

Having previously messed about with AudioGraph I knew there was a way of creating an in-memory waveform and that the Windows-Universal-Samples github repository had an AudioCreation sample that showed how to do this.

Fundamentally, this sample shows how to use the QuantumStarted event of the AudioFrameInputNode to dynamically add AudioFrame to the AudioFrameInputNode which are then rendered to the output node. An extract from the sample is shown here:

  unsafe private AudioFrame GenerateAudioData(uint samples)
  {
      // Buffer size is (number of samples) * (size of each sample)
      // We choose to generate single channel (mono) audio. For multi-channel, multiply by number of channels
      uint bufferSize = samples * sizeof(float);
      AudioFrame frame = new Windows.Media.AudioFrame(bufferSize);

      using (AudioBuffer buffer = frame.LockBuffer(AudioBufferAccessMode.Write))
      using (IMemoryBufferReference reference = buffer.CreateReference())
      {
          byte* dataInBytes;
          uint capacityInBytes;
          float* dataInFloat;

          // Get the buffer from the AudioFrame
          ((IMemoryBufferByteAccess)reference).GetBuffer(out dataInBytes, out capacityInBytes);

          // Cast to float since the data we are generating is float
          dataInFloat = (float*)dataInBytes;

          float freq = 1000; // choosing to generate frequency of 1kHz
          float amplitude = 0.3f;
          int sampleRate = (int)graph.EncodingProperties.SampleRate;
          double sampleIncrement = (freq * (Math.PI * 2)) / sampleRate;

          // Generate a 1kHz sine wave and populate the values in the memory buffer
          for (int i = 0; i < samples; i++)
          {
              double sinValue = amplitude * Math.Sin(theta);
              dataInFloat[i] = (float)sinValue;
              theta += sampleIncrement;
          }
      }

      return frame;
  }

Imitation being the sincerest form of flattery, I then refactored this code to read the binary data from the SpeechSynthesisStream rather than generate a sine wave as shown above. This was greatly facilited by the WindowsRuntimeStreamExtensions.AsStreamForRead method which allowed me to use basic Stream methods (specifically Stream.ReadByte()) instead of having to mess about with IBuffer instances.

In short order, I ended up with the code below (where _stream is a member of the containing class pointing to the underlying SpeechSynthesisStream):

    private unsafe void QuantumStarted(AudioFrameInputNode sender, FrameInputNodeQuantumStartedEventArgs args)
    {
        uint numSamplesNeeded = (uint)args.RequiredSamples;

        if (numSamplesNeeded != 0 && _stream.Position < _stream.Length)
        {
            uint bufferSize = numSamplesNeeded * sizeof(float);
            AudioFrame frame = new AudioFrame(bufferSize);

            using (AudioBuffer buffer = frame.LockBuffer(AudioBufferAccessMode.Write))
            {
                using (IMemoryBufferReference reference = buffer.CreateReference())
                {
                    byte* dataInBytes;
                    uint capacityInBytes;

                    // Get the buffer from the AudioFrame
                    ((IMemoryBufferByteAccess)reference).GetBuffer(out dataInBytes, out capacityInBytes);

                    for (int i = 0; i < bufferSize; i++)
                    {
                        if (_stream.Position < _stream.Length)
                        {
                            dataInBytes[i] = (byte)_stream.ReadByte();
                        }
                        else
                        {
                            dataInBytes[i] = 0;
                        }
                    }
                }
            }

            _frameInputNode.AddFrame(frame);
        }
    }

And to my surprised, it worked!

I encapsulated this code into a class named AudioSpeechInputNode and made this class implement IAudioInputNode so it could be treated like any other node in the AudioGraph. Finally I added an extension method to AudioGraph that created instance of this node in the same way that other nodes are created. This is shown below:

    AudioSpeechInputNode speechInputNode = await _graph.CreateSpeechInputNodeAsync(new SpeechSynthesizer(), "As input node");
    speechInputNode.AddOutgoingConnection(_outputNode); // device output node
    speechInputNode.Stop();

With this node in hand you're then at liberty to call the Start, Stop and Reset methods as you see fit.

Et voila, a SpeechSynthesisStream rendered in an AudioGraph without the need for an intermediary file.

You said there was a 'but' ...

Well, yes. Three of them actually.

The big 'but'

While this approach certainly solves the issue with needing temporary files and a 'popping' sound each time speech is emitted, I'm afraid to say it does not resolve the 'popping' noise encountered when the application starts on a RaspberryPi.

Being a good nerd, I had a spare RaspberryPi 3 hanging around with a recent version of Windows 10 IoT Core installed. It took just a few minutes to recompile my sample app to ARM and deploy it to the Pi whereupon I could confirm that there is no popping when emitting speech but there is when the application starts. In fact, I receive three distinct 'pops' during application start-up which, by studiously placing breakpoints, I isolated to AudioGraph.CreateAsync (two pops) and AudioGraph.Start (one pop).

Microsoft would have us believe that this is an issue with the RaspberryPi firmware but, as it also occurs on DragonBoard 410c I'm more inclined to believe it's an issue with the Windows drivers. On a hunch, I've just ordered a USB Sound Adapter from Amazon. This device is meant to be Windows and RaspberryPi compatible (which doesn't necessarily mean it'll work with IoT Core on RPi) and, if it works, I'll be very interested to see if I still get the popping noises when the application starts. I'll update this post once I have an answer...

Update: I'm pleased to say that, not only does this device work with Windows 10 IoT Core running on the RaspberryPi 3, but it also resolves the issue with popping noises when the application starts. Of course, this would probably also solve the issue with popping noises when rendering speech through MediaPlayer too making my solution above less necessary.

The intermediate 'but'

My code makes a number of assumptions about the format of the SpeechSynthesisStream and encapsulates these as constants. It would be much better to read the format from the WAVE 'fmt ' chunk of the underlying RIFF structures in the stream but, being a pragmatic, YAGNI principled developer... I skipped this for now.

The small 'but'

As is probably very obvious, the code above is in no way optimised. I'm sure there are much better and faster ways of storing and copying the binary data from the SpeechSynthesisStream into the AudioBuffer (perhaps just using an intermediate byte array would help) but, for now, this code works fine.

Show me the code

All the code for the above can be found in a UWP sample app within the BlogProjects repository of my Github account.

Do get in touch if you find this code helpful or have suggestions for improving it.