When starting Audio manager, I receive one of the following error messages for several plug-ins: “Failed to load plug-in ‘…’. Could not load file or assembly ‘…’ or one of its dependencies. Operation is not supported. (Exception from HRESULT: 0x80131515)” / “Failed to load plug-in ‘…’. Die Datei oder Assembly “…” oder eine Abhängigkeit davon wurde nicht gefunden. Der Vorgang wird nicht unterstützt. (Ausnahme von HRESULT: 0x80131515)”

Windows may block plug-ins from being loaded if your browser / extraction program flags them as having been downloaded from the Internet.

The solution is do “unblock” the download (archive file) prior to extracting. It is not sufficient to unblock the AudioManager.exe file. You can unblock a file by right-clicking it in Windows Explorer, selecting “Properties”, and then clicking the “Unblock” button at the bottom or (in Windows 10+) checking the “Unblock” checkbox. (If there is no “unblock” button/checkbox, the file has already been unblocked.)

How can I make the dialogue application play a sound?

On the dialogue side, you need to send an output act that includes a representation of type AudioTrack. In this representation, you can either use the trackId or the uri attribute to specify the sound to be played. The first is the ID of a track in your library and is usually a bit faster in execution. The second specifies the URL of a local (file://…) or remote (http://…) audio file. You need to address the output act to a device that can play audio. This can be a speaker device provided by Audio Manager, e.g. the same device you use for the TTS.

If you want to use Audio Manager to play the sound, you need to ensure that its configuration includes a SoundLibraryConnector. Therefore, the XML config file needs to include a Connector section like the following:

<Devices xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:noNamespaceSchemaLocation=”AudioDevices.xsd”>
<Device id=”spk1″>
<Connector id=”tts” class=”DFKI.Automotive.Audio.TTS.DesktopTtsHandler”>
<Connector id=”snd” class=”DFKI.Automotive.Audio.SoundLibraryConnector“>

You may adjust the sound library directory (<dir>…</dir>) as necessary. You can then place uncompressed .WAV files into that directory or a subdirectory. You can use the file name (without extension) as the trackId in the AudioTrack representation (see above) to refer to a file.



What is the difference between the Desktop Speech and Speech Server ASR/TTS engines?

Audio Manager supports both engines, but there are a few differences.

  • Eric Brown (Microsoft) answered the following: “The APIs are mostly the same, but the underlying engines are different. Typically, the Server engine is designed to accept telephone-quality audio for command & control applications; the Desktop engine is designed to accept higher-quality audio for both command & control and dictation applications. You can use [the Desktop version] on a server OS, but it’s not designed to scale nearly as well as [the Server version]. The differences are that the Server engine won’t need training, and will work with lower-quality audio, but will have a lower recognition quality than the Desktop engine.” (Source: http://stackoverflow.com/questions/2977338/what-is-the-difference-between-system-speech-recognition-and-microsoft-speech-re/2982910#2982910)
  • Furthermore, the Server ASR does not include a dictation grammar.
  • In additional, the language module availability for both may be different (see the corresponding sections in the AM manual).

When using streaming, there is a delay between TTS generation and until I hear the output.

Try to reduce the buffer size for your StreamingPlaybackDevice.

Streaming does not seem to work reliably. Sometimes, audio does not arrive at all or only with delay.

  • Check if you have sufficient bandwidth for streaming. Streaming uncompressed data at a constant bitrate requires a certain bandwidth which you can calculate. Having other network traffic going on at the same time reduces this bandwidth. In WLANs, other network nodes may also reduce the bandwidth. You can conserve bandwidth by reducing the audio format quality, e.g. by choosing a sample rate of 16 KHz instead of 48, or using mono instead of stereo.
  • Repeated data loss can also cause delays and drops. If you are using a wireless connection with bad transmission conditions (distance, obstacles, interference…), you are likely to experience drops. This is particularly bad with TCP-based streaming (including HTTP streaming), since each drop will cause streaming to block until a retransmission occurs.
  • Make sure that your client device has enough processing power. We have observed that interruptions in streaming mistakenly attributed to the network can also be caused by an endpoint device with low processing power (Google Glass), which was unable to handle the streaming at high bitrate in parallel with other tasks.
  • Generally, the streaming monitors (configuration pages for streaming virtual devices / servers) in Audio Manager will help you analyze throughput and identify the bottleneck of your streaming.

Trying to run the Audio Manager Local Client using Mono yields the following error message: “Could not load file or assembly ‘Microsoft.VisualBasic, Version=, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a’ or one of its dependencies.”

You need to install the mono-basic package.

A physical device (speaker or microphone) was unplugged. When I plug it back in, it does not resume playing / recording.

  • Recovery from such disconnection situations is a new feature of Audio Manager that may not yet work under all circumstances.
  • In any case, if the device is not re-detected, invoke a manual device scan by selecting Devices > List Devices in the console menu, and then clicking the Refresh button.

When trying to set up a server for streaming, I receive the error message “Failed to start the streaming server”.

You need to run Audio Manager as an administrator to host a streaming server.

How can I enable barge-in for my ASR scenario?

Generally, barge-in means that the user can speak already before a TTS output is finished, thereby interrupting the TTS. The method depends on the ASR mode.

  • In push-to-talk mode, your application needs to listen for the ASR event that signals that the PTT button was pressed. There is both a connector message and a SiAM-dp DeviceStateChanged client notification available for that purpose. When this signal is received, the application should cancel all TTS output.
  • In speak-to-activate mode, your application similarly needs to listen for ASR event that signals speech, and cancel all TTS output in that case. An additional challenge in this mode is to prevent the TTS to trigger ASR events if loudspeakers are used (echo). You will need to add acoustic echo cancellation (AEC) for that. Audio Manager provides the Voice Capture DSP for that purpose, which you can add to your configuration (see “Voice Capture DSP” in section “Common Plug-ins” of the Audio Manager documentation for details).

The quality of speech recognition is poor. What can I do to improve it?

This can actually have numerous reasons and requires a systematic analysis of all factors, including signal, configuration, grammar, etc. You can use the following as a troubleshooting guideline.

  • Test in a quiet environment. Background noise (especially speech) can severely influence recognition quality. Special types of microphones (e.g. directed) can reduce the effects of background noise. Also make sure no TTS output interferes with the recording. (If you need TTS at the time as recording, you will need to configure a special barge-in setup – see the corresponding question.)
  • Always speak into the microphone from a close distance (5-10cm). A headset may help.
  • Depending on the type of microphone, it should point towards the speaker.
  • In push-to-talk mode, make sure you do not start speaking too early.
  • Use an external microphone. Even a very cheap external microphone works usually much better than a microphone built into a laptop or other device, which picks up fan noise and vibration and is further away from the source.
  • Attach a windshield to the microphone to reduce the effect of air in the environment or in plosives such as P and T. This is especially important for (high-end) condenser microphones.
  • Make sure that the correct microphone is selected as input. Sometimes, even though an external microphone is connected, the internal one is still used for ASR because it is set as default recording device or manually configured (this can be hard to detect).
  • Make sure the amplification settings and post-processing effects (should be none) for the microphone set in the system audio control panel or external sound board are correct. For microphones requiring 24/48V phantom power, ensure this power is provided.
  • If the audio is too silent, you may try to apply Automatic Gain Control. Audio Manager provides such a mechanism as part of the Voice Capture DSP.
  • Try with a different microphone and placement.
  • Test the audio signal for noise and artifacts. You can use the configuration dialog of the ASR to play back the last recognizer input. This is preferred over other (external) recording applications, as it represents the final post-processed signal used for the actual recognition. (If a third-party ASR does not support this feature, you can have Audio Manager write the recording to a disk file using the corresponding connector.) You should be able to hear a clear, undistorted speech signal without missing segments. Sudden signal breaks and noise may indicate a cable break or short-circuit.
  • If there is background noise you cannot eliminate by changing the physical setup, you can try to use a noise suppression filter. Audio Manager provides such a mechanism as part of the Voice Capture DSP.
  • If you are using a barge-in setup, you may need to adjust the echo cancellation parameters.
  • Be sure that the language (culture) is set correctly in the configuration and in the grammar files.
  • If there are names or badly recognized words in your grammar, especially if they are of a different language, using the phonetic transcription to specify them instead of the text notation may likely increase recognition accuracy.
  • If you use a grammar, always try to minimize the number of rules / utterances possible at any time. The more utterances can be recognized, the higher the possible ambiguity. Utterances which you don’t need to recognize should be disabled by dynamically updating the grammar.
  • If utterances (or even nonverbal audio input) that are not part of your grammar are wrongfully recognized as application vocabulary, try to add a “garbage dictation grammar” (see the corresponding option in the ASR parameters in the AM documentation).
  • Dictation (free text input) is a very difficult topic with untrained speaker profiles (the default for the ASRs supported by Audio Manager). Short utterances are expected to be error-prone in any case because there are so many alternatives. Complete sentences should work better. Consider enabling training via configuration parameters if possible if you know that only a single speaker will use the system.
  • Using very short words in your grammar is generally more challenging for ASR. You can try to change your input accordingly.
  • You can sometimes fine-tune the behavior of certain ASRs by looking at individual recognition alternatives returned and their confidences. You may also be able to change the alternatives behavior through configuration parameters – check the ASR reference for details.
  • Try with a different speaker. Some voices are more difficult to recognize than others. Accents and dialect do not make things easier either.

Playback is stuttering (e.g. TTS).

In setups with many connectors, devices, channels, or high sampling rates, performance issues can lead to stuttering. First, check whether other processes with a high CPU load may be the issue. Next, try the following

  • Reduce resampling by configuring connectors to use the device sampling rate OR change the physical device sampling rate to match the connector’s rate.
  • A 1:1 channel mapping reduces the required channel multiplexing.
  • Enable buffering on the logical device level (TBD) at the cost of response time.
  • If you have implemented your own connectors, try to “pre-compute” audio data if possible.

When trying to use the ASR, I receive the error message “Failed to add connector of type ‘DFKI.Automotive.Audio.ASR.DesktopAsrHandler’: No recognizer of the required ID found. Parameter name: culture (ApplicationException)”.

You are trying to use a recognizer language with the desktop ASR for which you don’t have the corresponding language pack installed/functioning on your system. If you use Windows 8.x or Windows 7 Ultimate, you can install additional language packs via control panel (Language settings in 8.x or Windows Update in Windows 7). In Windows 10, you have to use the System Settings app to install the speech input module for the desired language (click Time and Language, then Region and Language, then select a language (you may need to add it first), click Options, then click the Download button next to Speech Recognition.

If you don’t have this option, consider changing the DesktopAsrHandler to SpeechServerAsrHandler by replacing the corresponding strings in your configuration file. For the Speech Server ASR, you can download any required language packs from the web (see “Installing the Speech Server ASR”) independent from the OS language.

When trying to use the Speech Server ASR/TTS, I receive one of the following error messages: “Failed to locate connector of type ‘DFKI.Automotive.Audio.ASR.SpeechServerAsrHandler’: Type not found or TypeLoadException.” / “Failed to locate connector of type ‘DFKI.Automotive.Audio.ASR.ServerTtsHandler’: Type not found or TypeLoadException.” / “Retrieving the COM class factory for component with CLSID {…} failed due to the following error: 80040154 Class not registered”

Probably Speech Server is not working correctly.

  • Make sure the runtime and the language pack(s) are installed and that you have selected the 32-bit version of Speech Server and (if applicable) of the voice / recognition engine (even if you are running a 64-bit OS).
  • Alternatively, change the Audio Manager configuration from SpeechServerAsrHandler to DesktopAsrHandler or from ServerTtsHandler to DesktopTtsHandler. Whether this works depends on your OS version, OS language, installed language packs, and grammar language.
  • See the section on SAPI ASR and TTS plug-ins in the Audio Manager documentation for more details.

I hear audio only from one speaker. In earlier versions, I had perfect stereo output.

In version 1.0.8, some processing behavior was changed (actually fixed), which may have this side effect when using old (“wrong”) config files. Check your config file for a line such as

<PhysicalDevice channel=”0″>…</PhysicalDevice>

and change it into

<PhysicalDevice channel=”-1″>…</PhysicalDevice>

There is no TTS output, but the connector is attached and there is no error message.

Try to set the bit depth to 16 using the configuration parameter bitdepth.

Where can I find details about an error that occurs in Audio Manager?

  • Some additional details can be found in the system event log. To open it, run “eventvwr” from the command line. Then, under “Application and Service logs”, select “SiAM-dp”. You can also quickly access the event log from the Audio Manager console’s menu.
  • If no details are written to the event log, the log or event source may not be installed. Run Audio Manager explicitly as an administrator to do that.

Audio Manager crashes on the first start.

When starting for the first time, it is recommended to run explicitly with administrator privileges. Right-click the executable in Windows Explorer and choose “Run as Administrator”.

After starting the Audio Manager, I receive a message that every port can only be used once.

Make sure that the Audio manager is not already running. Check the tray area of the taskbar and the process list. If it is not running, see if any other application is using the same port.

When starting the Audio Manager, the window does not contain any text.

Make sure that the required version of the .NET Framework is installed.