The quality of speech recognition is poor. What can I do to improve it?

This can actually have numerous reasons and requires a systematic analysis of all factors, including signal, configuration, grammar, etc. You can use the following as a troubleshooting guideline.

  • Test in a quiet environment. Background noise (especially speech) can severely influence recognition quality. Special types of microphones (e.g. directed) can reduce the effects of background noise. Also make sure no TTS output interferes with the recording. (If you need TTS at the time as recording, you will need to configure a special barge-in setup – see the corresponding question.)
  • Always speak into the microphone from a close distance (5-10cm). A headset may help.
  • Depending on the type of microphone, it should point towards the speaker.
  • In push-to-talk mode, make sure you do not start speaking too early.
  • Use an external microphone. Even a very cheap external microphone works usually much better than a microphone built into a laptop or other device, which picks up fan noise and vibration and is further away from the source.
  • Attach a windshield to the microphone to reduce the effect of air in the environment or in plosives such as P and T. This is especially important for (high-end) condenser microphones.
  • Make sure that the correct microphone is selected as input. Sometimes, even though an external microphone is connected, the internal one is still used for ASR because it is set as default recording device or manually configured (this can be hard to detect).
  • Make sure the amplification settings and post-processing effects (should be none) for the microphone set in the system audio control panel or external sound board are correct. For microphones requiring 24/48V phantom power, ensure this power is provided.
  • If the audio is too silent, you may try to apply Automatic Gain Control. Audio Manager provides such a mechanism as part of the Voice Capture DSP.
  • Try with a different microphone and placement.
  • Test the audio signal for noise and artifacts. You can use the configuration dialog of the ASR to play back the last recognizer input. This is preferred over other (external) recording applications, as it represents the final post-processed signal used for the actual recognition. (If a third-party ASR does not support this feature, you can have Audio Manager write the recording to a disk file using the corresponding connector.) You should be able to hear a clear, undistorted speech signal without missing segments. Sudden signal breaks and noise may indicate a cable break or short-circuit.
  • If there is background noise you cannot eliminate by changing the physical setup, you can try to use a noise suppression filter. Audio Manager provides such a mechanism as part of the Voice Capture DSP.
  • If you are using a barge-in setup, you may need to adjust the echo cancellation parameters.
  • Be sure that the language (culture) is set correctly in the configuration and in the grammar files.
  • If there are names or badly recognized words in your grammar, especially if they are of a different language, using the phonetic transcription to specify them instead of the text notation may likely increase recognition accuracy.
  • If you use a grammar, always try to minimize the number of rules / utterances possible at any time. The more utterances can be recognized, the higher the possible ambiguity. Utterances which you don’t need to recognize should be disabled by dynamically updating the grammar.
  • If utterances (or even nonverbal audio input) that are not part of your grammar are wrongfully recognized as application vocabulary, try to add a “garbage dictation grammar” (see the corresponding option in the ASR parameters in the AM documentation).
  • Dictation (free text input) is a very difficult topic with untrained speaker profiles (the default for the ASRs supported by Audio Manager). Short utterances are expected to be error-prone in any case because there are so many alternatives. Complete sentences should work better. Consider enabling training via configuration parameters if possible if you know that only a single speaker will use the system.
  • Using very short words in your grammar is generally more challenging for ASR. You can try to change your input accordingly.
  • You can sometimes fine-tune the behavior of certain ASRs by looking at individual recognition alternatives returned and their confidences. You may also be able to change the alternatives behavior through configuration parameters – check the ASR reference for details.
  • Try with a different speaker. Some voices are more difficult to recognize than others. Accents and dialect do not make things easier either.

Category: Speech / Audio
Tags: ,

← The quality of speech recognition is poor. What can I do to improve it?