Recognizer

Recognizer

The recognizer property is used in multiple verbs (gather, transcribe, dial). It selects and configures the speech recognizer.

It is an object containing the following properties:

Option Description Required
vendor Speech vendor to use (see list below, along with any others you add via the custom speech API). No
language Language code to use for speech detection. Defaults to the application-level setting. No
fallbackVendor Fallback speech vendor to use (see list below, along with any others you add via the custom speech API). No
fallbackLanguage Fallback language code to use for speech detection. Defaults to the application-level setting. No
interim If true, interim transcriptions are sent. Default: false. No
hints (Google, Microsoft, Deepgram, Nvidia, Soniox) Array of words or phrases to assist speech detection. See examples below. No
hintsBoost (Google, Nvidia) Number indicating the strength to assign to the configured hints. See examples below. No
profanityFilter (Google, Deepgram, Nuance, Nvidia) If true, filter profanity from speech transcription. Default: no. No
singleUtterance (Google) If true, return only a single utterance/transcript. Default: true for gather. No
vad.enable If true, delay connecting to the cloud recognizer until speech is detected. No
vad.voiceMs If VAD is enabled, the number of milliseconds of speech required before connecting to the cloud recognizer. No
vad.mode If VAD is enabled, this setting governs the sensitivity of the voice activity detector; value must be between 0 to 3 inclusive, lower numbers mean more sensitive. No
separateRecognitionPerChannel If true, recognize both caller and called party speech using separate recognition sessions. No
altLanguages (Google, Microsoft) An array of alternative languages that the speaker may be using. No
punctuation (Google) Enable automatic punctuation. No
model (Google) Speech recognition model to use. Default: phone_call. No
enhancedModel (Google) Use enhanced model. No
words (Google) Enable word offsets. No
diarization (Google) Enable speaker diarization. No
diarizationMinSpeakers (Google) Set the minimum speaker count. No
diarizationMaxSpeakers (Google) Set the maximum speaker count. No
interactionType (Google) Set the interaction type: discussion, presentation, phone_call, voicemail, professionally_produced, voice_search, voice_command, dictation. No
naicsCode (Google) Set an industry NAICS code that is relevant to the speech. No
vocabularyName (AWS) The name of a vocabulary to use when processing the speech. No
vocabularyFilterName (AWS) The name of a vocabulary filter to use when processing the speech. No
filterMethod (AWS) The method to use when filtering the speech: remove, mask, or tag. No
languageModelName (AWS) The name of the custom language model when processing speech. No
identifyChannels (AWS) Enable channel identification. No
profanityOption (Microsoft) masked, removed, or raw. Default: raw. No
outputFormat (Microsoft) simple or detailed. Default: simple. No
requestSnr (Microsoft) Request signal to noise information. No
initialSpeechTimeoutMs (Microsoft) Initial speech timeout in milliseconds. No
minConfidence If provided, final transcripts with confidence less than this value return a reason of ‘stt-low-confidence’ in webhook. No
transcriptionHook Webhook to receive an HTTP POST when an interim or final transcription is received. Yes
asrTimeout Timeout value for continuous ASR feature. No
asrDtmfTerminationDigit DTMF key that terminates continuous ASR feature. No
azureServiceEndpoint Custom service endpoint to connect to, instead of hosted Microsoft regional endpoints. No
azureOptions Azure-specific speech recognition options (see below). No
deepgramOptions Deepgram-specific speech recognition options (see below). No
ibmOptions IBM Watson-specific speech recognition options (see below). No
nuanceOptions Nuance-specific speech recognition options (see below). No
nvidiaOptions Nvidia-specific speech recognition options (see below). No
sonioxOptions Soniox-specific speech recognition options (see below). No

Speech-to-Text Vendors

Zerpia natively supports the following speech-to-text services:

  • AssemblyAI
  • AWS
  • Azure
  • Cobalt
  • Deepgram
  • Google
  • IBM
  • Nuance
  • NVIDIA
  • Soniox

Note: Microsoft supports on-prem and private link options for deploying the speech service in addition to the hosted Microsoft service.

Google, Microsoft, Deepgram, and Nvidia all support the ability to provide a dynamic list of words or phrases that should be “boosted” by the recognizer, meaning the recognizer should be more likely to detect these terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation, it would look like this:

"hints": ["benign", "malignant", "biopsy"],
"hintsBoost": 50

Additionally, Google and Nvidia allow a boost factor to be specified at the phrase level, e.g.:

"hints": [
  {"phrase": "benign", "boost": 50},
  {"phrase": "malignant", "boost": 10},
  {"phrase": "biopsy", "boost": 20}
]

Azure Options

azureOptions is an object with the following properties.

Option Description Required
speechSegmentationSilenceTimeoutMs Duration (in ms) of non-speech audio within a phrase that’s currently being spoken before that phrase is considered “done.” See here for details. No

Nuance Options

nuanceOptions is an object with the following properties. Please refer to the Nuance Documentation for detailed descriptions.

Option Description Required
clientId Nuance client ID to authenticate with (overrides setting in Zerpia portal). No
secret Nuance secret to authenticate with (overrides setting in Zerpia portal). No
kryptonEndpoint Endpoint of on-prem Krypton endpoint to connect to. Defaults to hosted service. No
topic Specialized language model. No
utteranceDetectionMode How many sentences (utterances) within the audio stream are processed ('single', 'multiple', 'disabled'). Default: single. No
punctuation Whether to enable auto punctuation. No
includeTokenization Whether to include tokenized recognition result. No
discardSpeakerAdaptation If speaker profiles are used, whether to discard updated speaker data. By default, data is stored. No
suppressCallRecording Whether to disable call logging and audio capture. By default, call logs, audio, and metadata are collected. No
maskLoadFailures Whether to terminate recognition when failing to load external resources. No
suppressInitialCapitalization When true, the first word in a sentence is not automatically capitalized. No
allowZeroBaseLmWeight When true, custom resources (DLMs, wordsets, etc.) can use the entire. No
filterWakeupWord Whether to remove the wakeup word from the final result. No
resultType The level of recognition results ('final', 'partial', 'immutable_partial'). Default: final. No
noInputTimeoutMs Maximum silence, in milliseconds, allowed while waiting for user input after recognition timers are started. No
recognitionTimeoutMs Maximum duration, in milliseconds, of recognition turn. No
utteranceEndSilenceMs Minimum silence, in milliseconds, that determines the end of a sentence. No
maxHypotheses Maximum number of n-best hypotheses to return. No
speechDomain Mapping to internal weight sets for language models in the data pack. No
userId Identifies a specific user within the application. No
speechDetectionSensitivity A balance between detecting speech and noise (breathing, etc.), 0 to 1. 0 means ignore all noise, 1 means interpret all noise as speech. Default: 0.5. No
clientData An object containing arbitrary key-value pairs to inject into the call log. No
formatting.scheme Keyword for a formatting type defined in the data pack. No
formatting.options Object containing key-value pairs of formatting options and values defined in the data pack. No
resource An array of zero or more recognition resources (domain LMs, wordsets, etc.) to improve recognition. No
resource[].inlineWordset Inline wordset JSON resource. See Wordsets for details. No
resource[].builtin Name of a built-in resource in the data pack. No
resource[].inlineGrammar Inline grammar, SRGS XML format. No
resource[].wakeupWord Array of wakeup words. No
resource[].weightName Input field setting the weight of the domain LM or built-in relative to the data pack ('defaultWeight', 'lowest', 'low', 'medium', 'high', 'highest'). Default: MEDIUM. No
resource[].weightValue Weight of DLM or built-in as a numeric value from 0 to 1. Default: 0.25. No
resource[].reuse Whether the resource will be used multiple times ('undefined_reuse', 'low_reuse', 'high_reuse'). Default: low_reuse. No
resource[].externalReference An external DLM or settings file for creating or updating a speaker profile. No
resource[].externalReference.type Resource type ('undefined_resource_type', 'wordset', 'compiled_wordset', 'domain_lm', 'speaker_profile', 'grammar', 'settings'). No
resource[].externalReference.uri Location of the resource as a URN reference. No
resource[].externalReference.maxLoadFailures When true, allow transcription to proceed if resource loading fails. No
resource[].externalReference.requestTimeoutMs Time to wait when downloading resources. No
resource[].externalReference.headers An object containing HTTP cache-control directives (e.g., max-age etc.). No

Deepgram Options

deepgramOptions is an object with the following properties. Please refer to the Deepgram Documentation for detailed descriptions.

Option Description Required
apiKey Deepgram API key to authenticate with (overrides setting in Zerpia portal). No
tier Deepgram tier you would like to use ('enhanced', 'base'). Default: base. No
model Deepgram model used to process submitted audio ('general', 'meeting', 'phonecall', 'voicemail', 'finance', 'conversationalai', 'video', 'custom'). Default: general. No
endpointing Indicates the number of milliseconds of silence Deepgram will use to determine a speaker has finished saying a word or phrase. The value provided must be either a number of milliseconds or 'false' to disable the feature entirely. Note: the default endpointing value that Deepgram uses is 10 milliseconds. You can set this value higher to allow to require more silence before a final transcript is returned, but we suggest a value of 1000 (one second) or less, as we have observed strange behaviors with higher values. If you wish to allow more time for pauses during a conversation before returning a transcript, we suggest using the utteranceEndMs feature instead that is described below. No (default: 10ms)
customModel ID of custom model. No
version Deepgram version of model to use. Default: latest. No
punctuate Indicates whether to add punctuation and capitalization to the transcript. No
profanityFilter Indicates whether to remove profanity from the transcript. No
redact Whether to redact information from transcripts ('pci', 'numbers', 'true', 'ssn'). No
diarize Whether to assign a speaker to each word in the transcript. No
diarizeVersion If set to '2021-07-14.0', the legacy diarization feature will be used. No
multichannel Indicates whether to transcribe each audio channel independently. No
alternatives Number of alternative transcripts to return. No
numerals Indicates whether to convert numbers from written format (e.g., “one”) to numerical format (e.g., “1”). No
search An array of terms or phrases to search for in the submitted audio. No
replace An array of terms or phrases to search for in the submitted audio and replace. No
keywords An array of keywords to which the model should pay particular attention to, boosting or suppressing to help it understand context. No
tag A tag to associate with the request. Tags appear in usage reports. No
utteranceEndMs (added in 0.8.5) A number of milliseconds of silence that Deepgram will wait after the last word was spoken before returning an UtteranceEnd event, which is used by Zerpia to trigger the transcript webhook if this property is supplied. This is essentially Deepgram’s version of continuous ASR (and in fact if you enable continuous ASR on Deepgram, it will work by enabling this property). No
shortUtterance (added in 0.8.5) Causes a transcript to be returned as soon as the Deepgram is_final property is set. This should only be used in scenarios where you are expecting a very short confirmation or directed command and you want minimal latency. No
smartFormatting (added in 0.8.5) Indicates whether to enable Deepgram’s Smart Formatting feature. No
fillerWords (added in 0.9.3) Indicates if Deepgram should transcribe filler words. No

IBM Options

ibmOptions is an object with the following properties. Please refer to the IBM Watson Documentation for detailed descriptions.

Option Description Required
sttApiKey IBM API key to authenticate with (overrides setting in Zerpia portal). No
sttRegion IBM region (overrides setting in Zerpia portal). No
instanceId IBM speech instance ID (overrides setting in Zerpia portal). No
model The model to use for speech recognition. No
languageCustomizationId ID of a custom language model. No
acousticCustomizationId ID of a custom acoustic model. No
baseModelVersion Base model to be used. No
watsonMetadata A tag value to apply to the request data provided. No
watsonLearningOptOut Set to true to prevent IBM from using your API request data to improve their service. No

Nvidia Options

nvidiaOptions is an object with the following properties. Please refer to the Nvidia Riva Documentation for detailed descriptions.

Option Description Required
rivaUri gRPC endpoint (IP:port) that Nvidia Riva is listening on. No
maxAlternatives Number of alternatives to return. No
profanityFilter Indicates whether to remove profanity from the transcript. No
punctuation Indicates whether to provide punctuation in the transcripts. No
wordTimeOffsets Indicates whether to provide word-level detail. No
verbatimTranscripts Indicates whether to provide verbatim transcripts. No
customConfiguration An object of key-value pairs that can be sent to Nvidia for custom configuration. No

Soniox Options

sonioxOptions is an object with the following properties. Please refer to the Soniox Documentation for detailed descriptions.

Option Description Required
api_key Soniox API key. No
model Soniox model to use. Default: precision_ivr. No
profanityFilter Indicates whether to remove profanity from the transcript. No
storage Properties that dictate whether to store audio and/or transcripts. Can be useful for debugging purposes. No
storage.id Storage identifier. No
storage.title Storage title. No
storage.disableStoreAudio If true, do not store audio. Default: false. No
storage.disableStoreTranscript If true, do not store transcript. Default: false. No
storage.disableSearch If true, do not allow search. Default: false. No

Ready To Get Started?