Recognizer

The recognizer property is used in multiple verbs (gather, transcribe, dial). It selects and configures the speech recognizer.

It is an object containing the following properties:

Option	Description	Required
`vendor`	Speech vendor to use (see list below, along with any others you add via the custom speech API).	No
`language`	Language code to use for speech detection. Defaults to the application-level setting.	No
`fallbackVendor`	Fallback speech vendor to use (see list below, along with any others you add via the custom speech API).	No
`fallbackLanguage`	Fallback language code to use for speech detection. Defaults to the application-level setting.	No
`interim`	If `true`, interim transcriptions are sent. Default: `false`.	No
`hints`	(Google, Microsoft, Deepgram, Nvidia, Soniox) Array of words or phrases to assist speech detection. See examples below.	No
`hintsBoost`	(Google, Nvidia) Number indicating the strength to assign to the configured hints. See examples below.	No
`profanityFilter`	(Google, Deepgram, Nuance, Nvidia) If `true`, filter profanity from speech transcription. Default: `no`.	No
`singleUtterance`	(Google) If `true`, return only a single utterance/transcript. Default: `true` for gather.	No
`vad.enable`	If `true`, delay connecting to the cloud recognizer until speech is detected.	No
`vad.voiceMs`	If VAD is enabled, the number of milliseconds of speech required before connecting to the cloud recognizer.	No
`vad.mode`	If VAD is enabled, this setting governs the sensitivity of the voice activity detector; value must be between 0 to 3 inclusive, lower numbers mean more sensitive.	No
`separateRecognitionPerChannel`	If `true`, recognize both caller and called party speech using separate recognition sessions.	No
`altLanguages`	(Google, Microsoft) An array of alternative languages that the speaker may be using.	No
`punctuation`	(Google) Enable automatic punctuation.	No
`model`	(Google) Speech recognition model to use. Default: `phone_call`.	No
`enhancedModel`	(Google) Use enhanced model.	No
`words`	(Google) Enable word offsets.	No
`diarization`	(Google) Enable speaker diarization.	No
`diarizationMinSpeakers`	(Google) Set the minimum speaker count.	No
`diarizationMaxSpeakers`	(Google) Set the maximum speaker count.	No
`interactionType`	(Google) Set the interaction type: `discussion`, `presentation`, `phone_call`, `voicemail`, `professionally_produced`, `voice_search`, `voice_command`, `dictation`.	No
`naicsCode`	(Google) Set an industry NAICS code that is relevant to the speech.	No
`vocabularyName`	(AWS) The name of a vocabulary to use when processing the speech.	No
`vocabularyFilterName`	(AWS) The name of a vocabulary filter to use when processing the speech.	No
`filterMethod`	(AWS) The method to use when filtering the speech: `remove`, `mask`, or `tag`.	No
`languageModelName`	(AWS) The name of the custom language model when processing speech.	No
`identifyChannels`	(AWS) Enable channel identification.	No
`profanityOption`	(Microsoft) `masked`, `removed`, or `raw`. Default: `raw`.	No
`outputFormat`	(Microsoft) `simple` or `detailed`. Default: `simple`.	No
`requestSnr`	(Microsoft) Request signal to noise information.	No
`initialSpeechTimeoutMs`	(Microsoft) Initial speech timeout in milliseconds.	No
`minConfidence`	If provided, final transcripts with confidence less than this value return a reason of ‘stt-low-confidence’ in webhook.	No
`transcriptionHook`	Webhook to receive an HTTP POST when an interim or final transcription is received.	Yes
`asrTimeout`	Timeout value for continuous ASR feature.	No
`asrDtmfTerminationDigit`	DTMF key that terminates continuous ASR feature.	No
`azureServiceEndpoint`	Custom service endpoint to connect to, instead of hosted Microsoft regional endpoints.	No
`azureOptions`	Azure-specific speech recognition options (see below).	No
`deepgramOptions`	Deepgram-specific speech recognition options (see below).	No
`ibmOptions`	IBM Watson-specific speech recognition options (see below).	No
`nuanceOptions`	Nuance-specific speech recognition options (see below).	No
`nvidiaOptions`	Nvidia-specific speech recognition options (see below).	No
`sonioxOptions`	Soniox-specific speech recognition options (see below).	No

—

Speech-to-Text Vendors

Zerpia natively supports the following speech-to-text services:

AssemblyAI
AWS
Azure
Cobalt
Deepgram
Google
IBM
Nuance
NVIDIA
Soniox

Note: Microsoft supports on-prem and private link options for deploying the speech service in addition to the hosted Microsoft service.

Google, Microsoft, Deepgram, and Nvidia all support the ability to provide a dynamic list of words or phrases that should be “boosted” by the recognizer, meaning the recognizer should be more likely to detect these terms and return them in the transcript. A boost factor can also be applied. In the most basic implementation, it would look like this:

"hints": ["benign", "malignant", "biopsy"],
"hintsBoost": 50

Additionally, Google and Nvidia allow a boost factor to be specified at the phrase level, e.g.:

"hints": [
  {"phrase": "benign", "boost": 50},
  {"phrase": "malignant", "boost": 10},
  {"phrase": "biopsy", "boost": 20}
]

—

Azure Options

azureOptions is an object with the following properties.

Option	Description	Required
`speechSegmentationSilenceTimeoutMs`	Duration (in ms) of non-speech audio within a phrase that’s currently being spoken before that phrase is considered “done.” See here for details.	No

—

Nuance Options

nuanceOptions is an object with the following properties. Please refer to the Nuance Documentation for detailed descriptions.

Option	Description	Required
`clientId`	Nuance client ID to authenticate with (overrides setting in Zerpia portal).	No
`secret`	Nuance secret to authenticate with (overrides setting in Zerpia portal).	No
`kryptonEndpoint`	Endpoint of on-prem Krypton endpoint to connect to. Defaults to hosted service.	No
`topic`	Specialized language model.	No
`utteranceDetectionMode`	How many sentences (utterances) within the audio stream are processed (`'single'`, `'multiple'`, `'disabled'`). Default: `single`.	No
`punctuation`	Whether to enable auto punctuation.	No
`includeTokenization`	Whether to include tokenized recognition result.	No
`discardSpeakerAdaptation`	If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.	No
`suppressCallRecording`	Whether to disable call logging and audio capture. By default, call logs, audio, and metadata are collected.	No
`maskLoadFailures`	Whether to terminate recognition when failing to load external resources.	No
`suppressInitialCapitalization`	When `true`, the first word in a sentence is not automatically capitalized.	No
`allowZeroBaseLmWeight`	When `true`, custom resources (DLMs, wordsets, etc.) can use the entire.	No
`filterWakeupWord`	Whether to remove the wakeup word from the final result.	No
`resultType`	The level of recognition results (`'final'`, `'partial'`, `'immutable_partial'`). Default: `final`.	No
`noInputTimeoutMs`	Maximum silence, in milliseconds, allowed while waiting for user input after recognition timers are started.	No
`recognitionTimeoutMs`	Maximum duration, in milliseconds, of recognition turn.	No
`utteranceEndSilenceMs`	Minimum silence, in milliseconds, that determines the end of a sentence.	No
`maxHypotheses`	Maximum number of n-best hypotheses to return.	No
`speechDomain`	Mapping to internal weight sets for language models in the data pack.	No
`userId`	Identifies a specific user within the application.	No
`speechDetectionSensitivity`	A balance between detecting speech and noise (breathing, etc.), 0 to 1. 0 means ignore all noise, 1 means interpret all noise as speech. Default: 0.5.	No
`clientData`	An object containing arbitrary key-value pairs to inject into the call log.	No
`formatting.scheme`	Keyword for a formatting type defined in the data pack.	No
`formatting.options`	Object containing key-value pairs of formatting options and values defined in the data pack.	No
`resource`	An array of zero or more recognition resources (domain LMs, wordsets, etc.) to improve recognition.	No
`resource[].inlineWordset`	Inline wordset JSON resource. See Wordsets for details.	No
`resource[].builtin`	Name of a built-in resource in the data pack.	No
`resource[].inlineGrammar`	Inline grammar, SRGS XML format.	No
`resource[].wakeupWord`	Array of wakeup words.	No
`resource[].weightName`	Input field setting the weight of the domain LM or built-in relative to the data pack (`'defaultWeight'`, `'lowest'`, `'low'`, `'medium'`, `'high'`, `'highest'`). Default: `MEDIUM`.	No
`resource[].weightValue`	Weight of DLM or built-in as a numeric value from 0 to 1. Default: 0.25.	No
`resource[].reuse`	Whether the resource will be used multiple times (`'undefined_reuse'`, `'low_reuse'`, `'high_reuse'`). Default: `low_reuse`.	No
`resource[].externalReference`	An external DLM or settings file for creating or updating a speaker profile.	No
`resource[].externalReference.type`	Resource type (`'undefined_resource_type'`, `'wordset'`, `'compiled_wordset'`, `'domain_lm'`, `'speaker_profile'`, `'grammar'`, `'settings'`).	No
`resource[].externalReference.uri`	Location of the resource as a URN reference.	No
`resource[].externalReference.maxLoadFailures`	When `true`, allow transcription to proceed if resource loading fails.	No
`resource[].externalReference.requestTimeoutMs`	Time to wait when downloading resources.	No
`resource[].externalReference.headers`	An object containing HTTP cache-control directives (e.g., `max-age` etc.).	No

—

Deepgram Options

deepgramOptions is an object with the following properties. Please refer to the Deepgram Documentation for detailed descriptions.

Option	Description	Required
`apiKey`	Deepgram API key to authenticate with (overrides setting in Zerpia portal).	No
`tier`	Deepgram tier you would like to use (`'enhanced'`, `'base'`). Default: `base`.	No
`model`	Deepgram model used to process submitted audio (`'general'`, `'meeting'`, `'phonecall'`, `'voicemail'`, `'finance'`, `'conversationalai'`, `'video'`, `'custom'`). Default: `general`.	No
`endpointing`	Indicates the number of milliseconds of silence Deepgram will use to determine a speaker has finished saying a word or phrase. The value provided must be either a number of milliseconds or `'false'` to disable the feature entirely. Note: the default endpointing value that Deepgram uses is 10 milliseconds. You can set this value higher to allow to require more silence before a final transcript is returned, but we suggest a value of 1000 (one second) or less, as we have observed strange behaviors with higher values. If you wish to allow more time for pauses during a conversation before returning a transcript, we suggest using the `utteranceEndMs` feature instead that is described below.	No (default: 10ms)
`customModel`	ID of custom model.	No
`version`	Deepgram version of model to use. Default: `latest`.	No
`punctuate`	Indicates whether to add punctuation and capitalization to the transcript.	No
`profanityFilter`	Indicates whether to remove profanity from the transcript.	No
`redact`	Whether to redact information from transcripts (`'pci'`, `'numbers'`, `'true'`, `'ssn'`).	No
`diarize`	Whether to assign a speaker to each word in the transcript.	No
`diarizeVersion`	If set to `'2021-07-14.0'`, the legacy diarization feature will be used.	No
`multichannel`	Indicates whether to transcribe each audio channel independently.	No
`alternatives`	Number of alternative transcripts to return.	No
`numerals`	Indicates whether to convert numbers from written format (e.g., “one”) to numerical format (e.g., “1”).	No
`search`	An array of terms or phrases to search for in the submitted audio.	No
`replace`	An array of terms or phrases to search for in the submitted audio and replace.	No
`keywords`	An array of keywords to which the model should pay particular attention to, boosting or suppressing to help it understand context.	No
`tag`	A tag to associate with the request. Tags appear in usage reports.	No
`utteranceEndMs` (added in 0.8.5)	A number of milliseconds of silence that Deepgram will wait after the last word was spoken before returning an `UtteranceEnd` event, which is used by Zerpia to trigger the transcript webhook if this property is supplied. This is essentially Deepgram’s version of continuous ASR (and in fact if you enable continuous ASR on Deepgram, it will work by enabling this property).	No
`shortUtterance` (added in 0.8.5)	Causes a transcript to be returned as soon as the Deepgram `is_final` property is set. This should only be used in scenarios where you are expecting a very short confirmation or directed command and you want minimal latency.	No
`smartFormatting` (added in 0.8.5)	Indicates whether to enable Deepgram’s Smart Formatting feature.	No
`fillerWords` (added in 0.9.3)	Indicates if Deepgram should transcribe filler words.	No

—

IBM Options

ibmOptions is an object with the following properties. Please refer to the IBM Watson Documentation for detailed descriptions.

Option	Description	Required
`sttApiKey`	IBM API key to authenticate with (overrides setting in Zerpia portal).	No
`sttRegion`	IBM region (overrides setting in Zerpia portal).	No
`instanceId`	IBM speech instance ID (overrides setting in Zerpia portal).	No
`model`	The model to use for speech recognition.	No
`languageCustomizationId`	ID of a custom language model.	No
`acousticCustomizationId`	ID of a custom acoustic model.	No
`baseModelVersion`	Base model to be used.	No
`watsonMetadata`	A tag value to apply to the request data provided.	No
`watsonLearningOptOut`	Set to `true` to prevent IBM from using your API request data to improve their service.	No

—

Nvidia Options

nvidiaOptions is an object with the following properties. Please refer to the Nvidia Riva Documentation for detailed descriptions.

Option	Description	Required
`rivaUri`	gRPC endpoint (IP:port) that Nvidia Riva is listening on.	No
`maxAlternatives`	Number of alternatives to return.	No
`profanityFilter`	Indicates whether to remove profanity from the transcript.	No
`punctuation`	Indicates whether to provide punctuation in the transcripts.	No
`wordTimeOffsets`	Indicates whether to provide word-level detail.	No
`verbatimTranscripts`	Indicates whether to provide verbatim transcripts.	No
`customConfiguration`	An object of key-value pairs that can be sent to Nvidia for custom configuration.	No

—

Soniox Options

sonioxOptions is an object with the following properties. Please refer to the Soniox Documentation for detailed descriptions.

Option	Description	Required
`api_key`	Soniox API key.	No
`model`	Soniox model to use. Default: `precision_ivr`.	No
`profanityFilter`	Indicates whether to remove profanity from the transcript.	No
`storage`	Properties that dictate whether to store audio and/or transcripts. Can be useful for debugging purposes.	No
`storage.id`	Storage identifier.	No
`storage.title`	Storage title.	No
`storage.disableStoreAudio`	If `true`, do not store audio. Default: `false`.	No
`storage.disableStoreTranscript`	If `true`, do not store transcript. Default: `false`.	No
`storage.disableSearch`	If `true`, do not allow search. Default: `false`.	No

Recognizer

Recognizer

Speech-to-Text Vendors

Azure Options

Nuance Options

Deepgram Options

IBM Options

Nvidia Options

Soniox Options

Ready To Get Started?