Listen
Listen
Zerpia does not have a ‘record’ verb. This is by design, for data privacy reasons:
Recordings can contain sensitive and confidential information about your customers, and such data is never stored at rest in the Zerpia core.
Instead, Zerpia provides the **listen** verb, where one or more audio streams can be forked and sent in real-time to your application for processing.
The listen
verb can also be nested in a dial or config verb, which allows the audio for a call between two parties to be sent to a remote WebSocket server.
To use the listen
verb, you must implement a WebSocket server to receive and process the audio. The endpoint should be prepared to accept WebSocket connections with a subprotocol name of ‘audio.zerpia.com’.
The format of the audio data sent over the WebSocket is 16-bit PCM encoding, with a user-specified sample rate. The audio is sent in binary frames over the WebSocket connection.
Additionally, one text frame is sent immediately after the WebSocket connection is established. This text frame contains a JSON string with all of the call attributes normally sent on an HTTP request (e.g., callSid
, etc.), plus **sampleRate** and **mixType** properties describing the audio sample rate and stream(s). You can also add additional metadata to this payload using the **metadata** property as described in the table below. Once the initial text frame containing the metadata has been sent, the remote side should expect to receive only binary frames containing audio.
Note that the remote side can optionally send messages and audio back over the WebSocket connection, as described below in Bidirectional Audio.
{
"verb": "listen",
"url": "wss://myrecorder.example.com/calls",
"mixType": "stereo"
}
You can use the following attributes with the listen
command:
Option | Description | Required |
---|---|---|
actionHook |
Webhook to invoke when the listen operation ends. The information will include the duration of the audio stream, and also a 'digits' property if the recording was terminated by a DTMF key. |
Yes |
bidirectionalAudio.enabled |
If true , enable bidirectional audio. |
No (default: true ) |
bidirectionalAudio.streaming |
If true , enable streaming of audio from your application to Zerpia (and the remote caller). |
No (default: false ) |
bidirectionalAudio.sampleRate |
Required if streaming. | No |
disableBidirectionalAudio |
(Deprecated) If true , disable bidirectional audio (same as setting bidirectionalAudio.enabled = false ). |
No |
finishOnKey |
The set of digits that can end the listen action. |
No |
maxLength |
The maximum length of the listened audio stream, in seconds. | No |
metadata |
Arbitrary data to add to the JSON payload sent to the remote server when the WebSocket connection is first established. | No |
mixType |
"mono" (send single channel), "stereo" (send dual channel of both calls in a bridge), or "mixed" (send audio from both calls in a bridge in a single mixed audio stream). Default: mono . |
No |
passDtmf |
If true , any DTMF digits detected from the caller will be passed over the WebSocket as text frames in JSON format. Default: false . |
No |
playBeep |
true or false , whether to play a beep at the start of the listen operation. Default: false . |
No |
sampleRate |
Sample rate of audio to send (allowable values: 8000, 16000, 24000, 48000, or 64000). Default: 8000. | No |
timeout |
The number of seconds of silence that terminates the listen operation. |
No |
transcribe |
A nested transcribe verb. | No |
url |
URL of the remote server to connect to. | Yes |
wsAuth.username |
HTTP basic auth username to use on the WebSocket connection. | No |
wsAuth.password |
HTTP basic auth password to use on the WebSocket connection. | No |
—
Passing DTMF
Any DTMF digits entered by the far-end party on the call can optionally be passed to the WebSocket server as JSON text frames by setting the passDtmf
property to true
. Each DTMF entry is reported separately in a payload that contains the specific DTMF key that was entered, as well as the duration which is reported in RTP timestamp units. The payload sent will look like this:
{
"event": "dtmf",
"dtmf": "2",
"duration": "1600"
}
—
Bidirectional Audio
Audio can also be sent back over the WebSocket to Zerpia. This audio, if supplied, will be played out to the caller. (Note: Bidirectional audio is not supported when the listen
verb is nested in the context of a dial
verb).
There are two separate modes for bidirectional audio:
- **Non-streaming**, where you provide a full base64-encoded audio file as JSON text frames.
- **Streaming**, where you stream audio as L16 PCM raw audio as binary frames.
Non-streaming
The far-end WebSocket server supplies bidirectional audio by sending a JSON text frame over the WebSocket connection:
{
"type": "playAudio",
"data": {
"audioContent": "base64-encoded content..",
"audioContentType": "raw",
"sampleRate": "16000"
}
}
In the example above, raw (headerless) audio is sent. The audio must be 16-bit PCM encoded audio, with a configurable sample rate of either 8000, 16000, 24000, 32000, 48000, or 64000 kHz. Alternatively, a WAV file format can be supplied by using type "wav"
(or "wave"
), and in this case, no sampleRate
property is needed. In all cases, the audio must be base64 encoded when sent over the socket.
If multiple playAudio
commands are sent before the first has finished playing, they will be queued and played in order. You may have up to 10 queued playAudio
commands at any time.
Once a playAudio
command has finished playing out the audio, a playDone
JSON text frame will be sent over the WebSocket connection:
{
"type": "playDone"
}
A killAudio
command can also be sent by the WebSocket server to stop the playout of audio that was started via a previous playAudio
command:
{
"type": "killAudio"
}
And finally, if the WebSocket connection wishes to end the listen
, it can send a disconnect
command:
{
"type": "disconnect"
}
Streaming Bidirectional Audio
To enable streaming bidirectional audio, you must explicitly enable it in the listen
verb with the streaming
property as shown below:
{
"verb": "listen",
"bidirectionalAudio": {
"enabled": true,
"streaming": true,
"sampleRate": 8000
}
}
Your application should then send binary frames of linear-16 PCM raw data with the specified sample rate over the WebSocket connection.
Note that you can specify both the sample rate you want to receive over the WebSocket and the sample rate you want to send back audio, and they do not need to be the same. In the example below, we choose to receive 8kHz sampling but send back 16kHz sampling.
{
"verb": "listen",
"sampleRate": 8000,
"bidirectionalAudio": {
"enabled": true,
"streaming": true,
"sampleRate": 16000
}
}
Commands
You can send the following commands over the WebSocket as JSON frames:
disconnect
killAudio
mark
clearMarks
Disconnect
{
"type": "disconnect"
}
This causes the WebSocket to be closed from the Zerpia side, and the associated listen
verb to end.
KillAudio
{
"type": "killAudio"
}
This causes any audio that is playing out from the bidirectional socket, as well as any buffered audio, to be flushed.
Mark
{
"type": "mark",
"data": {
"name": "my-mark-1"
}
}
You can send a mark
command if you want to synchronize activities on your end with the playout of the audio stream you’ve provided. Because the audio you provide will usually be buffered before it is streamed, if you want to know when a specific piece of audio has started or completed, send a mark
command with a name
property at the point in the stream you want to sync with. When that point in the audio stream is later reached during playback, you will get a matching JSON frame back over the WebSocket:
{
"type": "mark",
"data": {
"name": "my-mark-1",
"event": "playout"
}
}
Note that event
will contain either playout
or cleared
depending on whether the audio stream reached the mark during playout or the mark was never played out due to a killAudio
command.
ClearMarks
{
"type": "clearMarks"
}
This command clears (removes) any audio marks that are being tracked. When you remove the marks in this way, you will not receive mark
events for the removed marks.