Skip to main content

WebSocket Protocol

The SFVoPI WebSocket protocol defines 8 event types for bidirectional audio streaming: 5 incoming events from Superfone and 3 outgoing commands from your server.

Protocol Overview

DirectionEvent TypePurpose
Incoming (Superfone → Your Server)startStream established, ready for media
IncomingmediaAudio chunk from caller
IncomingdtmfDTMF key pressed by caller
IncomingclearedAudioAcknowledgment of clearAudio command
IncomingplayedStreamAcknowledgment of checkpoint command
Outgoing (Your Server → Superfone)playAudioSend audio to caller
OutgoingclearAudioClear audio playback buffer
OutgoingcheckpointMark a named checkpoint in stream

All messages are JSON-encoded strings sent over the WebSocket connection.


Incoming Events (Superfone → Your Server)

1. start Event

Sent when the WebSocket connection is established and the audio stream is ready. This is the first event you'll receive.

When: Immediately after WebSocket connection is established (or after reconnection).

Fields:

FieldTypeDescription
eventstringAlways "start"
streamIdstringUnique identifier for this audio stream. Format: SFV_STRM_{IN|OUT|BI}_<ULID> where the middle segment reflects stream direction (IN inbound, OUT outbound, BI bidirectional). Treat as opaque.
callIdstringUUID of the call

Example:

{
"event": "start",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"callId": "550e8400-e29b-41d4-a716-446655440000"
}

What to do:

  • Store streamId and callId for this connection
  • Initialize your audio processor (transcription, AI, recording)
  • Start any timers or background tasks
  • If this is a reconnection, resume processing from where you left off

TypeScript Interface:

interface StartEvent {
event: "start";
streamId: string;
callId: string;
}

2. media Event

Sent continuously during the call, containing audio chunks from the caller.

When: Every 20ms (typically) while the caller is speaking or silent.

Fields:

FieldTypeDescription
eventstringAlways "media"
streamIdstringStream identifier from start event
media.payloadstringBase64-encoded PCM audio data
media.contentTypestringAudio codec: "audio/PCMA" (recommended), "audio/PCMU", or "audio/x-l16"
media.sampleRatenumberSample rate in Hz: 8000 or 16000

Example:

{
"event": "media",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMA",
"sampleRate": 8000
}
}

What to do:

  • Decode base64 payload to get raw PCM audio bytes
  • Process audio (transcribe, analyze, record)
  • Optionally send audio back using playAudio command

Decoding Audio:

// Decode base64 to Buffer
const audioBuffer = Buffer.from(mediaEvent.media.payload, 'base64');

// audioBuffer is now raw PCM audio (16-bit signed little-endian)
// For 8000 Hz: 160 samples = 320 bytes = 20ms of audio
// For 16000 Hz: 320 samples = 640 bytes = 20ms of audio

// Example: Save to file
fs.appendFileSync('recording.pcm', audioBuffer);

// Example: Send to transcription service
await transcriptionService.processAudio(audioBuffer);

TypeScript Interface:

interface MediaEvent {
event: "media";
streamId: string;
media: {
payload: string;
contentType: "audio/PCMA" | "audio/PCMU" | "audio/x-l16";
sampleRate: 8000 | 16000;
};
}

3. dtmf Event

Sent when the caller presses a key on their phone's keypad.

When: Immediately after DTMF tone is detected by Superfone.

Fields:

FieldTypeDescription
eventstringAlways "dtmf"
streamIdstringStream identifier from start event
digitstringKey pressed: "0" to "9", "*", "#", "A" to "D"

Example:

{
"event": "dtmf",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"digit": "5"
}

What to do:

  • Handle IVR menu navigation ("Press 1 for sales, 2 for support")
  • Collect input (account number, PIN, confirmation)
  • Trigger actions (skip, replay, transfer)

Example Handler:

function handleDtmfEvent(event) {
console.log(`Caller pressed: ${event.digit}`);

switch (event.digit) {
case '1':
// Transfer to sales
playAudio(salesGreeting);
break;
case '2':
// Transfer to support
playAudio(supportGreeting);
break;
case '*':
// Go back to main menu
playAudio(mainMenu);
break;
case '#':
// Confirm input
processInput();
break;
default:
// Invalid input
playAudio(invalidInputMessage);
}
}

TypeScript Interface:

interface DtmfEvent {
event: "dtmf";
streamId: string;
digit: string;
}

4. clearedAudio Event

Acknowledgment that the audio playback buffer has been cleared in response to your clearAudio command.

When: Immediately after Superfone processes your clearAudio command.

Fields:

FieldTypeDescription
eventstringAlways "clearedAudio"
streamIdstringStream identifier from start event
sequenceNumbernumberSequence number from your clearAudio command

Example:

{
"event": "clearedAudio",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"sequenceNumber": 42
}

What to do:

  • Confirm that audio buffer was cleared
  • Resume sending new audio (if needed)
  • Update UI or logs

Use Case: You sent a long audio message, but the caller interrupted. You send clearAudio to stop playback, then send new audio based on the interruption.

TypeScript Interface:

interface ClearedAudioEvent {
event: "clearedAudio";
streamId: string;
sequenceNumber: number;
}

5. playedStream Event

Acknowledgment that a checkpoint you marked with the checkpoint command has been reached (all audio before the checkpoint has finished playing).

When: After all audio sent before the checkpoint command has been played to the caller.

Fields:

FieldTypeDescription
eventstringAlways "playedStream"
streamIdstringStream identifier from start event
namestringCheckpoint name from your checkpoint command

Example:

{
"event": "playedStream",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"name": "greeting_complete"
}

What to do:

  • Trigger next action (e.g., start listening for response)
  • Update state machine (e.g., move from "greeting" to "listening")
  • Log timing metrics

Use Case: You send a greeting message, then send a checkpoint named "greeting_complete". When you receive playedStream with name: "greeting_complete", you know the greeting finished playing and can now listen for the caller's response.

TypeScript Interface:

interface PlayedStreamEvent {
event: "playedStream";
streamId: string;
name: string;
}

Outgoing Commands (Your Server → Superfone)

6. playAudio Command

Send audio to the caller. Audio is queued and played in the order received.

When: Whenever you want to play audio to the caller (TTS, pre-recorded message, AI-generated speech).

Fields:

FieldTypeRequiredDescription
eventstringYesAlways "playAudio"
media.payloadstringYesBase64-encoded PCM audio data
media.contentTypestringYesAudio codec: "audio/PCMA" (recommended), "audio/PCMU", or "audio/x-l16"
media.sampleRatenumberYesSample rate in Hz: 8000 or 16000

Example:

{
"event": "playAudio",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMA",
"sampleRate": 8000
}
}

Encoding Audio:

// Encode PCM audio to base64
const audioBuffer = fs.readFileSync('greeting.pcm');
const base64Audio = audioBuffer.toString('base64');

// Send playAudio command
const command = {
event: 'playAudio',
media: {
payload: base64Audio,
contentType: "audio/PCMA",
sampleRate: 8000,
},
};

ws.send(JSON.stringify(command));

Notes:

  • Audio is queued and played in FIFO order
  • Use the same codec and sample rate as specified in your Stream JSON response
  • Audio chunks are typically 20ms (160 samples at 8000 Hz, 320 samples at 16000 Hz)
  • You can send multiple playAudio commands in rapid succession (they will be queued)
Payload Size Limit

Each playAudio frame's base64 payload must be at most ~1.5 MB (1,500,000 characters). Frames exceeding this limit are silently dropped. In practice this is far larger than any real-time voice chunk — if you're approaching it, you're almost certainly buffering too much audio per frame. Send audio in small chunks (20–200 ms) instead of large blocks.

TypeScript Interface:

interface PlayAudioCommand {
event: "playAudio";
media: {
payload: string;
contentType: "audio/PCMA" | "audio/PCMU" | "audio/x-l16";
sampleRate: 8000 | 16000;
};
}

7. clearAudio Command

Clear the audio playback buffer. All queued audio is discarded immediately.

When: When you need to interrupt playback (e.g., caller interrupted, error occurred, new context).

Fields:

FieldTypeRequiredDescription
eventstringYesAlways "clearAudio"
sequenceNumbernumberYesUnique sequence number for tracking acknowledgment

Example:

{
"event": "clearAudio",
"sequenceNumber": 42
}

What happens:

  1. Superfone immediately stops playing audio to the caller
  2. All queued playAudio commands are discarded
  3. Superfone sends clearedAudio event with the same sequenceNumber

Example:

let clearSequence = 0;

function clearAudioBuffer(ws) {
clearSequence++;

const command = {
event: 'clearAudio',
sequenceNumber: clearSequence,
};

ws.send(JSON.stringify(command));
console.log(`Sent clearAudio with sequence ${clearSequence}`);
}

// Handle acknowledgment
function handleClearedAudioEvent(event) {
console.log(`Audio cleared, sequence ${event.sequenceNumber}`);
// Now safe to send new audio
}

Use Case: Voice AI bot is speaking a long response, but the caller interrupts. You send clearAudio to stop playback, then send new audio based on the interruption.

TypeScript Interface:

interface ClearAudioCommand {
event: "clearAudio";
sequenceNumber: number;
}

8. checkpoint Command

Mark a named checkpoint in the audio stream. When all audio sent before this checkpoint has finished playing, Superfone sends a playedStream event.

When: When you need to know when specific audio has finished playing (e.g., greeting complete, question asked).

Fields:

FieldTypeRequiredDescription
eventstringYesAlways "checkpoint"
streamIdstringYesStream identifier from start event
namestringYesUnique name for this checkpoint

Example:

{
"event": "checkpoint",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"name": "greeting_complete"
}

What happens:

  1. Superfone marks the current position in the audio queue
  2. When all audio before this checkpoint finishes playing, Superfone sends playedStream event with the same name
Pending Checkpoint Queue Limit

Superfone holds up to 100 pending checkpoints per stream. If you queue a 101st without receiving playedStream acknowledgments for earlier ones, the new checkpoint is silently dropped. Consume playedStream events as they arrive to keep the queue drained.

Example:

// Send greeting audio
sendAudioToCallee(greetingAudio, ws);

// Mark checkpoint
const checkpoint = {
event: 'checkpoint',
streamId: currentStreamId,
name: 'greeting_complete',
};
ws.send(JSON.stringify(checkpoint));

// Later, when you receive playedStream event:
function handlePlayedStreamEvent(event) {
if (event.name === 'greeting_complete') {
console.log('Greeting finished playing, now listening for response');
startListening();
}
}

Use Cases:

  • Turn-taking: Know when to stop speaking and start listening
  • State machine: Transition between states (greeting → listening → responding)
  • Timing metrics: Measure how long audio takes to play
  • Synchronization: Coordinate audio playback with other actions

TypeScript Interface:

interface CheckpointCommand {
event: "checkpoint";
streamId: string;
name: string;
}

Complete Example

Here's a complete WebSocket handler that processes all 8 event types:

const WebSocket = require('ws');

const wss = new WebSocket.Server({ port: 3000 });

wss.on('connection', (ws) => {
let streamId = null;
let callId = null;
let clearSequence = 0;

ws.on('message', (data) => {
const event = JSON.parse(data.toString());

switch (event.event) {
case 'start':
streamId = event.streamId;
callId = event.callId;
console.log(`Stream started: ${streamId}`);
break;

case 'media':
// Decode and process audio
const audioBuffer = Buffer.from(event.media.payload, 'base64');
processAudio(audioBuffer);
break;

case 'dtmf':
console.log(`DTMF pressed: ${event.digit}`);
handleDtmfInput(event.digit);
break;

case 'clearedAudio':
console.log(`Audio cleared: sequence ${event.sequenceNumber}`);
break;

case 'playedStream':
console.log(`Checkpoint reached: ${event.name}`);
handleCheckpoint(event.name);
break;
}
});

// Send audio to caller
function sendAudio(audioBuffer) {
const command = {
event: 'playAudio',
media: {
payload: audioBuffer.toString('base64'),
contentType: "audio/PCMA",
sampleRate: 8000,
},
};
ws.send(JSON.stringify(command));
}

// Clear audio buffer
function clearAudio() {
clearSequence++;
const command = {
event: 'clearAudio',
sequenceNumber: clearSequence,
};
ws.send(JSON.stringify(command));
}

// Mark checkpoint
function markCheckpoint(name) {
const command = {
event: 'checkpoint',
streamId,
name,
};
ws.send(JSON.stringify(command));
}
});

Next Steps

  • Audio Processing — Build audio processors with examples
  • Overview — Learn about codecs, sample rates, and architecture
  • Answer Webhook — Configure Stream JSON response
  • Examples — See complete working examples