Skip to main content

Audio Streaming Overview

SFVoPI provides real-time bidirectional audio streaming over WebSocket, enabling you to build voice AI bots, call recording systems, live transcription services, and interactive voice response (IVR) applications.

What is Audio Streaming?

Audio streaming allows your server to:

  • Receive audio from the caller in real-time as they speak
  • Send audio back to the caller (TTS, pre-recorded messages, AI-generated speech)
  • Process audio on-the-fly (transcription, sentiment analysis, voice commands)
  • Control playback (clear audio buffer, mark checkpoints, detect DTMF tones)

All audio is transmitted as base64-encoded PCM audio chunks over a WebSocket connection.

Architecture

Connection Lifecycle

  1. Call Answered → Superfone calls your answer_url webhook
  2. Webhook Response → You return Stream JSON with your WebSocket URL
  3. WebSocket Connect → Superfone connects to your WebSocket server
  4. start Event → Superfone sends the start event with streamId and callId
  5. Audio Streaming → Bidirectional audio exchange begins
    • Superfone sends media events (caller audio)
    • You send playAudio commands (audio to caller)
    • Superfone sends dtmf events (keypad presses)
  6. Call End → WebSocket disconnects, Superfone calls your hangup_url webhook

Supported Codecs

SFVoPI supports three codecs:

CodecFull NameDescriptionSample RatesContent-Type
PCMAG.711 A-lawDefault & recommended. Native telephony codec on SFVoPI — zero transcoding.8000 Hzaudio/PCMA
PCMUG.711 μ-lawAccepted but forces A-law ↔ μ-law transcoding on every frame. Prefer PCMA.8000 Hzaudio/PCMU
L16Linear 16-bit PCMRaw signed little-endian PCM. SFVoPI transcodes A-law ↔ PCM per frame.8000, 16000, 24000, 32000, 44100, 48000 Hzaudio/x-l16
Recommended: PCMA @ 8000 Hz

The SFVoPI telephony stack runs A-law at 8 kHz end-to-end. Picking PCMA + 8000 makes SFVoPI a pure byte passthrough — lowest latency, zero CPU overhead. All other choices add transcoding on the SFVoPI side.

G.711 codecs (PCMA, PCMU) provide:

  • 8-bit resolution (256 quantization levels)
  • 64 kbps bitrate at 8000 Hz
  • Low latency (no compression delay)
  • High compatibility (supported by all VoIP systems)

L16 provides:

  • 16-bit resolution (65 536 quantization levels)
  • Raw linear PCM — easiest format to process in pipelines that expect PCM
  • Higher bandwidth (128 kbps @ 8 kHz, 768 kbps @ 48 kHz)
  • Requires SFVoPI to transcode each direction since the trunk is A-law
Codec Selection

Specify your preferred codec in the Stream JSON response to the answer webhook:

{
"stream": {
"url": "wss://your-server.com/ws",
"codec": "PCMA",
"sample_rate": 8000
}
}

Sample Rates

Sample RateApplies ToUse Case
8000 HzPCMA, PCMU, L16Matches SFVoPI trunk — use this.
16000 HzL16 onlyWideband PCM in your pipeline. Note: source audio is still 8 kHz — SFVoPI upsamples.
24000, 32000, 44100, 48000 HzL16 onlyAccepted but identical caveat — trunk is 8 kHz, extra samples carry no new information.
Trunk is 8 kHz

The SFVoPI telephony network is 8 kHz A-law. Requesting higher sample rates causes SFVoPI to upsample 8 kHz source audio before sending to you and downsample your TTS output before sending to the caller. Zero real quality gain, extra CPU cost. If your pipeline wants 16 kHz PCM internally, do the resample in your own infra and stream 8 kHz to SFVoPI.

Stream Directions

Control which direction audio flows:

DirectionDescriptionUse Case
BOTHBidirectional (default)Voice AI bot, IVR with TTS
INBOUNDCaller → Your Server onlyCall recording, transcription
OUTBOUNDYour Server → Caller onlyAnnouncement playback
Bandwidth Optimization

Use INBOUND for recording-only applications to reduce bandwidth by 50%.

Audio Format

Audio is transmitted as base64-encoded bytes. The byte format depends on the codec selected in the answer-webhook:

  • PCMA / PCMU: G.711-encoded 8-bit samples @ 8000 Hz (160 bytes per 20 ms frame)
  • L16: 16-bit signed little-endian PCM at your chosen sample rate (320 bytes per 20 ms frame @ 8 kHz)
  • Channels: Mono (1 channel) in all cases
  • Frame Size: typically 20 ms chunks

Example media event (PCMA, recommended):

{
"event": "media",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMA",
"sampleRate": 8000
}
}
Stream ID format

streamId follows the pattern SFV_STRM_{IN|OUT|BI}_<ULID> — the middle segment reflects stream direction (IN inbound, OUT outbound, BI bidirectional). Use it as an opaque identifier.

Use Cases

1. Voice AI Bot

Build conversational AI agents that can:

  • Transcribe caller speech in real-time (Whisper, Deepgram)
  • Generate responses using LLMs (OpenAI, Anthropic)
  • Synthesize speech (ElevenLabs, Google TTS)
  • Handle interruptions and turn-taking

Example: Customer support bot that answers FAQs, schedules appointments, and escalates to human agents.

2. Call Recording

Record all calls for:

  • Quality assurance
  • Compliance (financial, healthcare)
  • Training and coaching
  • Dispute resolution

Example: Save audio chunks to S3, generate transcripts, and store metadata in your database.

3. Live Transcription

Transcribe calls in real-time for:

  • Call center agent assist (show live transcript + suggestions)
  • Accessibility (deaf/hard-of-hearing users)
  • Real-time analytics (sentiment, keywords)

Example: Stream audio to Deepgram, display live transcript to agent, highlight action items.

4. Interactive Voice Response (IVR)

Build dynamic IVR systems with:

  • DTMF detection (keypad input)
  • Speech recognition (voice commands)
  • Context-aware menus
  • Database lookups (account balance, order status)

Example: "Press 1 for sales, 2 for support, or say your account number."

5. Call Analytics

Analyze calls for:

  • Sentiment analysis (detect frustrated callers)
  • Keyword spotting (compliance violations)
  • Speaker diarization (who said what)
  • Call quality metrics (silence, crosstalk)

Example: Flag calls with negative sentiment for manager review.

Reconnection Handling

If your WebSocket server disconnects unexpectedly, Superfone automatically attempts to reconnect:

  • Retry Count: 3 attempts
  • Backoff Strategy: Exponential (1s, 2s, 4s)
  • Behavior: On successful reconnect, Superfone re-sends the start event with the same streamId
Handle Reconnections

Your server should:

  1. Accept reconnections for existing streamId
  2. Resume audio processing from the reconnection point
  3. Not duplicate processing (use streamId to track state)

Next Steps