Skip to main content

Audio Streaming Overview

SFVoPI provides real-time bidirectional audio streaming over WebSocket, enabling you to build voice AI bots, call recording systems, live transcription services, and interactive voice response (IVR) applications.

What is Audio Streaming?

Audio streaming allows your server to:

  • Receive audio from the caller in real-time as they speak
  • Send audio back to the caller (TTS, pre-recorded messages, AI-generated speech)
  • Process audio on-the-fly (transcription, sentiment analysis, voice commands)
  • Control playback (clear audio buffer, mark checkpoints, detect DTMF tones)

All audio is transmitted as base64-encoded PCM audio chunks over a WebSocket connection.

Architecture

Connection Lifecycle

  1. Call Answered → Superfone calls your answer_url webhook
  2. Webhook Response → You return Stream JSON with your WebSocket URL
  3. WebSocket Connect → Superfone connects to your WebSocket server
  4. start Event → Superfone sends the start event with streamId and callId
  5. Audio Streaming → Bidirectional audio exchange begins
    • Superfone sends media events (caller audio)
    • You send playAudio commands (audio to caller)
    • Superfone sends dtmf events (keypad presses)
  6. Call End → WebSocket disconnects, Superfone calls your hangup_url webhook

Supported Codecs

SFVoPI supports two G.711 codecs for audio encoding:

CodecFull NameDescriptionSample RatesContent-Type
PCMUG.711 μ-lawStandard in North America and Japan8000 Hz, 16000 Hzaudio/PCMU
PCMAG.711 A-lawStandard in Europe and rest of world8000 Hz, 16000 Hzaudio/PCMA

Both codecs provide:

  • 8-bit resolution (256 quantization levels)
  • 64 kbps bitrate at 8000 Hz
  • 128 kbps bitrate at 16000 Hz
  • Low latency (no compression delay)
  • High compatibility (supported by all VoIP systems)
Codec Selection

Specify your preferred codec in the Stream JSON response to the answer webhook:

{
"stream": {
"url": "wss://your-server.com/ws",
"codec": "PCMU",
"sample_rate": 8000
}
}

Sample Rates

SFVoPI supports two sample rates:

Sample RateDescriptionUse CaseAudio Quality
8000 HzNarrowband (telephone quality)Standard telephony, IVR, basic voice AIGood for speech
16000 HzWideband (HD voice)High-quality voice AI, transcriptionBetter clarity

Recommendation:

  • Use 8000 Hz for standard telephony applications (lower bandwidth, faster processing)
  • Use 16000 Hz for voice AI and transcription (better accuracy, higher quality)

Stream Directions

Control which direction audio flows:

DirectionDescriptionUse Case
BOTHBidirectional (default)Voice AI bot, IVR with TTS
INBOUNDCaller → Your Server onlyCall recording, transcription
OUTBOUNDYour Server → Caller onlyAnnouncement playback
Bandwidth Optimization

Use INBOUND for recording-only applications to reduce bandwidth by 50%.

Audio Format

All audio is transmitted as base64-encoded PCM (Pulse Code Modulation):

  • Encoding: Base64 string
  • Sample Format: 16-bit signed little-endian PCM
  • Channels: Mono (1 channel)
  • Frame Size: Variable (typically 20ms chunks = 160 samples at 8000 Hz)

Example media event:

{
"event": "media",
"streamId": "01JJXYZ...",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMU",
"sampleRate": 8000
}
}

Use Cases

1. Voice AI Bot

Build conversational AI agents that can:

  • Transcribe caller speech in real-time (Whisper, Deepgram)
  • Generate responses using LLMs (OpenAI, Anthropic)
  • Synthesize speech (ElevenLabs, Google TTS)
  • Handle interruptions and turn-taking

Example: Customer support bot that answers FAQs, schedules appointments, and escalates to human agents.

2. Call Recording

Record all calls for:

  • Quality assurance
  • Compliance (financial, healthcare)
  • Training and coaching
  • Dispute resolution

Example: Save audio chunks to S3, generate transcripts, and store metadata in your database.

3. Live Transcription

Transcribe calls in real-time for:

  • Call center agent assist (show live transcript + suggestions)
  • Accessibility (deaf/hard-of-hearing users)
  • Real-time analytics (sentiment, keywords)

Example: Stream audio to Deepgram, display live transcript to agent, highlight action items.

4. Interactive Voice Response (IVR)

Build dynamic IVR systems with:

  • DTMF detection (keypad input)
  • Speech recognition (voice commands)
  • Context-aware menus
  • Database lookups (account balance, order status)

Example: "Press 1 for sales, 2 for support, or say your account number."

5. Call Analytics

Analyze calls for:

  • Sentiment analysis (detect frustrated callers)
  • Keyword spotting (compliance violations)
  • Speaker diarization (who said what)
  • Call quality metrics (silence, crosstalk)

Example: Flag calls with negative sentiment for manager review.

Reconnection Handling

If your WebSocket server disconnects unexpectedly, Superfone automatically attempts to reconnect:

  • Retry Count: 3 attempts
  • Backoff Strategy: Exponential (1s, 2s, 4s)
  • Behavior: On successful reconnect, Superfone re-sends the start event with the same streamId
Handle Reconnections

Your server should:

  1. Accept reconnections for existing streamId
  2. Resume audio processing from the reconnection point
  3. Not duplicate processing (use streamId to track state)

Next Steps