Audio Streaming Overview

SFVoPI provides real-time bidirectional audio streaming over WebSocket, enabling you to build voice AI bots, call recording systems, live transcription services, and interactive voice response (IVR) applications.

What is Audio Streaming?

Audio streaming allows your server to:

Receive audio from the caller in real-time as they speak
Send audio back to the caller (TTS, pre-recorded messages, AI-generated speech)
Process audio on-the-fly (transcription, sentiment analysis, voice commands)
Control playback (clear audio buffer, mark checkpoints, detect DTMF tones)

All audio is transmitted as base64-encoded PCM audio chunks over a WebSocket connection.

Architecture

Connection Lifecycle

Call Answered → Superfone calls your answer_url webhook
Webhook Response → You return Stream JSON with your WebSocket URL
WebSocket Connect → Superfone connects to your WebSocket server
start Event → Superfone sends the start event with streamId and callId
Audio Streaming → Bidirectional audio exchange begins
- Superfone sends media events (caller audio)
- You send playAudio commands (audio to caller)
- Superfone sends dtmf events (keypad presses)
Call End → WebSocket disconnects, Superfone calls your hangup_url webhook

Supported Codecs

SFVoPI supports two G.711 codecs for audio encoding:

Codec	Full Name	Description	Sample Rates	Content-Type
PCMU	G.711 μ-law	Standard in North America and Japan	8000 Hz, 16000 Hz	`audio/PCMU`
PCMA	G.711 A-law	Standard in Europe and rest of world	8000 Hz, 16000 Hz	`audio/PCMA`

Both codecs provide:

8-bit resolution (256 quantization levels)
64 kbps bitrate at 8000 Hz
128 kbps bitrate at 16000 Hz
Low latency (no compression delay)
High compatibility (supported by all VoIP systems)

Codec Selection

Specify your preferred codec in the Stream JSON response to the answer webhook:

{
  "stream": {
    "url": "wss://your-server.com/ws",
    "codec": "PCMU",
    "sample_rate": 8000
  }
}

Sample Rates

SFVoPI supports two sample rates:

Sample Rate	Description	Use Case	Audio Quality
8000 Hz	Narrowband (telephone quality)	Standard telephony, IVR, basic voice AI	Good for speech
16000 Hz	Wideband (HD voice)	High-quality voice AI, transcription	Better clarity

Recommendation:

Use 8000 Hz for standard telephony applications (lower bandwidth, faster processing)
Use 16000 Hz for voice AI and transcription (better accuracy, higher quality)

Stream Directions

Control which direction audio flows:

Direction	Description	Use Case
BOTH	Bidirectional (default)	Voice AI bot, IVR with TTS
INBOUND	Caller → Your Server only	Call recording, transcription
OUTBOUND	Your Server → Caller only	Announcement playback

Bandwidth Optimization

Use INBOUND for recording-only applications to reduce bandwidth by 50%.

Audio Format

All audio is transmitted as base64-encoded PCM (Pulse Code Modulation):

Encoding: Base64 string
Sample Format: 16-bit signed little-endian PCM
Channels: Mono (1 channel)
Frame Size: Variable (typically 20ms chunks = 160 samples at 8000 Hz)

Example media event:

{
  "event": "media",
  "streamId": "01JJXYZ...",
  "media": {
    "payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+...",
    "contentType": "audio/PCMU",
    "sampleRate": 8000
  }
}

Use Cases

1. Voice AI Bot

Build conversational AI agents that can:

Transcribe caller speech in real-time (Whisper, Deepgram)
Generate responses using LLMs (OpenAI, Anthropic)
Synthesize speech (ElevenLabs, Google TTS)
Handle interruptions and turn-taking

Example: Customer support bot that answers FAQs, schedules appointments, and escalates to human agents.

2. Call Recording

Record all calls for:

Quality assurance
Compliance (financial, healthcare)
Training and coaching
Dispute resolution

Example: Save audio chunks to S3, generate transcripts, and store metadata in your database.

3. Live Transcription

Transcribe calls in real-time for:

Call center agent assist (show live transcript + suggestions)
Accessibility (deaf/hard-of-hearing users)
Real-time analytics (sentiment, keywords)

Example: Stream audio to Deepgram, display live transcript to agent, highlight action items.

4. Interactive Voice Response (IVR)

Build dynamic IVR systems with:

DTMF detection (keypad input)
Speech recognition (voice commands)
Context-aware menus
Database lookups (account balance, order status)

Example: "Press 1 for sales, 2 for support, or say your account number."

5. Call Analytics

Analyze calls for:

Sentiment analysis (detect frustrated callers)
Keyword spotting (compliance violations)
Speaker diarization (who said what)
Call quality metrics (silence, crosstalk)

Example: Flag calls with negative sentiment for manager review.

Reconnection Handling

If your WebSocket server disconnects unexpectedly, Superfone automatically attempts to reconnect:

Retry Count: 3 attempts
Backoff Strategy: Exponential (1s, 2s, 4s)
Behavior: On successful reconnect, Superfone re-sends the start event with the same streamId

Handle Reconnections

Your server should:

Accept reconnections for existing streamId
Resume audio processing from the reconnection point
Not duplicate processing (use streamId to track state)

Next Steps

WebSocket Protocol — Learn all 8 event types and commands
Audio Processing — Build your first audio processor
Answer Webhook — Configure Stream JSON response
Examples — See complete working examples

What is Audio Streaming?​

Architecture​

Connection Lifecycle​

Supported Codecs​

Sample Rates​

Stream Directions​

Audio Format​

Use Cases​

1. Voice AI Bot​

2. Call Recording​

3. Live Transcription​

4. Interactive Voice Response (IVR)​

5. Call Analytics​

Reconnection Handling​

Next Steps​