Audio Streaming Overview
SFVoPI provides real-time bidirectional audio streaming over WebSocket, enabling you to build voice AI bots, call recording systems, live transcription services, and interactive voice response (IVR) applications.
What is Audio Streaming?
Audio streaming allows your server to:
- Receive audio from the caller in real-time as they speak
- Send audio back to the caller (TTS, pre-recorded messages, AI-generated speech)
- Process audio on-the-fly (transcription, sentiment analysis, voice commands)
- Control playback (clear audio buffer, mark checkpoints, detect DTMF tones)
All audio is transmitted as base64-encoded PCM audio chunks over a WebSocket connection.
Architecture
Connection Lifecycle
- Call Answered → Superfone calls your
answer_urlwebhook - Webhook Response → You return Stream JSON with your WebSocket URL
- WebSocket Connect → Superfone connects to your WebSocket server
startEvent → Superfone sends thestartevent withstreamIdandcallId- Audio Streaming → Bidirectional audio exchange begins
- Superfone sends
mediaevents (caller audio) - You send
playAudiocommands (audio to caller) - Superfone sends
dtmfevents (keypad presses)
- Superfone sends
- Call End → WebSocket disconnects, Superfone calls your
hangup_urlwebhook
Supported Codecs
SFVoPI supports two G.711 codecs for audio encoding:
| Codec | Full Name | Description | Sample Rates | Content-Type |
|---|---|---|---|---|
| PCMU | G.711 μ-law | Standard in North America and Japan | 8000 Hz, 16000 Hz | audio/PCMU |
| PCMA | G.711 A-law | Standard in Europe and rest of world | 8000 Hz, 16000 Hz | audio/PCMA |
Both codecs provide:
- 8-bit resolution (256 quantization levels)
- 64 kbps bitrate at 8000 Hz
- 128 kbps bitrate at 16000 Hz
- Low latency (no compression delay)
- High compatibility (supported by all VoIP systems)
Specify your preferred codec in the Stream JSON response to the answer webhook:
{
"stream": {
"url": "wss://your-server.com/ws",
"codec": "PCMU",
"sample_rate": 8000
}
}
Sample Rates
SFVoPI supports two sample rates:
| Sample Rate | Description | Use Case | Audio Quality |
|---|---|---|---|
| 8000 Hz | Narrowband (telephone quality) | Standard telephony, IVR, basic voice AI | Good for speech |
| 16000 Hz | Wideband (HD voice) | High-quality voice AI, transcription | Better clarity |
Recommendation:
- Use 8000 Hz for standard telephony applications (lower bandwidth, faster processing)
- Use 16000 Hz for voice AI and transcription (better accuracy, higher quality)
Stream Directions
Control which direction audio flows:
| Direction | Description | Use Case |
|---|---|---|
| BOTH | Bidirectional (default) | Voice AI bot, IVR with TTS |
| INBOUND | Caller → Your Server only | Call recording, transcription |
| OUTBOUND | Your Server → Caller only | Announcement playback |
Use INBOUND for recording-only applications to reduce bandwidth by 50%.
Audio Format
All audio is transmitted as base64-encoded PCM (Pulse Code Modulation):
- Encoding: Base64 string
- Sample Format: 16-bit signed little-endian PCM
- Channels: Mono (1 channel)
- Frame Size: Variable (typically 20ms chunks = 160 samples at 8000 Hz)
Example media event:
{
"event": "media",
"streamId": "01JJXYZ...",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMU",
"sampleRate": 8000
}
}
Use Cases
1. Voice AI Bot
Build conversational AI agents that can:
- Transcribe caller speech in real-time (Whisper, Deepgram)
- Generate responses using LLMs (OpenAI, Anthropic)
- Synthesize speech (ElevenLabs, Google TTS)
- Handle interruptions and turn-taking
Example: Customer support bot that answers FAQs, schedules appointments, and escalates to human agents.
2. Call Recording
Record all calls for:
- Quality assurance
- Compliance (financial, healthcare)
- Training and coaching
- Dispute resolution
Example: Save audio chunks to S3, generate transcripts, and store metadata in your database.
3. Live Transcription
Transcribe calls in real-time for:
- Call center agent assist (show live transcript + suggestions)
- Accessibility (deaf/hard-of-hearing users)
- Real-time analytics (sentiment, keywords)
Example: Stream audio to Deepgram, display live transcript to agent, highlight action items.
4. Interactive Voice Response (IVR)
Build dynamic IVR systems with:
- DTMF detection (keypad input)
- Speech recognition (voice commands)
- Context-aware menus
- Database lookups (account balance, order status)
Example: "Press 1 for sales, 2 for support, or say your account number."
5. Call Analytics
Analyze calls for:
- Sentiment analysis (detect frustrated callers)
- Keyword spotting (compliance violations)
- Speaker diarization (who said what)
- Call quality metrics (silence, crosstalk)
Example: Flag calls with negative sentiment for manager review.
Reconnection Handling
If your WebSocket server disconnects unexpectedly, Superfone automatically attempts to reconnect:
- Retry Count: 3 attempts
- Backoff Strategy: Exponential (1s, 2s, 4s)
- Behavior: On successful reconnect, Superfone re-sends the
startevent with the samestreamId
Your server should:
- Accept reconnections for existing
streamId - Resume audio processing from the reconnection point
- Not duplicate processing (use
streamIdto track state)
Next Steps
- WebSocket Protocol — Learn all 8 event types and commands
- Audio Processing — Build your first audio processor
- Answer Webhook — Configure Stream JSON response
- Examples — See complete working examples