Audio Streaming Overview
SFVoPI provides real-time bidirectional audio streaming over WebSocket, enabling you to build voice AI bots, call recording systems, live transcription services, and interactive voice response (IVR) applications.
What is Audio Streaming?
Audio streaming allows your server to:
- Receive audio from the caller in real-time as they speak
- Send audio back to the caller (TTS, pre-recorded messages, AI-generated speech)
- Process audio on-the-fly (transcription, sentiment analysis, voice commands)
- Control playback (clear audio buffer, mark checkpoints, detect DTMF tones)
All audio is transmitted as base64-encoded PCM audio chunks over a WebSocket connection.
Architecture
Connection Lifecycle
- Call Answered → Superfone calls your
answer_urlwebhook - Webhook Response → You return Stream JSON with your WebSocket URL
- WebSocket Connect → Superfone connects to your WebSocket server
startEvent → Superfone sends thestartevent withstreamIdandcallId- Audio Streaming → Bidirectional audio exchange begins
- Superfone sends
mediaevents (caller audio) - You send
playAudiocommands (audio to caller) - Superfone sends
dtmfevents (keypad presses)
- Superfone sends
- Call End → WebSocket disconnects, Superfone calls your
hangup_urlwebhook
Supported Codecs
SFVoPI supports three codecs:
| Codec | Full Name | Description | Sample Rates | Content-Type |
|---|---|---|---|---|
| PCMA | G.711 A-law | Default & recommended. Native telephony codec on SFVoPI — zero transcoding. | 8000 Hz | audio/PCMA |
| PCMU | G.711 μ-law | Accepted but forces A-law ↔ μ-law transcoding on every frame. Prefer PCMA. | 8000 Hz | audio/PCMU |
| L16 | Linear 16-bit PCM | Raw signed little-endian PCM. SFVoPI transcodes A-law ↔ PCM per frame. | 8000, 16000, 24000, 32000, 44100, 48000 Hz | audio/x-l16 |
The SFVoPI telephony stack runs A-law at 8 kHz end-to-end. Picking PCMA + 8000 makes SFVoPI a pure byte passthrough — lowest latency, zero CPU overhead. All other choices add transcoding on the SFVoPI side.
G.711 codecs (PCMA, PCMU) provide:
- 8-bit resolution (256 quantization levels)
- 64 kbps bitrate at 8000 Hz
- Low latency (no compression delay)
- High compatibility (supported by all VoIP systems)
L16 provides:
- 16-bit resolution (65 536 quantization levels)
- Raw linear PCM — easiest format to process in pipelines that expect PCM
- Higher bandwidth (128 kbps @ 8 kHz, 768 kbps @ 48 kHz)
- Requires SFVoPI to transcode each direction since the trunk is A-law
Specify your preferred codec in the Stream JSON response to the answer webhook:
{
"stream": {
"url": "wss://your-server.com/ws",
"codec": "PCMA",
"sample_rate": 8000
}
}
Sample Rates
| Sample Rate | Applies To | Use Case |
|---|---|---|
| 8000 Hz | PCMA, PCMU, L16 | Matches SFVoPI trunk — use this. |
| 16000 Hz | L16 only | Wideband PCM in your pipeline. Note: source audio is still 8 kHz — SFVoPI upsamples. |
| 24000, 32000, 44100, 48000 Hz | L16 only | Accepted but identical caveat — trunk is 8 kHz, extra samples carry no new information. |
The SFVoPI telephony network is 8 kHz A-law. Requesting higher sample rates causes SFVoPI to upsample 8 kHz source audio before sending to you and downsample your TTS output before sending to the caller. Zero real quality gain, extra CPU cost. If your pipeline wants 16 kHz PCM internally, do the resample in your own infra and stream 8 kHz to SFVoPI.
Stream Directions
Control which direction audio flows:
| Direction | Description | Use Case |
|---|---|---|
| BOTH | Bidirectional (default) | Voice AI bot, IVR with TTS |
| INBOUND | Caller → Your Server only | Call recording, transcription |
| OUTBOUND | Your Server → Caller only | Announcement playback |
Use INBOUND for recording-only applications to reduce bandwidth by 50%.
Audio Format
Audio is transmitted as base64-encoded bytes. The byte format depends on the codec selected in the answer-webhook:
- PCMA / PCMU: G.711-encoded 8-bit samples @ 8000 Hz (160 bytes per 20 ms frame)
- L16: 16-bit signed little-endian PCM at your chosen sample rate (320 bytes per 20 ms frame @ 8 kHz)
- Channels: Mono (1 channel) in all cases
- Frame Size: typically 20 ms chunks
Example media event (PCMA, recommended):
{
"event": "media",
"streamId": "SFV_STRM_IN_01JJXYZ123ABC456DEF789GHI",
"media": {
"payload": "//7+/v7+/v7+/v7+/v7+/v7+/v7+...",
"contentType": "audio/PCMA",
"sampleRate": 8000
}
}
streamId follows the pattern SFV_STRM_{IN|OUT|BI}_<ULID> — the middle segment reflects stream direction (IN inbound, OUT outbound, BI bidirectional). Use it as an opaque identifier.
Use Cases
1. Voice AI Bot
Build conversational AI agents that can:
- Transcribe caller speech in real-time (Whisper, Deepgram)
- Generate responses using LLMs (OpenAI, Anthropic)
- Synthesize speech (ElevenLabs, Google TTS)
- Handle interruptions and turn-taking
Example: Customer support bot that answers FAQs, schedules appointments, and escalates to human agents.
2. Call Recording
Record all calls for:
- Quality assurance
- Compliance (financial, healthcare)
- Training and coaching
- Dispute resolution
Example: Save audio chunks to S3, generate transcripts, and store metadata in your database.
3. Live Transcription
Transcribe calls in real-time for:
- Call center agent assist (show live transcript + suggestions)
- Accessibility (deaf/hard-of-hearing users)
- Real-time analytics (sentiment, keywords)
Example: Stream audio to Deepgram, display live transcript to agent, highlight action items.
4. Interactive Voice Response (IVR)
Build dynamic IVR systems with:
- DTMF detection (keypad input)
- Speech recognition (voice commands)
- Context-aware menus
- Database lookups (account balance, order status)
Example: "Press 1 for sales, 2 for support, or say your account number."
5. Call Analytics
Analyze calls for:
- Sentiment analysis (detect frustrated callers)
- Keyword spotting (compliance violations)
- Speaker diarization (who said what)
- Call quality metrics (silence, crosstalk)
Example: Flag calls with negative sentiment for manager review.
Reconnection Handling
If your WebSocket server disconnects unexpectedly, Superfone automatically attempts to reconnect:
- Retry Count: 3 attempts
- Backoff Strategy: Exponential (1s, 2s, 4s)
- Behavior: On successful reconnect, Superfone re-sends the
startevent with the samestreamId
Your server should:
- Accept reconnections for existing
streamId - Resume audio processing from the reconnection point
- Not duplicate processing (use
streamIdto track state)
Next Steps
- WebSocket Protocol — Learn all 8 event types and commands
- Audio Processing — Build your first audio processor
- Answer Webhook — Configure Stream JSON response
- Examples — See complete working examples