tabtablabs

This step‑by‑step tutorial guides you from zero to a phone‑call‑capable AI assistant. You’ll learn the building blocks—LiveKit rooms and agents, realtime speech with OpenAI Realtime, telephony via Twilio SIP routing into LiveKit, and traces in Langfuse—then assemble them into a production-ready flow.

You will build:

a minimal LiveKit Agent that connects to a room,
the same agent speaking & listening via OpenAI Realtime,
a telephony entry point (Twilio → LiveKit SIP inbound trunk + dispatch rule),
a clean hang‑up tool, BVC noise‑cancellation, and Langfuse telemetry.

LiveKit Agents is the realtime framework we’ll use to build, run, and dispatch voice agents. OpenAI Realtime API gives low‑latency “speech‑in/speech‑out” with turn detection. Twilio provides the PSTN phone number and SIP routing into LiveKit. Langfuse ingests OTLP traces so you can inspect sessions.

0) Prerequisites

Why this matters: you’ll use uv to manage the Python project, LiveKit Cloud to host rooms & SIP, Twilio for PSTN, OpenAI for realtime speech, and Langfuse for observability.

Python 3.11+
uv package & project manager (install and basics).
LiveKit Cloud account + lk CLI (auth, projects).
Twilio account with a phone number.
OpenAI API key (Realtime).
Langfuse project (OTLP ingestion).

1) Concepts & Architecture (the mental model)

Goal: understand the moving parts before writing code.

LiveKit room: a realtime space where participants (our agent and a SIP “caller” participant) exchange audio.
Agent worker: a Python process running LiveKit Agents that joins rooms and speaks for your app.
OpenAI Realtime: turns your agent into “ears + brain + voice” in one model with built‑in turn detection (no manual STT/TTS).
SIP inbound + dispatch rule: Twilio routes calls to LiveKit SIP; a dispatch rule decides which room and which agent join.
Hang‑up: end the call by deleting the room—disconnects everyone.
Noise cancellation: enable BVC (background voice cancellation) to keep telephony audio clean.
Tracing: send OTLP traces to Langfuse for session‑level observability.

Call flow in words:

Caller dials Twilio number → Twilio SIP routes INVITE → LiveKit SIP inbound trunk → dispatch rule creates room (e.g., call‑abc) and dispatches agent → agent joins room and speaks via OpenAI Realtime. Hang‑up deletes the room.

2) Project Scaffold with `uv`

Purpose: create a clean, reproducible Python project.

Directory layout

phone-ai
├── .python-version
├── .env.local                # secrets (DO NOT COMMIT)
├── pyproject.toml
├── livekit.toml              # created later by CLI or provided manually
└── src
    ├── __init__.py
    ├── agent.py
    └── tools.py

uv is a fast package & project manager; we’ll use uv sync to install and uv run to execute.

`pyproject.toml`

[project]
name = "phone-ai"
version = "0.1.0"
description = "Phone AI: LiveKit + Twilio + OpenAI Realtime + Langfuse"
requires-python = ">=3.11"
dependencies = [
  "python-dotenv>=1.0",
  "livekit-agents[mcp,openai,silero,turn-detector",                 # core agent framework
  "livekit-plugins-noise-cancellation",  # BVC & telephony-optimized filters
  "twilio>=9",                      # useful for phone-side utilities (optional)
]

[tool.uv]  # optional; uv works without this section too

Install deps:

uv sync

LiveKit Agents & the OpenAI plugin are published on PyPI. The noise‑cancellation plugin is separate and supports BVC. OTLP HTTP exporter is the standard way to send traces.

3) Configure local secrets

Create .env.local (never commit):

# LiveKit (Cloud)
LIVEKIT_URL=wss://<your-project-subdomain>.livekit.cloud
LIVEKIT_API_KEY=REDACTED
LIVEKIT_API_SECRET=REDACTED

# OpenAI
OPENAI_API_KEY=REDACTED

# Langfuse (OTLP HTTP endpoint)
LANGFUSE_HOST=https://us.cloud.langfuse.com
LANGFUSE_PUBLIC_KEY=REDACTED
LANGFUSE_SECRET_KEY=REDACTED

We’ll programmatically set OTEL_EXPORTER_OTLP_* to point at Langfuse’s OTLP endpoint (/api/public/otel) with Basic Auth.

4) Hello, Room: the smallest possible Agent

Purpose: prove we can start a worker and join a room—no speech yet.

Concepts: Agent, AgentSession, cli.run_app, and the worker entrypoint.

Create src/agent.py (step 1):

# src/agent.py
from dotenv import load_dotenv
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, RoomInputOptions, cli

load_dotenv(".env.local", override=True)

INSTRUCTIONS = "You are a polite assistant. Keep replies short."

class HelloAgent(Agent):
    async def on_enter(self) -> None:
        # Generate a first reply when the agent joins the room.
        await self.session.generate_reply()

async def entrypoint(ctx: JobContext):
    session = AgentSession()  # No speech yet
    await session.start(
        room=ctx.room,
        agent=HelloAgent(),
        room_input_options=RoomInputOptions(),  # defaults
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, agent_name="phone-ai"))

Run it:

uv run src/agent.py start

This mirrors the official “build a voice agent” pattern (we’ll add audio shortly).

5) Give the agent a voice with OpenAI Realtime

Purpose: add realtime speech comprehension + synthesis (one model), plus turn detection.

Why Realtime: keeps latency low and handles endpointing/interruptions for you. In LiveKit Agents, you pass a RealtimeModel instead of a text‑only LLM or separate STT/TTS.

Replace entrypoint() in src/agent.py (step 2):

from livekit.plugins import openai  # add this import at top

async def entrypoint(ctx: JobContext):
    # 1) Create a Realtime model with a built-in voice.
    #    (You can change "marin" to any available Realtime voice.)
    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="marin")
        # If you want fine-grained control, see docs for configuring
        # input_audio_transcription and turn_detection parameters.
    )

    # 2) Start the session with your agent in the current room.
    await session.start(room=ctx.room, agent=HelloAgent())

LiveKit’s OpenAI Realtime plugin exposes a Python RealtimeModel. Use it directly in the session config to get speech‑in/speech‑out. For deeper control, consult the resources section for configuring semantic VAD turn detection, transcription model, and more. OpenAI Realtime overview & API reference are listed in Resources.

6) Add a tool to hang up cleanly + telephony noise cancellation

Why: when the conversation ends, delete the room so the phone leg drops immediately; enable BVC (telephony‑optimized) noise cancellation for clarity.

Create src/tools.py:

# src/tools.py
from livekit import api
from livekit.agents import get_job_context

async def hangup_call():
    """
    Ends the call by deleting the room (disconnects all participants).
    Requires LIVEKIT_* env vars.
    """
    ctx = get_job_context()
    if ctx and ctx.room:
        await ctx.api.room.delete_room(api.DeleteRoomRequest(room=ctx.room.name))

Update src/agent.py (step 3):

from livekit.agents.llm import function_tool
from livekit.plugins import noise_cancellation  # add

class PhoneAgent(Agent):
    def __init__(self) -> None:
        super().__init__(instructions=INSTRUCTIONS)

    @function_tool
    async def end_call(self, ctx):
        """Politely end the call and hang up."""
        from tools import hangup_call
        await ctx.wait_for_playout()
        await hangup_call()

    async def on_enter(self) -> None:
        await self.session.generate_reply()

async def entrypoint(ctx: JobContext):
    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="marin"),
    )
    await session.start(
        room=ctx.room,
        agent=PhoneAgent(),
        room_input_options=RoomInputOptions(
            # Enable Telephony-optimized Background Voice Cancellation
            noise_cancellation=noise_cancellation.BVCTelephony(),
        ),
    )

Delete room is the recommended way to hang up for all participants. BVC is LiveKit Cloud’s enhanced noise/voice cancellation (Krisp), ideal for voice AI; a SIP‑telephony setting is available on trunks and in‑agent filters.

7) Run locally (no phones yet)

# Start the worker
uv run src/agent.py start

You can join a test room from another client (e.g., LiveKit Agents Playground) to see the agent respond.

8) Connect a phone number: Twilio → LiveKit SIP → Dispatch your agent

You have two common inbound paths. Choose one:

A) Twilio Programmable Voice (TwiML Sip to LiveKit)

Create a TwiML Bin in Twilio:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial>
    <Sip>
      sip:<your-project-subdomain>.sip.livekit.cloud
    </Sip>
  </Dial>
</Response>

Point your Twilio phone number’s Voice URL to this TwiML Bin.
In LiveKit, create an inbound trunk and a dispatch rule (below).

Twilio <Sip> dials any SIP endpoint; see Resources for a full TwiML path for inbound voice.

B) Twilio Elastic SIP Trunking (scalable, SIP‑native)

Set your Origination SIP URI to your LiveKit SIP domain (visible in LiveKit UI or project settings) and proceed with the same LiveKit trunk/dispatch setup.

Create LiveKit inbound trunk (with Krisp enabled) and dispatch rule

inbound-trunk.json:

{
  "trunk": {
    "name": "Inbound Trunk",
    "numbers": ["+1YOURNUMBER"],
    "krispEnabled": true
  }
}

lk sip inbound create inbound-trunk.json

The inbound trunk accepts calls from your provider and can enable Krisp noise cancellation globally.

dispatch-rule.json (create per‑caller rooms and explicitly dispatch your agent):

{
  "dispatch_rule": {
    "rule": {
      "dispatchRuleIndividual": { "roomPrefix": "call-" }
    },
    "roomConfig": {
      "agents": [{ "agentName": "phone-ai" }]
    }
  }
}

lk sip dispatch-rule create dispatch-rule.json

Dispatch rules control how inbound calls land in rooms and which agents join (explicit dispatch recommended for SIP).

Test: start your worker (uv run src/agent.py start), call your Twilio number, and confirm a LiveKit room like call-xyz appears with your agent joined.

Notes

Twilio <Sip> TwiML reference if using Programmable Voice.
Twilio SIP interface docs (general).
LiveKit inbound calls & Twilio flow.

9) Add Langfuse observability via OpenTelemetry

Purpose: ship traces for each session to Langfuse’s OTLP endpoint—debug audio turns, latencies, and errors.

Add this helper to src/agent.py (near imports):

import base64, os
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_langfuse_otel():
    host = os.getenv("LANGFUSE_HOST", "").rstrip("/")
    pub = os.getenv("LANGFUSE_PUBLIC_KEY")
    sec = os.getenv("LANGFUSE_SECRET_KEY")
    if not (host and pub and sec):
        return None

    # Langfuse OTLP endpoint + Basic Auth
    os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = f"{host}/api/public/otel/v1/traces"
    os.environ["OTEL_EXPORTER_OTLP_PROTOCOL"] = "http/protobuf"
    os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "Authorization=Basic " + base64.b64encode(f"{pub}:{sec}".encode()).decode()

    tp = TracerProvider()
    tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
    return tp

Initialize it in entrypoint():

tp = setup_langfuse_otel()
# (Optionally register shutdown callback to flush)

Langfuse provides an OTLP HTTP ingestion endpoint; the standard Python exporter is opentelemetry-exporter-otlp-proto-http. OTLP/HTTP & exporter environment variables are standard OpenTelemetry.

10) Make it truly conversational (prompt & behavior)

Refine the agent’s instructions:

INSTRUCTIONS = """
You are a friendly receptionist for ACME.
- Greet the caller briefly.
- Collect name and reason for calling.
- Offer to schedule a follow-up if appropriate.
- Keep turns short; don't overtalk the caller.
- When the caller is finished, call the `end_call` tool.
"""

11) Optional: Scheduling via an MCP server (Cal.com)

If you want the agent to book meetings, you can run a Cal.com Model Context Protocol server and expose scheduling tools. Start one via NPX and point it to your Cal.com API key:

npx -y @calcom/cal-mcp

Then, in LiveKit Agents, you can attach MCP servers (stdio) to your session if you choose to expand beyond basic voice. (Full MCP integration is out of scope for the core telephony build, but the server exists and is easy to pilot.)

12) Deploy to LiveKit Cloud (production path)

While local is great for development, you’ll likely deploy the worker and let LiveKit Cloud handle dispatch at scale:

# Authenticate once
lk cloud auth

# Create & deploy your agent from the working dir
lk agent create
lk agent deploy

The CLI will generate/update a livekit.toml in your project, track your agent ID, and build a container image for you.

13) End‑to‑end test checklist

Worker running locally or deployed.
Call your Twilio number; LiveKit SIP should create a room (e.g., call-...) and dispatch your agent_name.
Speak and interrupt the agent; Realtime handles turn detection.
Ask the agent to end the call → verifies the end_call tool (room deletion).
Confirm noise cancellation effective (BVC telephony).
See Langfuse traces arriving via OTLP.

14) Troubleshooting

Agent joins but call doesn’t end: ensure you’re calling delete_room(...) on the LiveKit API; merely stopping the agent leaves the caller in silence.
Dispatch not working: verify your dispatch-rule.json schema and that agentName matches agent_name you passed to WorkerOptions.
No audio / choppy audio: enable BVC (either trunk‑level krispEnabled or RoomInputOptions filter) and test again.
Realtime voice not speaking: confirm OPENAI_API_KEY is set and you’re using the Realtime plugin (not a text‑only LLM).
uv environment issues: re‑run uv sync, and verify Python 3.11+ is active.

15) Full, self‑contained file listing (final state)

pyproject.toml — see §2.

.env.local — see §3 (Sensitive keys omitted).

src/tools.py — see §6.

src/agent.py — consolidated:

from dotenv import load_dotenv
import base64, os
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, RoomInputOptions, cli
from livekit.agents.llm import function_tool
from livekit.plugins import openai, noise_cancellation
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

load_dotenv(".env.local", override=True)

def setup_langfuse_otel():
    host = os.getenv("LANGFUSE_HOST", "").rstrip("/")
    pub = os.getenv("LANGFUSE_PUBLIC_KEY")
    sec = os.getenv("LANGFUSE_SECRET_KEY")
    if not (host and pub and sec):
        return None
    os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = f"{host}/api/public/otel/v1/traces"
    os.environ["OTEL_EXPORTER_OTLP_PROTOCOL"] = "http/protobuf"
    os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "Authorization=Basic " + base64.b64encode(f"{pub}:{sec}".encode()).decode()
    tp = TracerProvider()
    tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
    return tp

INSTRUCTIONS = """
You are a friendly receptionist. Keep turns short.
Collect caller name & reason. Offer a follow-up. Use the end_call tool to hang up.
"""

class PhoneAgent(Agent):
    def __init__(self) -> None:
        super().__init__(instructions=INSTRUCTIONS)

    @function_tool
    async def end_call(self, ctx):
        """Politely end the call and hang up for everyone."""
        from tools import hangup_call
        await ctx.wait_for_playout()
        await hangup_call()

    async def on_enter(self) -> None:
        await self.session.generate_reply()

async def entrypoint(ctx: JobContext):
    setup_langfuse_otel()

    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="marin"),
    )

    await session.start(
        room=ctx.room,
        agent=PhoneAgent(),
        room_input_options=RoomInputOptions(
            noise_cancellation=noise_cancellation.BVCTelephony(),
        ),
    )

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, agent_name="phone-ai"))

16) What you learned

Core LiveKit building blocks: rooms, agents, dispatch, and SIP inbound trunks.
Realtime voice with OpenAI Realtime via LiveKit’s plugin (low latency, turn detection).
Telephony routing (Twilio → LiveKit SIP) with explicit agent dispatch.
Clean termination (delete room) and noise cancellation (BVC).
Observability with OTLP → Langfuse.

17) Extensions & next steps

Separate TTS while keeping Realtime STT: configure Realtime model for text output (modalities=["text"]) and pair with your chosen TTS.
Outbound calling (dial out to users) with SIP participants.
Advanced dispatch and scaling in LiveKit Cloud via lk agent tooling.
Add MCP tools (e.g., Cal.com) for scheduling workflows.

Resources

LiveKit Agents (Python) docs & reference: building agents, sessions, worker lifecycle. (LiveKit docs)
OpenAI Realtime: conceptual overview & API reference. (OpenAI Platform)
LiveKit ↔ OpenAI Realtime plugin (Python): how to pass RealtimeModel. (LiveKit docs)
SIP inbound & Twilio Voice: trunks, dispatch rules, TwiML <Sip>. (LiveKit docs)
Noise cancellation (Krisp/BVC): feature overview & trunk options. (LiveKit docs)
Langfuse OTLP ingestion (getting started). (Langfuse)
Example repos
- Agents examples (Python) including a basic voice agent. (GitHub)
- LiveKit SIP agent example (Twilio inbound). (GitHub)

Redactions / sensitive items

API keys and secrets are omitted by design. Use .env.local locally and LiveKit secrets for production deployments.

That’s it! You’ve built a phone‑call voice AI from fundamentals, not just copy‑pasting: LiveKit room/agent basics → Realtime speech → Twilio SIP → clean hang‑up → Langfuse traces. From here, iterate on prompts, add tools (search, calendar), and move to LiveKit Cloud for resilient, auto‑scaled deployments.