AI Calling Agent API Architecture for Real Estate

Zappio Team

AI & Real Estate Experts · 4 July 2026 · 14 min read

Technical Architecture & Developer Guides · Developer Implementation

This Is a Distributed Systems Problem, Not a Prompt Engineering Exercise

Building a production-grade AI Calling Agent for real estate is not a prompt engineering exercise. It is a distributed systems problem — one that spans telephony infrastructure, speech processing pipelines, LLM inference, CRM integration, concurrent call management, and real-time data synchronization. Developers and solution architects evaluating or building AI Calling systems for real estate need to understand the full architectural stack before making infrastructure commitments that are expensive to reverse.

This article details the complete API architecture for a scalable real estate AI Calling pipeline — from lead ingestion through call orchestration, speech processing, LLM inference, disposition logic, and CRM write-back — with emphasis on the components that determine production performance versus the components that look identical in demos but fail at scale.

The Seven-Layer Architecture

A production AI Calling Agent for real estate operates across seven distinct architectural layers. Each layer has specific performance requirements, failure modes, and scaling constraints that must be understood before the system handles real lead volume.

┌─────────────────────────────────────────────────────┐
│  Layer 7: Analytics & Monitoring                    │
├─────────────────────────────────────────────────────┤
│  Layer 6: CRM Integration & Data Sync               │
├─────────────────────────────────────────────────────┤
│  Layer 5: Disposition Logic & Outcome Classification│
├─────────────────────────────────────────────────────┤
│  Layer 4: LLM Inference & Response Generation       │
├─────────────────────────────────────────────────────┤
│  Layer 3: Speech Processing (ASR → NLU → TTS)       │
├─────────────────────────────────────────────────────┤
│  Layer 2: Call Orchestration                        │
├─────────────────────────────────────────────────────┤
│  Layer 1: Lead Ingestion & Queue Management         │
└─────────────────────────────────────────────────────┘

Layer 1: Lead Ingestion & Queue Management

The ingestion layer receives lead data from multiple sources simultaneously — portal webhooks (99acres, MagicBricks, Housing.com), Meta Lead Ads webhooks, Google Lead Form webhooks, CRM-native triggers, and manual upload batches. It must handle burst traffic during project launches (2,000+ leads/hour) without dropping events.

Webhook receiver — a stateless HTTP endpoint (Node.js Express or FastAPI) that accepts POST requests, validates HMAC signatures, and immediately acknowledges receipt with HTTP 200 to prevent portal retry storms; processing happens asynchronously
Message queue — Apache Kafka or AWS SQS/SNS for durable lead event buffering; all webhook payloads are written to the queue as immutable events before any processing occurs, ensuring leads are never lost during downstream processing failures
Deduplication service — checks phone number + timestamp against a Redis cache (TTL: 4 hours) to suppress duplicate submissions from the same buyer across multiple portals, happening before the call trigger to prevent multiple concurrent AI calls to the same buyer
Priority queue router — assigns each lead to a priority tier (A/B/C/D) based on lead source, portal lead tier, intent score, and time of submission, determining queue position for call trigger dispatch

def assign_priority_tier(lead: Lead) -> str:
    score = 0

    # Source weight
    if lead.source in ["99acres_premium_assured", "housing_express"]:
        score += 40
    elif lead.source in ["google_search", "magicbricks_tier1"]:
        score += 30
    elif lead.source in ["meta_facebook", "instagram"]:
        score += 20
    else:
        score += 10

    # Intent score (if available)
    if lead.intent_score:
        score += min(lead.intent_score // 2, 30)

    # After-hours bonus (peak intent windows)
    submission_hour = lead.submitted_at.hour  # IST
    if 20 <= submission_hour <= 23:  # 8 PM - 11 PM
        score += 15

    # Tier assignment
    if score >= 70: return "A"   # < 45 sec trigger
    if score >= 50: return "B"   # < 90 sec trigger
    if score >= 30: return "C"   # < 150 sec trigger
    return "D"                   # < 300 sec trigger

💡

Target latency — Layer 1 end-to-end: under 8 seconds from webhook receipt to queue entry with priority assignment.

Layer 2: Call Orchestration

The orchestration layer manages the lifecycle of every AI call — from trigger to termination — including concurrent call limits, retry logic, human transfer routing, and carrier-level telephony management. Indian real estate AI Calling requires Indian telephony infrastructure for regulatory compliance, latency, and call quality; the orchestration layer connects to one or more telephony providers via SIP trunk or REST API.

Exotel — REST API, WebSocket support for real-time audio streaming, Indian DID numbers, established compliance posture
Tata Communications (TTSL) — enterprise SIP trunks, highest concurrent call capacity for large deployments, requires enterprise contract
Plivo — REST API with WebSocket streaming, competitive per-minute pricing, good Indian carrier interconnect

QUEUED → DIALING → RINGING → CONNECTED → IN_PROGRESS
                                              ↓
                              HUMAN_TRANSFER / COMPLETED / NO_ANSWER / FAILED
                                              ↓
                                         DISPOSITION_WRITTEN

Concurrent call management: each telephony provider has concurrent call limits tied to trunk capacity. The orchestration layer maintains a semaphore counter per provider and queues call triggers when the limit is reached rather than attempting over-capacity calls that fail silently.

class CallOrchestrator:
    def __init__(self, provider: TelephonyProvider, max_concurrent: int):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.provider = provider

    async def initiate_call(self, lead: Lead, script_context: dict) -> CallSession:
        async with self.semaphore:
            session = await self.provider.create_call(
                to=lead.phone,
                from_=self._select_caller_id(lead.city),  # City-matched DID
                websocket_url=f"wss://audio.platform.io/session/{lead.lead_id}",
                timeout_seconds=30,
                record=True,
                record_format="mp3"
            )
            return session

Caller ID selection: use city-matched local DID numbers — Gurgaon leads receive a 0124 or matching mobile prefix number; Hyderabad leads receive a 040 or matching prefix number. Local caller IDs improve answer rates by 18–24% over national or unfamiliar area codes.

Attempt	Delay After Previous	Time Window
Attempt 1	Immediate (< 90 sec from lead receipt)	Any time (9 AM – 9 PM IST for first contact)
Attempt 2	2 hours after Attempt 1	9 AM – 9 PM IST
Attempt 3	24 hours after Attempt 2	10 AM – 7 PM IST (conservative)
No further AI attempts	—	Handoff to human BDR queue or long-cycle nurture

Layer 3: Speech Processing — ASR → NLU → TTS

The speech processing layer converts bidirectional audio into structured intent data and generates natural-sounding AI responses. This is the layer most responsible for perceived call quality — and the layer most developers underestimate in complexity.

Automatic Speech Recognition (ASR)

The ASR engine receives raw audio from the telephony stream (8kHz, μ-law G.711 codec — standard telephony) and transcribes it to text in real time with under 300ms latency. Critical Indian real estate ASR requirements include Hinglish code-switching, Indian English accent variation across regions, real estate vocabulary (RERA, HARERA, BHK, PLC, super built-up, carpet area, possession), and proper noun recognition for project names, sectors, and road names.

ASR Provider	Hinglish Support	Indian RE Vocabulary	Latency	Cost
Google Cloud Speech-to-Text v2	Strong	Requires custom vocabulary	180–280ms	$0.016/min
OpenAI Whisper (self-hosted)	Strong	Requires fine-tuning	200–400ms (GPU)	Infrastructure cost
Sarvam AI STT	Native Indian	RE terminology included	150–250ms	₹0.80–₹1.20/min
Azure Speech (custom model)	Moderate	With custom model training	160–240ms	$0.016/min
Deepgram Nova-3 India	Good	Custom vocabulary needed	120–200ms	$0.0043/min

💡

Recommendation for Indian real estate: Sarvam AI STT for Tier 1 deployment (native Hinglish, lowest latency for Indian accents, real estate vocabulary pre-trained) with Google Cloud Speech as fallback.

Natural Language Understanding (NLU)

NLU sits between ASR output and LLM inference, extracting structured intent data from transcribed speech before the full LLM processes it — reducing LLM inference cost and latency by handling simple entity extraction at the NLU layer.

def extract_entities(transcript: str) -> dict:
    """Fast regex + NER extraction before LLM inference."""
    entities = {}

    # Budget extraction
    budget_pattern = r'(\d+(?:\.\d+)?)\s*(lakh|crore|L|Cr|cr)'
    budget_match = re.search(budget_pattern, transcript, re.IGNORECASE)
    if budget_match:
        entities['budget_stated'] = normalize_amount(
            budget_match.group(1), budget_match.group(2)
        )

    # BHK extraction
    bhk_pattern = r'(\d)\s*(?:BHK|bhk|bedroom)'
    bhk_match = re.search(bhk_pattern, transcript)
    if bhk_match:
        entities['bhk_preference'] = int(bhk_match.group(1))

    # Possession timeline
    possession_keywords = {
        'ready': 'immediate',
        'ready to move': 'immediate',
        'under construction': 'under_construction',
        '6 months': '6_months',
        '1 year': '12_months',
        '2 years': '24_months'
    }
    for keyword, value in possession_keywords.items():
        if keyword.lower() in transcript.lower():
            entities['possession_preference'] = value
            break

    return entities

Text-to-Speech (TTS)

The TTS engine converts LLM-generated text responses to natural-sounding speech. For Indian real estate, TTS quality directly impacts buyer engagement — robotic or monotone voices trigger immediate hang-ups. Requirements: natural Hindi/English/Hinglish prosody, under 200ms first-byte latency for streaming, correct pronunciation of Indian names and place names, and an appropriate formal register for real estate context.

💡

Recommended: ElevenLabs Turbo v3 (lowest latency, highest naturalness) for English/Hinglish; Sarvam AI TTS for Hindi-primary conversations. Stream audio in 50ms chunks to the telephony layer to minimize perceived response latency.

Layer 4: LLM Inference & Response Generation

The LLM layer is the conversational intelligence of the system — it receives conversation history, extracted entities, and project knowledge base context, and generates the next AI turn. Every inference call receives a structured prompt composed of four context blocks: system instructions with project data and qualification objectives, known entities from the current call, call state, and detected language.

def build_inference_prompt(
    conversation_history: list,
    entities: dict,
    project_kb: dict,
    call_state: CallState
) -> str:

    system_prompt = f"""
You are a professional real estate qualification specialist for {project_kb['developer_name']}.
Project: {project_kb['project_name']} | Location: {project_kb['location']}
RERA/HARERA Registration: {project_kb['rera_number']}
Available inventory: {json.dumps(project_kb['inventory'])}
Pricing: {json.dumps(project_kb['pricing'])}
Possession date: {project_kb['possession_date']}

QUALIFICATION OBJECTIVES (collect in order):
1. Confirm budget range (project starts at {project_kb['price_floor']})
2. Confirm BHK requirement ({', '.join(project_kb['available_bhk'])} available)
3. Confirm possession timeline preference
4. Confirm end-use vs investment purpose
5. Book site visit if qualified (available slots: {project_kb['site_visit_slots']})

KNOWN ENTITIES FROM THIS CALL:
{json.dumps(entities, indent=2)}

CALL STATE: {call_state.value}
LANGUAGE: {call_state.detected_language}

Respond in {call_state.detected_language}. Keep responses under 40 words.
If buyer switches language, match immediately.
Never make up project data not in the context above.
"""

    messages = [{"role": "system", "content": system_prompt}]
    messages.extend(conversation_history)

    return messages

Real-time conversation requires LLM response generation in under 800ms end-to-end (from ASR output to TTS audio start) — a constraint that eliminates large model inference without optimization.

LLM	Avg. Latency (India routing)	Quality	Cost/1K tokens
GPT-4o mini	280–450ms	High	$0.00015
Claude Haiku 4.5	250–400ms	High	$0.00025
Gemini 1.5 Flash	200–380ms	High	$0.000075
GPT-4o (full)	600–1,200ms	Very High	$0.0025
Llama 3.3 70B (self-hosted, A100)	180–320ms	Good	Infrastructure

💡

Recommendation: GPT-4o mini or Gemini 1.5 Flash for qualification turns (high speed, sufficient quality for structured qualification). Reserve the full GPT-4o model for escalation detection and complex objection turns where response quality is worth the latency cost. Use streaming API responses for all LLM providers — begin TTS generation as the first tokens arrive rather than waiting for the complete response, reducing perceived latency by 200–350ms.

Layer 5: Disposition Logic & Outcome Classification

When a call ends, the disposition layer classifies the call outcome into structured CRM-ready data. This classification must be deterministic — the same call outcome should always produce the same disposition — and comprehensive enough to drive all downstream workflows.

class Disposition(Enum):
    SITE_VISIT_BOOKED = "site_visit_booked"
    QUALIFIED_CALLBACK = "qualified_callback_requested"
    QUALIFIED_CONSIDERING = "qualified_considering"
    BUDGET_MISMATCH = "budget_mismatch"
    TIMELINE_MISMATCH = "timeline_mismatch"
    NOT_IN_MARKET = "not_in_market"
    ALREADY_PURCHASED = "already_purchased"
    LANGUAGE_BARRIER = "language_barrier_escalate"
    REQUESTED_HUMAN = "human_escalation_requested"
    NO_ANSWER = "no_answer"
    INVALID_NUMBER = "invalid_number"
    CALL_DROPPED = "call_dropped_retry"
    DNC_REQUESTED = "do_not_call"

def classify_disposition(
    entities: dict,
    conversation_history: list,
    call_outcome: str,
    project_kb: dict
) -> Disposition:

    # Hard outcomes (deterministic)
    if call_outcome == "no_answer": return Disposition.NO_ANSWER
    if call_outcome == "invalid": return Disposition.INVALID_NUMBER

    # Check DNC signal
    if any(phrase in str(conversation_history).lower()
           for phrase in ["don't call", "remove number", "not interested ever"]):
        return Disposition.DNC_REQUESTED

    # Site visit booked (highest priority positive)
    if entities.get("site_visit_confirmed"):
        return Disposition.SITE_VISIT_BOOKED

    # Budget qualification
    stated_budget = entities.get("budget_stated")
    if stated_budget:
        if stated_budget < project_kb["price_floor"] * 0.80:
            return Disposition.BUDGET_MISMATCH

    # Continue classification logic...
    return Disposition.QUALIFIED_CONSIDERING

Layer 6: CRM Integration & Data Sync

The CRM integration layer writes structured call outcomes to the developer's CRM within 30 seconds of call completion. For real estate, this typically means Sell.Do, LeadSquared, Salesforce Real Estate Cloud, Kylas, or Freshsales. Every call produces a standardized disposition payload written to CRM regardless of which CRM is deployed:

{
  "lead_id": "CRM_XXXXXXXX",
  "external_lead_id": "99A_XXXXXXXXXX",
  "call_session_id": "CALL_UUID_XXXXXXXX",
  "call_timestamp": "2026-07-04T14:23:17+05:30",
  "call_duration_seconds": 247,
  "recording_url": "https://storage.platform.io/recordings/CALL_UUID.mp3",
  "transcript_url": "https://storage.platform.io/transcripts/CALL_UUID.txt",
  "disposition": "site_visit_booked",
  "intent_score": 84,
  "qualification_data": {
    "budget_stated_min": 11000000,
    "budget_stated_max": 14000000,
    "bhk_preference": 3,
    "possession_preference": "within_18_months",
    "purpose": "end_use",
    "loan_required": true,
    "primary_objection": null,
    "language_used": "hinglish"
  },
  "site_visit": {
    "booked": true,
    "slot_date": "2026-07-06",
    "slot_time": "11:00",
    "confirmation_sent": true
  },
  "next_action": {
    "type": "site_visit_reminder",
    "scheduled_for": "2026-07-05T10:00:00+05:30"
  }
}

Idempotency: all CRM write-back calls include an X-Idempotency-Key header (value: call_session_id). If the write-back fails and retries, the CRM deduplicates based on this key — preventing duplicate lead records or double-booked site visits.

Retry policy for CRM write-back: Attempt 1 immediate post-call; Attempt 2 at 30 seconds if Attempt 1 fails; Attempt 3 at 5 minutes if Attempt 2 fails; dead letter queue if all 3 fail, for manual review — never silently drop CRM updates.

Layer 7: Analytics & Monitoring

Production AI Calling systems require real-time observability across all seven layers. Key metrics that indicate system health versus business performance:

System Health Metrics (Infrastructure)

Metric	Alert Threshold	Measurement
ASR transcription latency	> 350ms p95	Per-utterance timestamp
LLM inference latency	> 900ms p95	Inference request duration
TTS first-byte latency	> 250ms p95	Streaming start time
Call setup latency (trigger → ring)	> 120 seconds	Lead receipt to outbound ring
CRM write-back failure rate	> 2%	Failed writes / total calls
Concurrent call capacity utilization	> 85%	Active calls / trunk capacity

Business Performance Metrics

Metric	Target	Frequency
Lead contact rate	> 90%	Daily
Qualification completion rate	> 50%	Daily
Site visit booking rate (per qualified)	> 20%	Weekly
Call drop rate (< 30 seconds)	< 15%	Daily
DNC opt-out rate	< 3%	Weekly
CRM field completion rate	> 94%	Weekly
After-hours lead coverage rate	> 95%	Daily

Scaling Considerations: From 100 to 10,000 Concurrent Calls

The architecture described above handles 100–500 concurrent calls adequately on modest infrastructure. Scaling to 10,000 concurrent calls (realistic for enterprise developers during major project launches) requires specific architectural decisions at each layer.

Layer 1 (Ingestion) — Kafka partitioning by lead source; auto-scaling consumer groups; Redis Cluster for deduplication at high volume
Layer 2 (Orchestration) — horizontal scaling of call orchestrators; multiple SIP trunks from different providers for capacity redundancy; session state stored in Redis (not in-process) to allow stateless orchestrator scaling
Layer 3 (Speech) — ASR/TTS capacity must scale with concurrent calls since each call requires a dedicated streaming connection; use managed cloud ASR/TTS with auto-scaling, as self-hosted Whisper requires GPU cluster management
Layer 4 (LLM) — LLM API rate limits are the most common bottleneck at scale; maintain accounts across multiple API providers with a load-balancing proxy that routes based on current rate limit headroom
Layer 6 (CRM) — CRM APIs have rate limits (Sell.Do: 100 req/min; LeadSquared: 50 req/min); buffer CRM write-backs in a queue and write at the CRM's maximum rate, never dropping writes but accepting that CRM sync lags by 2–5 minutes at peak volume

Frequently Asked Questions

Build vs. buy is a function of three variables: your lead volume — at under 5,000 leads/month, build cost is rarely justified versus a managed platform's deployment timeline advantage; your custom requirements — if you need India-specific telephony, HARERA data integration, and Sell.Do/LeadSquared native connectivity out of the box, a real-estate-native managed platform deploys in 7–14 days versus 90–180 days for a custom build; and your engineering capacity — maintaining a production voice AI pipeline requires dedicated ML/infrastructure engineering that most real estate developers lack internally. This architecture is most relevant for proptech companies building AI Calling as a product offering, or enterprise developer groups with dedicated technology teams.

Each call session maintains its own isolated context object — conversation history, entity state, call state machine, and project knowledge base reference. The LLM inference is stateless: it receives the full context on every turn and does not share state between sessions. Redis stores session state keyed by call_session_id; the orchestration layer fetches and updates this state atomically on each turn. There is no global LLM state that can be contaminated by concurrent sessions.

At 1,000 leads/month with average 3 call attempts per lead and 4-minute average call duration: approximately 12,000 call-minutes/month. Infrastructure cost estimate: ASR (Sarvam/Google) ₹12,000–₹18,000; TTS (ElevenLabs Turbo) ₹8,000–₹14,000; LLM inference (GPT-4o mini, ~500 tokens/turn × 8 turns × 12,000 calls) ₹3,500–₹5,500; telephony (Exotel, ₹0.80–₹1.20/min) ₹9,600–₹14,400; infrastructure (compute, Redis, Kafka managed) ₹8,000–₹14,000. Total: ₹41,100–₹65,900/month — comparable to managed platform pricing, without accounting for engineering time to build and maintain the system.

Latency and cost. A production real-time voice conversation requires under 800ms end-to-end response generation to feel natural — routing every turn through a full frontier model (600–1,200ms typical) breaks this constraint and produces noticeable, unnatural pauses. Handling simple entity extraction (budget, BHK, possession keywords) with fast regex/NER before LLM inference reduces the tokens the LLM needs to process and allows a smaller, faster model to handle the bulk of qualification turns, reserving the larger model only for complex objection handling or escalation detection where the latency cost is justified by response quality.

Constrain the system prompt explicitly — include only verified project knowledge base fields (pricing, inventory, RERA number, possession date) in the context window, and add an explicit instruction never to state information not present in that context. Pair this with a post-response validation layer that flags any numeric claims (prices, dates, unit counts) in the AI's output against the knowledge base before the response is spoken — if a mismatch is detected, regenerate the response or fall back to a safe deflection ('let me confirm that exact detail and follow up'). This two-layer approach — constrained prompting plus output validation — is significantly more reliable than prompting alone.

Disclaimer: API specifications, latency benchmarks, pricing figures, and provider recommendations in this article reflect market conditions and documented API capabilities as of Q2 2026. Cloud provider pricing, API rate limits, and service specifications change frequently. Verify current pricing and capabilities directly with each provider before architectural commitments. Infrastructure cost estimates are indicative ranges — actual costs depend on call volume, call duration, model selection, self-hosted vs. managed infrastructure choices, and regional data residency requirements. This content is for technical evaluation purposes only.

Loading article...

Ready to stop losing leads?

Join 200+ real-estate consultants using Zappio. Go live in 2 hours.

AI Calling Agent API Architecture for Real Estate — How to Build a Scalable Voice Qualification Pipeline

This Is a Distributed Systems Problem, Not a Prompt Engineering Exercise

The Seven-Layer Architecture

Layer 1: Lead Ingestion & Queue Management

Layer 2: Call Orchestration

Layer 3: Speech Processing — ASR → NLU → TTS

Automatic Speech Recognition (ASR)

Natural Language Understanding (NLU)

Text-to-Speech (TTS)

Layer 4: LLM Inference & Response Generation

Layer 5: Disposition Logic & Outcome Classification

Layer 6: CRM Integration & Data Sync

Layer 7: Analytics & Monitoring

System Health Metrics (Infrastructure)

Business Performance Metrics

Scaling Considerations: From 100 to 10,000 Concurrent Calls

Frequently Asked Questions

Ready to stop losing leads?

This Is a Distributed Systems Problem, Not a Prompt Engineering Exercise

The Seven-Layer Architecture

Layer 1: Lead Ingestion & Queue Management

Layer 2: Call Orchestration

Layer 3: Speech Processing — ASR → NLU → TTS

Automatic Speech Recognition (ASR)

Natural Language Understanding (NLU)

Text-to-Speech (TTS)

Layer 4: LLM Inference & Response Generation

Layer 5: Disposition Logic & Outcome Classification

Layer 6: CRM Integration & Data Sync

Layer 7: Analytics & Monitoring

System Health Metrics (Infrastructure)

Business Performance Metrics

Scaling Considerations: From 100 to 10,000 Concurrent Calls

Frequently Asked Questions