Zappio Team
AI & Real Estate Experts · 4 July 2026 · 14 min read
Zappio Team
AI & Real Estate Experts · 4 July 2026 · 14 min read
Building a production-grade AI Calling Agent for real estate is not a prompt engineering exercise. It is a distributed systems problem — one that spans telephony infrastructure, speech processing pipelines, LLM inference, CRM integration, concurrent call management, and real-time data synchronization. Developers and solution architects evaluating or building AI Calling systems for real estate need to understand the full architectural stack before making infrastructure commitments that are expensive to reverse.
This article details the complete API architecture for a scalable real estate AI Calling pipeline — from lead ingestion through call orchestration, speech processing, LLM inference, disposition logic, and CRM write-back — with emphasis on the components that determine production performance versus the components that look identical in demos but fail at scale.
A production AI Calling Agent for real estate operates across seven distinct architectural layers. Each layer has specific performance requirements, failure modes, and scaling constraints that must be understood before the system handles real lead volume.
┌─────────────────────────────────────────────────────┐ │ Layer 7: Analytics & Monitoring │ ├─────────────────────────────────────────────────────┤ │ Layer 6: CRM Integration & Data Sync │ ├─────────────────────────────────────────────────────┤ │ Layer 5: Disposition Logic & Outcome Classification│ ├─────────────────────────────────────────────────────┤ │ Layer 4: LLM Inference & Response Generation │ ├─────────────────────────────────────────────────────┤ │ Layer 3: Speech Processing (ASR → NLU → TTS) │ ├─────────────────────────────────────────────────────┤ │ Layer 2: Call Orchestration │ ├─────────────────────────────────────────────────────┤ │ Layer 1: Lead Ingestion & Queue Management │ └─────────────────────────────────────────────────────┘
The ingestion layer receives lead data from multiple sources simultaneously — portal webhooks (99acres, MagicBricks, Housing.com), Meta Lead Ads webhooks, Google Lead Form webhooks, CRM-native triggers, and manual upload batches. It must handle burst traffic during project launches (2,000+ leads/hour) without dropping events.
def assign_priority_tier(lead: Lead) -> str:
score = 0
# Source weight
if lead.source in ["99acres_premium_assured", "housing_express"]:
score += 40
elif lead.source in ["google_search", "magicbricks_tier1"]:
score += 30
elif lead.source in ["meta_facebook", "instagram"]:
score += 20
else:
score += 10
# Intent score (if available)
if lead.intent_score:
score += min(lead.intent_score // 2, 30)
# After-hours bonus (peak intent windows)
submission_hour = lead.submitted_at.hour # IST
if 20 <= submission_hour <= 23: # 8 PM - 11 PM
score += 15
# Tier assignment
if score >= 70: return "A" # < 45 sec trigger
if score >= 50: return "B" # < 90 sec trigger
if score >= 30: return "C" # < 150 sec trigger
return "D" # < 300 sec triggerTarget latency — Layer 1 end-to-end: under 8 seconds from webhook receipt to queue entry with priority assignment.
The orchestration layer manages the lifecycle of every AI call — from trigger to termination — including concurrent call limits, retry logic, human transfer routing, and carrier-level telephony management. Indian real estate AI Calling requires Indian telephony infrastructure for regulatory compliance, latency, and call quality; the orchestration layer connects to one or more telephony providers via SIP trunk or REST API.
QUEUED → DIALING → RINGING → CONNECTED → IN_PROGRESS
↓
HUMAN_TRANSFER / COMPLETED / NO_ANSWER / FAILED
↓
DISPOSITION_WRITTENConcurrent call management: each telephony provider has concurrent call limits tied to trunk capacity. The orchestration layer maintains a semaphore counter per provider and queues call triggers when the limit is reached rather than attempting over-capacity calls that fail silently.
class CallOrchestrator:
def __init__(self, provider: TelephonyProvider, max_concurrent: int):
self.semaphore = asyncio.Semaphore(max_concurrent)
self.provider = provider
async def initiate_call(self, lead: Lead, script_context: dict) -> CallSession:
async with self.semaphore:
session = await self.provider.create_call(
to=lead.phone,
from_=self._select_caller_id(lead.city), # City-matched DID
websocket_url=f"wss://audio.platform.io/session/{lead.lead_id}",
timeout_seconds=30,
record=True,
record_format="mp3"
)
return sessionCaller ID selection: use city-matched local DID numbers — Gurgaon leads receive a 0124 or matching mobile prefix number; Hyderabad leads receive a 040 or matching prefix number. Local caller IDs improve answer rates by 18–24% over national or unfamiliar area codes.
| Attempt | Delay After Previous | Time Window |
|---|---|---|
| Attempt 1 | Immediate (< 90 sec from lead receipt) | Any time (9 AM – 9 PM IST for first contact) |
| Attempt 2 | 2 hours after Attempt 1 | 9 AM – 9 PM IST |
| Attempt 3 | 24 hours after Attempt 2 | 10 AM – 7 PM IST (conservative) |
| No further AI attempts | — | Handoff to human BDR queue or long-cycle nurture |
The speech processing layer converts bidirectional audio into structured intent data and generates natural-sounding AI responses. This is the layer most responsible for perceived call quality — and the layer most developers underestimate in complexity.
The ASR engine receives raw audio from the telephony stream (8kHz, μ-law G.711 codec — standard telephony) and transcribes it to text in real time with under 300ms latency. Critical Indian real estate ASR requirements include Hinglish code-switching, Indian English accent variation across regions, real estate vocabulary (RERA, HARERA, BHK, PLC, super built-up, carpet area, possession), and proper noun recognition for project names, sectors, and road names.
| ASR Provider | Hinglish Support | Indian RE Vocabulary | Latency | Cost |
|---|---|---|---|---|
| Google Cloud Speech-to-Text v2 | Strong | Requires custom vocabulary | 180–280ms | $0.016/min |
| OpenAI Whisper (self-hosted) | Strong | Requires fine-tuning | 200–400ms (GPU) | Infrastructure cost |
| Sarvam AI STT | Native Indian | RE terminology included | 150–250ms | ₹0.80–₹1.20/min |
| Azure Speech (custom model) | Moderate | With custom model training | 160–240ms | $0.016/min |
| Deepgram Nova-3 India | Good | Custom vocabulary needed | 120–200ms | $0.0043/min |
Recommendation for Indian real estate: Sarvam AI STT for Tier 1 deployment (native Hinglish, lowest latency for Indian accents, real estate vocabulary pre-trained) with Google Cloud Speech as fallback.
NLU sits between ASR output and LLM inference, extracting structured intent data from transcribed speech before the full LLM processes it — reducing LLM inference cost and latency by handling simple entity extraction at the NLU layer.
def extract_entities(transcript: str) -> dict:
"""Fast regex + NER extraction before LLM inference."""
entities = {}
# Budget extraction
budget_pattern = r'(\d+(?:\.\d+)?)\s*(lakh|crore|L|Cr|cr)'
budget_match = re.search(budget_pattern, transcript, re.IGNORECASE)
if budget_match:
entities['budget_stated'] = normalize_amount(
budget_match.group(1), budget_match.group(2)
)
# BHK extraction
bhk_pattern = r'(\d)\s*(?:BHK|bhk|bedroom)'
bhk_match = re.search(bhk_pattern, transcript)
if bhk_match:
entities['bhk_preference'] = int(bhk_match.group(1))
# Possession timeline
possession_keywords = {
'ready': 'immediate',
'ready to move': 'immediate',
'under construction': 'under_construction',
'6 months': '6_months',
'1 year': '12_months',
'2 years': '24_months'
}
for keyword, value in possession_keywords.items():
if keyword.lower() in transcript.lower():
entities['possession_preference'] = value
break
return entitiesThe TTS engine converts LLM-generated text responses to natural-sounding speech. For Indian real estate, TTS quality directly impacts buyer engagement — robotic or monotone voices trigger immediate hang-ups. Requirements: natural Hindi/English/Hinglish prosody, under 200ms first-byte latency for streaming, correct pronunciation of Indian names and place names, and an appropriate formal register for real estate context.
Recommended: ElevenLabs Turbo v3 (lowest latency, highest naturalness) for English/Hinglish; Sarvam AI TTS for Hindi-primary conversations. Stream audio in 50ms chunks to the telephony layer to minimize perceived response latency.
The LLM layer is the conversational intelligence of the system — it receives conversation history, extracted entities, and project knowledge base context, and generates the next AI turn. Every inference call receives a structured prompt composed of four context blocks: system instructions with project data and qualification objectives, known entities from the current call, call state, and detected language.
def build_inference_prompt(
conversation_history: list,
entities: dict,
project_kb: dict,
call_state: CallState
) -> str:
system_prompt = f"""
You are a professional real estate qualification specialist for {project_kb['developer_name']}.
Project: {project_kb['project_name']} | Location: {project_kb['location']}
RERA/HARERA Registration: {project_kb['rera_number']}
Available inventory: {json.dumps(project_kb['inventory'])}
Pricing: {json.dumps(project_kb['pricing'])}
Possession date: {project_kb['possession_date']}
QUALIFICATION OBJECTIVES (collect in order):
1. Confirm budget range (project starts at {project_kb['price_floor']})
2. Confirm BHK requirement ({', '.join(project_kb['available_bhk'])} available)
3. Confirm possession timeline preference
4. Confirm end-use vs investment purpose
5. Book site visit if qualified (available slots: {project_kb['site_visit_slots']})
KNOWN ENTITIES FROM THIS CALL:
{json.dumps(entities, indent=2)}
CALL STATE: {call_state.value}
LANGUAGE: {call_state.detected_language}
Respond in {call_state.detected_language}. Keep responses under 40 words.
If buyer switches language, match immediately.
Never make up project data not in the context above.
"""
messages = [{"role": "system", "content": system_prompt}]
messages.extend(conversation_history)
return messagesReal-time conversation requires LLM response generation in under 800ms end-to-end (from ASR output to TTS audio start) — a constraint that eliminates large model inference without optimization.
| LLM | Avg. Latency (India routing) | Quality | Cost/1K tokens |
|---|---|---|---|
| GPT-4o mini | 280–450ms | High | $0.00015 |
| Claude Haiku 4.5 | 250–400ms | High | $0.00025 |
| Gemini 1.5 Flash | 200–380ms | High | $0.000075 |
| GPT-4o (full) | 600–1,200ms | Very High | $0.0025 |
| Llama 3.3 70B (self-hosted, A100) | 180–320ms | Good | Infrastructure |
Recommendation: GPT-4o mini or Gemini 1.5 Flash for qualification turns (high speed, sufficient quality for structured qualification). Reserve the full GPT-4o model for escalation detection and complex objection turns where response quality is worth the latency cost. Use streaming API responses for all LLM providers — begin TTS generation as the first tokens arrive rather than waiting for the complete response, reducing perceived latency by 200–350ms.
When a call ends, the disposition layer classifies the call outcome into structured CRM-ready data. This classification must be deterministic — the same call outcome should always produce the same disposition — and comprehensive enough to drive all downstream workflows.
class Disposition(Enum):
SITE_VISIT_BOOKED = "site_visit_booked"
QUALIFIED_CALLBACK = "qualified_callback_requested"
QUALIFIED_CONSIDERING = "qualified_considering"
BUDGET_MISMATCH = "budget_mismatch"
TIMELINE_MISMATCH = "timeline_mismatch"
NOT_IN_MARKET = "not_in_market"
ALREADY_PURCHASED = "already_purchased"
LANGUAGE_BARRIER = "language_barrier_escalate"
REQUESTED_HUMAN = "human_escalation_requested"
NO_ANSWER = "no_answer"
INVALID_NUMBER = "invalid_number"
CALL_DROPPED = "call_dropped_retry"
DNC_REQUESTED = "do_not_call"
def classify_disposition(
entities: dict,
conversation_history: list,
call_outcome: str,
project_kb: dict
) -> Disposition:
# Hard outcomes (deterministic)
if call_outcome == "no_answer": return Disposition.NO_ANSWER
if call_outcome == "invalid": return Disposition.INVALID_NUMBER
# Check DNC signal
if any(phrase in str(conversation_history).lower()
for phrase in ["don't call", "remove number", "not interested ever"]):
return Disposition.DNC_REQUESTED
# Site visit booked (highest priority positive)
if entities.get("site_visit_confirmed"):
return Disposition.SITE_VISIT_BOOKED
# Budget qualification
stated_budget = entities.get("budget_stated")
if stated_budget:
if stated_budget < project_kb["price_floor"] * 0.80:
return Disposition.BUDGET_MISMATCH
# Continue classification logic...
return Disposition.QUALIFIED_CONSIDERINGThe CRM integration layer writes structured call outcomes to the developer's CRM within 30 seconds of call completion. For real estate, this typically means Sell.Do, LeadSquared, Salesforce Real Estate Cloud, Kylas, or Freshsales. Every call produces a standardized disposition payload written to CRM regardless of which CRM is deployed:
{
"lead_id": "CRM_XXXXXXXX",
"external_lead_id": "99A_XXXXXXXXXX",
"call_session_id": "CALL_UUID_XXXXXXXX",
"call_timestamp": "2026-07-04T14:23:17+05:30",
"call_duration_seconds": 247,
"recording_url": "https://storage.platform.io/recordings/CALL_UUID.mp3",
"transcript_url": "https://storage.platform.io/transcripts/CALL_UUID.txt",
"disposition": "site_visit_booked",
"intent_score": 84,
"qualification_data": {
"budget_stated_min": 11000000,
"budget_stated_max": 14000000,
"bhk_preference": 3,
"possession_preference": "within_18_months",
"purpose": "end_use",
"loan_required": true,
"primary_objection": null,
"language_used": "hinglish"
},
"site_visit": {
"booked": true,
"slot_date": "2026-07-06",
"slot_time": "11:00",
"confirmation_sent": true
},
"next_action": {
"type": "site_visit_reminder",
"scheduled_for": "2026-07-05T10:00:00+05:30"
}
}Idempotency: all CRM write-back calls include an X-Idempotency-Key header (value: call_session_id). If the write-back fails and retries, the CRM deduplicates based on this key — preventing duplicate lead records or double-booked site visits.
Retry policy for CRM write-back: Attempt 1 immediate post-call; Attempt 2 at 30 seconds if Attempt 1 fails; Attempt 3 at 5 minutes if Attempt 2 fails; dead letter queue if all 3 fail, for manual review — never silently drop CRM updates.
Production AI Calling systems require real-time observability across all seven layers. Key metrics that indicate system health versus business performance:
| Metric | Alert Threshold | Measurement |
|---|---|---|
| ASR transcription latency | > 350ms p95 | Per-utterance timestamp |
| LLM inference latency | > 900ms p95 | Inference request duration |
| TTS first-byte latency | > 250ms p95 | Streaming start time |
| Call setup latency (trigger → ring) | > 120 seconds | Lead receipt to outbound ring |
| CRM write-back failure rate | > 2% | Failed writes / total calls |
| Concurrent call capacity utilization | > 85% | Active calls / trunk capacity |
| Metric | Target | Frequency |
|---|---|---|
| Lead contact rate | > 90% | Daily |
| Qualification completion rate | > 50% | Daily |
| Site visit booking rate (per qualified) | > 20% | Weekly |
| Call drop rate (< 30 seconds) | < 15% | Daily |
| DNC opt-out rate | < 3% | Weekly |
| CRM field completion rate | > 94% | Weekly |
| After-hours lead coverage rate | > 95% | Daily |
The architecture described above handles 100–500 concurrent calls adequately on modest infrastructure. Scaling to 10,000 concurrent calls (realistic for enterprise developers during major project launches) requires specific architectural decisions at each layer.
Disclaimer: API specifications, latency benchmarks, pricing figures, and provider recommendations in this article reflect market conditions and documented API capabilities as of Q2 2026. Cloud provider pricing, API rate limits, and service specifications change frequently. Verify current pricing and capabilities directly with each provider before architectural commitments. Infrastructure cost estimates are indicative ranges — actual costs depend on call volume, call duration, model selection, self-hosted vs. managed infrastructure choices, and regional data residency requirements. This content is for technical evaluation purposes only.