Vcloud director availability
- sathyahraj

- 4 days ago
- 9 min read

What Is Cloud Director Availability (VCDA)?
VCDA is a Disaster Recovery-as-a-Service (DRaaS) solution provided through 's Partner Connect ecosystem. It empowers tenants and cloud providers to replicate, migrate, fail over, and reverse failover vApps and VMs between on-premises vCenter and cloud environments, or between Cloud Director sites—supporting both migration and disaster recovery workflows.
Key operations include:
Onboarding and migration: Facilitates seamless movement of workloads without modifications to vApps or VMs.
Self-service DR and recovery: Tenants can manage protection, failover, reverse-failover, and test recovery via UI or portal.
Flexible workflows: Supports cloud-to-cloud, on-prem-to-cloud, cloud-to-on-prem, and even vCenter-to-vCenter topologies.
Version enhancements (e.g., in version 4.6):
1-minute RPOs
Tunnel appliance HA (load-balanced redundancy)
Recovery Plans, bandwidth throttling, recovery priorities, and public APIs.
High Availability Architecture in VCDA
Cloud Director Cloud Site (Provider Side)
Core components per HA instance:
Cloud Director Replication Management Appliance
One or more Replicator Appliance(s)
One or optionally two Tunnel Appliance(s) configured in active-active mode for HA.
Deploying multiple appliances across sites allows for scalable, multi-tenant DR services.
vCenter-Only (No Cloud Director)
Deployments include:
vCenter Replication Management Appliance
Optionally, Replicator Appliance(s)
This configuration still supports HA for migration/DR without needing Cloud Director.
On-Premises Tenant Site
Tenants deploy an On-Premises Appliance (OVA-based) in vCenter:
Serves as Replication Engine, Tunnel Node, and UI portal
Paired to a provider-side cloud or vCenter site
Once paired, tenants can replicate and perform DR operations securely.
Cloud Director Core HA Mechanisms
Even abstracting from VCDA, the base Cloud Director platform maintains availability via:
Multi-cell architecture:
Deploy multiple Cloud Director cells (instances) that connect to a shared database and transfer server storage.
Cells are stateless, allowing restarts without data loss.
Support Portal
Database HA cluster:
One primary cell plus two standby cells—automated failover can be configured.
Cluster health states: Healthy (primary + ≥2 standbys), Degraded (primary + 1 standby); if degraded, the system is vulnerable until a standby is restored.

Load balancing:
A session-aware load balancer ensures continuity by routing requests across cells.
Layer | HA Mechanisms |
VCDA (DRaaS) | Redundant Tunnel Appliances (active-active), multiple Replicators, management appliances |
vCenter DR/Migration Only | Replicator appliances (scalable layering) |
On-Prem Tenant Site | Single appliance; high availability relies on provider site redundancy |
Cloud Director Platform | Multi-cell deployment + DB HA cluster + load balancer |
Modern vSphere DR/migration setup from VCDA 4.4 release – covers cloud-to-cloud and vCenter-to-vSphere flow
Ransomware recovery flow showing integration between tenant, VCDA, and recovery services
VCDA 4.1 port mapping and component interaction—classic diagram useful for initial setup understanding
Multi-NIC appliance deployment layout—illustrates network interface configurations for complex environments
Scaling Limits & Deployment Sizing
Scale-Out via Replicator Appliances
For large-scale environments (managing over 6,000 replications), VCDA recommends scaling out by deploying more Cloud Replicator appliances to distribute load and prevent out-of-memory issues on the Cloud Replication Management, Replicator, or Tunnel appliances. This scaling is applied on the cloud/provider side only—no changes required on the on-premises appliances.
Minimum Hardware Specs & Appliance Roles
According to official deployment requirements (VCD Availability 4.5+):
Replicator appliances can now be deployed alongside replicated services in vCenter Replication Management Appliance.
Typical specs:
Replicator Appliance: 8 vCPUs, 8 GB RAM, 10 GB storage
Tunnel Appliance: 4 vCPUs, 4 GB RAM, 10 GB storage
HA: Starting in VCDA 4.6, a second Tunnel Appliance can be deployed for high availability.
A minimal combined appliance (for testing only) includes all services (Manager, Replicator, Tunnel, Tenant UI) and is sized at 4 vCPUs, 6 GB RAM, 10 GB storage—but is not recommended for production.
Deployment Topologies
Production deployments typically involve multiple VCDA instances per provider Cloud Director site, each mapped to different Provider VDCs. This enables granular control and scalability across diverse cloud environments.
On the on-premises side, multiple “On-Premises Replication Appliance” instances can be deployed and paired with the same cloud organization for load distribution and redundancy.
Recommended Sizing & Resilience Strategy
Cloud Locator (Manager) Appliance
This serves as the management entrypoint (UI/API), coordinating replication activities. Its sizing should be aligned with expected workloads and tenant count.
Scale-Out Strategy
Start with a robust base deployment: one Manager Appliance, one Tunnel Appliance, and at least one Replicator Appliance.
Add more Replicator Appliances as replication volume and tenant count grow.
Add a second Tunnel Appliance in VCDA 4.6+ environments for HA.
Use Cases and Deployment Guides
An official VMware white paper—"Architecting VMware Cloud Director Availability Solution in a Multi-Cloud Environment"—offers detailed topology options, traffic flows, and port-level design rationale.

Component | Role | Minimum Specs | Scalability / HA Notes |
Manager Appliance | UI, API, orchestrator | (Not specified)* | Scale with load |
Replicator Appliance | Handles replication streams | 8 vCPU, 8 GB RAM, 10 GB | Deploy multiple for scale-out |
Tunnel Appliance | Secures replication traffic | 4 vCPU, 4 GB RAM, 10 GB | Add second for HA (VCDA 4.6+) |
Combined Appliance (Test only) | All-in-one single-node deployment | 4 vCPU, 6 GB RAM, 10 GB | Only for labs/evaluations |
On-Prem Replication Appliance | Tenant-side replication initiation | (Standard per role) | Multiple instances supported |
Provider VDC Instances (Cloud) | Logical isolation for multi-tenancy | N/A | Each may require its own VCDA instance |
Recovery workflows — step-by-step (with checks & gotchas)
Sizing & capacity guidance (how many appliances, appliance specs, scale-out guidance)
Detailed port / network mapping for complex topologies (multi-NIC, tunnel, DNAT notes)
1) Recovery workflows — deep dive (concept → actions → validation)
This covers the whole life cycle: pairing → protect (seed) → ongoing replication → recovery plan (test/failover) → failback / reverse protection → cleanup.
a. Pairing & onboarding (pre-reqs)
Pair the sites (cloud ↔ cloud or on-prem ↔ cloud) to establish trust (X.509) so appliances discover each other and authenticated sessions work. The pairing step is administrative and requires accepting certificates from the remote endpoint. Pairing enables discovery of workloads and destination resources.
b. Protection (initial seed + ongoing sync)
Create a replication for a VM/vApp from the tenant UI (or vCenter plugin). The first time you replicate, VCDA may require a seed copy / initial full copy (over LAN or shipped seed) then subsequent changes are transferred as deltas.
Configure SLA profile / retention / RPO per replication (recovery frequency). VCDA supports low RPOs (minutes) depending on infrastructure. Monitor the first sync closely — it’s the most network/storage intensive step.
c. Recovery Plans (orchestration)
Recovery Plans let you group replications and define the order and steps (power-on order, network maps, pre/post scripts, prompt steps). They are the DR playbooks you run for test or actual failover. Plans can be validated before execution. You can schedule, test, suspend, and manage plans via UI or API. Test operations are first-class API calls (there’s a POST /recovery-plans/{id}/test endpoint that returns a task).
d. Test failover (safe rehearsal)
Test failover spins up replicas in an isolated network (so you don’t disturb production). VCDA’s test validates preconditions (availability of target resources, network mappings, missing artifacts) and creates a sandbox recovery so you can run app-level verification. Use this often during runbooks and after infra changes.
e. Planned failover (controlled migration)
For planned events (maintenance/migration) you can perform a planned failover which attempts to minimize data loss: final sync, cutover, and power on at target. This is similar to migration flows but uses the same protection stack.
f. Unplanned failover (disaster)
When source is down: run the Recovery Plan failover — VCDA will bring up VMs at the DR site following plan steps (networking, IP mappings, power sequence, pre/post scripts). Confirm DNS, load balancer entries, and app connectivity. After failover, you may need to re-protect the recovered VMs (reverse replication) to enable failback.
g. Failback / reverse protection
Failback usually follows: re-establish replication from DR site back to primary (reverse protection), resync any deltas, perform a final cutover back to primary, and clean up the temporary recovered resources. Plans support migrating and re-protect steps to automate parts of this. Always test failback during non-production windows.
h. Operational checks / runbook items (practical)
Validate Recovery Plan preconditions (storage space, networks, availability of target VMs). VCDA 4.6+ enhances validation checks.
Monitor RPO violations and investigate bottlenecks: tunnel CPU, replicator CPU, vSAN/datastore saturation, network uplink. If RPOs slip, scale replicators or tune network/storage.
2) Sizing & capacity guidance — recommended VM sizing per tenant load
Two key scaling levers: add replicator appliances (scale-out) and right-size existing appliances (avoid OOM/CPU starvation). VCDA is designed to scale horizontally by adding replicators.
Official appliance specs (typical / minimum)
(these are explicit appliance sizes from VMware docs / whitepapers for VCDA components)
Cloud / vCenter Replication Management Appliance (Manager) — coordinates UI/API and orchestrates replication flows. (Sizing: scale by expected concurrent API/UI sessions; exact cores depend on tenant load — monitor and scale).
Cloud Replicator Appliance — handles replication streams (recommended baseline in docs): ~8 vCPU / 8 GB RAM (older docs and cloud deployments show 4–8 vCPU / 6–8 GB depending on version). VMware recommends deploying at least two Replicator appliances per VCDA instance and scaling out replicators based on the number of active protections.
Cloud Tunnel Appliance — proxies/tunnels replication traffic and supports HA (VCDA 4.6+ supports active-active tunnel appliances). Typical spec: 4 vCPU / 4 GB RAM (some docs show smaller lab specs but production customers use larger sizes).
Combined appliance (lab/test only) — all services in one VM (small labs only): ~4 vCPU / 6 GB RAM (not for production).
How many replicators / how to size for tenants
VMware’s guidance is behavioural (scale-out based on active protections) rather than a fixed “N VMs per replicator” number because I/O profile, RPO, VM size, and network impact capacity. Practical approach:
Start template (provider cloud):
Manager: 1 appliance (monitor load)
Tunnel: 1 appliance (add second for HA)
Replicator: 2 appliances (baseline for production) — distribute active protections across them.
Measure & threshold: pick KPI thresholds (per replicator): CPU > 70% sustained, memory > 75%, increase in replication latency, or RPO violations. When thresholds hit, add another replicator. VMware explicitly recommends adding replicators to scale beyond thousands of replications.
Example (illustrative, adapt to your workload):
Light environment: 1–2 replicators → up to ~200–500 active replications (depends on VM I/O & RPO).
Medium environment: 3–5 replicators → 500–2000 active replications.
Large environment: scale to many replicators where each handles a “bucket” of protections; aim for distribution across ESXi hosts (DRS rule: Separate Virtual Machines).
Note: these example ranges are indicative — run pilot tests and measure.
Tenant VM sizing guidance (for provider operators)
Per-tenant recommendations: limit number of concurrently protected/high-IO VMs per tenant SLA tier. Offer tiers like Bronze / Silver / Gold with fixed RPOs and max concurrent protected VMs — this avoids noisy-neighbor effects. Base the max on your measured throughput per replicator. (This is an ops pattern recommended in the architecting guides).
Sizing checklist
Deploy at least two replicators per VCDA instance.
Keep replicator VMs on different ESXi hosts (Separate VMs DRS rule) and on resource clusters with access to the replication vmkernel network.
Add second Tunnel appliance for HA (VCDA 4.6+).
Monitor RPOs, latency, CPU ready, vSAN/datastore throughput, and add replicators when replication latency/RPO violations increase.
3) Detailed port / network mappings and complex multi-NIC topologies
Ports below are the load-bearing network mapping points you’ll need to design firewalls/DNAT and NSX/Edge rules for.
Port | Protocol | Source → Destination | Purpose / Service |
8047/tcp | TCP | Manager/Cloud → Tunnel | Tunnel API (management) — used to enable/maintain the tunnel. (used by the cloud manager to configure the tunnel). |
8048/tcp | TCP | Remote tunnel / on-prem replicator → Tunnel (data) | Tunnel data endpoint — actual replication data flows to this port. Use DNAT to forward external 8048 to tunnel IP. |
8043/tcp | TCP | Tunnel ↔ Replicator | Management traffic used by the Tunnel service to talk to Replicator. |
8044/tcp | TCP | Replication Manager ↔ Tunnel / Manager service | Management endpoint used by Manager/Replicator (seen in multi-NIC configs). |
8443/tcp | TCP | Admin UI/API → Manager appliance | Manager UI/API (e.g., Manager service UI on 8441/8443 in some configs). (Used for manager UI/API access). |
443/tcp | TCP | External (optional) → Tunnel (DNAT) | Public publishing of tunnel endpoints; can be translated to 8048 internally. |
NFC / ESXi VMkernel (typically 902) | TCP/UDP | Replicator ↔ ESXi vmkernel | Used to transfer VM data (NFC protocol) from ESXi — replicator must be on a network that reaches ESXi vmkernel. (Configure LWD/NFC addresses). |
DNAT / Edge NAT: When exposing a tunnel to the internet, it’s common to DNAT external port (443 or custom) to tunnel internal 8048; pairing and replication traffic uses 8048 for data and 8047 for control.
Multi-NIC appliances: Appliances often have several NICs (management, NFC, replication). Only one tunnel interface is used to talk to local VCDA components — you must choose which interface the Tunnel uses for local service and set static routes as necessary. Reconfigure tunsetendpoints / rtrsetendpoints / c4 setendpoints if you change NICs. The docs give exact CLI/UI steps.
Replicator NIC mapping: The replicator needs a Mgmt address (to talk to manager/tunnel) and an LWD/NFC address to talk to ESXi vmkernel for data-plane traffic — configure these intentionally to keep replication traffic on the fastest path.
Example complex topology patterns
Public internet pairing (tenant on-prem → provider cloud)
Expose Tunnel appliances to public IP (DNAT 443 → 8048) and ensure 8047/8048 and management ports are reachable. Pair from on-prem replicator to Tunnel public FQDN.
Private peered network between data centers (no DNAT)
Use tunnel interfaces on private subnets; pair using private interface IP and port 8048. No DNAT required, lower latency, simpler firewall rules.
Multi-NIC for high throughput
Use dedicated L2/L3 segment for replication vmkernel to replicate traffic (replicator LWD/NFC interface), and separate management NIC for control plane (Manager/Tunnel traffic). Configure static routes and set the mgmtAddress/lwdAddress parameters as recommended.





Comments