top of page

Vcloud director availability

ree

What Is  Cloud Director Availability (VCDA)?

 

VCDA is a Disaster Recovery-as-a-Service (DRaaS) solution provided through 's Partner Connect ecosystem. It empowers tenants and cloud providers to replicate, migrate, fail over, and reverse failover vApps and VMs between on-premises vCenter and cloud environments, or between Cloud Director sites—supporting both migration and disaster recovery workflows.

 

Key operations include:

 

Onboarding and migration: Facilitates seamless movement of workloads without modifications to vApps or VMs.

 

Self-service DR and recovery: Tenants can manage protection, failover, reverse-failover, and test recovery via UI or portal.

 

Flexible workflows: Supports cloud-to-cloud, on-prem-to-cloud, cloud-to-on-prem, and even vCenter-to-vCenter topologies.

 

Version enhancements (e.g., in version 4.6):

 

1-minute RPOs

 

Tunnel appliance HA (load-balanced redundancy)

 

Recovery Plans, bandwidth throttling, recovery priorities, and public APIs.

 

High Availability Architecture in VCDA

 

Cloud Director Cloud Site (Provider Side)

 

Core components per HA instance:

 

Cloud Director Replication Management Appliance

 

One or more Replicator Appliance(s)

 

One or optionally two Tunnel Appliance(s) configured in active-active mode for HA.

 

Deploying multiple appliances across sites allows for scalable, multi-tenant DR services.

 

vCenter-Only (No Cloud Director)

 

Deployments include:

 

vCenter Replication Management Appliance

 

Optionally, Replicator Appliance(s)

 

This configuration still supports HA for migration/DR without needing  Cloud Director.

 

On-Premises Tenant Site

 

Tenants deploy an On-Premises Appliance (OVA-based) in vCenter:

 

Serves as Replication Engine, Tunnel Node, and UI portal

 

Paired to a provider-side cloud or vCenter site

 

Once paired, tenants can replicate and perform DR operations securely.

 

Cloud Director Core HA Mechanisms

 

Even abstracting from VCDA, the base  Cloud Director platform maintains availability via:

 

Multi-cell architecture:

 

Deploy multiple Cloud Director cells (instances) that connect to a shared database and transfer server storage.

 

Cells are stateless, allowing restarts without data loss.

 

Support Portal

 

Database HA cluster:

 

One primary cell plus two standby cells—automated failover can be configured.

 

Cluster health states: Healthy (primary + ≥2 standbys), Degraded (primary + 1 standby); if degraded, the system is vulnerable until a standby is restored.

 

ree

Load balancing:

 

A session-aware load balancer ensures continuity by routing requests across cells.

 

 

Layer

HA Mechanisms

VCDA (DRaaS)

Redundant Tunnel Appliances (active-active), multiple Replicators, management appliances

vCenter DR/Migration Only

Replicator appliances (scalable layering)

On-Prem Tenant Site

Single appliance; high availability relies on provider site redundancy

Cloud Director Platform

Multi-cell deployment + DB HA cluster + load balancer

  

Modern vSphere DR/migration setup from VCDA 4.4 release – covers cloud-to-cloud and vCenter-to-vSphere flow

 Ransomware recovery flow showing integration between tenant, VCDA, and recovery services

 

VCDA 4.1 port mapping and component interaction—classic diagram useful for initial setup understanding

 

Multi-NIC appliance deployment layout—illustrates network interface configurations for complex environments

 

Scaling Limits & Deployment Sizing

 

Scale-Out via Replicator Appliances

For large-scale environments (managing over 6,000 replications), VCDA recommends scaling out by deploying more Cloud Replicator appliances to distribute load and prevent out-of-memory issues on the Cloud Replication Management, Replicator, or Tunnel appliances. This scaling is applied on the cloud/provider side only—no changes required on the on-premises appliances.

 

Minimum Hardware Specs & Appliance Roles

According to official deployment requirements (VCD Availability 4.5+):

 

Replicator appliances can now be deployed alongside replicated services in vCenter Replication Management Appliance.

 

Typical specs:

 

Replicator Appliance: 8 vCPUs, 8 GB RAM, 10 GB storage 

Tunnel Appliance: 4 vCPUs, 4 GB RAM, 10 GB storage

 

HA: Starting in VCDA 4.6, a second Tunnel Appliance can be deployed for high availability.

 

A minimal combined appliance (for testing only) includes all services (Manager, Replicator, Tunnel, Tenant UI) and is sized at 4 vCPUs, 6 GB RAM, 10 GB storage—but is not recommended for production.

 

Deployment Topologies


Production deployments typically involve multiple VCDA instances per provider Cloud Director site, each mapped to different Provider VDCs. This enables granular control and scalability across diverse cloud environments.

 

On the on-premises side, multiple “On-Premises Replication Appliance” instances can be deployed and paired with the same cloud organization for load distribution and redundancy.

 

Recommended Sizing & Resilience Strategy

 

Cloud Locator (Manager) Appliance

 

This serves as the management entrypoint (UI/API), coordinating replication activities. Its sizing should be aligned with expected workloads and tenant count.

 

Scale-Out Strategy


Start with a robust base deployment: one Manager Appliance, one Tunnel Appliance, and at least one Replicator Appliance.

 

Add more Replicator Appliances as replication volume and tenant count grow.

 

Add a second Tunnel Appliance in VCDA 4.6+ environments for HA.

 

Use Cases and Deployment Guides

 

An official VMware white paper—"Architecting VMware Cloud Director Availability Solution in a Multi-Cloud Environment"—offers detailed topology options, traffic flows, and port-level design rationale. 

 

 

ree

 

Component

Role

Minimum Specs

Scalability / HA Notes

Manager Appliance

UI, API, orchestrator

(Not specified)*

Scale with load

Replicator Appliance

Handles replication streams

8 vCPU, 8 GB RAM, 10 GB

Deploy multiple for scale-out

Tunnel Appliance

Secures replication traffic

4 vCPU, 4 GB RAM, 10 GB

Add second for HA (VCDA 4.6+)

Combined Appliance (Test only)

All-in-one single-node deployment

4 vCPU, 6 GB RAM, 10 GB

Only for labs/evaluations

On-Prem Replication Appliance

Tenant-side replication initiation

(Standard per role)

Multiple instances supported

Provider VDC Instances (Cloud)

Logical isolation for multi-tenancy

N/A

Each may require its own VCDA instance

 

Recovery workflows — step-by-step (with checks & gotchas)

 

Sizing & capacity guidance (how many appliances, appliance specs, scale-out guidance) 

Detailed port / network mapping for complex topologies (multi-NIC, tunnel, DNAT notes)

 

 

1) Recovery workflows — deep dive (concept → actions → validation)

This covers the whole life cycle: pairing → protect (seed) → ongoing replication → recovery plan (test/failover) → failback / reverse protection → cleanup.

 

a. Pairing & onboarding (pre-reqs)

Pair the sites (cloud ↔ cloud or on-prem ↔ cloud) to establish trust (X.509) so appliances discover each other and authenticated sessions work. The pairing step is administrative and requires accepting certificates from the remote endpoint. Pairing enables discovery of workloads and destination resources.

 

 

b. Protection (initial seed + ongoing sync)

Create a replication for a VM/vApp from the tenant UI (or vCenter plugin). The first time you replicate, VCDA may require a seed copy / initial full copy (over LAN or shipped seed) then subsequent changes are transferred as deltas. 

Configure SLA profile / retention / RPO per replication (recovery frequency). VCDA supports low RPOs (minutes) depending on infrastructure. Monitor the first sync closely — it’s the most network/storage intensive step.

 

c. Recovery Plans (orchestration)

Recovery Plans let you group replications and define the order and steps (power-on order, network maps, pre/post scripts, prompt steps). They are the DR playbooks you run for test or actual failover. Plans can be validated before execution. You can schedule, test, suspend, and manage plans via UI or API. Test operations are first-class API calls (there’s a POST /recovery-plans/{id}/test endpoint that returns a task).

 

d. Test failover (safe rehearsal)

Test failover spins up replicas in an isolated network (so you don’t disturb production). VCDA’s test validates preconditions (availability of target resources, network mappings, missing artifacts) and creates a sandbox recovery so you can run app-level verification. Use this often during runbooks and after infra changes.

 

e. Planned failover (controlled migration)

For planned events (maintenance/migration) you can perform a planned failover which attempts to minimize data loss: final sync, cutover, and power on at target. This is similar to migration flows but uses the same protection stack.  

 

f. Unplanned failover (disaster)

When source is down: run the Recovery Plan failover — VCDA will bring up VMs at the DR site following plan steps (networking, IP mappings, power sequence, pre/post scripts). Confirm DNS, load balancer entries, and app connectivity. After failover, you may need to re-protect the recovered VMs (reverse replication) to enable failback.  

 

g. Failback / reverse protection

Failback usually follows: re-establish replication from DR site back to primary (reverse protection), resync any deltas, perform a final cutover back to primary, and clean up the temporary recovered resources. Plans support migrating and re-protect steps to automate parts of this. Always test failback during non-production windows.

 

h. Operational checks / runbook items (practical)

Validate Recovery Plan preconditions (storage space, networks, availability of target VMs). VCDA 4.6+ enhances validation checks.  

 

Monitor RPO violations and investigate bottlenecks: tunnel CPU, replicator CPU, vSAN/datastore saturation, network uplink. If RPOs slip, scale replicators or tune network/storage.  

 

2) Sizing & capacity guidance — recommended VM sizing per tenant load

Two key scaling levers: add replicator appliances (scale-out) and right-size existing appliances (avoid OOM/CPU starvation). VCDA is designed to scale horizontally by adding replicators.

 

Official appliance specs (typical / minimum)

(these are explicit appliance sizes from VMware docs / whitepapers for VCDA components)

 

Cloud / vCenter Replication Management Appliance (Manager) — coordinates UI/API and orchestrates replication flows. (Sizing: scale by expected concurrent API/UI sessions; exact cores depend on tenant load — monitor and scale).  

 

Cloud Replicator Appliance — handles replication streams (recommended baseline in docs): ~8 vCPU / 8 GB RAM (older docs and cloud deployments show 4–8 vCPU / 6–8 GB depending on version). VMware recommends deploying at least two Replicator appliances per VCDA instance and scaling out replicators based on the number of active protections.  

 

Cloud Tunnel Appliance — proxies/tunnels replication traffic and supports HA (VCDA 4.6+ supports active-active tunnel appliances). Typical spec: 4 vCPU / 4 GB RAM (some docs show smaller lab specs but production customers use larger sizes).  

 

Combined appliance (lab/test only) — all services in one VM (small labs only): ~4 vCPU / 6 GB RAM (not for production).  

 

How many replicators / how to size for tenants

 

VMware’s guidance is behavioural (scale-out based on active protections) rather than a fixed “N VMs per replicator” number because I/O profile, RPO, VM size, and network impact capacity. Practical approach:

 

Start template (provider cloud):

 

Manager: 1 appliance (monitor load)

 

Tunnel: 1 appliance (add second for HA)

 

Replicator: 2 appliances (baseline for production) — distribute active protections across them.

 

Measure & threshold: pick KPI thresholds (per replicator): CPU > 70% sustained, memory > 75%, increase in replication latency, or RPO violations. When thresholds hit, add another replicator. VMware explicitly recommends adding replicators to scale beyond thousands of replications.

 

Example (illustrative, adapt to your workload):

 

Light environment: 1–2 replicators → up to ~200–500 active replications (depends on VM I/O & RPO).

 

Medium environment: 3–5 replicators → 500–2000 active replications.

 

Large environment: scale to many replicators where each handles a “bucket” of protections; aim for distribution across ESXi hosts (DRS rule: Separate Virtual Machines).

Note: these example ranges are indicative — run pilot tests and measure.

 

Tenant VM sizing guidance (for provider operators)

Per-tenant recommendations: limit number of concurrently protected/high-IO VMs per tenant SLA tier. Offer tiers like Bronze / Silver / Gold with fixed RPOs and max concurrent protected VMs — this avoids noisy-neighbor effects. Base the max on your measured throughput per replicator. (This is an ops pattern recommended in the architecting guides).  

 

Sizing checklist

Deploy at least two replicators per VCDA instance.  

 

Keep replicator VMs on different ESXi hosts (Separate VMs DRS rule) and on resource clusters with access to the replication vmkernel network.  

 

Add second Tunnel appliance for HA (VCDA 4.6+).  

 

Monitor RPOs, latency, CPU ready, vSAN/datastore throughput, and add replicators when replication latency/RPO violations increase.  

 

3) Detailed port / network mappings and complex multi-NIC topologies

Ports below are the load-bearing network mapping points you’ll need to design firewalls/DNAT and NSX/Edge rules for. 

Port

Protocol

Source → Destination

Purpose / Service

8047/tcp

TCP

Manager/Cloud → Tunnel

Tunnel API (management) — used to enable/maintain the tunnel. (used by the cloud manager to configure the tunnel).

8048/tcp

TCP

Remote tunnel / on-prem replicator → Tunnel (data)

Tunnel data endpoint — actual replication data flows to this port. Use DNAT to forward external 8048 to tunnel IP.

8043/tcp

TCP

Tunnel ↔ Replicator

Management traffic used by the Tunnel service to talk to Replicator.

8044/tcp

TCP

Replication Manager ↔ Tunnel / Manager service

Management endpoint used by Manager/Replicator (seen in multi-NIC configs).

8443/tcp

TCP

Admin UI/API → Manager appliance

Manager UI/API (e.g., Manager service UI on 8441/8443 in some configs). (Used for manager UI/API access).

443/tcp

TCP

External (optional) → Tunnel (DNAT)

Public publishing of tunnel endpoints; can be translated to 8048 internally.

NFC / ESXi VMkernel (typically 902)

TCP/UDP

Replicator ↔ ESXi vmkernel

Used to transfer VM data (NFC protocol) from ESXi — replicator must be on a network that reaches ESXi vmkernel. (Configure LWD/NFC addresses).

 

DNAT / Edge NAT: When exposing a tunnel to the internet, it’s common to DNAT external port (443 or custom) to tunnel internal 8048; pairing and replication traffic uses 8048 for data and 8047 for control.

 

Multi-NIC appliances: Appliances often have several NICs (management, NFC, replication). Only one tunnel interface is used to talk to local VCDA components — you must choose which interface the Tunnel uses for local service and set static routes as necessary. Reconfigure tunsetendpoints / rtrsetendpoints / c4 setendpoints if you change NICs. The docs give exact CLI/UI steps. 

Replicator NIC mapping: The replicator needs a Mgmt address (to talk to manager/tunnel) and an LWD/NFC address to talk to ESXi vmkernel for data-plane traffic — configure these intentionally to keep replication traffic on the fastest path. 

Example complex topology patterns 

Public internet pairing (tenant on-prem → provider cloud) 

Expose Tunnel appliances to public IP (DNAT 443 → 8048) and ensure 8047/8048 and management ports are reachable. Pair from on-prem replicator to Tunnel public FQDN. 

Private peered network between data centers (no DNAT) 

Use tunnel interfaces on private subnets; pair using private interface IP and port 8048. No DNAT required, lower latency, simpler firewall rules.

 

Multi-NIC for high throughput 

Use dedicated L2/L3 segment for replication vmkernel to replicate traffic (replicator LWD/NFC interface), and separate management NIC for control plane (Manager/Tunnel traffic). Configure static routes and set the mgmtAddress/lwdAddress parameters as recommended.

 

 

 
 
 

Recent Posts

See All
Automation using Power Cli

<# PowerCLI - vSphere Full Monitoring Automation File: PowerCLI - vSphere Full Monitoring Automation.ps1 Purpose: Complete, production-ready PowerCLI automation script collection for comprehensive vSp

 
 
 
VMware Real Time Scenario Interview Q & A

Part III Scenario 61: VM Network Adapter Type Mismatch Leading to Throughput & Latency Issues In a virtualised environment, several Windows and Linux VMs were upgraded from older hardware generations.

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page