Site Recovery Manager

sathyahraj
5 days ago
6 min read

What Is SRM?

VMware Site Recovery Manager is an enterprise-grade disaster recovery orchestration tool. It automates the planning, testing, failover, and failback of virtual machine workloads between two vCenter-managed sites: a primary (protected) site and a secondary (recovery) site

Core Components and Architecture

vCenter & SRM Servers at Each Site: SRM runs alongside vCenter on both the protected and recovery sites (as appliances), coordinating actions during recovery events
Replication Mechanisms: Supports VMware vSphere Replication (hypervisor-level) and array-based replication via Storage Replication Adapters (SRAs) or Virtual Volumes

✅ Key Features

1. Automated Orchestration

Executes recovery plans with minimal manual steps.
Controls shutdown, storage sync, virtual machine startup, and network configuration in a predefined order to meet RTOs reliably

2. Non‑Disruptive Testing

Runs recovery plan validation in an isolated environment without impacting production.
Uses temporary copies of replicated data to simulate failover scenarios safely

3. Failback / Planned Migration

Seamlessly returns operations back to the original site once the primary site is restored.
Can be automated and treated as a planned migration rather than ad hoc manual restoration

4. Flexible Recovery Planning

Allows for granular recovery plans including sequencing, IP/network settings, resource mappings, and custom scripting for post-failover automation.

5. Scalable & Application-Agnostic

Designed to protect thousands of VMs across multiple sites.
Works with any application running in the VMware environment without needing app-specific plugins

6. Support for Advanced Topologies

Supports active‑passive and bi‑directional (A ↔ B) configurations.
Shared recovery site model enables multiple protected sites to fail over into one recovery hub

7. Integration Ecosystem

Integrates with VMware NSX for network virtualization, vSAN for storage, and VMware Cloud Foundation. Also usable in public cloud DRaaS setups (e.g., VMC on AWS, Azure VMware Solution)

🧰 Deployment & Licensing

Installation: Deploy SRM as a Photon‑OS-based appliance on both sites. Use the HTML5 Clarity UI for configuration and management
Licensing: SRM is licensed per protected VM, not per CPU. Term and perpetual licenses are supported, and vSphere Replication is included free with vSphere Essentials Plus and above

⚖️ Benefits and Use Cases

Reduced Downtime & Errors: Automated execution ensures consistency and speed.
Compliance Support: Non‑disruptive testing and documented recovery plans meet audit and regulatory needs.
Lower TCO: Leverages existing VMware integrations and automation to cut manual labor and overhead

Typical Use Cases:

Critical site-level DR and failover
Planned datacenter migrations
Maintenance-based application testing
Hybrid cloud and multi-cloud DR strategies

🧠 Deployment Workflow Summary

Step	Description
1. Site Pairing	Establish SRM connection between protected and recovery vCenter servers
2. Inventory Mapping	Map resources like folders, networks, datastores and resource pools
3. Protection Groups	Group VMs based on policies: array-based, datastore, or vSphere replication
4. Recovery Plan Configuration	Define VM order, network customization, scripts, and priorities
5. Testing	Run non‑disruptive failover tests using isolated copies of replicated data
6. Execution	Initiate planned migration or disaster recovery failover
7. Reprotect / Failback	After recovery, reprotect VMs and optionally move them back to primary site

1. Designing the SRM Recovery Plan ✅

🔄 Recovery Plan Structure & Workflow

Protection Groups: Organize VMs into groups based on application tiers or data dependencies (e.g. web servers, DB servers). These map to replication sets and specify RTO/RPO requirements.
Recovery Plans: A plan is essentially an automated runbook controlling VM shutdown, replication sync, startup sequence, IP/network customization, and scripts. Multiple plans can reference the same protection groups.
Dependencies & Sequencing: Define inter-VM dependencies (e.g. DC first, then application servers), enabling parallel startup within priority groups for faster recovery.
Pre/Post actions & IP Customization: Customize VM IPs, gateways, run in-guest scripts (DNS updates, services start), and display prompts during recovery execution.

🧪 Testing

Use non‑disruptive test recovery: runs against isolated snapshots, so production isn't impacted and replication continues simultaneously. Test environments can be isolated or duplicate networks, depending on needs.
Clean up test state afterward (remove placeholder VMs, delete snapshots).

🛠️ Execution & Failback

Unplanned Failover: Trigger real recovery plan after a disaster; SRM orchestrates shutdown, final sync, startup, and IP reconfiguration.
Planned Migration / Failback: If the protected site is operational, plan migration can move workloads orderly with minimal data loss. SRM reprotects VMs and reverses direction.

2. Topology Mapping & Resource Alignment

Site Pairing: Pair protected and recovery vCenter servers and their SRM instances via site-pair configuration.
Inventory Mapping: Map folders, resource pools, networks, and datastores between sites to ensure smooth migration. NSX universal logical switches can span L2 networks for seamless failover.
Resource Considerations: Ensure sufficient compute, storage, and network resources at the recovery site. Use few but large datastores and group VMs to minimize recovery latency.

3. Choosing Replication Methods: vSphere vs. Array-Based

Feature	Array-Based Replication (ABR)	vSphere Replication (VR)
Replication Layer	Storage array (LUN/volume level)	Hypervisor (VM-level replication)
RPO	As low as sub‑minute (vendor dependent)	5 minutes to 24 hours (5 min with vSAN/vVols)
Write-order fidelity	Maintains across multiple VMs in group	Fidelity only within individual VM disks
Scale	Up to thousands of VMs	~2,000 VMs per SRM instance
Storage dependency	Requires same vendor array at both sites	Storage‑agnostic (VMware‑supported)
Cost	Higher licensing and vendor-specific setup	Included with many vSphere licenses (Essentials Plus+)

🟢 When to Choose:

Array-Based: Ideal for enterprise use cases requiring sub-minute RPOs, write-order consistency globally, large VM scale, and tight SLAs.
vSphere Replication: Best suited for smaller environments, mixed storage, budget-conscious setups, or non-critical workloads.

You can even mix both: use VR for lower-tier VMs, and ABR for critical workloads within the same SRM deployment—just don’t protect the same VM by both mechanisms.

4. Putting It All Together: Architecture Planning

Map required RTO/RPO per application/tier → choose replication accordingly.
Design protection groups aligned to workloads, dependencies, and replication capabilities.
Configure inventory mappings of compute, network, folders, and storage to match planned failover topologies.
Build and test recovery plans:
- set sequence, customize IPs
- include pre/post scripts
- test non-disruptively
Plan execution strategy:
- scheduled migrations
- failover vs planned migration scenarios
- reprotect and failback workflows
Baseline performance with recommended settings (e.g. larger fewer datastores, grouped VM startups) to reduce latency.

5. Best Practices & Practical Tips

Separate large VMs and page files to avoid unnecessary replication load.
Tune bandwidth and replication settings—CBT for VR, compression, network latency considerations.
Use parallel startup within priority groups and minimize protection groups to improve RTO.
Integrate SRM with NSX, vSAN, and VMware Cloud if using hybrid or multi‑site deployments for better automation.
Document recovery plan history, run test reports, and align with compliance or audit requirements.

SRM Topology Explanation

1. Protected (Primary) Site

vCenter Server and SRM appliance manage production workloads.
vSphere Replication appliances or storage arrays (with SRAs) handle replication.
Protected VMs reside in clusters and storage datastores ready for replication.

2. Recovery (Secondary) Site

Mirrored setup with vCenter + SRM appliance.
Placeholder VMs are created in advance to reserve inventory slots.
Replication targets: VR receives VM-level blocks; array-based replication mirrors LUNs/volumes.

3. Replication & Network Links

Network connectivity connects SRM, vSphere Replication services, and SRA ports (e.g. ports 31031, 44046) across sites.
Storage replication occurs either via the hypervisor (vSphere replication) or directly between arrays (ABR).
Replication traffic uses dedicated replication networks for isolation and performance.

4. Inventory & Resource Mapping

Folders, resource pools, datastores, and networks are mapped from the protected site to the recovery site.
NSX or inventory-based network mappings ensure consistent virtual networking and, if used, universal logical switches can allow seamless L2 failover.

5. Recovery Plan Execution Flow

Initiate recovery or planned migration.
Perform final sync (if source still online), shut down VMs.
Recovery site powers on placeholder VMs in defined priority groups with dependencies.
Launch post‑power-on scripts, apply IP changes, reconfigure services.
After recovery or test, cleanup and optionally reprotect and fail back.

🔍 Key Takeaways from the Diagram

Provides a holistic view of components: vCenter servers, SRM appliances, replication layer, VM inventory, networks.
Shows dual replication modes: vSphere Replication (VM-level) vs. Array-Based Replication using SRAs.
Illustrates bi‑directional topology, supporting both planned migrations and failbacks.
Includes network port/service mapping — especially useful for firewall and compliance planning

🧩 Enhancing for Your Environment

You can tailor this layout to various SRM topologies:

Shared Recovery Site: Multiple protected sites mapping into one recovery site (multi-pair SRM)
Stretched Cluster Integration: Combine SRM with vSAN stretched clusters, protecting across metro sites to a third site for ultimate resiliency
NSX-Aware Deployment: Use Cross‑VC NSX logical networks and automated mapping, enabling identical IP addressing and security across sites—ideal for test and DR networks

📝 How to Create Your Own Topology Diagram

Consider the following when building your custom diagram:

Clearly mark vCenter + SRM pairs at each site.
Show replication components: VR appliances and/or array replication adapters.
Annotate network connectivity: control, replication, and VM traffic.
Indicate inventory mappings: network, resource pools, datastores, folder names.
Define placeholder VM logic, recovery priority groups, and sequencing.
Include pre/post script stages, IP customization steps.
Layer in optional components like NSX, stretched clusters, or F5 BIG-IP for routing and DNS failover