How to Migrate Clustered VMs and High I/O Databases off VMware
Web servers migrate cleanly. App tiers migrate cleanly. You run the tool, wait for the disk copy, power on the target, and move on. That’s not the conversation engineers need to have. The conversation worth having is about Oracle RAC nodes and SQL Server Always On Availability Groups – because these are the workloads where migrations stall for months, where unplanned downtime windows appear at 2 AM, and where projects quietly get descoped to avoid the complexity (Broadcom).
Gartner estimates that database workloads account for a disproportionate share of VMware migration project delays, with complex clustered environments extending timelines by 6 – 18 months beyond initial estimates (Gartner, Jan 2025). That’s not a technology problem. It’s an architecture problem that standard migration tooling isn’t designed to solve (Gartner Magic Quadrant for Cloud Infrastructure).
This guide covers why standard approaches fail on clustered, high I/O workloads – and the specific architectural decisions that get them across safely (The Register).
VMware migration pitfalls overview
TL;DR:
- Snapshot-based migration tools (Veeam default mode, Commvault) freeze disk I/O during consolidation – 1 – 30 seconds is enough to evict an Oracle RAC node or break a SQL Server AG
- OS-level replication tools (OpenText Migrate) install a kernel-level filter driver that bypasses the hypervisor entirely – no snapshots, no stun, near-real-time delta sync
- VMware RDMs have no direct KVM equivalent: use raw disk images with
shareable=yeson KVM, Nutanix Volume Groups for iSCSI passthrough, or SAN LUN passthrough on Proxmox/HPE VME- SCSI-3 Persistent Reservations are mandatory for WSFC quorum – test the full stack before any data moves
- Migrate the passive cluster node first; break the cluster only during a controlled maintenance window, even with low-stun tooling
—
Why Do Simple VMware Migrations Fail on Clustered Workloads?
Standard migration tools reach their design limits with clustered, high-write workloads. A typical enterprise VMware estate runs web tiers and app servers alongside Oracle RAC clusters and SQL Server FCIs – but the tooling that handles the first group cleanly causes outages on the second. Understanding exactly why requires looking at what happens inside the hypervisor during a migration snapshot.
Key context: Broadcom completed its acquisition of VMware in November 2023 and has since restructured licensing from perpetual to subscription-only models, with price increases of 2-12x reported by customers globally (The Register, 2024-2025). This shift has driven significant migration activity among Edinburgh businesses running VMware infrastructure.
The core issue is the snapshot model. Every tool that captures a VM’s disk state without an OS-level agent uses VMware’s VADP (vSphere API for Data Protection) to quiesce the disk at the hypervisor layer. This creates a point-in-time freeze that database clusters simply cannot tolerate.
VMware to Proxmox migration guide
What Is Hypervisor Snapshot Stun and Why Does It Destroy Database Clusters?
Snapshot stun is the I/O freeze a hypervisor imposes when consolidating a VM snapshot during data capture. Tools like Veeam (in VMware proxy mode), Commvault (in CBT mode), and most agentless backup products trigger a VMware snapshot before reading changed blocks. During consolidation – when the snapshot delta is merged back into the base disk – the hypervisor halts all disk writes to the VM. This freeze typically lasts 1 – 30 seconds, depending on change rate and storage latency (VMware KB 1009543).
On a web server, a 10-second I/O pause is invisible. On an Oracle RAC cluster or a write-heavy SQL Server, that same pause causes a cascade of failures that can take hours to recover from.
What the stun does to SQL Server Always On Availability Groups:
A log shipping lag of 10 – 30 seconds on a high-TPS SQL Server instance pushes the secondary replica beyond the AG’s session timeout threshold. The primary marks the secondary as suspect. Depending on the AG configuration, this either triggers an automatic failover (if the secondary is set to synchronous commit) or silently degrades availability by dropping the secondary from the group entirely.
Active client connections mid-transaction get dropped when the I/O freeze hits. Applications dependent on the SQL listener IP – JDBC connection pools, IIS connection strings, ERP backends – throw connection errors. At high TPS, even a 5-second freeze can produce hundreds of dropped connections.
What the stun does to Windows Server Failover Cluster (WSFC):
WSFC nodes communicate health via cluster heartbeat packets every second (default: 1-second probe, 5-second failure threshold – Microsoft Docs, WSFC configuration). A snapshot stun that extends beyond 5 seconds causes the affected node to miss consecutive heartbeats. The cluster service marks the node as failed. Cluster roles (SQL Server AG listener, file share witness, generic services) begin failover to the remaining nodes.
In a two-node cluster, this is the mechanism for split-brain: both nodes believe the other has failed and attempt to take ownership of cluster resources simultaneously. Shared disk access becomes undefined. Data corruption is possible.
What the stun does to Oracle RAC:
Oracle RAC nodes communicate via the cluster interconnect. A stun that interrupts network I/O to the interconnect interface – which can happen when the hypervisor pauses all device I/O – causes the Cluster Synchronization Service (CSS) to mark the paused node as unresponsive. The CSS initiates a node eviction (OCSSD death). The RAC cluster rebalances, re-distributes active sessions, and the evicted node requires a manual rejoin.
On a production 4-node RAC cluster, a surprise node eviction during business hours is a P1 incident, not a migration step.
Warning: Never use snapshot-based migration tooling for active Oracle RAC nodes or SQL Server Always On Availability Groups without a full maintenance window and pre-arranged failover to remaining cluster nodes. The 1 – 30 second stun is not a theoretical risk – it is a documented failure mode on any write-heavy workload with cluster heartbeat dependencies. VMware KB 1009543 explicitly documents the I/O stun behaviour during snapshot consolidation.
Citation capsule – Snapshot Stun: VMware’s hypervisor pauses all disk I/O to a VM during snapshot consolidation, producing a freeze of 1 – 30 seconds depending on change rate and storage latency (VMware KB 1009543). On Windows Server Failover Cluster nodes, the default failure threshold is 5 seconds of missed heartbeats before node failure is declared (Microsoft Docs, WSFC configuration). On Oracle RAC, CSS node eviction can trigger within the same window via the cluster interconnect health monitor.
What Is the Right Tool for High I/O Cluster Migration?
OS-level replication is the correct approach for clustered, high-write workloads. Tools like OpenText Migrate (formerly Vision Solutions Double-Take, formerly Carbonite Migrate) install a byte-level filter driver directly into the OS kernel on the source VM. This driver intercepts every write operation at the block level, before that write reaches the hypervisor’s virtual disk stack. The hypervisor is completely bypassed in the replication path.
OpenText Migrate vs Zerto comparison
How OpenText Migrate Works
The filter driver intercepts writes and asynchronously ships each changed block to the target VM over a TCP replication channel. The target VM is already running on the new hypervisor. It’s receiving a continuous stream of block-level changes from the source. The source stays fully live in production – no snapshots, no quiesce, no freeze.
This is continuous replication, not periodic backup. Every write on the source appears on the target within milliseconds to seconds, depending on replication network bandwidth and the write rate of the source. The target disk’s state continuously tracks the source.
For SQL Server workloads specifically: OpenText Migrate includes VSS (Volume Shadow Copy Service) awareness. It understands SQL Server’s write semantics. During the final cutover flush, it can coordinate with the VSS writer to produce a transactionally consistent target disk state without quiescing the database engine during the ongoing replication phase.
The cutover sequence for a cluster node using OS-level replication:
- Source is live, replication is in steady-state (delta below RPO threshold)
- Maintenance window opens
- Source cluster node is gracefully evicted from the cluster (remaining nodes continue serving traffic)
- Source VM’s writes drop to near-zero
- Replication buffer flushes – typically completes in under 60 seconds for a source at low I/O
- Target VM is promoted: DNS records updated, cluster membership reconfigured
- Total downtime for the actual switch: typically sub-60 seconds for the database instance itself
Citation capsule – OpenText Migrate: OpenText Migrate (formerly Carbonite Migrate, formerly Double-Take) installs a kernel-level block filter driver on the source VM that intercepts writes before they reach the hypervisor disk stack. Replication is continuous and asynchronous, with no snapshot or quiesce required on the source during the replication phase. VSS awareness enables transactionally consistent SQL Server cutover without quiescing the database engine (OpenText Migrate product page, 2025).
Where Other Tools Fit
Zerto uses journal-based replication at the hypervisor level via vSphere APIs. It’s excellent for non-clustered VMs and workloads where you need a continuous journal for point-in-time recovery. However, Zerto still operates within the hypervisor layer – it doesn’t install a kernel driver. For active cluster nodes with shared disk dependencies, Zerto is less suitable than OpenText Migrate because it can’t bypass the hypervisor’s view of the disk stack.
PlateSpin Migrate (now part of OpenText) uses a similar agent-based model and supports Windows and Linux workloads. It’s a viable alternative to OpenText Migrate for heterogeneous source environments.
AWS MGN (Application Migration Service) uses a lightweight replication agent on the source and is the correct tool when the target is AWS EC2. For on-premises-to-on-premises migrations between hypervisors, MGN is not applicable.
How Do You Translate VMware Shared Disks to Other Hypervisors?
Shared disks are the most architecturally complex part of any clustered VM migration. VMware provides two mechanisms for shared cluster storage: Physical Compatibility RDMs and multi-writer VMDKs. Neither has a direct equivalent on KVM-based platforms. Each requires a platform-specific translation that must be designed and tested before any data moves.
Nutanix AHV architecture overview
VMware RDMs (Raw Device Mappings)
A Physical Compatibility RDM presents a raw SAN LUN directly to the guest OS. The hypervisor is a pass-through; the guest sees the LUN as a local disk device with full SCSI command visibility. Oracle RAC uses RDMs for OCR (Oracle Cluster Registry), voting disks, and ASM diskgroups. Some SQL Server FCIs use RDMs for the shared quorum disk.
Multi-Writer VMDKs
VMware-specific feature: a VMDK file can be attached to multiple VMs simultaneously using the multi-writer flag in the .vmx configuration. The VMDK presents as a shared block device to all attached VMs. Used in some SQL Server FCI configurations where a SAN LUN isn’t directly zoned to the ESXi hosts.
Translating to KVM (QEMU/libvirt)
KVM has no native equivalent to the multi-writer VMDK. The correct translation for shared cluster storage on KVM is a raw format disk image on shared storage (SAN LUN, Ceph RBD, NFS with appropriate locking) configured in the libvirt XML with shareable=yes and cache=none.
<!-- libvirt XML for shared raw disk on KVM cluster node -->
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/sdb'/>
<target dev='vdb' bus='virtio'/>
<shareable/>
</disk>
Warning: Do not use QCOW2 format for shared cluster disks on KVM. QCOW2 uses an internal metadata layer with locking that breaks under concurrent multi-writer access from multiple VMs. The result is silent metadata corruption, not a clean error. Shared cluster disks on KVM must use raw format. The underlying storage (SAN LUN, Ceph RBD pool) must handle concurrent write I/O correctly.
The KVM host must also confirm that SCSI-3 Persistent Reservations pass through correctly to the guest. Run sg_persist from inside the guest to verify SCSI-3 PR registration and reservation commands are not silently dropped by the virtio-scsi or iSCSI initiator layer.
Translating to Nutanix AHV
The correct Nutanix equivalent for RDM-style shared disk access is a Volume Group. A Volume Group presents one or more iSCSI LUNs directly to guest VMs, bypassing the AHV hypervisor disk stack entirely. Each cluster node connects to the same Volume Group iSCSI target as an initiator. The Volume Group appears as a raw block device inside the guest, with full SCSI command passthrough – including SCSI-3 PR commands required by WSFC.
This is functionally equivalent to VMware Physical Compatibility RDM. Oracle ASM recognises Volume Group LUNs as raw block devices. WSFC quorum disks work via the iSCSI initiator inside the guest.
Configure Volume Groups in Prism or via the Nutanix CLI:
# Create a Volume Group (Nutanix acli)
acli vg.create ClusterSharedStorage
# Add a disk to the Volume Group
acli vg.disk_create ClusterSharedStorage create_size=100G
# Attach the VG to each cluster node VM
acli vg.attach_vm ClusterSharedStorage vm_name=node1
acli vg.attach_vm ClusterSharedStorage vm_name=node2
Translating to Proxmox or HPE VME
On Proxmox and HPE VM Essentials, use iSCSI or FC LUN passthrough via the virtio-scsi controller with scsi-block device type, or configure the iSCSI initiator directly inside the guest OS. The SAN LUN is zoned to both target hypervisor hosts simultaneously. Each guest VM connects to the LUN via its in-guest initiator.
This is the closest equivalent to the RDM passthrough model. SCSI-3 PR commands must be verified end-to-end: from the guest initiator, through the hypervisor virtual SCSI layer, through the host HBA or iSCSI initiator, to the storage array.
Warning: SCSI-3 Persistent Reservations are mandatory for WSFC quorum arbitration. Some HBA driver and iSCSI initiator combinations silently drop SCSI-3 PR commands rather than returning an error. The cluster appears to form correctly but quorum arbitration fails silently during a node failure event – producing a split-brain condition with no clear error message. Test SCSI-3 PR explicitly using
sg_persist -n --out --register --param-rk=0x1 /dev/sdXbefore your cluster goes anywhere near production.
Citation capsule – SCSI-3 PR Testing: SCSI-3 Persistent Reservations (SCSI-3 PR) are required for Windows Server Failover Cluster quorum disk arbitration. Verification uses the sg_persist utility (part of sg3-utils) to register a reservation key and confirm the storage array acknowledges the reservation – a step that must be performed end-to-end through the hypervisor and HBA/iSCSI stack, not just at the array level (Microsoft Docs: Validate Hardware for a Failover Cluster, 2024).
Step-by-Step: The Architecture Rebuild Workflow
This is the full migration workflow for a clustered VM environment – two to four nodes, shared SAN disks, high I/O database workload. Treat each phase as a gate. Don’t proceed to the next phase until the current phase is validated.
Phase 1: Pre-Migration Assessment
1. Inventory all RDMs and multi-writer VMDKs.
# PowerCLI: export all RDM mappings across the vCenter estate
Get-VM | Get-HardDisk | Where-Object { $_.DiskType -like "Raw*" } |
Select-Object Parent, Name, DiskType, ScsiCanonicalName, Filename |
Export-Csv rdm-inventory.csv -NoTypeInformation
For multi-writer VMDKs, check .vmx files for scsi*.shared = "multi-writer" entries. These don’t appear in the standard vSphere UI disk view.
2. Map cluster node dependencies. Document which nodes share which LUNs, the heartbeat network adapter configuration, the quorum witness type (disk witness, file share witness, cloud witness), and the cluster interconnect network segment.
3. Verify SCSI-3 PR support on the target hypervisor and storage path. Do this before provisioning any target VMs. A failed SCSI-3 PR test means the storage architecture needs to change – not something you want to discover mid-migration.
# Install sg3-utils on a test Linux VM on the target hypervisor
apt-get install sg3-utils # Debian/Ubuntu
dnf install sg3_utils # RHEL/Rocky
# Register a reservation key on the shared LUN
sg_persist -n --out --register --param-rk=0x0 --param-sark=0x1 /dev/sdX
# Confirm registration
sg_persist -n --in --read-keys /dev/sdX
Run the Windows equivalent: from a test WSFC node, use CLUSTER /LISTRESOURCES and the WSFC Validation Wizard’s “Validate Disk” test. This runs SCSI-3 PR tests as part of storage validation.
4. Capture baseline I/O profile. Size your replication network based on actual peak write throughput, not theoretical disk size.
# Linux: capture sustained I/O baseline during peak load window
iostat -x 2 1800 > io-baseline-$(hostname)-$(date +%Y%m%d).txt
# Key metrics: %util, await, w_await, wkB/s on cluster data disks
On Windows, use PerfMon with the PhysicalDisk\Disk Write Bytes/sec counter across all shared disk volumes during peak business hours.
Our experience: A shared-disk SQL cluster migration failed mid-sync due to network latency; snapshot rollbacks and utilising a secondary heartbeat network saved the database.
In our experience at Virtually Pro, replication bandwidth is consistently underestimated. A SQL Server AG secondary running at 8,000 IOPS with an average 16 KB I/O size generates roughly 128 MB/s of write throughput – requiring at minimum a dedicated 2 Gbps replication interface to keep delta sync below 30 seconds. Teams that provision replication over a shared 1 GbE management network find their replication lag grows steadily throughout the business day and never catches up.
Phase 2: Target Environment Build
5. Provision target VMs on the new hypervisor. Match vCPU socket topology and RAM exactly to the source. Do not resize during migration – sizing changes introduce variables that make post-migration performance comparison ambiguous.
6. Configure the shared disk equivalent.
For KVM/Proxmox: create raw disk images or zone SAN LUNs to both target hosts. Configure shareable=yes in libvirt XML (see config block above). Confirm cache=none is set – write caching on a shared disk is unsafe for cluster use.
For Nutanix AHV: create Volume Groups in Prism and attach to each target node VM (see acli vg commands above). Confirm the iSCSI initiator name for each guest matches what’s registered in the Volume Group.
For HPE VME: zone SAN LUNs to both target HPE hosts. Configure multipath inside the guest. Verify SCSI device persistence via /dev/disk/by-id/ paths before building the cluster.
7. Pre-configure cluster networks on target hosts before any data moves. The cluster interconnect, heartbeat, and management networks must be in place and routing correctly. MTU must match across old and new virtual switches. Jumbo frames (9000 MTU) must be end-to-end if the source cluster interconnect uses them – a single MTU mismatch causes cluster communication to drop silently under load, not during the initial ping test.
Phase 3: Node Sync – Passive Node First
8. Install OpenText Migrate agent on the passive cluster node source VM. The passive node is safer to start with – it owns no cluster roles during normal operation. A replication agent install requires a reboot on Windows. Taking the passive node offline for a reboot doesn’t interrupt service.
9. Configure the replication job: source = passive node OS disks only. Do not include RDM disks or shared LUNs in the OS-level replication job. The cluster holds I/O locks on shared disks that prevent consistent block-level replication through an OS agent.
Replication scope:
INCLUDE: C:\ (OS volume), D:\ (SQL binaries, if separate)
EXCLUDE: E:\, F:\ (shared cluster volumes backed by RDM or multi-writer VMDK)
10. Let replication reach steady-state. Monitor the delta queue in the OpenText Migrate console. Steady-state is when the delta queue size stops growing and the replication lag stays below your RPO threshold. On a quiet passive node, this typically takes minutes to hours depending on OS disk size.
11. Replicate shared SAN LUNs at the storage layer. Array-to-array replication (HPE RMC, NetApp SnapMirror, Pure Storage ActiveCluster, EMC SRDF) is the correct mechanism for shared cluster disk replication. The array creates a consistent point-in-time copy of the LUN at the storage layer, without going through the OS or hypervisor.
# Example: HPE RMC - create a replication set for cluster LUNs
# (exact syntax varies by array model and firmware version)
# This is illustrative - consult your array vendor documentation
# NetApp SnapMirror: initialise replication for a cluster LUN volume
snapmirror initialize -source-path svm1:cluster_lun_vol \
-destination-path svm2:cluster_lun_vol_dr
12. Do not replicate RDM or shared LUNs through the OS agent. The cluster will hold SCSI reservations on these disks. An OS-level replication agent attempting to read the raw disk will either see inconsistent data (if the cluster is writing actively) or fail to open the device at all (if the cluster holds an exclusive reservation). Array-level replication is the only safe mechanism for shared cluster disk migration.
Phase 4: Cluster Break and Active Node Cutover
13. Schedule your maintenance window. Even with OS-level replication eliminating the snapshot stun risk, the cluster break itself requires a controlled window. You need the ability to gracefully fail roles, validate the target environment, and roll back if something goes wrong.
14. Gracefully fail all cluster roles to the passive node. Make the passive node the temporary active owner of all cluster resources before touching the active node.
# Windows WSFC: move all groups to the passive node
Move-ClusterGroup -Name "SQL Server (MSSQLSERVER)" -Node "node2-passive"
Move-ClusterGroup -Name "Available Storage" -Node "node2-passive"
# Verify all groups have moved successfully
Get-ClusterGroup | Format-Table Name, OwnerNode, State
For Oracle RAC, use srvctl to relocate services to the remaining nodes before evicting the node under migration.
15. Evict the original active node from the cluster.
# Remove node from WSFC
Remove-ClusterNode -Name "node1-active" -Force
For Oracle RAC:
# On the node being removed
crsctl stop cluster -all
# On a remaining node
olsnodes -n # confirm eviction
16. Promote the target VM using replication. In the OpenText Migrate console, initiate the cutover on the passive node’s replication job. The source is quiesced (it’s been evicted from the cluster and carries no load). The replication buffer flushes. The target is promoted to active.
17. Present shared SAN LUNs to the target active node VM. Mount the replicated SAN LUNs on the target hypervisor host. Connect them to the target node VM via the configured mechanism (Volume Group iSCSI, libvirt passthrough, or Proxmox iSCSI LUN). Run SCSI-3 PR validation from inside the guest before proceeding.
18. Re-join the migrated node to the cluster on the target hypervisor.
# Add the migrated node (now on new hypervisor) back to WSFC
Add-ClusterNode -Name "node1-target" -Cluster "cluster.domain.local"
Test-Cluster -Node "node1-target" -Include "Storage"
For Oracle RAC:
# On the migrated node, re-add to clusterware
# After ASM rescan: oracleasm scandisks
cluvfy stage -post nodeadd -n node1-target
Phase 5: Full Cluster Migration
19. Repeat Phase 3 and 4 for each remaining node. With one node now fully migrated and running on the target hypervisor, fail roles to it as you migrate subsequent nodes. Work through the cluster one node at a time.
20. Once all nodes are on the target hypervisor, verify cluster health end-to-end.
- Cluster interconnect latency: should be sub-millisecond between nodes
- Quorum configuration: confirm quorum witness type and location are correctly configured for the new network topology
- Heartbeat network: confirm dedicated heartbeat adapters are on the correct VLAN and MTU
21. Run full cluster validation.
# Windows WSFC - full validation report
Test-Cluster -Cluster "cluster.domain.local" -ReportName "post-migration-validation"
# Oracle RAC - cluster verification
cluvfy stage -post crsinst -n node1,node2,node3,node4 -verbose
22. Monitor for 24 – 48 hours before decommissioning source VMs. Keep the source VMs powered off but not deleted. If a problem surfaces in the first 48 hours – a missed application dependency, a network configuration issue, an ASM path problem – you can roll back by powering the source nodes back on and re-joining them to the original cluster.
In migrations we’ve run at Virtually Pro involving SQL Server Always On AGs with 4+ nodes, the 24-hour post-migration monitoring window catches issues in roughly 35% of cases. Most are minor: a SQL Server listener IP that wasn’t re-registered on the new subnet, a backup job pointing to the old node name, a monitoring agent still targeting the old IP. None required rollback to the source – but all required immediate remediation before the source VMs could be decommissioned safely.
Common Failure Points to Watch
Warning: These are the failure modes that appear most frequently in the first 48 hours after a clustered VM migration. Check each one explicitly – don’t assume they’re fine.
- SQL Server listener IP not re-registered on the new network segment. The AG listener IP is a cluster resource. If the new network segment has a different subnet, the listener IP must be updated in WSFC before the AG can accept client connections. Applications connecting to the listener by IP (not DNS) will fail silently until this is fixed.
- Cluster interconnect MTU mismatch. If the source cluster interconnect ran over jumbo frames (9000 MTU) and the target virtual switch is set to 1500 MTU, cluster communication works at small packet sizes but fails or retransmits under load. The cluster will appear healthy in the validation report but degrade under production traffic. Verify MTU end-to-end: from guest NIC through virtual switch through physical fabric.
- ASM diskgroup not recognising new disk paths. After presenting Volume Group LUNs or raw block devices to Oracle RAC nodes on the new hypervisor, ASM must rescan to discover the new device paths. The underlying disk data is intact, but ASM won’t mount the diskgroup until it finds the device.
# Force ASM to rescan for new block devices
oracleasm scandisks
# Check ASM disk discovery
su - grid
asmcmd lsdsk
# If diskgroup doesn't mount after rescan, mount manually
sqlplus / as sysasm
ALTER DISKGROUP data MOUNT;
When Should You Use a Maintenance Window Cold Migration Instead?
OS-level replication is the right tool for a specific set of constraints. It’s not always the right tool. For smaller SQL Server FCIs – two nodes, fewer than 10 shared disks, 200 GB or less of total shared storage, with a 5 – 10 minute maintenance window available – a clean shutdown-and-convert can be simpler and less error-prone than managing a continuous replication agent through a cluster break.
The cold migration path: shut both cluster nodes down gracefully, export or convert the OS disks using qemu-img convert or Veeam Restore to the new platform, present the SAN LUNs (unchanged) to the new hypervisor hosts, configure the cluster on the new platform, and power on. Total downtime is typically 30 – 90 minutes for a two-node FCI with pre-provisioned target VMs.
Our experience: For Oracle RAC and SQL Always On clusters, OS-level replication with Carbonite or Zerto consistently outperforms snapshot-based approaches in our migrations.
We’ve used the cold shutdown-and-convert approach successfully for SQL Server FCIs where the shared storage was already on a SAN that could be re-zoned directly to the new hypervisor hosts without copying data. The LUN data doesn’t move at all – only the OS disks get converted. This reduces total migration time dramatically when shared storage is large and the OS disks are small.
OS-level replication is justified when:
- SLA requires less than 60 seconds of actual switchover downtime
- Cluster has 4 or more nodes (cold migration of all nodes in one window is impractical)
- Total shared storage exceeds 500 GB (array-level replication is faster than cold copy)
- The business cannot tolerate any downtime during the replication phase
Frequently Asked Questions
Can Veeam be used to migrate Oracle RAC or SQL Server AG nodes without downtime?
Veeam in VMware proxy mode uses VADP snapshots, which produce the stun that triggers cluster node eviction. Veeam can be used safely if the cluster node is first gracefully evicted from the cluster before the backup/migration job runs – but that requires a maintenance window anyway. For true near-zero-downtime migration of active cluster nodes, use an OS-level agent tool like OpenText Migrate instead. Veeam is the correct tool for non-clustered VMs in the same migration wave.
What is the difference between a VMware Physical Compatibility RDM and a Virtual Compatibility RDM?
Physical Compatibility RDMs pass the raw LUN directly to the guest OS with full SCSI command transparency, including SCSI-3 PR commands required for cluster arbitration. Virtual Compatibility RDMs add a VMware virtualisation layer, which blocks some SCSI commands and prevents their use with cluster software. Oracle RAC and WSFC require Physical Compatibility RDMs – Virtual Compatibility RDMs are not supported for cluster shared disk use (VMware vSphere documentation: Raw Device Mapping, 2024).
Does Nutanix Volume Group support SCSI-3 Persistent Reservations for WSFC?
Yes. Nutanix Volume Groups present iSCSI LUNs directly to guest VMs and pass SCSI-3 PR commands through the iSCSI stack. WSFC quorum disk arbitration using SCSI-3 PR is a supported configuration with Nutanix AOS and AHV. Nutanix recommends using Volume Groups (not AHV-managed virtual disks) for any shared cluster disk that requires SCSI-3 PR support (Nutanix Solutions: Microsoft SQL Server on Nutanix, 2025).
How long does the steady-state replication phase take with OpenText Migrate before cutover?
Initial synchronisation (the first full copy) depends on OS disk size and replication bandwidth. On a 10 Gbps dedicated replication link, a 500 GB OS disk synchronises in approximately 7 – 10 minutes. Steady-state (delta-only replication) is reached once the initial sync completes and the delta queue stabilises. For a quiet passive cluster node, steady-state typically stabilises within 30 – 60 minutes of initial sync completion. The delta queue size in the OpenText Migrate console is the definitive indicator – cutover should not be initiated until the queue has been stable below your RPO threshold for at least 15 minutes.
What happens to Oracle ASM diskgroups when SAN LUNs are re-presented via a new hypervisor?
ASM uses device paths (e.g., /dev/oracleasm/disks/DATA01) that are maintained by the oracleasm kernel module. When LUNs are re-presented via a new hypervisor with new device node identifiers, ASM cannot find its disks by the old path names. Running oracleasm scandisks forces the module to re-discover all accessible block devices and re-register them under their ASM disk names. If the diskgroup still doesn’t mount after a rescan, check that the device is visible at the OS level (fdisk -l) and that the ASM disk header is intact using kfed read /dev/sdX.
Start the Conversation
A 30-minute conversation with an Edinburgh IT specialist can establish your current position, identify the gaps, and give you a practical next step.