Proxmox 8 with Ceph for small-business HA: when it makes sense and when it does not

We get asked this a lot: “We have three servers and a 10-person office. Should we build a Proxmox cluster on Ceph for high availability?” The honest answer is: sometimes yes, often no. This post is the checklist we wish someone had given us before our first hyperconverged build.

What Proxmox + Ceph actually buys you

A Proxmox VE 8 cluster with Ceph RBD as the VM storage backend gives you three things that classic shared-storage setups struggle to match at the same price:

Node-loss tolerance. With a healthy 3-node Ceph cluster (size=3, min_size=2), one node can die and VMs keep running. Proxmox HA restarts the VMs that were on the dead node within roughly a minute or two.
No external SAN. Storage lives on the same boxes as the hypervisors. No separate iSCSI/NFS appliance to license, patch, or replace.
Live migration and rolling upgrades. You can drain a node, patch the kernel, and move on without scheduling downtime.

That is genuinely useful for a business that loses real money when the file server or the ERP database is down for an afternoon.

When it makes sense

In our experience the sweet spot looks like this:

Three or more dedicated nodes, identical or near-identical. Two-node Ceph with an external monitor is possible but fragile; just don’t.
A real storage network. 10 GbE minimum for the Ceph cluster network, 25 GbE if you can afford it. Separate VLAN or, better, separate physical NICs from VM and management traffic.
Enterprise NVMe or at least enterprise SATA SSDs with power-loss protection. Consumer drives will work for a week and then collapse under Ceph’s sync write pattern.
Workloads that justify the complexity: a Postgres or MSSQL instance, a Windows AD/file server people depend on, a line-of-business app like 1C or an ERP. Anything where one hour of downtime costs more than a few hundred euros.
Someone on staff (or on retainer) who can read ceph -s and not panic.

When it does not make sense

We’ve seen these patterns end badly:

Two nodes plus a Raspberry Pi as a tiebreaker. Quorum works, Ceph performance does not. PG recovery on a single surviving OSD host is brutal.
1 GbE “because the office switch has free ports.” Ceph will technically run. Latency on writes will be terrible, and a single failed OSD will saturate the link for hours during backfill.
Mixed consumer SSDs of various sizes. CRUSH balancing becomes a guessing game and write amplification kills the cheaper drives first.
One sysadmin who has never touched Ceph. When the cluster goes HEALTH_WARN at 22:00 on a Friday, Googling “PG inconsistent” is not a recovery plan.
Workloads that don’t need HA. A dev/test lab, a couple of internal tools, a print server — these are fine on a single Proxmox host with good backups to PBS.

Rule of thumb: if your RTO is “by tomorrow morning” and your RPO is “last night’s backup,” you do not need Ceph. A single well-maintained Proxmox host plus Proxmox Backup Server and ZFS replication to a second box will serve you better and cost a third as much.

A reasonable 3-node baseline

For a small business that genuinely needs HA, this is the minimum we’d quote without wincing:

3x server-grade nodes, each with:
- 1 CPU, 16+ cores, modern generation
- 128–256 GB ECC RAM
- 2x enterprise NVMe (1.92 TB+) for OSDs, with PLP
- 2x small SSDs mirrored for the Proxmox OS
- 2x 10/25 GbE for Ceph (LACP or separate public/cluster), 2x 1/10 GbE for VMs and management
A separate physical box (or a VM elsewhere) running Proxmox Backup Server
A managed switch that actually supports jumbo frames and LACP

Sizing the pool

With size=3 (three replicas), usable capacity is roughly one third of raw, and you should plan to stay under ~70% full to leave room for recovery. So 6x 1.92 TB NVMe across three nodes gives ~11.5 TB raw, ~3.8 TB usable, and you should treat ~2.6 TB as the practical ceiling.

Quick check after install:

ceph -s
ceph osd df tree
ceph df

And for the pool used by Proxmox VMs:

ceph osd pool get <pool> size
ceph osd pool get <pool> min_size

Keep size=3, min_size=2. Setting min_size=1 to “survive two failures” is how people lose data.

Failure modes we have actually hit

Single OSD disk fails. Ceph rebalances automatically. If the cluster network is 10 GbE+ and the pool isn’t full, users notice nothing. Replace the disk, ceph-volume re-creates the OSD, done.
One node hard-down. VMs on that node are fenced and restarted on the survivors via Proxmox HA. Ceph runs degraded but min_size=2 is satisfied. Bring the node back, it re-joins, PGs recover.
Two nodes down at once. Ceph blocks I/O. This is correct behavior — the alternative is split-brain. Plan capacity and maintenance windows so this can’t happen by accident.
Network partition between nodes. This is the scariest one. Run Corosync on a dedicated link separate from Ceph and from VM traffic. We’ve seen a saturated Ceph backfill knock Corosync out of quorum and trigger an HA storm. Separate links, or at least separate VLANs with QoS.
Clock skew. Ceph monitors are unforgiving about NTP. Make sure chrony is healthy on every node before you blame anything else.

Cheaper alternatives to consider first

Before committing to Ceph, ask whether one of these is enough:

Single Proxmox host + ZFS + PBS. Daily incremental backups, tested restore procedure. Hardware failure means hours of downtime, not days. For many small offices this is the right answer.
Two Proxmox hosts with ZFS replication (pvesr). Async replication every few minutes. Manual or scripted failover. RPO of minutes, RTO of “however fast your admin types.” No shared storage, no Ceph.
Proxmox cluster with an external NFS/iSCSI box. Simpler than Ceph, but the storage box becomes the single point of failure unless you pay for a dual-controller unit — at which point Ceph is usually cheaper.

The honest summary

Proxmox 8 with Ceph is a great fit when you have at least three proper nodes, a real 10/25 GbE storage network, enterprise SSDs, and an operator who is comfortable with the stack. In that configuration it is genuinely production-grade and significantly cheaper than the equivalent VMware + SAN build.

It is the wrong tool when the budget forces consumer SSDs, 1 GbE, or two-node compromises. In those cases a single well-backed-up Proxmox host — or a two-node ZFS-replication setup — will be more reliable than a fragile Ceph cluster, and far easier to recover when something goes wrong at 02:00.