Hostzero Logo
Back to Articles

The Hidden Linux Defaults That Are Slowing Down Your Ceph Cluster

Five kernel settings that make the difference between mediocre and outstanding Ceph performance in 2026 — and why the defaults are the wrong choice for storage clusters.

Hostzero Team
February 2026

If you're running a Ceph cluster in 2026 — whether as a backend for Proxmox VMs, Kubernetes persistent volumes, or as a sovereign S3 replacement — you've invested in solid hardware: NVMe SSDs, 25 GbE networking, enough RAM per OSD. And yet the latency fluctuates — P99 values spike sporadically even though the hardware should be keeping up just fine.

The root cause often isn't Ceph itself, but one layer below: the Linux kernel. Every distribution ships defaults optimized for general-purpose workloads — web servers, databases, batch processing. Linux maximizes throughput, not latency. For Ceph, that's a problem: a storage cluster serving VMs, containers, and databases needs predictable, low response times on every single I/O operation.

This article covers five kernel settings that should be checked and adjusted on every Ceph node — including the background on why the defaults are problematic. The recommendations apply equally to current Ceph releases (Squid 19.2.x, Tentacle 20.2.x) on current Linux distributions (Debian 12, Ubuntu 24.04, Rocky Linux 9).

Setting

Default

Impact on Ceph

Root Cause

vm.swappiness

60

10–100 µs per page fault on NVMe swap; 1–5 ms on network storage

OSD heap evicted to disk

Transparent Huge Pages

always

10–50 ms compaction stalls

khugepaged defragments in the background

CPU governor

powersave/ondemand

10–50 µs frequency ramp

DVFS transitions under variable load

C-States

all enabled

50–100 µs wake latency

CPU must restore voltage and clock

I/O scheduler

mq-deadline

unnecessary overhead on NVMe

scheduling logic NVMe doesn't need

These defaults exist for good reason: they save power and maximize overall throughput. For 99% of workloads, that's the right call. A Ceph cluster that needs to deliver consistent IOPS at low latency belongs to the other 1%.

1. Set Swappiness to 0

Why This Is Critical for Ceph

With vm.swappiness = 60, the kernel treats file-backed pages (page cache) and anonymous pages (heap, stack) roughly equally when deciding what to evict under memory pressure. This means: the heap of an OSD process can be swapped to disk even though there are page cache entries that could safely be discarded instead.

Ceph OSD daemons hold BlueStore caches, RocksDB block caches, and various internal data structures in their heap. When that heap gets swapped to disk and then needs to be loaded back via page fault, the OSD thread stalls — right in the middle of a client I/O operation. On NVMe, a swap read takes 10–100 µs. On slower storage, significantly longer.

This becomes particularly critical during recovery operations: Ceph OSDs consume significantly more memory during backfill and recovery than in steady state. The Ceph documentation recommends provisioning at least 8 GB of RAM per OSD — precisely because of this overhead during peak periods. In hyper-converged setups with Proxmox VE 9 or Kubernetes on the same nodes, the risk of memory pressure is further increased.

The Fix

bash
# Check current value
sysctl vm.swappiness

# Set persistently
echo 'vm.swappiness = 0' | sudo tee /etc/sysctl.d/90-ceph.conf
sudo sysctl -p /etc/sysctl.d/90-ceph.conf

With swappiness=0, anonymous pages are only swapped when the system is critically low on memory. The page cache is evicted first — that's safe because file contents can always be re-read from disk.

Verification

bash
# Monitor swap activity (si/so should be at 0)
vmstat 1 | awk '{print $7, $8}'

# Check current swap usage
free -h

Important: Setting swappiness to 0 does not disable swap. It means the kernel keeps the heap in RAM as long as possible. Under severe memory pressure, swapping still occurs — better than an OOM kill.

2. Disable Transparent Huge Pages

Why THP and Ceph Don't Mix

Transparent Huge Pages (THP) automatically merge 4 KB pages into 2 MB pages to reduce TLB misses. Sounds good, but the mechanism comes at a cost: the kernel thread khugepaged continuously scans memory looking for pages that can be merged. To do this, it needs to perform memory compaction — moving physical pages around to create contiguous 2 MB regions.

During this compaction, the kernel holds locks that can block memory allocations. An OSD thread requesting memory at that moment waits 10–50 milliseconds. Not microseconds — milliseconds. These are the kind of spikes that turn a normal P99 value into an outlier.

Ceph has officially addressed this issue: since 2019, a tracker entry identifies THP as problematic for Ceph daemons. The recommendation is to disable THP system-wide until selective usage via madvise is implemented. Both ceph-ansible and cephadm set disable_transparent_hugepage to True by default — this recommendation has not changed with Ceph Tentacle (2025) or current kernel versions.

An additional problem: THP interacts poorly with the memory allocators Ceph uses (tcmalloc, jemalloc). Reports from the Ceph community show that THP with tcmalloc can lead to uncontrolled memory growth — RSS usage rises above the configured osd_memory_target, in the worst case all the way to an OOM kill.

The Fix

bash
# Check status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never

# Disable immediately
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

# Make persistent via systemd service
cat <<'EOF' | sudo tee /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'

[Install]
WantedBy=basic.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable disable-thp.service

Verification

bash
# Monitor compaction activity (should not be increasing)
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'

# Check huge page usage
grep AnonHugePages /proc/meminfo

Trade-off: Without THP, you lose automatic huge page optimization. For Ceph, that's no loss — BlueStore manages its own cache and does not benefit from transparent huge pages.

3. Set CPU Governor to Performance

Why Frequency Scaling Causes Ceph Latency

Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The ondemand governor (a common default) monitors CPU utilization and only increases frequency when load rises.

The problem for Ceph: OSD workloads are bursty. An OSD is briefly idle, then a client request arrives, then it's idle again. The governor sees low utilization and keeps the frequency low. The first instructions of an incoming I/O request run at reduced clock speed — the frequency ramp takes 10–50 µs.

This effect has been specifically measured in the Ceph community: in a benchmark with NVMe SSDs, performance with the powersave governor was significantly below that with the performance governor, because the CPU at moderate utilization (around 27%) never scaled up at all. Only switching to performance delivered the expected IOPS.

The Ceph blog also explicitly recommends the network-latency or latency-performance TuneD profile for all-flash deployments — which, among other things, sets the CPU governor to performance.

The Fix

bash
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Set on all cores
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance | sudo tee "$cpu"
done

# Or via TuneD (recommended for Ceph nodes)
sudo tuned-adm profile latency-performance

Verification

bash
# Check frequency (should be constant at maximum)
grep MHz /proc/cpuinfo | sort -t: -k2 -n | tail -4

# Via turbostat (if installed)
turbostat --interval 1 --show Core,CPU,Bzy_MHz

Trade-off: Higher power consumption. In a data center where Ceph nodes run 24/7 anyway, that's an acceptable price for consistent latencies.

4. Restrict Deep C-States

Why Idle CPUs Become a Latency Problem

Even with the performance governor, idle CPU cores enter sleep states (C-states) to save power:

C-State

What Happens

Wake Latency

C0

Active

0

C1

Clock stopped

1–5 µs

C1E

Clock + voltage reduced

5–10 µs

C3

L1/L2 cache cold

30–50 µs

C6

Voltage cut, state saved to RAM

50–100 µs

A Ceph OSD is waiting for the next client request. The CPU drops into C6. The request arrives — and the CPU needs 50–100 µs to become fully active again. This latency adds up on every I/O operation.

This effect is particularly relevant for Ceph because the Ceph mailing list documents a concrete case: after setting cpu_dma_latency=0 on all OSD nodes, recovery throughput increased by a conservative 30% — from ~16 GB/s to ~22 GB/s. The CPUs then ran in turbo instead of dropping into deep C-states.

The Fix

Option 1: Kernel boot parameters (recommended)

bash
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub:
processor.max_cstate=1 intel_idle.max_cstate=0

# Update GRUB
sudo update-grub  # Debian/Ubuntu
# or
sudo grub2-mkconfig -o /boot/grub2/grub.cfg  # RHEL/Rocky

Option 2: At runtime (temporary)

bash
# Disable C-states > C1
for state in /sys/devices/system/cpu/cpu*/cpuidle/state[2-9]/disable; do
  echo 1 | sudo tee "$state"
done

Option 3: Via /dev/cpu_dma_latency (as in the Ceph mailing list case)

bash
# Keeps CPUs in C0/C1 as long as the file descriptor is open
exec 3>/dev/cpu_dma_latency
echo -ne '\x00\x00\x00\x00' >&3

Verification

bash
# Check C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%
# C3% and C6% should be at 0

Note on AMD systems: AMD Rome processors (EPYC 7002 series) and newer appear to be less sensitive to C-state transitions. Nevertheless, the official Ceph benchmarks recommend restricting C-states.

5. Adjust I/O Scheduler for NVMe

Why NVMe Drives Don't Need a Scheduler

For traditional HDDs and SATA SSDs, the Linux I/O scheduler manages the order of requests to minimize seek times. The default mq-deadline prioritizes reads and arranges writes contiguously.

NVMe drives have their own controllers with deep internal queues and don't need this optimization. Every request the kernel reorders or delays is wasted CPU time and additional latency. For NVMe drives, the scheduler should be set to none — a pure pass-through.

This is also recommended in the Ceph community and by enterprise distributions like Red Hat: enterprise NVMe SSDs with their own power-safe caches and controllers perform best with none.

The Fix

bash
# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler

# Set to none (per NVMe device)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler

# Make persistent via udev rule
cat <<'EOF' | sudo tee /etc/udev/rules.d/60-ceph-scheduler.rules
# NVMe: no scheduler
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"
# SATA/SAS SSD: mq-deadline
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# HDD: mq-deadline
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
EOF

sudo udevadm control --reload-rules

Verification

bash
# Check for all block devices
for dev in /sys/block/*/queue/scheduler; do
  echo "$dev: $(cat $dev)"
done

Bonus: Ceph-Specific Settings

Beyond kernel tunings, there are some Ceph-internal knobs that become relevant in combination with the above changes:

BlueStore memory target: With the kernel tunings in place (no swap, no THP), BlueStore can use its cache more effectively. The default of 4 GB per OSD is a compromise; for NVMe deployments, it's worth going to 8 GB or more:

bash
ceph config set osd osd_memory_target 8589934592  # 8 GB

NUMA pinning: On dual-socket systems, OSD processes should be pinned to the same NUMA node as their associated NVMe drives and NICs. Every access across the QPI/UPI link between sockets adds latency.

Networking: For Ceph clusters with 10 GbE or more, jumbo frames (MTU 9000) should be enabled, provided all switches in the storage network support it. With modern Linux kernels (6.x), the TCP defaults are already well tuned — many of the older tuning guides with adjusted buffer sizes are no longer useful or even counterproductive on current kernels.

Quick Audit: Everything at a Glance

bash
#!/bin/bash
echo "=== Swappiness ==="
sysctl vm.swappiness

echo "=== Transparent Huge Pages ==="
cat /sys/kernel/mm/transparent_hugepage/enabled

echo "=== CPU Governor ==="
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 2>/dev/null || echo "N/A"

echo "=== C-States (C3+) ==="
for state in /sys/devices/system/cpu/cpu0/cpuidle/state[2-9]/disable; do
  [ -f "$state" ] && echo "$(dirname $state | xargs basename): disabled=$(cat $state)"
done

echo "=== I/O Scheduler (NVMe) ==="
for dev in /sys/block/nvme*/queue/scheduler; do
  [ -f "$dev" ] && echo "$(echo $dev | grep -o 'nvme[^/]*'): $(cat $dev)"
done

Design Philosophy: When These Tunings Make Sense

Each of the optimizations described trades throughput or power consumption for latency:

Optimization

What We Give Up

What We Gain

swappiness=0

Page cache efficiency under memory pressure

OSD heap stays in RAM

THP deactivated

Automatic huge pages

No compaction stalls

Performance governor

Power savings

No frequency ramp delays

C-state limits

Idle power consumption

Predictable wake latency

I/O scheduler none

Request reordering

Direct NVMe access

These tunings are not right for every use case. On a development system, a pure batch processing server, or a memory-constrained system, the defaults may be the better choice.

For a production Ceph cluster that needs to deliver consistent IOPS and low latencies for VMs, containers, or database backends, they are essential. Especially in all-flash deployments with NVMe, the hardware is often not the bottleneck — rather, it's the kernel, which was optimized for a different use case. If you run sovereign open-source infrastructure in your own data center, you have the advantage of controlling these knobs yourself — unlike proprietary storage appliances or cloud storage services, where this layer remains invisible.

Outlook 2026: What's Changing in Ceph — and What Isn't

Ceph Tentacle (20.2.x) — The Current Stable Release

With Ceph Tentacle, the current stable release since November 2025, there have been notable improvements at the storage engine level. BlueStore received a faster write-ahead log (WAL), and OMAP iteration was accelerated — which primarily improves RGW bucket listings and scrub operations. Additionally, the hybrid_btree2 allocator was backported from Squid, which delivers significantly better allocation times on fragmented storage compared to the older hybrid allocator.

For the kernel tunings described in this article, Tentacle changes nothing: BlueStore still relies on the Linux kernel for I/O, memory management, and CPU scheduling. The recommendations for swappiness, THP, CPU governor, C-states, and I/O scheduler remain unchanged.

Crimson and SeaStore — The OSD Rewrite

The most exciting project for Ceph performance in the coming years is Crimson — an OSD daemon rewritten from scratch that is intended to replace the classic ceph-osd long-term. Crimson is based on the Seastar framework and follows a consistent shared-nothing architecture: each CPU core runs in its own reactor, without locks and with minimal cross-core communication.

With Tentacle, the first practically usable tech preview is available: Crimson-OSD with SeaStore (the new, native object store for NVMe) can now be deployed via cephadm. Current benchmarks from the Ceph blog show that SeaStore already significantly outperforms the classic OSD on 4K random reads — achieving up to 400K IOPS compared to 130K IOPS on identical hardware. Sequential workloads are at comparable levels. Random writes are still an active area of optimization.

What does this mean for Linux kernel tuning? Crimson with SeaStore will ultimately depend less on the kernel: SeaStore can access NVMe drives directly via SPDK (kernel bypass), and the Seastar architecture avoids many of the thread synchronization issues that slow down the classic OSD during THP compaction and C-state transitions. However, Crimson is not yet production-ready — the Ceph project itself explicitly recommends it only for testing and experimentation. The community expects a timeline to GA similar to BlueStore: several years from tech preview to production readiness.

What This Means for 2026

For everyone running Ceph in production today — and that applies to the vast majority of installations on Squid or Tentacle with the classic OSD and BlueStore — the kernel tunings described in this article remain the most effective lever for consistent, low latencies. The optimizations cost nothing beyond a bit of extra power and can be applied on any node in minutes.

In parallel, it's worth keeping an eye on the Crimson development. If you're planning a new cluster and going NVMe-only, you can already evaluate Crimson in a test environment — with the knowledge that its architecture will eventually make many of the kernel limitations described here obsolete. The next Ceph release after Tentacle (codename: Umbrella) will show how far Crimson has come by then.

Have questions about this topic?

Our experts are happy to advise you on your individual strategy.

Schedule a consultation