The Hidden Linux Defaults That Are Slowing Down Your Ceph Cluster
Five kernel settings that make the difference between mediocre and outstanding Ceph performance in 2026 — and why the defaults are the wrong choice for storage clusters.
If you're running a Ceph cluster in 2026 — whether as a backend for Proxmox VMs, Kubernetes persistent volumes, or as a sovereign S3 replacement — you've invested in solid hardware: NVMe SSDs, 25 GbE networking, enough RAM per OSD. And yet the latency fluctuates — P99 values spike sporadically even though the hardware should be keeping up just fine.
The root cause often isn't Ceph itself, but one layer below: the Linux kernel. Every distribution ships defaults optimized for general-purpose workloads — web servers, databases, batch processing. Linux maximizes throughput, not latency. For Ceph, that's a problem: a storage cluster serving VMs, containers, and databases needs predictable, low response times on every single I/O operation.
This article covers five kernel settings that should be checked and adjusted on every Ceph node — including the background on why the defaults are problematic. The recommendations apply equally to current Ceph releases (Squid 19.2.x, Tentacle 20.2.x) on current Linux distributions (Debian 12, Ubuntu 24.04, Rocky Linux 9).
Setting | Default | Impact on Ceph | Root Cause |
|---|---|---|---|
| 60 | 10–100 µs per page fault on NVMe swap; 1–5 ms on network storage | OSD heap evicted to disk |
Transparent Huge Pages |
| 10–50 ms compaction stalls |
|
CPU governor |
| 10–50 µs frequency ramp | DVFS transitions under variable load |
C-States | all enabled | 50–100 µs wake latency | CPU must restore voltage and clock |
I/O scheduler |
| unnecessary overhead on NVMe | scheduling logic NVMe doesn't need |
These defaults exist for good reason: they save power and maximize overall throughput. For 99% of workloads, that's the right call. A Ceph cluster that needs to deliver consistent IOPS at low latency belongs to the other 1%.
1. Set Swappiness to 0
Why This Is Critical for Ceph
With vm.swappiness = 60, the kernel treats file-backed pages (page cache) and anonymous pages (heap, stack) roughly equally when deciding what to evict under memory pressure. This means: the heap of an OSD process can be swapped to disk even though there are page cache entries that could safely be discarded instead.
Ceph OSD daemons hold BlueStore caches, RocksDB block caches, and various internal data structures in their heap. When that heap gets swapped to disk and then needs to be loaded back via page fault, the OSD thread stalls — right in the middle of a client I/O operation. On NVMe, a swap read takes 10–100 µs. On slower storage, significantly longer.
This becomes particularly critical during recovery operations: Ceph OSDs consume significantly more memory during backfill and recovery than in steady state. The Ceph documentation recommends provisioning at least 8 GB of RAM per OSD — precisely because of this overhead during peak periods. In hyper-converged setups with Proxmox VE 9 or Kubernetes on the same nodes, the risk of memory pressure is further increased.
The Fix
# Check current value
sysctl vm.swappiness
# Set persistently
echo 'vm.swappiness = 0' | sudo tee /etc/sysctl.d/90-ceph.conf
sudo sysctl -p /etc/sysctl.d/90-ceph.confWith swappiness=0, anonymous pages are only swapped when the system is critically low on memory. The page cache is evicted first — that's safe because file contents can always be re-read from disk.
Verification
# Monitor swap activity (si/so should be at 0)
vmstat 1 | awk '{print $7, $8}'
# Check current swap usage
free -hImportant: Setting swappiness to 0 does not disable swap. It means the kernel keeps the heap in RAM as long as possible. Under severe memory pressure, swapping still occurs — better than an OOM kill.
2. Disable Transparent Huge Pages
Why THP and Ceph Don't Mix
Transparent Huge Pages (THP) automatically merge 4 KB pages into 2 MB pages to reduce TLB misses. Sounds good, but the mechanism comes at a cost: the kernel thread khugepaged continuously scans memory looking for pages that can be merged. To do this, it needs to perform memory compaction — moving physical pages around to create contiguous 2 MB regions.
During this compaction, the kernel holds locks that can block memory allocations. An OSD thread requesting memory at that moment waits 10–50 milliseconds. Not microseconds — milliseconds. These are the kind of spikes that turn a normal P99 value into an outlier.
Ceph has officially addressed this issue: since 2019, a tracker entry identifies THP as problematic for Ceph daemons. The recommendation is to disable THP system-wide until selective usage via madvise is implemented. Both ceph-ansible and cephadm set disable_transparent_hugepage to True by default — this recommendation has not changed with Ceph Tentacle (2025) or current kernel versions.
An additional problem: THP interacts poorly with the memory allocators Ceph uses (tcmalloc, jemalloc). Reports from the Ceph community show that THP with tcmalloc can lead to uncontrolled memory growth — RSS usage rises above the configured osd_memory_target, in the worst case all the way to an OOM kill.
The Fix
# Check status
cat /sys/kernel/mm/transparent_hugepage/enabled
# Output: [always] madvise never
# Disable immediately
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# Make persistent via systemd service
cat <<'EOF' | sudo tee /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled && echo never > /sys/kernel/mm/transparent_hugepage/defrag'
[Install]
WantedBy=basic.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable disable-thp.serviceVerification
# Monitor compaction activity (should not be increasing)
watch -n1 'grep -E "compact_|thp_" /proc/vmstat'
# Check huge page usage
grep AnonHugePages /proc/meminfoTrade-off: Without THP, you lose automatic huge page optimization. For Ceph, that's no loss — BlueStore manages its own cache and does not benefit from transparent huge pages.
3. Set CPU Governor to Performance
Why Frequency Scaling Causes Ceph Latency
Modern CPUs use Dynamic Voltage and Frequency Scaling (DVFS) to save power. The ondemand governor (a common default) monitors CPU utilization and only increases frequency when load rises.
The problem for Ceph: OSD workloads are bursty. An OSD is briefly idle, then a client request arrives, then it's idle again. The governor sees low utilization and keeps the frequency low. The first instructions of an incoming I/O request run at reduced clock speed — the frequency ramp takes 10–50 µs.
This effect has been specifically measured in the Ceph community: in a benchmark with NVMe SSDs, performance with the powersave governor was significantly below that with the performance governor, because the CPU at moderate utilization (around 27%) never scaled up at all. Only switching to performance delivered the expected IOPS.
The Ceph blog also explicitly recommends the network-latency or latency-performance TuneD profile for all-flash deployments — which, among other things, sets the CPU governor to performance.
The Fix
# Check current governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Set on all cores
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance | sudo tee "$cpu"
done
# Or via TuneD (recommended for Ceph nodes)
sudo tuned-adm profile latency-performanceVerification
# Check frequency (should be constant at maximum)
grep MHz /proc/cpuinfo | sort -t: -k2 -n | tail -4
# Via turbostat (if installed)
turbostat --interval 1 --show Core,CPU,Bzy_MHzTrade-off: Higher power consumption. In a data center where Ceph nodes run 24/7 anyway, that's an acceptable price for consistent latencies.
4. Restrict Deep C-States
Why Idle CPUs Become a Latency Problem
Even with the performance governor, idle CPU cores enter sleep states (C-states) to save power:
C-State | What Happens | Wake Latency |
|---|---|---|
C0 | Active | 0 |
C1 | Clock stopped | 1–5 µs |
C1E | Clock + voltage reduced | 5–10 µs |
C3 | L1/L2 cache cold | 30–50 µs |
C6 | Voltage cut, state saved to RAM | 50–100 µs |
A Ceph OSD is waiting for the next client request. The CPU drops into C6. The request arrives — and the CPU needs 50–100 µs to become fully active again. This latency adds up on every I/O operation.
This effect is particularly relevant for Ceph because the Ceph mailing list documents a concrete case: after setting cpu_dma_latency=0 on all OSD nodes, recovery throughput increased by a conservative 30% — from ~16 GB/s to ~22 GB/s. The CPUs then ran in turbo instead of dropping into deep C-states.
The Fix
Option 1: Kernel boot parameters (recommended)
# Add to GRUB_CMDLINE_LINUX in /etc/default/grub:
processor.max_cstate=1 intel_idle.max_cstate=0
# Update GRUB
sudo update-grub # Debian/Ubuntu
# or
sudo grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL/RockyOption 2: At runtime (temporary)
# Disable C-states > C1
for state in /sys/devices/system/cpu/cpu*/cpuidle/state[2-9]/disable; do
echo 1 | sudo tee "$state"
doneOption 3: Via /dev/cpu_dma_latency (as in the Ceph mailing list case)
# Keeps CPUs in C0/C1 as long as the file descriptor is open
exec 3>/dev/cpu_dma_latency
echo -ne '\x00\x00\x00\x00' >&3Verification
# Check C-state residency
turbostat --interval 1 --show Core,C1%,C3%,C6%
# C3% and C6% should be at 0Note on AMD systems: AMD Rome processors (EPYC 7002 series) and newer appear to be less sensitive to C-state transitions. Nevertheless, the official Ceph benchmarks recommend restricting C-states.
5. Adjust I/O Scheduler for NVMe
Why NVMe Drives Don't Need a Scheduler
For traditional HDDs and SATA SSDs, the Linux I/O scheduler manages the order of requests to minimize seek times. The default mq-deadline prioritizes reads and arranges writes contiguously.
NVMe drives have their own controllers with deep internal queues and don't need this optimization. Every request the kernel reorders or delays is wasted CPU time and additional latency. For NVMe drives, the scheduler should be set to none — a pure pass-through.
This is also recommended in the Ceph community and by enterprise distributions like Red Hat: enterprise NVMe SSDs with their own power-safe caches and controllers perform best with none.
The Fix
# Check current scheduler
cat /sys/block/nvme0n1/queue/scheduler
# Set to none (per NVMe device)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# Make persistent via udev rule
cat <<'EOF' | sudo tee /etc/udev/rules.d/60-ceph-scheduler.rules
# NVMe: no scheduler
ACTION=="add|change", KERNEL=="nvme*", ATTR{queue/scheduler}="none"
# SATA/SAS SSD: mq-deadline
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
# HDD: mq-deadline
ACTION=="add|change", KERNEL=="sd*", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="mq-deadline"
EOF
sudo udevadm control --reload-rulesVerification
# Check for all block devices
for dev in /sys/block/*/queue/scheduler; do
echo "$dev: $(cat $dev)"
doneBonus: Ceph-Specific Settings
Beyond kernel tunings, there are some Ceph-internal knobs that become relevant in combination with the above changes:
BlueStore memory target: With the kernel tunings in place (no swap, no THP), BlueStore can use its cache more effectively. The default of 4 GB per OSD is a compromise; for NVMe deployments, it's worth going to 8 GB or more:
ceph config set osd osd_memory_target 8589934592 # 8 GBNUMA pinning: On dual-socket systems, OSD processes should be pinned to the same NUMA node as their associated NVMe drives and NICs. Every access across the QPI/UPI link between sockets adds latency.
Networking: For Ceph clusters with 10 GbE or more, jumbo frames (MTU 9000) should be enabled, provided all switches in the storage network support it. With modern Linux kernels (6.x), the TCP defaults are already well tuned — many of the older tuning guides with adjusted buffer sizes are no longer useful or even counterproductive on current kernels.
Quick Audit: Everything at a Glance
#!/bin/bash
echo "=== Swappiness ==="
sysctl vm.swappiness
echo "=== Transparent Huge Pages ==="
cat /sys/kernel/mm/transparent_hugepage/enabled
echo "=== CPU Governor ==="
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 2>/dev/null || echo "N/A"
echo "=== C-States (C3+) ==="
for state in /sys/devices/system/cpu/cpu0/cpuidle/state[2-9]/disable; do
[ -f "$state" ] && echo "$(dirname $state | xargs basename): disabled=$(cat $state)"
done
echo "=== I/O Scheduler (NVMe) ==="
for dev in /sys/block/nvme*/queue/scheduler; do
[ -f "$dev" ] && echo "$(echo $dev | grep -o 'nvme[^/]*'): $(cat $dev)"
doneDesign Philosophy: When These Tunings Make Sense
Each of the optimizations described trades throughput or power consumption for latency:
Optimization | What We Give Up | What We Gain |
|---|---|---|
| Page cache efficiency under memory pressure | OSD heap stays in RAM |
THP deactivated | Automatic huge pages | No compaction stalls |
Performance governor | Power savings | No frequency ramp delays |
C-state limits | Idle power consumption | Predictable wake latency |
I/O scheduler | Request reordering | Direct NVMe access |
These tunings are not right for every use case. On a development system, a pure batch processing server, or a memory-constrained system, the defaults may be the better choice.
For a production Ceph cluster that needs to deliver consistent IOPS and low latencies for VMs, containers, or database backends, they are essential. Especially in all-flash deployments with NVMe, the hardware is often not the bottleneck — rather, it's the kernel, which was optimized for a different use case. If you run sovereign open-source infrastructure in your own data center, you have the advantage of controlling these knobs yourself — unlike proprietary storage appliances or cloud storage services, where this layer remains invisible.
Outlook 2026: What's Changing in Ceph — and What Isn't
Ceph Tentacle (20.2.x) — The Current Stable Release
With Ceph Tentacle, the current stable release since November 2025, there have been notable improvements at the storage engine level. BlueStore received a faster write-ahead log (WAL), and OMAP iteration was accelerated — which primarily improves RGW bucket listings and scrub operations. Additionally, the hybrid_btree2 allocator was backported from Squid, which delivers significantly better allocation times on fragmented storage compared to the older hybrid allocator.
For the kernel tunings described in this article, Tentacle changes nothing: BlueStore still relies on the Linux kernel for I/O, memory management, and CPU scheduling. The recommendations for swappiness, THP, CPU governor, C-states, and I/O scheduler remain unchanged.
Crimson and SeaStore — The OSD Rewrite
The most exciting project for Ceph performance in the coming years is Crimson — an OSD daemon rewritten from scratch that is intended to replace the classic ceph-osd long-term. Crimson is based on the Seastar framework and follows a consistent shared-nothing architecture: each CPU core runs in its own reactor, without locks and with minimal cross-core communication.
With Tentacle, the first practically usable tech preview is available: Crimson-OSD with SeaStore (the new, native object store for NVMe) can now be deployed via cephadm. Current benchmarks from the Ceph blog show that SeaStore already significantly outperforms the classic OSD on 4K random reads — achieving up to 400K IOPS compared to 130K IOPS on identical hardware. Sequential workloads are at comparable levels. Random writes are still an active area of optimization.
What does this mean for Linux kernel tuning? Crimson with SeaStore will ultimately depend less on the kernel: SeaStore can access NVMe drives directly via SPDK (kernel bypass), and the Seastar architecture avoids many of the thread synchronization issues that slow down the classic OSD during THP compaction and C-state transitions. However, Crimson is not yet production-ready — the Ceph project itself explicitly recommends it only for testing and experimentation. The community expects a timeline to GA similar to BlueStore: several years from tech preview to production readiness.
What This Means for 2026
For everyone running Ceph in production today — and that applies to the vast majority of installations on Squid or Tentacle with the classic OSD and BlueStore — the kernel tunings described in this article remain the most effective lever for consistent, low latencies. The optimizations cost nothing beyond a bit of extra power and can be applied on any node in minutes.
In parallel, it's worth keeping an eye on the Crimson development. If you're planning a new cluster and going NVMe-only, you can already evaluate Crimson in a test environment — with the knowledge that its architecture will eventually make many of the kernel limitations described here obsolete. The next Ceph release after Tentacle (codename: Umbrella) will show how far Crimson has come by then.
Have questions about this topic?
Our experts are happy to advise you on your individual strategy.
Schedule a consultation