The pool default.rgw.buckets.data has 501 GiB stored, but USED shows 3.5 TiB.
root@ceph-01:~# ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
TOTAL 196 TiB 193 TiB 3.5 TiB 3.6 TiB 1.85
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 19 KiB 12 56 KiB 0 61 TiB
.rgw.root 2 32 2.6 KiB 6 1.1 MiB 0 61 TiB
default.rgw.log 3 32 168 KiB 210 13 MiB 0 61 TiB
default.rgw.control 4 32 0 B 8 0 B 0 61 TiB
default.rgw.meta 5 8 4.8 KiB 11 1.9 MiB 0 61 TiB
default.rgw.buckets.index 6 8 1.6 GiB 211 4.7 GiB 0 61 TiB
default.rgw.buckets.data 10 128 501 GiB 5.36M 3.5 TiB 1.90 110 TiB
The default.rgw.buckets.data pool is using erasure coding:
root@ceph-01:~# ceph osd erasure-code-profile get EC_RGW_HOST
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=4
plugin=jerasure
technique=reed_sol_van
w=8
If anyone could help explain why it's using up 7 times more space, it would help a lot. Versioning is disabled. ceph version 15.2.13 (octopus stable).
This is related to bluestore_min_alloc_size_hdd=64K (default on Octopus).
If using Erasure Coding, data is broken up into smaller chunks, which each take 64K on disk.
One option would be to lower bluestore_min_alloc_size_hdd to 4K, which is good if your solution requires storing millions of tiny (16K) objects. In my case, I'm storing hundreds of millions of 3-4M photos, so I decide to skip Erasure Coding, stay on bluestore_min_alloc_size_hdd=64K and switch to replicated 3 (min 2). Which is much safer and faster in the long run.
Here is the reply from Josh Baergen on the mailing list: