zfs scrub is a pain on one of our servers. It consumes all of the disk
IO and any interactive work on it becomes annoying. Our pool is a
mirror of two identical Samsung HD754JJ disks, we’re running
9.2-RELEASE, the box has 8GB RAM and default ZFS settings.
Here’s the IO load during scrub as shown by iostat(1):
And, here’s our current pool status: (yes, we also seem to have a
performance issue here, the scrub should go much faster):
123456789101112131415
% zpool status
pool: rpool
state: ONLINE
scan: scrub in progress since Wed Aug 13 04:59:44 2014
247G scanned out of 569G at 10.1M/s, 9h5m to go
0 repaired, 43.38% done
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/6f4e7b58-cdb7-11df-b6d7-xxxxxxxxxxxx ONLINE 0 0 0
gptid/6ffa234a-cdb7-11df-b6d7-yyyyyyyyyyyy ONLINE 0 0 0
errors: No known data errors
Direct tunning can be done by adjusting some sysctls, the relevant ones
are below (with default values shown).
After some testing with different settings, we settled with the
following configuration. Note that we’re interested in having a
responsive server during the scrub here and don’t care if scrub takes a
long time to complete.
In summary, the above disables scrub prefetch; limits the number of IOPS
to about 66 on each device (1000 / 15 = 66); tells ZFS that the pool can
be considered idle 1000ms after last activity and sets max pending IO
operations per device to 3.
You can read exelent descriptions of these (and other ZFS tunables) on
this ZFS guide 1.
Now lets see what iostat(1) looks like with these changes:
Scrub seems to be running just as fast (when the system isn’t doing any
other IO):
123456789101112131415
% zpool status
pool: rpool
state: ONLINE
scan: scrub in progress since Wed Aug 13 04:59:44 2014
265G scanned out of 569G at 10.3M/s, 8h22m to go
0 repaired, 46.63% done
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/6f4e7b58-cdb7-11df-b6d7-xxxxxxxxxxxx ONLINE 0 0 0
gptid/6ffa234a-cdb7-11df-b6d7-yyyyyyyyyyyy ONLINE 0 0 0
errors: No known data errors
And interactively the server is much more responsive, so thats objective
complete.