If you're familiar with disk structure, you know that disks are broken down into sectors, which are normally 512 bytes in size; all read or write operations occur in multiples of the sector size. When you look closer, hard disks actually include a great deal of extra data in between sectors. These extra bytes are used by the disk's firmware to detect and correct errors within each sector. As hard disks grow larger, the result is that more and more data must be stored on each square centimeter of disk, resulting in more low-level errors, thus straining the firmware's error correction capabilities.
One way around this problem is to increase the sector size from 512 bytes to a larger value, enabling more powerful error-correction algorithms to be used. These algorithms can use less data on a per-byte basis to correct for more serious problems than is possible with 512-byte sectors. Thus, changing to a larger sector size has two practical benefits: improved reliability and greater disk capacity—at least in theory.
Why are there performance effects?
Unfortunately, changing the apparent sector size in firmware can degrade performance. To understand why, you should understand something about file system data structures and how partitions are placed on the hard disk.
Most modern file systems use data structures that are 4096 bytes or larger in size. Thus, most disk I/O operations are in multiples of this amount. Consider what happens when Linux wants to read or write one of these data structures on a new disk with 4096-byte sectors. If the file system data structures happen to align perfectly with the underlying physical partition size, a read or write of a 4096-byte data structure results in a read or write of a single sector. The hard disk's firmware doesn't need to do anything extraordinary; but when the file system data structures do not align perfectly with the underlying physical sectors, a read or write operation must access two physical sectors. For a read operation, this takes little or no extra time because the read/write head on the disk most likely passes over both sectors in succession, and the firmware can simply discard the data it doesn't need. Writes of misaligned data structures, on the other hand, require the disk's firmware to first read two sectors, modify portions of both sectors, and then write two sectors. This operation takes longer than when the 4096 bytes occupy a single sector. Thus, performance is degraded.
How can you tell if your data structures are properly aligned? Most file systems align their data structures to the beginning of the partitions that contain them. Thus, if a partition begins on a 4096-byte (8-sector) boundary, it's properly aligned. Unfortunately, until recently, most Linux partitioning tools did not create partitions aligned in this way. The upcoming section, Aligning partitions, describes how to do the job with common Linux partitioning software.
How to fix in ZFS
So, what does this have to do with ZFS? ZFS doesn’t have a pre-set block size. It uses variable sized blocks depending on the amount of data it is writing. If it’s writing 1000 bytes, then it will write the minimum number of sectors necessary to fit that data, which in the case of 512-byte sectors, is 2. This means that writing 1000 bytes requires reading and writing 4096 bytes or maybe 8192 depending on alignment. This means that properly aligning the partition will not solve the issue with ZFS. Here we need a different solution.
How to fix in ZFS
There is a hack to force zpool creation with minimum sector size equal to 4k:
%gnop create -S 4096 ${DEV0}
%zpool create tank ${DEV0}.nop
%zpool export tank
%gnop destroy ${DEV0}.nop
%zpool import tank
Zpool created this way is much faster on problematic 4k sector drives
which lies about its sector size (like WD EARS). This hack works perfectly
fine when system is running. Gnop layer is created only for "zpool create"
command -- ZFS stores information about sector size in its metadata. After
zpool creation one can export the pool, remove gnop layer and reimport the
pool. Difference can be seen in the output from the zdb command:
- on 512 sector device (2**9 = 512):
%zdb tank |grep ashift
ashift=9
- on 4096 sector device (2**12 = 4096):
%zdb tank |grep ashift
ashift=12
This change is permanent. The only possibility to change the value of
ashift is: zpool destroy/create and restoring pool from backup.
Reference:
http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/
http://hardforum.com/showthread.php?t=1546137
http://www.nexentastor.org/boards/1/topics/1318
http://kerneltrap.org/mailarchive/freebsd-fs/2010/12/21/6885750
http://www.cod3r.com/2010/06/zfs-on-western-digital-ears-drives/
Hiç yorum yok:
Yorum Gönder