Linux Not Fully Prepared for 4096-Byte Sector Hard Drives

Guest post by theosib 2010-02-14 Linux 65 Comments

Recently, I bought a pair of those new Western Digital Caviar Green drives. These new drives represent a transitional point from 512-byte sectors to 4096-byte sectors. A number of articles have been published recently about this, explaining the benefits and some of the challenges that we’ll be facing during this transition. Reportedly, Linux should unaffected by some of the pitfalls of this transition, but my own experimentation has shown that Linux is just as vulnerable to the potential performance impact as Windows XP. Despite this issue being known about for a long time, basic Linux tools for partitioning and formatting drives have not caught up.

The problem most likely to hit you with one of these drives is very slow write performance. This is caused by improper logical-to-physical sector alignment. OS’s like Linux use 4K blocks (or multiples of 4K) to store data, which matches well with the physical sector. However, nothing restricts you from creating a partition that starts on an odd-numbered 512-byte logical sector. This misalignment causes a performance hit since the drive has to read and rewrite the 4K sectors with whatever 512-byte slices changed.

WD claims to have done some studies and found that Windows XP was hardest hit. By default, the first primary partition starts on LBA block 63, which obviously is not a multiple of 8. They provide a utility to shift partitions by 512 bytes to line them up. WD also tested other OS’s and declared both MacOS X and Linux to be “unaffected”. I don’t know about MacOS, but with regard to Linux, they are not entirely correct. Following are the results of my experimentation.

The first thing I did was test the performance effect itself. It has been suggested that WD might internally offset block addresses by 1 so that LBA 63 maps to LBA 64. This way, Windows XP partitions would not really be misaligned. I performed a test that demonstrates that WD has not done this. I’ve included the source code to my test at the end of the article. This program does random 4K block writes to the drive at a selectable 512-byte alignment. So if I pass 0 to the program, it runs the test on 4K boundaries. If I pass 1, the test is on 4K boundaries plus 512. The effects of this test are amplified by the use of O_SYNC, which insists that all writes hit the disk immediately, but it demonstrates the problem. Note that I realize that all my testing is “quick and dirty,” but I’m just trying to demonstrate a point, not analyze it in painful detail.

1000 random aligned 4K writes consistently take between 7 and 8 seconds.
1000 random unaligned 4K writes consistently take between 22 and 24 seconds.

Now, this just demonstrates the problem we already know about. What about how it affects filesystems. Next, to formatting the drives.

I have two drives, /dev/sdc and /dev/sdd, both identical Green drives. I partitioned them as follows:

For /dev/sdd, I used fdisk to add a Linux (0x83) primary partition, taking up the whole disk, using fdisk defaults. By default, the partition starts at LBA 63.

For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode (“x”), then setting the start sector (“b”) to 64.

Once that was finished, I formatted both drives using the command “time mke2fs /dev/sdc1” (and sdd1).

/dev/sdc, which was aligned, took 5m 45.716s to format.
/dev/sdd, which was not aligned, took 19m 53.609s to format.

That’s a difference of greater than a factor of three!

Now to file tests. I ran two test. The first test was to copy one large file. I have a Windows XP disk image for qemu-kvm that takes up 18308968KiB. I copied the file (from my much faster 7200 RPM drives in RAID1 configuration) to one drive, then the other, then I reran the first test to avoid buffering effects.

$ time cp winxp.img /mnt/sdc   # ALIGNED
real    5m9.360s
user    0m0.090s
sys     0m20.420s

$ time cp winxp.img /mnt/sdd   # UNALIGNED
real    13m26.943s
user    0m0.110s
sys     0m19.350s

Pretty striking difference. I didn’t really expect this. Since this is one large file, and it can be written linearly to the disk, I expected that we would see a very slight performance hit. I think this is something that itself should be investigated. There’s no reason for long contiguous writes to get hit this hard, and it’s something that the kernel developers need to look into and fix. To complete the testing, I next tried random writes. I have some stuff I’ve been working on for school, lots of small files of all sorts of different sizes. So I decided to copy that stuff recursively.

$ time cp -r Computer Architecture/ /mnt/sdc   # ALIGNED
real    42m9.602s
user    0m0.680s
sys     1m59.070s

$ time cp -r Computer Architecture/ /mnt/sdd   # UNALIGNED
real    138m54.610s
user    0m0.660s
sys     2m15.630s

This performance hit of a factor of about 3.3 is surprisingly consistent across operations. And this is severe. I’ve read people guessing that there would be a 30% performance loss. But a 230% performance loss is exceptionally bad.

In conclusion, these drives are on the market now. We’ve known about this issue for a LONG time, and now it’s here, and we haven’t fully prepared. Some distros, like Ubuntu, use “parted”, which has a very nice “–align optimal” option that will do the right thing. But parted is incomplete, and we must rely on tools like fdisk for everything else. But anyone manually formatting drives based on popular how-tos that pop up at the top of Google searches is going to cause themselves a major performance hit, because mention of this alignment issue and how to fix it is conspicuously absent. I’ve done a lot of googling on this topic, and as far as I can tell, this issue has really not been taken seriously. There’s plenty of discussion on aligning partitions for SSDs and VMWare volumes, but nothing about the issue relating to these new hard drives. And no fix for fdisk. Most of the drives still being sold today have 512-byte sectors, so lots of people will say “not my problem”, but it will become your problem soon since all the hard disk manufacturers have been very eager to make the switch. This time next year you may have trouble buying a drive without 4K sectors, and you’re going to want all your Linux distros to handle them properly.

Evaluation setup and methodology:

Gentoo ~amd64 system with 2.6.31-gentoo-r5 kernel
fdisk version: fdisk (util-linux-ng 2.17)
The drives are identical, but I did not try swapping configurations to make sure that one drive isn’t fundamentally slower than the other.
Core 2 Quad at 2.33GHz (Q9450), 8GiB of RAM
MSI X48 Platinum motherboard — Intel X48 Express + ICH9R

Related articles:

http://lwn.net/Articles/322777/
http://hardware.slashdot.org/article.pl?sid=06/03/24/0619231
http://bugs.gentoo.org/show_bug.cgi?id=304727

About the author

Timothy Miller is a Ph.D. student at The Ohio State University, specializing in Computer Architecture, and Artificial Intelligence. Prior to going back to school, he worked professionally as a chip designer. Tim is also the founder of the Open Graphics Project.

Random block write code:

#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <fcntl.h>

char buffer[4096];

int main(int argc, char *argv[])
{
    int fd, i, off;
    long bk, byte;
    
    if (argc<2) {
        off = 0;
    } else {
        off = atoi(argv[1]);
    }
    
    srandom(off);
    
    fd = open("/dev/sdc", O_RDWR | O_SYNC);
    printf("fd=%dn", fd);
    
    for (i=0; i<1000; i++) {
        bk = random() % 200000000;
        byte = bk * 4096 + off * 512;
        lseek64(fd, byte, SEEK_SET);
        write(fd, buffer, 4096);
    }
    
    close(fd);

    return 0;        
}

65 Comments

2010-02-14 10:48 am
Kroc
Thanks for the excellently prepared article, I didnâ€™t have to do anything to it.
I really didnâ€™t know about this issue, thanks for investigating and presenting it so clearly. Does this affect disks of a certain size (and above), or any size that are manufactured to this setting?â€”I certainly want to avoid this problem when replacing HDDs for WinXP machines.

2010-02-14 12:16 pm
aaron
Indeed an excellent article!
I think it would happen in all 4K sector drives regardless of size. If you started at LBA 63, the drive would be forced to update 2 4K sectors because the virtual sectors would lay across 2 physical sectors.
P=======P=======P=======
=V=======V=======V======
Edited 2010-02-14 12:20 UTC

2010-02-14 12:27 pm
Kroc
Reading the Aandtech article, it appears intended for 1TB+ drives, but as you say, it could apply to drives of any size if WD decided to adopt it across the board.
2010-02-14 2:09 pm
f0dder
Hm, how is LBA addresses specified exactly?
I thought it was specified as multiples of drive sector size, in which case addressing via LBA should always be aligned – but is it in reality always a multiple of 512?
Is there a difference between how LBAs in the MBR are interpreted, and how ATAPI interprets LBAs?

2010-02-14 2:21 pm
f0dder
…but perhapse the 4k-sector WD drives are currently running in 512b-sector emulation mode? That would at least explain the “LBA weirdness”

2010-02-14 3:06 pm
theosib
My drives are 1GB, but I’m sure it’ll affect any 4K-sector drive the same.
2010-02-14 3:48 pm
yoko-t
Thanks for the excellently prepared article, I didnâ€™t have to do anything to it.
I really didnâ€™t know about this issue, thanks for investigating and presenting it so clearly. Does this affect disks of a certain size (and above), or any size that are manufactured to this setting?â€”I certainly want to avoid this problem when replacing HDDs for WinXP machines.
Did that include actually reading the article?
Dear God, who actually uses fdisk to partition a harddrive anymore? To get information about the partitions *ON* a harddrive,yes-to actually *PARTITION* a harddrive,hell no.
Hell, when it was widely used,everyone knew it was sort of buggy,and the program itself warned you about the DOS-based partitions it created.
Unless I missed something, the author of this “article” seemed to know nothing about Gparted,which can be found on most live CD’s/USB rescue distro’s or can be installed on RPM based distros like Fedora with the “yum install gparted” command

2010-02-14 5:14 pm
r00kie
As if gparted (which is just a pretty gui + parted) can’t screw things up too. fdisk and the likes are for people that know what they are doing.
In case you don’t know there are distros that have text mode installs and use utilities like fdisk/cfdisk. Just because you can use gparted it doesn’t mean you have to.
2010-02-14 6:02 pm
FishB8
I use it all the time. Furthermore I actually use gdisk so that I can create hybrid MBR-GPT partitions. Parted (and gparted) totally trashes hybrid MRB-GPT setups.
2010-02-15 3:29 pm
Quag7
I’ve never used anything but fdisk, and never had any bugs or problems with it. But then I run Gentoo and don’t use the installer.

2010-02-14 11:06 am
Lennie
It’s not so much Linux, it’s the DOS-compatible partition that fdisk creates.
If you don’t need DOS-compatiblility, you wouldn’t have a problem.
It’s a DOS/Windows-compatibility thing you are trying to attribute to Linux. DOS/Windows has a problem, Linux just tries to be compatible.
As you said, parted does it just fine.
Edited 2010-02-14 11:09 UTC

2010-02-14 12:06 pm
bralkein
The fdisk man page (on my machine at least) contains a lengthy warning about how fdisk will quite happily create some pretty dodgy partition layouts and it recommends parted for doing anything even remotely unusual. I guess this falls into that category.
All the major noob-friendly distros use gparted for doing the partition editing, don’t they? Will that protect users from these kinds of problem, then?

2010-02-14 12:34 pm
WereCatf
All the major noob-friendly distros use gparted for doing the partition editing, don’t they? Will that protect users from these kinds of problem, then?
I do consider Mandriva to be pretty newbie-friendly all in all; it’s clear, consistent, and provides an extensive selection of documentation and loads of online help if needed. Also, it’s really stable and has excellent control center utility.
But alas, Mandriva doesn’t actually use gparted. They use some sort of a tool of their own which apparently uses libparted as its backend. As far as I know quite a few distros actually do it that way. But as the article states, you seemingly have to use “–align optimal” option which does the right thing. It doesn’t automatically align the partitions properly without that. And I have no idea if those custom partitioning tools employed by various distros pass such an option to libparted. If they don’t then that’ll be a very important issue to fix immediately.
I’d actually prefer if distros rolled out an update of some sort which will check the currently installed system and its partitioning scheme and warn if they are misaligned and would provide a way of fixing it; not everyone re-installs their system all the time and as such could be using misaligned partitions for years before next re-install.

2010-02-14 1:15 pm
Lennie
I think it will be mostly ok if they use it on new installs from now on, these WDs are not widely available on the market yet (and when you do buy them: their is a BIG warning on the front).
2010-02-14 1:16 pm
Lennie
gparted get it’s right:
“When enabled, Round to cylinders aligns partition boundaries on the cylinder boundaries for the disk device. Enabled is the default setting.”
The Ubuntu installer uses Partman from Debian, which uses parted in the background.
So that’s a start.
But if parted will do it right when run from partman, I don’t know yet.
Edited 2010-02-14 13:19 UTC

2010-02-14 1:43 pm
Lennie
I just had a look at what partman does and what parted does. Parted does align by default as well, just like gparted. And partman just passes the MB’s to parted, if I looked in the right places. That means it will do the right thing by default I think.
2010-02-14 3:00 pm
theosib
Aligning partition to “cylinder” boundaries is BAD. The cylinders are fake and they’re in units of 63.
2010-02-14 3:58 pm
Lennie
You could be right.
I guess my brain is off because it’s weekend.

2010-02-14 1:24 pm
bralkein
Yeah, Mandriva would certainly fall within the category of distros which should sort all of this stuff out automatically without the user having to worry. Distros like Arch, Gentoo & Slackware all generally expect their users to be aware of the technical issues. But for the mainstream distros this does need to be fixed.
Maybe we could take a look at our respective distros and file a bug report if there could be an issue.
2010-02-14 1:36 pm
modmans2ndcoming
Mandriva was the first distro to make partitioning the hard drive mortal friendly.
Mandrake Linux 7.0 was when they first released it and I recall thinking “Holy crap, this thing needs to be sold separately”
I went so far as to use the install disk up to the partition step to repartition my hard drives for a while.

2010-02-14 2:15 pm
akajeff
So, since you’re using Linux, wouldn’t it behoove you to use the GPT (GUID Parition Table) scheme which handles, by design, the new block size?
ref: http://en.wikipedia.org/wiki/GUID_Partition_Table

2010-02-14 5:55 pm
bralkein
Well apart from anything, fdisk doesn’t support GPT. Parted does, but as he said, you can get parted to automatically solve the problem anyway.
The problem as TFA describes it is that a bunch of the tools & tutorials out there today will give you bad results if you use them with the new block size. They will probably give you bad results if you use them with GPT, too.
Edited 2010-02-14 17:57 UTC

2010-02-14 2:33 pm
Brendan
Hi,
It’s a DOS/Windows-compatibility thing you are trying to attribute to Linux. DOS/Windows has a problem, Linux just tries to be compatible.
No.
Very old disk drives used “CHS” (Cylinders, Heads, Sectors) instead of LBA. Due to limitations this didn’t work for drives larger than about 500 MiB, so the industry shifted to LBA; and created a CHS->LBA translation scheme.
Due to BIOS limits, this CHS->LBA translation scheme usually uses “63 sectors per track”, which is the highest number of sectors per track that the old BIOS disk interface can handle.
For performance reasons OSs make partitions that start/end on track boundaries (having a few sectors at the start or end of a partition that are on a track by themselves causes more disk head movement).
Basically what I’m saying is that the problem wasn’t caused by *any* OS. The problem is caused by 30 years of backward compatibility (and the lack of foresight, from BIOS, disk and OS designers).
The ironic part is that the original IBM design supported floppy disks and hard drives with different sector sizes. It’s unfortunate that this aspect of the original design was lost, and unfortunate that these new hard drives need to emulate 512-byte sectors to begin with.
-Brendan

2010-02-14 11:06 am
s-peter
Thanks for the interesting and informative article! Should be careful when installing my next pair of drives…
Just one minor observation: can you actually have a “230% performance loss?” That sounds as if the performance of the system turned negative… I think it would be clearer to say 230% overhead (in operation time) or 70% performance loss (in average throughput).
2010-02-14 11:07 am
Lennie
http://www.anandtech.com/storage/showdoc.aspx?i=3691
2010-02-14 11:10 am
Idefix
There recently was a discussion about this on the util-linux-ng mailing list:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926
2010-02-14 1:13 pm
jaymzh
The posts on the fdisk list seem to imply that the version of fdisk you are using will do the right thing provided you’re using a .32 kernel that can properly report the disk topology. Could you test with that?
Here’s the post I’m referring to:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926

2010-02-14 2:17 pm
malxau
The posts on the fdisk list seem to imply that the version of fdisk you are using will do the right thing provided you’re using a .32 kernel that can properly report the disk topology. Could you test with that?
Here’s the post I’m referring to:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926
Reporting disk topology requires the hardware to also communicate that topology. AFAIK many of these drives do not. Christoph Zimmermann summarized the same observation in the thread you point to:
alexander an I have got the same conclusion on these topic:
– it is a must that the partitions are aligned correctly to 4KiB
boundries. else the drive is unusable slow.
– the drive does NOT report its physical sector size. so doing proper
programming is not enough.
It looks like the discussion in that thread is about aligning partitions by default on a sufficiently large granularity to “get by” (as Vista and above do.) Note that other technologies (eg. SSDs, RAID) benefit from large alignment (larger than 512 bytes or 4Kb.)
2010-02-14 3:02 pm
theosib
I was using a 2.6.32 kernel. It clearly did not “do the right thing.”

2010-02-14 1:24 pm
sj87
I did battle with this issue just this morning. I had to manually configure the partitions to both begin and be length of multiples of eight using parted.
I didn’t actually suffer of major read/write performance loss, only some random reads made the drive go crazy and very slow. Write speed topped at 70 MB/s, and after setting the partitions well-aligned it has gone up to 80 MB/s.
It has been suggested that WD might internally offset block addresses by 1 so that LBA 63 maps to LBA 64. — I performed a test that demonstrates that WD has not done this
You’re both wrong and wrong. It’s a hack mode that can be enabled by connecting pins 7-8 on the HDD. But of course it’s not enabled by default, because it would totally screw up every other OS.
Edited 2010-02-14 13:25 UTC

2010-02-14 5:28 pm
soulrebel123
I think I might have the same problem. What model are you talking about?
mine is WD15EADS.
Perhaps not only linux is not ready. They are not ready even at WD support.

2010-02-15 2:10 pm
sj87
I think I might have the same problem. What model are you talking about?
mine is WD15EADS.
Currently the only “Advanced Format” disks (as WD calls it) are the Green models ending in “EARS”. WD15EADS is of the previous generation and therefore not affected.
The Advanced Format disks are called WD10EARS, WD15EARS and WD20EARS.
Hm, how is LBA addresses specified exactly?
I thought it was specified as multiples of drive sector size, in which case addressing via LBA should always be aligned – but is it in reality always a multiple of 512?
I read that every major HDD manufacturer has agreed to using a 512-byte-emulation mode until the end of 2014.
I also read that there is a way for software to ask the disk about its real physical layout but that Western Digital hasn’t implemented such feature in its current line of 4k disks. Therefore no software can detect them and take care to stay aligned.
Edited 2010-02-15 14:11 UTC

2010-02-16 5:57 pm
soulrebel123
I know it is not supposed to be an advanced format drive, but running the code at the and of the article indeed shows performance differences with alignment other than 0 and 8. (I ran it on a partition starting at sector 64)
Also the disk freezes a lot when doing io.
I am sending it back.
2010-02-17 6:12 pm
kjmph
Where did you get this info? If you read WD’s site:
http://www.wdc.com/en/products/products.asp?driveid=773
Formatted Capacity 2,000,398 MB
Capacity 2 TB
Interface SATA 3 Gb/s
User Sectors Per Drive 3,907,029,168
That’s 512 byte sectors, model # WD20EARS

2010-02-14 1:29 pm
slysir
‘Parted Magic’ uses core programs of GParted and Parted to handle partitioning tasks with ease, while featuring other useful programs (e.g. Partimage, TestDisk, Truecrypt, G4L, SuperGrubDisk, ddrescue, etc…).
If you ever used PartitionMagic with windows, ‘Parted Magic’ is a superior linux partitioning tool that you can use from a cd, usb or load it from its own directory on the drive.
http://partedmagic.com/
2010-02-14 1:46 pm
r00kie
I don’t know if you have tried it or not but would be nice to know if fdisk -b 4096 /dev/sd? does make any difference. Also it would be good to know if other similar utilities like cfdisk create unaligned partitions or not.
2010-02-14 2:32 pm
Pr3st00
I confess I was expecting more from the XBOX 360..
2010-02-14 2:52 pm
arkeo
LEARN
SOME
ENGLISH!!!
This is ridiculous:
“Reportedly, Linux should unaffected by some of the pitfalls of this transition, but my own experimentation has shown that Linux is just as vulnerable to the potential performance impact as Windows XP. Despite this issue being known about for a long time, basic Linux tools for partitioning and formatting drives have not caught up.”
You either start teaching your writers some proper international English or you might as well post news in Dutch (or Italian: “Ho una notizia (quasi inutile) da divulgare ma mi manca il vocabolario!!!”)
Grow up.

2010-02-14 3:05 pm
theosib
I made a mistake by leaving “be” out of my sentence. I think you’re smart enough to make the appropriate inference. This was not intended to be a professional publication, so I didn’t bother editing it 20 times and having it proof-read by lots of other people.
2010-02-14 3:18 pm
siride
Aside from forgetting a word (“should /be/ unaffected”), I see no problems with what he wrote.
2010-02-14 3:55 pm
rockwell
wow, he forgot one word, and therefore he’s incapable of communicating in English?
Please, drink a fifth of scotch and calm the hell down.
2010-02-14 5:29 pm
BluenoseJake
That seemed pretty understandable to me, perhaps you need to bone up on your reading comprehension.
2010-02-15 4:23 pm
arkeo
I would like to apologize to you all.
I was angry for other reasons.
I just lost my job.
I love OSNews, and I hate grammar mistakes on the front page. A little more care by the authors, that’s all I’m asking for.
Cheers…

2010-02-14 3:12 pm
pagerc
So using cp is about as braindead as rm -rf /* for testing disk io. Its all about the block size that’s read/written which in the case of cp is 1 character at a time. Something like dd or tar would provide a better metric for streaming writes. tar -cpf – some_path/ | tar -xpf – -C /path/to/final/destination
Or you can use dd which allows you to slice and dice and adjust block sizes trivially, then you can write to a raw block device and see what it can do sans filesystem crap.
An interesting test would use variable block sizes of 512, 768, 1024, 2048, 4096, 8192, 16384, which will show an odd output block size at 768, and the performance of 1 and 2 bit shifts above and below the new block size. Just to show how brain dead a block size of 1 is, I am throwing that in here too.
for BS in 1 512 768 1024 2048 4096 8192 16384 ; do
for SKIP in 0 1 2 4 8; do
dd if=/dev/zero of=/dev/sdc bs=${BS} seek=${SKIP} count=1024k
done
done

2010-02-14 3:20 pm
siride
Just make sure to do that on a disk whose data you don’t care about. I only say this in case some naive user decides to test their primary hard-drive in this fashion and ends up destroying the first X megabytes of the drive (including bootloader and partition tables).
2010-02-14 3:22 pm
smashIt
So using cp is about as braindead as rm -rf /* for testing disk io. Its all about the block size that’s read/written which in the case of cp is 1 character at a time.
it’s the duty of the blockdevice-driver and/or the filesystem to collect several such manipulations before writing them to the disk
2010-02-15 7:41 am
legume42
Well, no. cp doesn’t do 1 char at a time, it tries to minimize io. It might not be clear from the code though. Even stdio keeps an internal io buffer you can’t normally see.
A quick truss/strace on recent FreeBSD/CentOS/Solaris shows 64k buffers/4k buffers/mmap the whole darned file approaches.

2010-02-14 3:14 pm
graigsmith
you could get around this type of issue by partitioning your drive right?
2010-02-14 6:16 pm
Cyphase
I just got a WD15EARS (1.5 TB, SATA 3 Gb/s, 64 MB Cache) a couple of days ago. I formatted it as ext3 and have filled it ~90% at ~20MB/sec. Is there anyway (please-oh-please-oh-please) that I can fix this in-place, i.e. without having to reformat? It must be technically possible..
2010-02-15 2:19 am
lemur2
I have two drives, /dev/sdc and /dev/sdd, both identical Green drives. I partitioned them as follows:
For /dev/sdd, I used fdisk to add a Linux (0x83) primary partition, taking up the whole disk, using fdisk defaults. By default, the partition starts at LBA 63.
For /dev/sdc, I used fdisk the same as with sdd, but after creating the partition, I realigned it. I did this by entering expert mode (“x”), then setting the start sector (“b”) to 64.
http://www.gnu.org/software/parted/faq.shtml
Does GNU Parted support physical sector sizes not equal to 512?
Starting from 1.7, GNU Parted will automatically align partitions to the physical sector size reported by an ATAPI-compliant drive.
Surely you should be using parted to partition drives, and not fdisk?

2010-02-15 4:27 am
theosib
This drive does not in any way report its physical sector size. It says its sectors are 512 and that’s it.

2010-02-15 8:18 am
darknexus
Shouldn’t WD correct that then? How is a disk tool supposed to partition a drive properly if the drive itself is reporting incorrect data? If the physical sectors are 4096, the drive should report 4096 shouldn’t it?

2010-02-15 9:28 am
smitty
I believe the whole point is to emulate the 512b sectors so that legacy OS’s will work. The drives can’t query the OS and then modify how they report themselves depending on what is supported. They do provide a jumper so you can manually turn the legacy emulation on or off, but most people aren’t going to mess with that.
2010-02-15 3:41 pm
theosib
We have knowledge of the problem. We can deal with it, regardless of how the drive lies.

2010-02-16 12:13 am
darknexus
Yes it can be dealt with, yet at the same time it seems as though these drives should be able to report the correct geometry when queried properly. That would mean the partitioning tools would be aware of it from the start rather than having to manually deal with the problem. Most people I know, even ones with good technical knowledge, wouldn’t have known how to handle this one as they don’t delve that deep into drive partitioning. For the sake of avoiding trouble whenever possible the drive should report the geometry properly when queried by an os that knows how to ask for the *real* geometry and not that ridiculous LBA compatibility hack we’ve had to live with for so long thanks to bios and Windows.

2010-02-15 2:01 pm
mkpetersen
First of all it’s important to distinguish between logical block size which is used when sending commands to a device and the physical block size which is used by the device internally.
Linux has supported (SCSI) drives that present 4KB logical block sizes for a long time. For compatiblity with legacy OS’es, however, consumer grade ATA drives with 4KB physical blocks continue to present a 512-byte logical block interface. The knob indicating that the drive has 4KB physical blocks is orthogonal to the logical block size reporting, allowing the information to be communicated without interfering with legacy OS’es like XP that only know about 512-byte sectors.
We have worked closely with disk manufacturers for a long time to make sure we were ready. Western Digital have been instrumental in the ATA specification in terms of the alignment and physical block size parameters. The engineering sample drives I have received from WDC have all implemented the physical block size knobs correctly. Which makes it even more baffling that they end up shipping an advanced format drive that gets it wrong. I have no idea why they did that. The location of the block size information in IDENTIFY DEVICE is unlikely to be inspected by legacy systems, so I highly doubt it’s a compatibility thing. Brown paper bag time for Western Digital…
It is true that the effects of this particular drive reporting incorrect information could have been mitigated by a 1MB default alignment. However, that would still have caused misalignment for other drives that come wired with 1-alignment to compensate for the legacy DOS sector 63 offset. So blindly aligning to 1MB won’t cut it. Windows Vista/7 don’t do that either. Like Linux, they compensate based upon what the drive reports.
Linux 2.6.31 and beyond will report device alignment and physical block size for all block devices. It is then up to the userland partitioning utilities etc. to adjust start offsets accordingly. You’ll find that both parted and util-linux-ng have been updated to do this. And that modern fdisk will in fact align on a 1MB (+/- drive alignment) boundary by default.
Caveat being that Fedora is the only community distribution I know of that’s using the updated bits. I don’t think all of them made it into Fedora 12 but I’m sure Fedora 13 will do the right thing.
So I encourage you to work with your distribution vendor to ensure they start shipping recent partition tooling.
Martin K. Petersen
Kernel Developer, Oracle Linux Engineering

2010-02-15 11:35 am
coolvibe
Because all the #include statements in the source code snippet are empty.
2010-02-15 9:53 pm
pavlinux
[code]
#define _LARGEFILE64_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define LIMIT 1000
char buffer[4096];
int main(int argc, char *argv[]) {
int fd, i, off;
long bk[LIMIT], byte;
if (argc<2) {
off = 0;
} else {
off = atoi(argv[1]);
}
srandom(off);
/* fill array of randoms */
for (i = 0; i < LIMIT; i++) {
*(bk+i) = random() % 2000000000;
}
*bk = 0; /* goto begin */
off *= 512; // mul
off += 4096 // add
fd = open(“/dev/sds”, O_RDWR | O_SYNC);
printf(“fd = %d”, fd);
for (i = 0 ; i < LIMIT; i++) {
byte = bk * off;
lseek64(fd, byte, SEEK_SET);
write(fd, buffer, 4096);
}
close(fd);
return 0;
}
[/code]
[i]Edited 2010-02-15 22:07 UTC
2010-02-15 10:10 pm
pietrek
Wouldn’t simply setting proper sectors/tracks option do the job? Like it’s described here: http://www.ocztechnologyforum.com/forum/showthread.php?48309-Partit…
2010-02-16 1:46 pm
pauld
I’ve heard that LVM does a bit of the alignment for you, when creating a new logical volume. Not sure if that would help here, it would plea for use of logical volumes even in simple setups.
2010-02-16 3:58 pm
the_olo
Hi!
In the util-linux-ng mailing thread that some commenters have already mentioned here (thread started by myself, BTW), I did a test similar to yours, only fully automated and using a ready-made benchmark named PostMark which is quite well suited for exposing this particular performance problem:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926/fo…
This benchmark script is able to automatically expose the optimal partition offset that offers the best performance.
In the same thread, you can read that the util-linux-ng guys have already committed a fix for this issue in fdisk:
http://thread.gmane.org/gmane.linux.utilities.util-linux-ng/2926/fo…
What’s left to be fixed now is parted:
http://parted.alioth.debian.org/cgi-bin/trac.cgi/ticket/251
2010-02-16 4:12 pm
Tuxie
Why didn’t you use 4k blocksize? That should also make a big difference! Just supply “-b 4096” to mkfs.

2010-02-16 7:53 pm
sj87
Why didn’t you use 4k blocksize? That should also make a big difference! Just supply “-b 4096” to mkfs.
4k blocksize is the default value, not something that needs to configured manually.

2010-02-16 4:17 pm
Tuishimi
Thank you!
2010-02-17 2:40 pm
twitterfire
Does this issue affect both GPT and MBR partition tables?