LogFS: a New Way of Thinking About Flash Filesystems

Submitted by FreeRhino 2007-05-18 Hardware 21 Comments

“Storage manufacturers are getting ready to start shipping solid state disks, and Linux-based devices like One Laptop per Child’s XO and Intel’s Classmate don’t contain standard hard disks. To improve performance on the wide array of flash memory storage devices now available, project leader Jern Engel has announced LogFS, a scalable filesystem specifically for flash devices.”

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

21 Comments

2007-05-18 6:18 pm

archiesteel
I particularly like the fact that it’s optimized to increase the useful life of Flash disks. That is my main concern with Flash technology as a replacement for hard drives (especially for system disks, which are frequently written to).

2007-05-22 4:19 am

whocarez
As far as I know devices nowadays have on-chip wear-leveling. I just can’t imagine how doing this twice might improve anything. Do we really need yet another file system?

2007-05-18 6:43 pm

DigitalAxis
The first problem with using ext3 for flash devices, Engel says, is the erase operation used on flash filesystems. He says that you can only write data on a “sector” of the flash device once before it needs to be erased, unlike hard disks:

After that a relatively large piece of flash needs to be erased. The size of these erase blocks differ — it is usually between 16 and 128KB.

After this erase, all data is gone and cannot be recovered. So a flash filesystem has to make sure that no important data is in the area before it gets nuked. If there is, the filesystem has to move it elsewhere first.

Moving data elsewhere means there is no fixed relationship between the physical location of data on the device and the logical location of data in terms of file and file offset. Ext2/ext3 and most of the other disks filesystems depend on such a fixed relationship, so they don’t work as flash filesystems.

I know I don’t know much about file systems, but I was under the impression that most filesystems do allow files to be scattered in pieces across the filesystem, hence fragmentation and the problems that causes, unless lack of that is how ext2 and ext3 avoid needing to defragment…

Or is he referring to Flash media needing to erase much larger data blocks possibly containing other data, such that you’d need to move pieces of the File Allocation Table itself to avoid losing things?

2007-05-18 7:47 pm

renox
More likely the second point.

What I don’t understand myself is that I thought that flash had now: an increased number of read/write cycle and included wear levelling mechanism inside the flash controller itself so why is-it still needed to include such mechanism in the FS?

2007-05-18 7:56 pm

WereCatf
Most likely because on the software side it is a lot easier to change the method, upgrade the filesystem and so on, whereas a generic wear-leveling mechanism inside the firmware just won’t be as good when it has to suit a whole range of filesystems. A software side mechanism can take into account even the slightest detail, but a firmware one just has to deal with the lowest possible detail: block level.
2007-05-19 2:27 am

Cloudy
External flash devices, like USB sticks, have wear leveling and so forth implemented in the device.

Internal flash devices, especially on things like cell phones, have a simple raw i/o interface to the device.

If Linux is to be competitive on embedded devices, then it needs better support for those sorts of flash devices.

MTD + JFFS2 had severe performance problems, and while YAFFS is pretty good, it’s never had wide spread acceptance.

2007-05-18 7:52 pm

WereCatf
Or is he referring to Flash media needing to erase much larger data blocks possibly containing other data, such that you’d need to move pieces of the File Allocation Table itself to avoid losing things?

Well, I can’t say I’m a professional on these matters, and I _might_ talk just plain bulls*it, I guess he means the fact that for example ext2/3 usually uses a block size on 512 bytes, whereas a flash drive uses anything between 16kb to hundreds of kilobytes..So that’d mean a whole amount of unnecessary moving of data, I guess.

2007-05-18 11:14 pm

Earl Colby pottinger
Problem is there seemed to be a lot of details missing. At the least I expected that an new file system would consulate writes to diffirent sectors into a single block write. There was also no mention about write caching to collect a large number of writes operations before modifing a flash block. Done properly, you can reduce writes to the flash itself by a large amount.

2007-05-18 11:29 pm

sbergman27
“””

Problem is there seemed to be a lot of details missing.

“””

This may provide more info. I have not read it yet.

http://tinyurl.com/3ar6ht

Edited 2007-05-18 23:30

2007-05-19 7:55 am

Earl Colby pottinger
Thank you a lot. I am reading it right now.

2007-05-19 2:18 am

butters
It doesn’t matter how many times you flush to a flash medium. It only matters how many times you erase each erase block. Write buffering (not caching!) only helps when writing to media with much faster sequential access than random access. Firing off write requests as they come has no significant throughput penalty on flash and reduces the chance of data loss due to a system failure.

Once a write block is written, it cannot be rewritten until the containing erase block is erased. Avoiding erases is more important than avoiding writes. This means allowing erase blocks to become sparse as modified filesystem blocks move elsewhere, only invoking garbage collection when we begin to run out of free erase blocks.

2007-05-19 7:59 am

Earl Colby pottinger
Major brain fart on my part. I knew that and still somehow forgot. Result, I was thinking in terms of only being able to write entire blocks at a time which I know is wrong. Don’t worry, the moment I saw your reply, I gave myself a slap for being stupid.

Anyway, I am reading the logfs1.pdf to understand the features offered.

2007-05-19 12:46 am

butters
I know I don’t know much about file systems, but I was under the impression that most filesystems do allow files to be scattered in pieces across the filesystem, hence fragmentation and the problems that causes, unless lack of that is how ext2 and ext3 avoid needing to defragment…

Actually, scattering files across the medium is what allows ext[2-4], other FFS-inspired filesystems, and most *nix filesystem in general to reduce fragmentation. If you scatter files, you can later expand them into adjacent free space. If you pack them tightly, then you need to find someplace else to store the additional data. This is the root cause of fragmentation on sequential allocation filesystems like FAT and NTFS. However, fragmentation is only an important consideration if seeks are expensive. So on a hard disk, it’s important to know about free blocks and allocate them in a smart way.

On flash, seeks are fast, so FAT on an FTL-based flash medium isn’t actually that bad. Lookups are still just about as inefficient as practically conceivable, but it’s hard to design a filesystem that has bad read latency on flash media, even if that’s your primary goal. In addition, some FTL implementations understand enough about FAT to determine whether a given filesystem block is free, allowing for some limited garbage collection. So if your flash medium has an FTL, you may be best off using FAT regardless of whether you’re running Windows.

However, FTL is only a short-term solution to aid flash adoption by supporting legacy filesystems that don’t properly address the requirements imposed by flash. Only the filesystem fully understands the metadata and the relationships between data blocks. It’s far easier to teach the filesystem about flash than to teach the FTL about one or more filesystem types. This is a case where the abstraction layer gets in the way of the solution.

The principle reason why FFS-like filesystems won’t work on raw flash is because the mapping between an inode number and the offset of the inode on disk is a simple arithmetic function. Since flash requires a block to be moved somewhere else when it is updated, it breaks the assumption. Either they would need to change the inode number to reflect its new position on the medium, requiring updates to inodes and directory entries that refer to the inode, causing further relocations and a rippling series of inode number changes, or they would need to come up with some other way of locating inodes. Both of these changes are more than invasive enough to warrant a brand new filesystem.

Not surprisingly, LogFS chose to decouple inode numbers from locations on the medium. They keep a pair of journals that flip-flop on every journal update. The journal contains the current location of the inode file. The inode file contains every inode in the filesystem, arranged in a tree structure. When any erase block is written, the only other erase blocks written are for the inode file and the journal. This prevents having to update an erase block for each inode up the tree when the inodes dance around in response to a write. It also makes the code for updating inodes the same as for updating data, since they are both the contents of a file.

LogFS isn’t the perfect solution, but it’s pretty good. It currently has to scan each erase block to find free filesystem blocks. So identifying erase blocks for garbage collection isn’t as efficient as it could be. Wear leveling is implemented in a rather simplistic way. Keep a count of write cycles for each erase block, and stop using erase blocks when they exceed a certain count. There is no attempt to level the wear, just a guarantee not to exceed manufacturers’ specifications. I’m not sure the effect is really any different. Perhaps it could be made better by initially setting this limit quite low and then periodically increasing it in coarse increments when a certain percentage of allocation attempts are rejected for exceeding the cycle limit.

The more pressing problem is that most consumer flash products have an FTL that cannot be bypassed, and this will continue until Windows and consumer electronics devices have filesystems like LogFS than support raw flash. For flash vendors, selling raw flash is a liability because most consumers won’t know if their OS or device supports it or not.

It’s no surprise that the primary target hardware for LogFS is the OLPC XO, since they’re not constrained by legacy technology. In fact, until recently, the lead developer of LogFS had never run it on an actual raw flash medium, developing only on a simulated medium. He first attempted it on an XO at a conference where he was presenting. He warned that he had no idea if it would actually work on real flash. It failed miserably, producing a light-hearted chuckle from the audience of other Linux developers, but I believe he fixed the bugs soon afterwards.

Edited 2007-05-19 00:48

2007-05-19 1:20 am

sbergman27
Serious question: Have you ever thought about writing for lwn.net? Your clear writing style would fit quite well on the “Kernel” page of the Weekly Edition.

Disclaimer: I’m not associated with LWN except as a subscriber.

Edited 2007-05-19 01:20

2007-05-19 3:57 am

butters
I’m also an LWN subscriber and a fan of Jonathan Corbet’s writing. I hadn’t read his piece on LogFS until now. It’s well-written but inaccurate in some places. That’s understandable given the scarcity of good information on LogFS. You really need to have the skills of a kernel developer to write this kind of material, and although I don’t think Jonathan actually does any kernel hacking, he probably could. Also, if there’s anybody that would be deserving of a full-time job documenting the Linux kernel, he’s the man for the job.

To answer your question–Not specifically, but I’m beginning to realize that I should start doing real pieces in addition to posting comments here. On the other hand, I’m beginning to realize that I don’t have as much time for this stuff anymore…

2007-05-22 10:34 am

joern
Your description of wear leveling is wrong, the rest surprisingly accurate.

Wear leveling currently isn’t done at all. There isn’t even a provision to stop using certain blocks when they get worn out. Magic word is “currently”.

2007-05-18 11:03 pm

Replaced
Isn`t it based on LFS, the log structured filesystem from Sprite OS and 4.4BSD?

2007-05-19 2:29 am

Cloudy
No. MTD/JFFS and now MTD/LogFS use different data structures and approaches than Berkeley LFS. There’s an active implementation of LFS in NetBSD, but it doesn’t have the necessary optimizations for flash devices. PalmSource did a variant on LFS for the never-released Cobalt version of PalmOS, but I don’t believe anyone has ever asked them to open-source it.

2007-05-18 11:26 pm

transputer_guy
It is good to see attention paid to a Flash based FS since I expect no new disk based FS makes any sense anymore, atleast the OS and small file stuff should be Flash based, leave HDs for media stuff.

In the latest EET there is an article on how Flash devices could be improved specifically for large arrays used in solid state drives.

It comes from Mosaid Technologies one of the major DRAM IP design houses and now looking to bring some of the DRAM spiff to poor old Flash with much faster IO pins, better wear leveling and a much reduced ratio of erase block to write size.

So far no announced takers, but one can hope that Samsung and others will make over Flash for the drive market for high performance rather than the junk USB key drives, at least they might get better prices for that.

It has bothered me forever that USB key chains are completely free of any meaningfull specs save being USB2 compatible, which could mean anything at all in speed nos. The article gave some insight as to exactly why they are so slow.

2007-05-18 11:42 pm

WereCatf
It is good to see attention paid to a Flash based FS since I expect no new disk based FS makes any sense anymore, atleast the OS and small file stuff should be Flash based, leave HDs for media stuff.

IMHO, a flash drive would do quite well if the “media stuff” stored there wasn’t rewritten often. A flash storage does not wear out when being read from, only when being written to, but a regular mechanical HD will eventually die even if you don’t use it at all!

2007-05-19 12:04 am

transputer_guy
Ofcourse media files are not usually written often unless the Flash is used as a buffer from HD to DRAM for cost reasons. We all know Flash doesn’t wear on reads and that even light use of HDs leads to eventual death. The question is if we start to invest in high capacity Flash drives, they had better last alot longer than HDs given they cost about 100x as much per MB.

HDs wear out due to large amounts of excessive seeking of tiny files delivering relatively low performance whether reading or writing, but writing HDs does not require much seeking.

When media files alone are accessed, there is a natural bandwidth cap say a few MBytes/sec needed to satisfy the codec, so for the most part, storing and playing HiDef movies shouldn’t stress HDs out at all, the blocks are very large and only need to refill buffers occasionally. We can see that already in iPods where the HD is only spun up infrequently to refill buffers, but then again that spinning up and down probably isn’t too good an idea.