Red Hat has been working on new NVFS file system

Thom Holwerda 2020-09-15 Red Hat 7 Comments

Yet another new file-system being worked on for the Linux/open-source world is NVFS and has been spearheaded by a Red Hat engineer.
NVFS aims to be a speedy file-system for persistent memory like Intel Optane DCPMM. NVFS is geared for use on DAX-based (direct access) devices and maps the entire device into a linear address space that bypasses the Linux kernel’s block layer and buffer cache.

I understood some of those words.

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

7 Comments

2020-09-16 10:37 am
TommyD
I think the point is that today’s file-systems are still geared towards rotating media that are accessed in blocks of data. But with modern storage beginning to approximate RAM, it can just be accessed the same way we access memory – directly by addresses, with those addresses being mapped as part of main memory. Of course, we’ve always had memory-mapped drive storage, but it was on top of the underlying file-system – block-based access. And, even in RAM access, we still tend to read chunks and cache recent used data, but the move is towards there not being a difference between “main” memory and “storage” memory. We are approaching the point where main memory may not “go away” when the power is removed.
It’s funny because, in many 80’s and 90’s devices, there was no main and storage memory – PDA’s, etc. Application executed in place in memory, which didn’t erase when the device was turned off. Of course, it was slow. We may now have that same state, but with it very fast.
However, I think there will always be some kind of long-term or offline storage, but that may be just for archival purposes, with an occasional fetch.

2020-09-16 11:18 am
Alfman verbose=1
fretinator,
I think the point is that today’s file-systems are still geared towards rotating media that are accessed in blocks of data. But with modern storage beginning to approximate RAM, it can just be accessed the same way we access memory – directly by addresses, with those addresses being mapped as part of main memory.
It’s not quite as simple as that. Granted, solid state media is significantly faster than rotating disks in all cases. But even when we’re talking strictly in the domain of solid state NAND media, random write access kills performance. Even though SSDs have overcome the bottlenecks of physical head speed and rotation speed, you still get a fairly hefty penalty with random access instead of sequential access due to how NAND flash works. Sequential writes provide the most bandwidth by a large margin, Journal-type file systems that are designed to only write out completed spans tend to be extremely fast on SSD, which ironically can help optimize hard disks as well.
Do note that modern SSDs often provide several gigabytes of SLC cache to help mitigate the performance bottlenecks of the primary flash storage, which implements slower but more economic flash cells. So a short test of random access writes is likely to expose the cache’s performance rather than the underlying media. This isn’t generally a negative, however it does throw a wrench into the works if the intention is to benchmark the characteristics of the underlying media. For example having a cache ontop of a hard disk will make the hard disk appear to perform much better than it otherwise could. Here’s the conundrum though: if you are content just testing the performance of the cache, then it makes the performance of the backing storage almost irrelevant. you could switch between SSD or HDD and it wouldn’t even make a difference for trivial filesystem benchmarks.
It’s funny because, in many 80’s and 90’s devices, there was no main and storage memory – PDA’s, etc. Application executed in place in memory, which didn’t erase when the device was turned off. Of course, it was slow. We may now have that same state, but with it very fast.
However, I think there will always be some kind of long-term or offline storage, but that may be just for archival purposes, with an occasional fetch.
Thinking out loud, even with persistent memory solutions, I wonder if there are isolation problems. I can see such systems that never have to be rebooted (because they can survive powerdown), but then when they finally do need to be actually rebooted (ie updates or something), there’s less separation between runtime state and persistence storage.

2020-09-16 12:28 pm
Flatland_Spider
It’s not quite as simple as that.
You’re right, it’s not that simple. However, there is movement to peel back some of the layers of abstraction to allow the OS to manage the storage directly. I think it is mainly driven by hyperscalers who buy things like host managed drives.
Anyway, fewer layers of abstraction on flash based devices is where we’re heading because it makes more sense for the OS to manage the drives directly because it has a better idea of the state of the system then a blind drive controller does.
Do note that modern SSDs often provide several gigabytes of SLC cache to help mitigate the performance bottlenecks of the primary flash storage
Yes, and some multi layer NAND will work in a different mode until it reaches a specific point then it will compact to free up more space. The Intel p660 for example.
I can see such systems that never have to be rebooted (because they can survive powerdown), but then when they finally do need to be actually rebooted (ie updates or something), there’s less separation between runtime state and persistence storage.
I’ve thought about this when I was tinkering with LISP. There are several options: hotpatching which LISP does, start with a base overlay like many appliances do, build config management into the system.
Interestingly, LISP was built for this environment. LISP programs aren’t “programs” in the sense as they would be with something like C, but memory images. It’s very alien. With C, and others, the program bootstraps itself whenever it’s started, but with LISP the program just starts existing. Running LISP on hardware designed for C and OSes written in C doesn’t do it justice.
Updates are done by hotpatching the code while the program is running. The code is mutated in place, so there isn’t a concept of a reboot as we know and love. The program keeps running, and in this context, the fundamentals of functional programming and formal verification become very important.
The next option would be to start with a base image then overlay config settings and customizations on top. This is probably the most annoying and least flexible, but it’s the easiest to reason about. Docker works by applying layers to build the container, and OpenWRT works this way as well. This option means having to build tools to manage the system, and it means people need to use the management tools if they have any hope of updating the system without lots of work. I don’t particularly like this option.
Another option which is gaining popularity is to bake config management tools into the system management tools. NixOS and Guix take this approach. I have mixed feelings since it’s introducing complexity, and people are relying on a project to do more for them.

2020-09-16 11:44 am
Flatland_Spider
We are approaching the point where main memory may not “go away” when the power is removed.
It’s funny because, in many 80’s and 90’s devices, there was no main and storage memory – PDA’s, etc. Application executed in place in memory, which didn’t erase when the device was turned off.
It’s even funnier then that. The original computer memory technology, magnetic core, was persistent across power events. Persistent memory was something we lost early on. https://www.computerhistory.org/timeline/1953/#169ebbe2ad45559efbc6eb357208c601
There was an article, video, or something where an early operator was talking about how they would power the machine down during a brownout, and it would pick up where it left off when it was powered backup. There were more specifics, but they didn’t leave an impression like having persistent memory ~70 years ago did.
However, I think there will always be some kind of long-term or offline storage, but that may be just for archival purposes, with an occasional fetch.
Yeah, there will always be storage tiers simply because of cost. If someone comes up with a storage medium which is durable, performant, and super cheap to produce, it could take over quickly, but it never works like that.
2020-09-16 12:31 pm
sukru
Even RAM is not O(1).
This has reminded me of the old but great write up on the subject:
http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html “The Myth of RAM”
We have layers of memory with different access characteristics. And our abstractions does not make it easy to know how they are stored in real physical world.
Take the basic
int* value;
For example, variables can be stored in registers, with aliasing that would make them as fast as possible.
Scratch that, the compiler can optimize them out, and make it O(zero?) time.
Then there is the cache with its own multiple layers (2, 3 are common 4 is possible)
Then there is RAM directly attached to the CPU.
Even that is subject to block reads, breaking word boundaries would cause two reads instead of one.
Then there is RAM on the other CPU for multi-socket systems. They can still have the same int* pointer addressed to them, but access goes thru a separate bus.
There there is RAM on the PCI bus. Graphics card for example can be mapped into the same pointer space.
Then there is swap that goes to a disk. If that is an SSD, that also has a RAM cache, SLC cache, block characteristics and whatnot.
So even a simple int* pointer can have significantly different characteristics almost completely hidden from the programmer. The algorithm you think is O(n logn) can run in O(n^2) or worse in real life.

2020-09-16 1:39 pm
Flatland_Spider
Thanks for the link and info!
The algorithm you think is O(n logn) can run in O(n^2) or worse in real life.
Which level of computer Hell is that? L3, running code on real systems? After customer support (L2), but before supporting proprietary/obsolete legacy systems (L4) for eternity.

2020-09-29 9:15 am
Ovvi
It’s interesting information, I wish we studied this in college. Instead, we get a lot of unnecessary writing assignments, which are always difficult for me. I found a good site https://bestwritersonline.com where I read reviews on different writing services, and picked a few reliable one, which had high ratings from customers.