I love files. I love renaming them, moving them, sorting them, changing how they’re displayed in a folder, backing them up, uploading them to the internet, restoring them, copying them, and hey, even defragging them. As a metaphor for a way of storing a piece of information, I think they’re great. I like the file as a unit of work. If I need to write an article, it goes in a file. If I need to produce an image, it’s in a file.
[…]I’ve had a love of files since I first started creating them in Windows 95. But I’ve noticed we are starting to move away from the file as a fundamental unit of work.
There are forces at work to create as large a distance between the user and her files as possible, because not only do files represent a certain amount of user agency and control, they also represent a massive data mine for companies to profit from.
Y’know, this exactly captures the creeping sense of horror I felt when I first heard of efforts to make database-based filesystems.
Where are my files? That depends on how you query them. How can I make sure they last and can be found when I need them? Trust us. It’ll work.
(Sure, I do have plans to store stuff in databases that was once in files… but primarily because it’s already been proven that I need more cross-references to be able to reliably *find* them again, which means a common, structured backing store for e-mail, RSS, TODOs, appointments, and anything else I might want to attach, annotate, reply to, use as a reply to something else, or “typecast” from a received message to a TODO entry.)
…and the things I’m putting in databases are things where the correspondence to visible files was never relevant to begin with.
How did Eudora or Netscape Communicator or Outlook Express store e-mails on disk? I never knew or cared. I only care about how Thunderbird does it now because Mork is garbage.
How did I store my TODOs? Well, I’d put them all in a file and, when it got too big, I’d start a new file for the more pressing ones. Sooner or later, I’d lose track of TODOs because of how manual it was and how it had no way to hang things like priorities or deadlines off of them or group them into hierarchies such as “Work on hobby projects -> DOS Installer Creator -> Port public domain unzipping code -> Strip out platform abstraction which bloats out final EXE”.
ssokolow
Well, keep in mind that a file system is a database too. It isn’t an SQL database and isn’t a relational database, but there’s no reason we can’t unify file systems and databases under similar abstractions. Concepts like ACID compliance and transactions could be applied to file systems. just as they apply to other kinds of databases.
I know we’ve discussed this before here on osnews, but alas the osnews comment search feature is long gone, so I’m at a loss to find old posts. External search engines are not nearly as effective.
I don’t really have a problem with data going into more modern databases, it could be beneficial even. The bigger problem for me is the question of who owns/controls those databases. We’re seeing a rapid growth in services cropping up that take possession of end user data and store it into their databases. This concerns me as it greatly increases the potential for abuse.
No argument there… that’s just something that completely outside the Overton Window for me around the time that Windows Longhorn and the GNOME guys were dabbling in the idea of relational file storage.
What I was more concerned by was the idea that, if your root partition or C: has no single, authoritative hierarchy, as the GNOME and Microsoft efforts seemed to want as their end game, then discoverability of the OS’s underpinnings suffers greatly.
Sort of like how usability experts advise all websites to have a Site Map.
Even though what you said is true, there are important aspects about files as “isolated, contained” units that should not be disregarded: first, files are easily separated from the storage mechanics and can be easily moved between devices, that would not be the case were them database objects as we use today; second, regular databases use a level of “conformity” not found on regular file systems and the key difference is that on former your data is made to adhere to the structure of your db and on latter the structures were made to adapt to your data.
There are concerns also about lifetime of objects and their metadata that are not easily solved on regular databases, this is already solved on regular fs by zero or pattern write.
As much as I love db to query and find data and properties, I prefer to have the current “isolation”, as it fits better my regular use case and make it harder to data gathers to collect useful, resumed information about our activities.
acobar,
Emphasis mine. The tooling that we use for databases today hasn’t evolved to accommodate these use cases, but in principal there’s no reason that it couldn’t. The main obstacle I see is that nearly all software in existence today is coded directly to the file system abstraction from unix beginnings. We would have to emulate that abstraction/API to retain backwards compatibility. The problem is that virtually 100% of software in the wild just continues to be developed in legacy compatibility mode and not take advantage of higher level database features.
It’s related to the problem where new operating system designers can work on new innovative APIs yet POSIX compatibile APIs are the only ones that will ever get any serious use. It can be futile to develop better data abstractions without critical mass. Heck, even when we’re desperate for a solution as was the case with the IPv6 transition, we’re still stuck with legacy tech many decades past it’s prime.
BTW oracle has one such virtual file system…
https://docs.oracle.com/cd/E11882_01/appdev.112/e18294/adlob_fs.htm
As you’d expect, it stores files as blobs, so it’s of limited use for querying information within those files.
Well, not all databases are relational SQL databases. NoSQL databases may be better for adhoc data and can store arbitrary data graphs like JSON or XML. As a software developer, I use databases for my own work and I love their indexing, syncing, and querying capabilities. However I’m still forced to spend so time reinventing the wheel with buffers/parsing/caching/syncing/etc all the kinds of files that abound on a typical system where every piece of software uses it’s own format. I concede it will never happen, but it would be nice if all software I needed were coded to a higher level abstraction.
I will hold dearly to the ability open my files in notepad, regardless of their type. And I am writing this on a win7 machine. I may be getting old and conservative.
But there are use cases which require direct access to files, and not just some list google docs provides me. Renaming docx files to zip and uncompressing them reveals otherwise inaccessible resources to change for instance. The same applies for the translation project files I handle in the translation software I use, SDL Trados. If I had been locked down to a web interface which only provided some symbolic links to a database in the translation software firm’s web site, I would be effectively locked down in a horribly inefficient system.
Amen.
Your comment about interfaces actually reminds me of a blog post named “Don’t mak me code in your text box!” which I believe I ran across on Planet Mozilla.
https://blog.harterrt.com/coding_in_textboxes.html
Gah! How long will it take before my habits adjust to not having an “Edit” button!?
Files in a huge db? I’m cringing already. How do I send a file to an external party? Nah! I say leave things as they are.
The same way you send them now. Nothing with transmission changes.
It would probably be metadata and a link to an inode in the database. That’s how object stores work anyway. S3, Ceph, Minio, Switch, etc. don’t store the actual file in the database. They store the location of the file in the database to keep things manageable.
Reminds me of the Data Fork / Resource Fork set up on MacOS or the structured storage in various OSes that UNIX supplanted.
Both were terrible for transmitting data without breaking it unless you used something like Binhex or MacBinary to bundle everything into one “bag of bits” file… and the modern solution (which I have no problem with) is to skip that transformation and just de facto standardize on SQLite or some other “database inside a file” solution for structured data that needs to be passed around as a robustly self-contained unit.
“Flatland_Spider
The same way you send them now. Nothing with transmission changes.
It would probably be metadata and a link to an inode in the database. That’s how object stores work anyway. S3, Ceph, Minio, Switch, etc. don’t store the actual file in the database. They store the location of the file in the database to keep things manageable.”
Yo,Script Kiddie….
Ever hear of the term GIGO? Your comments are a product of that sort of thinking.
When that database blows up, and it *WILL* blow up, you and those like you will be nowhere around to fix the problems you have caused.
yoko-t,
There seems to be bias against the term “database”, but a file system IS a kind of database. Your criticism against databases applies to file systems as well. Behind the scenes, you’ll find the same kinds of data structures like btrees and journals used to implement both.
Sure, we use different APIs on top of these data stores, but there’s nothing inherently more robust about a database that uses a file system API. In principal a database like ext3fs is not more robust just because it uses a file system API and SQL is not less robust just because it uses an SQL API. The truth is transactional SQL databases can be engineered to be even more robust than file system databases. It really depends on the underlying implementation and not the API.
I’ll be honest, I’ve experienced more file system corruption than database corruption. Granted this is probably because I don’t use full journaling on the file system, whereas I do with the databases, but still the point is that robustness has less to do with the API and more to do with how it gets stored. So with this in mind, would you concede “database” are not inherently less robust than “file systems”?
This article reminds me of the feeling I got when Google Picasa was replaced by Google Photos. With Picasa, I had a desktop tool to organize, edit, and view my files in a variety of cool ways (facial recognition, tagging, timestamps, etc.) However, with Google Photos, everything needs to be on Google’s cloud. I don’t even have a good way to automatically copy photos/videos from my phone to my computer using that tool.
I have nearly 200,000 photos on my computer. This requires both files stored in strategically named folder trees and a database driven application (Digikam) to store and view metadata and thumbnails in order for the collection to be manageable. I can see the appeal of storing the files and metadata all in a database, but at the same time I cling to the idea of backing up the photos while not necessarily caring as much about the metadata in the database – especially when I write the critical tags to EXIF data and use filenames to create subgroups within the folders.
Ahhh, a fellow picasa user. I still use it. Saved the installer and install it on my new machines.
Google expects you to take photos on your android phone, and upload them directly to the cloud, so that you will be locked in. You want to move somewhere else? “Sure, here’s a very clunky tool that will nominally let you download your files in a very disorganized structure.” You don’t want to deal with that mess, but are running out of space? “Sure, hand in a Benjamin and get 50 GB more for one year.”
This is very romantic, and ignores, or misses, why of it all.
1) Just because a file exists doesn’t mean it’s useful.
Unless the format is documented, you don’t own your data. Outlook or Pages are just as user hostile as any SaaS, so let’s not get all misty eyed over “the files”.
Then there are programs which store data in idiotic ways. Like the log file which was space separated and allowed spaces in the data fields, but didn’t quote fields. Yeah, it was a mess trying to parse that thing to extract data.
Things have gotten better. SQLite has become a standard way of storing data. File formats are more open now and many SaaS products have APIs which allows data to be exported.
There is still work to be done, but it’s better. If we take nothing else away from FOSS, I hope empowering users to control their data by documenting and opening file formats is the thing.
2) Files are necessary evil.
iDevices standardizing on the app model makes a lot of sense. It reduces complexity, and it increases security.
For most people, the data and the application are tied together. Outlook is email. Pages is how people write things. They want to do a thing, and some tool enables then to do the thing they want to do. A very small portion of the world is tech savvy, and they just want something to let them do what they want to do. Building scripts around pandoc and LaTeX to is not user friendly. I’ve had great success doing just that with both of those, but most people don’t want to do that.
Backups and restores are easy. People don’t store stuff in random places, like everything in the root of the C: drive for one example.
All of those files sitting around are attack vectors. Either people put lots of effort into hardening their parsers, or they don’t let people put random junk in there. As a side note, one of the funniest/scariest things I’ve heard lately is to write configs in code then just execute the code on start up.
3) We live in a connected world with multiple devices.
An incredibly first world problem here. We have network connected devices, and we have multiple devices. The service oriented style we have now is the best way to get data in sync between everything. I edit something on my desktop, it syncs with the cloud, and the changes show up on my phone. It’s awesome.
The alternative is having to sync everything manually like the first Palm Pilots. It was horrible and clunky.
This is of course possible because of APIs, which are one of the better ways to interact with things. As someone who can write code, this is really nice, and it means there is a consistent interface which I can plug multiple things into. The data is consistent, and I get exactly what I want. It’s the object oriented dream Alan Key had. 🙂
Nothing is stopping people from running their own SaaS. It does take time, money, and knowledge though. Once again, most people aren’t tech literate and giving Google, Apple, MS, or whoever a couple of dollars a month, or living with the limitations of the free tiers, is worth it.