GLS³ is an open source semantic storage solution for GNU/Linux that indexes your data, extracts from it metadata and relevant information, allows you to organize it using queries and tags, an API to allow Developers to integrate searching and organization capabilities in their application, shared schemas between applications through an API, a pseudo file system for backward compatibility, a web interface, As-You-Type searching and more. Check their site for demonstration.
As a programmer, the technology behind it looks interesting. As an OS enthusiast, I’m alarmed, however, at what this might lead to: specifically, a winfs-style database filesystem. Don’t get me wrong, fast searches and queries are definitely a good thing, but building an entire filesystem around that one concept is bound to degrade the ability of users to navigate their files easily and efficiently.
OK, so now we have two approaches to desktop search: one where metadata is extracted and cached (like Beagle, Spotlight, Vista), and another where the data is structured according to the metadata (GLS and possibly BeFS?, not sure). MS canned their structured data solution, and people seem to be happy with the desktop search functionality recently made possible with traditional flat filesystems.
This project looks really cool, and I’d like to see it integrated tightly into a Linux distribution, but they have to demonstrate compelling functionality that really exploits the structured nature of the filesystem (probably in the form of a significant performance advantage). Otherwise Beagle will become the de facto standard search and indexing solution on free software desktops.
I think Beagle tightly integrated with something like Reiser4 is really the best solution technically. For desktop use, the properties of Reiser4 (great performance/footprint for small file storage) are just really appealing. The combo largely preserves the semantics of an unstructured underlying FS, while offering enhanced performance for whatever structured database index is layered on top of it. However, it may be the case that the design is simply too radical for the OSS comunity to implement.
There’s nothing fundamentally absurd about designing a storage system that supports both structured and unstructured data models without significant compromise. Storage is all about abstractions and references anyway. The hierarchical directory/file model is just one possible abstraction of a common on-disk format. Just as easily as we can recognize a mess of somewhat fragmented binary and text files as if they were arranged in a tree rooted at /, so can we recognize them as a set of structured collections.
In a traditional filesystem, a node has links to its children and a link to its parent (and usually to itself). In a structured filesystem, a node has, for each of its attributes, links to at least one other node (if any) that shares that particular attribute. There’s nothing, other than a little storage overhead, preventing filesystems from implementing nodes that contain both sets of links.
With regard to Reiser4, I think it’s the second-most innovative filesystem around (props to ZFS), and its plugin architecture should allow third parties to add the plumbing necessary for a structured model. I’ve been expecting an RDB plugin for Reiser4 for some time now, and I wasn’t the first to success it, not by far.
I’m not so sure that Reiser4 is that appealing for desktop users in general, though. If it were, they would be more distributions featuring it already. There’s politics with the kernel devs about not using Linux VFS instead of Reiser4’s plugin architecture for stuff that the former system can handle, and Hans/Namesys isn’t treating the kernel community as if cooperation is necessary to deliver a Linux filesystem… but the primary reasons why distributions (and users like me) don’t like Reiser4 is because its speedups come along with regressions, and it doesn’t have a great track record for stability and coherency. That, and it has unprecedented CPU usage for a filesystem, which is at best a trade-off for servers and at worst a showstopper for mobile applications.
Do we not get a Sunday eve column this week?
Reiser4 is sadly more hype from uninformed potential users than anything true. Although the idea of plugins and massive amounts of metadata are allowed in reiser4, these come with a big performance penalty. If you don’t believe me, take a look at these numbers with a fairly recent kernel and tell me what you think:
http://linuxgazette.net/122/TWDT.html#piszcz
Also, there is a patch that went into mainline for Linux kernel 2.6.15 that increased ext3 speed around 30% by some unscientific benchmarks I did:
http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=…
As a Linux Systems Admin, I’ve seen problems with desktops and servers running reiserfs where volumes corrupt themselves for no reason. I can’t say the same for ext3, that is why our server standard now says no to reiserfs (and thats on SLES9, where the default is reiser).
I completely agree about the hype and uninformed reiser4 users.
See this thread from lkml:
http://lkml.org/lkml/2004/8/27/70
After being publically called on it by a former employee, Hans admits outright to basically rigging the benchmarks on the Namesys site to make Reiser4 look good and Ext3 look bad by systematically dropping the phases in which Ext3 beats Reiser4 (and very badly) from the published results. He mumbles something on the list about making a note on the site about having done that any why. It’s been 2 years. The benchmarks are still there. But no note that the unfavorable test phases were dropped.
I would encourage Reiser4 fans to read the above LKML thread. However, here is one of the most interesting excerpts. Nikita Danilov is reponding to Hans Reiser:
“””
> If you ask real users, they say that reiser4 is fast, and their
> experience matches our benchmark. You can criticize the benchmark if
They experience 90 second stalls. And please, do not tell me how fast
reiser4 is, I spent a lot of time working with it, and I know very well
when it’s fast, and when it’s deadly slow.
> you want, but then you should run your own and publish it.
I did, after which you told me to turn OVERWRITE and MODIFY phases off,
beause performance was horrible.
I not criticizing mongo benchmark per se. I think that it is
fundamentally wrong to use results that were deliberatly manipulated to
get best appearance to reiser4 (by omitting work-loads where it performs
poorly) as an argument. It’s not clear who will, according to your
colorful expression, `eat dust’ as a result of this. Or do you think
that users never overwrite of modify files in real life?
“””
Edited 2006-07-10 02:03
even the menu bar of the result is copied from Apple …
There’s only so many solutions to a problem, and why not pick the one that seems to work best. I mean come on, when people say Linux/Mac OS/BeOS/etc… had feature X before Windows did that’s usually because Microsoft is claiming to have invented something they didn’t or is going further by trying to patent something with plenty of blatant prior art. Take tabbed browsing for example, Opera and FireFox had it first, but Microsoft implemented it later and got the patent too; then there’s the time MS had “Microsoft invents BASIC language” on the timeline they posted to their web site.
These guy’s aren’t claiming that they invented what is essentially spotlight, they just happened to pick what is probably the best design concievable. If you ask me it’s paying tribute to Spotlight rather than a malevolent attempt to rip it off and claim the innovation.
So why when Apple add Dashboard to Tiger, everybody was shooting after them because they “copied” Konfabulator …
The fact is when linux developpers copy on others, it doesn’t matter (just as exposé in SuSE 10) but when Microsoft or Apple copy on others, it is a scandal …
There is a bit diff here. While Apple claims inventions, these ones do not. Neither does XGL.
On the other hand, I always appreciate one tech being improved as long as they don’t start claiming inovation.
Yes there is a bit diff here.
Linux users are claiming since years that Apple and Microsoft suxe.
Years later, we found the same features in Linux, the same who suxe for sure but strangely now there are on linux, they do not suxe anymore ….
“While Apple claims inventions”
Apple claims innovation, not invention. It is not the same, read a dictionary if you do not understand the subtlety.
back to our topic, GLScube has great potential. it’s pseudo-filesystem allows compatability with legacy applications, i must say i’m impressed. however, i would like each file to have a “primary tag” wich would correspond to a traditional directory on the filesystem, so that when the database containing info about the files becomes corrupted, you still have the primary structure available.
I particular like the idea of virtual folders … personally would like something like that integrated into nautilus. Also tagging should be supported in nautilus and the File dialog, so everytime you save a file in any applciation you can choose to tag that files on save.
Edited 2006-07-10 09:21
I hope this project gets picked up by the two big DE’s – first because it offers some nice new things like semantic information, and second because it tries to be DE-agnostic. Strigi-dev vandenoever had a chat with these guys, so who knows…
(strigi is in KDE-svn, and *might* become the search/indexing/cl engine for KDE 4)
Is it possible to install and use it on an existing DE, or does the DE need to be tweaked to recognize it? IOW is this something users could use regardless of whether it’s “officially” incorporated into either Gnome or KDE?
The DE’s will have to work with them to get the full benefit, i think. But you can use it without integration, like you can now use beagle in KDE.
I saw the videos and this indeed looks nice.
It’s great that there’s competition between various implementations of search engines so we are not just stuck with beagle 🙂
It would be cool to get a freedesktop standard regarding meta data so we’ll not be locked into one desktop.
For me this just popped out of nowhere today. Could anyone elaborate on the history of this project?
All my textual data from the last 10 years, is just 70 Megabytes.
Inmail, outmail, notes and net clippings.
My sequential desktop search, searches that much in just over 4 seconds.
(results are hi-lited and in context)
Other desktop search engines, are complex indexing database
monstrosities designed for millions of times, that amount of data.
(results are an index mishmash of context and tiny preview screens)
Search is always about the CONTEXT.
A lifetimes reading, can fit on a DVD. (4.7 Gig)
For those with less than 200MB textual data, Indexing is OverKill.