Comparison of Desktop Indexers

Submitted by Torsten Rahn 2007-01-18 Benchmarks 43 Comments

“A number of search engines are available for the Gnome and KDE desktop environments, many based around the open source Lucene search engine. It would be tremendous if we could adopt one of these search engines for the Gnome platform, so we can provide the type of integrated search experience for our users that they really need, irrespective of which distort they are using. So to help in this assessment we have carried out a comparison of four different Unix based indexers [.pdf].”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

43 Comments

2007-01-18 4:55 pm

r.m.graham
For some reason Beagle lost points for being written in C# and needing Mono, but JIndex didn’t lose points for being written in Java and requiring a jvm.

Preferences aside, that seems a bit odd.

2007-01-18 4:58 pm

superstoned
well, it’s written by a sun (Java lovers, duh) guy. so it makes sense he doesn’t trash mono and JIndex too much, even tough they clearly suck in terms of performance and memory usage…

if they would make a smart choice, they would go for the best performance and least dependencies – strigi

2007-01-18 5:50 pm

elsewhere
so it makes sense he doesn’t trash mono and JIndex too much, even tough they clearly suck in terms of performance and memory usage…

I could never for the life of me understand why the decision was made to use mono as a framework for beagle, something intended to run as a transparent background service. That’s not really a slag against mono per se, it’s just a question of whether it makes sense to use it for something that should by design be fast and light. In fact Novell compounded this mistake by using mono for zen as well, which led to much of the griping and complaints with SL 10.1 and how slow and freaking unstable the package management was.

Not sure I understand why Sun is making the same mistake with Java either, aside from the obvious. I guess in a way it does provide a benchmark for optimizing the performance of an app like this to accomodate the overhead costs etc. but still…

I’ve played with Strigi, and I do like the fact that you never “feel” it running. That’s how it should be. I don’t care if it’s spiking my CPU when my system is idle, as long as I get my juice back the moment I need to do something else. The cool thing is that the core engine seems to be pretty much worked through, now it’s just further optimizing and building the hooks and plugins. Collaboration with the tracker team on unified interfaces would be fantastic.

2007-01-18 6:02 pm

superstoned
well, novell wrote beagle to show how great mono is and promote it, sun wrote their java indexer to show how great java is and promote it. of course, it’s not really advertising in my mind, if you see those numbers it’s clear there are some problems

yeah, tracker and strigi are the way to go – and as the latter seems to be much faster (30 times?), can index in zipfiles etc, has less dependencies and is working with the Nepomuk project to add contextual information, i’m glad KDE 4 will be using strigi.

2007-01-18 8:44 pm

AdamW
“best performance” in terms of speed and memory usage, but not “best performance” in terms of actual usefulness (i.e. it missed a bunch of results that the other tools got).

2007-01-18 9:03 pm

smitty
I’m not sure you can even compare performance all that usefully since the fixes required to get those extra hits may end up slowing down the app. Still, it was quite impressive for a young project and the developers seem to think the problems were rather minor.

For sure! The report is very useful in showing problems too. The main point that must be addressed the low result count. This is something I was totally unaware of. The reason for that is that I’ve not come around to writing unit tests to test search reliability. This is the first thing to pay attention to after the KDE metainfo work is finished.

Nevertheless, the overall impression is very good. Most negative points are rather vague and revolve around smaller issues. Please forgive me for being overjoyed at the huge speed differences.

-oever

2007-01-18 7:16 pm

anda_skoa
For some reason Beagle lost points for being written in C# and needing Mono, but JIndex didn’t lose points for being written in Java and requiring a jvm.

I guess the summaries have been written by different authors and haven’t been synchronized prior to publication.

For example the Strigi summary lists “not clear ANSI C” as a cons(begin written in C++), which cleary also applies to the the programs written in C# and Java

I found it odd that obviously only Strigi needs a build framework while all other projects seem to deliver hand written make files. I would have expected that Tracker would be using autotools and the Java application something like Ant.

2007-01-18 4:57 pm

superstoned
i think this is a nice article, tough it has some shortcomings. one is the remarks on cpu usage – the author seems to fail to realize beagle using not 100% cpu is in fact a bad thing, for several reasons.

first, the linux kernel will think it’s an interactive process, increasing it’s priority, thus allowing it to pre-empt user processes, killing interactivity of the system.

second, on a laptop, it’s better to use 100% cpu for 1 min than 50% for 2 mins, in terms of power usage…

thus the fact Strigi uses max cpu is positive, not negative. and it makes for a good choice, being up to 40 times faster in indexing (4 min vs 2 hours for beagle vs 3 hours for tracker) – the author states most noted problems are rather trivial to fix.

http://www.kdedevelopers.org/node/2639

anyway, a common plugin engine and dbus-interface would be good, for sure.

2007-01-18 5:04 pm

situation
I imagine most users would run Beagle as ‘nice -n 19 beagle’ or the like, so the kernel pushes the priority down and workflow isn’t interrupted.

Still a big fan of slocate, personally. Can force an update when I want (which can take under 10 minutes, if you have updated recently and are on a relatively new computer). The results are in a simple list format, etc. Not as advanced or user friendly, but it’s nice to have a simple version of a desktop indexer available still.

2007-01-18 5:18 pm

superstoned
if you run beagle nice -n 19, and it throttles, the kernel will increase it’s priority with… surprise, 19 points, thus it’ll run as prio 0. better than -19 (yeah inverted blablabla) which would be the case without the nice, but still not what you want.

and even if it does run on +19, it STILL uses cpu, even when you do a game. a scheduler policy like sched_batch would ensure it NEVER interupts another running process – that’s what you want.

and slocate, does that look in files as well? anyway, i’d rather have incremental updates like beagle & friends have.

2007-01-18 8:03 pm

Jamie
Thats not accurate – the most the linux kernel can adjust nice levels is +/- 5

2007-01-19 7:50 am

superstoned
yes, that might be true. i don’t use mainline cpu scheduler, but the staircase one, which might have more levels to change…

2007-01-18 5:38 pm

meebee
While I think, strigi is nice and has potential, it proved to be horribly unstable (0.3.11) in my own tests, left around many zombies etc.

It still has a lot of rough edges.

I also can’t confirm, that tracker is 40x slower. It is a bit slower, but more in the region of 20% to 30% (*not* times).

Also, what is it good for to be lightning fast when your search results are not good?

Again, strigi seems to be a very young project, so there is hope that these issues are fixed.

2007-01-18 6:08 pm

superstoned
well, indeed, all these projects are pretty young, so we’ll have to wait to see which one will stand out as the best solution. tracker and strigi of course have (imho) the best chance, being reasonably performant and not depending on controversial stuff like mono/java. if both happen to deliver the same d-bus interface, the best will be used the most, and that’s the most optimal solution.

btw strigi also delivers database services and is going to be the foundation of meta-data extraction and manipulation in KDE 4, in addition to having Nepomuk (contextual linking, labeling etc) integration, so i think it has the best cards right now… on the other hand, tracker is close to integration in gnome, and even tough gnome mostly doesn’t integrate things very deeply (or at least, does so slowly), gnomes don’t like stuff smelling kde’ish. after all, they even rejected aRts, even tough it was plain C, had a gnome-lib dependency and was the only technically reasonable solution by then…

but things can change.

2007-01-18 7:19 pm

anda_skoa
they even rejected aRts, even tough it was plain C

aRts is written in C++, not plain C

had a gnome-lib dependency

It didn’t unless you you are referring to using glib to some extend, however I think it had a reduced copy if it inside its own source tree.

2007-01-18 7:26 pm

superstoned
i stand corrected on the c++ issue, i see it’s indeed C++. but g(nome)lib is definitely a dependency, it’s the main reason apt-getting KDE pulls it in…
2007-01-18 7:37 pm

anda_skoa
but g(nome)lib is definitely a dependency

As I said it depends on glib, but it definitely does not depend on gnome-lib, two very different libraries.

From someone as well informed as you I’d almost consider it flamebait
2007-01-18 7:47 pm

superstoned
hmmm. i’m getting confused now. i thought for a long time glib wasn’t related to gnome, then something came up (don’t remember what, see my nick) which apparently made me think it WAS gnome-related…

now you say it isn’t. hmmm

and google says:

GLib is the low-level core library that forms the basis of GTK+ and GNOME.

so, who’s right? you or google?
2007-01-18 8:11 pm

anda_skoa
so, who’s right? you or google?

I am, of course!

glib is definitely a core dependecy of GNOME, it is the GTK+ platform abstraction library.

However it has move to a general purpose C utility library a long time ago, several projects not related to GNOME use as well, it is always packaged separately, etc.
2007-01-19 7:48 am

superstoned
well, i was more-or-less wrong, then, but at least it’s understandable
2007-01-18 9:05 pm

g2devi
glib is basically as set of c functions to handle data structures plus a few things like event loops and basic thread (see http://developer.gnome.org/doc/API/2.2/glib/index.html ).

There’s nothing GNOME-specific about it and it’s extremely useful if you’re doing any C programming since it allows you not to re-invent the wheel (e.g. virtually every C programmer re-invents the linked list). The danger of this constant re-inventing is that you might accidentally re-invent it wrong or re-invent it non-optimally.
2007-01-19 12:00 am

MattV
The danger of this constant re-inventing is that you might accidentally re-invent it wrong or re-invent it non-optimally.

The danger of constant re-inventing is that you’ll make one or both of the above mistakes 99% of the time!

2007-01-19 6:09 pm

rayiner
aRts also sucked.

2007-01-20 9:53 pm

superstoned
yeah, esound was great… now, after arts being unmaintained for 3 years, gstreamer is just a little better. ok, 0.10 is seriously better, but hey, it’s beating 5 year old tech…

2007-01-18 7:25 pm

Jamie
I also can’t confirm, that tracker is 40x slower.

I can confirm its most definitely not!

The article in question tested the ancient 0.5.0 release of tracker which was the first version to include our new indexer framework which was completely unoptimised.

The lastest release, 0.5.3, is tons faster and should be much closer to strigi. We are doing some more optimisation work in the next version so will be interesting to see how they compare then.

Note also strigi does not currently do any string processing like stemming so it lacks the ability to do accurate searches on plurals and stems. This might account for its impressive raw speed as well so we really need to see strigi get features like this to do a more “fair” comparison.

2007-01-18 7:27 pm

superstoned
but strigi does create a md5 (or something) hash for every file it indexes, to find duplicates. doe tracker do that?

btw good to hear tracker is getting some performance improvements

2007-01-18 6:44 pm

Hiev
so using 100% of CPU is now a feature and not mayor bug?

I don’t think so.

Edited 2007-01-18 18:49

2007-01-18 8:00 pm

borker
When nothing else is competing for the CPU, why not use it all? With the appropriate nice setting, strigi will happily settle into the background and give up the CPU to interactive tasks, so no impact on interactive users (or other, higher priority, batch tasks either). Thats why OS’ have schedulers…

2007-01-19 9:03 am

superstoned
indeed. when indexing, you want 100% cpu – anything less is just waisting cpu cycles… and of course, when indexing is done, cpu usage and mem usage should be as little as possible (and afaik most indexers do that right…).

2007-01-19 6:28 pm

migi
Not always you want 100% CPU,

Here are examples:

http://mail.gnome.org/archives/tracker-list/2007-January/msg00177.h…

—

Migi
2007-01-20 9:54 pm

superstoned
yeah, i read that, it’s my own discussion with their developers .

and i think you should read the rest of the discussion…

2007-01-18 5:22 pm

diegocg
Here’s a commentary from the strigi’s creator (which will be used in kde4, i think): http://www.kdedevelopers.org/node/2639
2007-01-18 6:10 pm

tristan
If there’s one thing that this report makes absolutely clear, it’s that managed environments like Mono and Java have absolutely no place in the world of low-level userspace daemons. Beagle’s memory usage was more than 10 times that of Tracker and Strigi, and JIndex was even worse. Beagle is good at what it does, but anyone who thinks that it is a long-term solution for metadata indexing is either mad, or working for Novell.

So given that we are left with a choice between Tracker and Strigi, this question is of course, which one? Clearly both of them still have some deficiencies (which is why most distros are going for Beagle right now), and both are in heavy development. A clear winner is hard to pick.

Unfortunately I think it might be the case that Gnome goes for Tracker and KDE goes for Strigi, which would be a shame, as this is definitely an opportunity for the two desktops to work together to achieve a common goal. Even if they do go for separate solutions, I hope that they could work out a common search API, so that the user wouldn’t have to have to daemons running to be able to use Gnome and KDE apps at the same time.

It’s also worth noting that Tracker aims to be not “just” an indexer, but a complete metadata database. So, for example, Rhythmbox (or Amarok!) wouldn’t need to maintain its own database, but would instead just be able to query tracker for all the audio files on the system. I don’t know whether the author of Strigi has similar plans, or indeed whether such a thing is practical.

Lastly, I have to say that this is the least professionally-written report I’ve ever seen. I realise that it’s primarily for internal use, but if I were to hand my boss something like this (“yes, I do like cakes”) I would probably find myself in some quite hot water…

Edited 2007-01-18 18:14

2007-01-18 7:14 pm

g2devi
From what I’ve read (sort of hinted at in the review), Tracker and Strigi are working on a common API so that, in theory, one could use Tracker as the front end database (that’ll be used for more than just indexing, e.g. a bookmarks database) and Strigi as the backend indexer. Not sure if this will pan out.
2007-01-18 7:23 pm

anda_skoa
If there’s one thing that this report makes absolutely clear, it’s that managed environments like Mono and Java have absolutely no place in the world of low-level userspace daemons

True, however I found it quite awesome how fast the Java index starts up after the first time (startup times diagram, page 10)

I hope that they could work out a common search API

There is a lively discussion about this on teh main freedesktop.org mailinglist, starting back in November and still continuing (subject “Simple search API”):

http://lists.freedesktop.org/archives/xdg/2006-November/thread.html

I don’t know whether the author of Strigi has similar plans

Well, Strigi already indexes metadata and I think there is a goal to make it collaborate with Nepomuk’s data relation framework

2007-01-18 7:31 pm

Jamie
Well, Strigi already indexes metadata and I think there is a goal to make it collaborate with Nepomuk’s data relation framework

Not as good as having an all in one integrated database/indexer like tracker (and vista). The problem with a dedicated indexer is they cant efficiently be coupled with a database without duplicating all the metadata in both databases and then which engine do you use for searching: Lucene or the DB?

Tracker was designed from the ground up to integrate both using a tightly coupled sqlite and a custom inverted word index and this gives you tremendous power as a result without duplicating metadata and having one interface for searching.

Edited 2007-01-18 19:35

2007-01-18 7:39 pm

anda_skoa
Not as good as having an all in one integrated database/indexer like tracker

Strigi does this as well (SQLite if I remember correctly). There should be slides from the Strigi presentation at aKademy06 somewhere which have detail about the Strigi architecture

2007-01-18 6:34 pm

FunkyELF
…why would a search engine need to be desktop environment specific?

What if I run gnome, kde, and xfce.

Why should I have 3 indexes built for each environment?

2007-01-18 7:30 pm

superstoned
yeah, 3 index tools would be bad in the long run. but they’ll all support d-bus, hopefully standardized, so you can swap ’em and use the same in each environment.

the mere fact we (now…) have several index daemons is NOT a bad thing, btw. it’s how Free Software works, and in the end, the one with the best design hopefully wins and gets used most. darwinism rules the FOSS world, and that’s a good thing.

2007-01-18 7:07 pm

Sphinx
Does the average user really need more than an occasional grok? Maintaining the db would seem to be a completely wasted effort. Something more like an illustra image datablade would seem much more useful.

#!/bin/sh

if [ “$1” == “” ]; then

echo ”

descend directory tree and search files

usage:

grok ‘<pattern>’ ‘<file mask>’

”

else

find . -name “$2” -print | while read i

do

whack=`grep -n $1 $i`

if [ “$?” -eq “0” ]; then

echo ”

FOUND: $i”

echo $whack

fi

done

fi

2007-01-18 7:33 pm

superstoned
doesn’t seem to give instant results on a 500 gb disk, don’t you think? and does it give results from compressed files (strigi does…), or meta-data from images and stuff? and does it tell you when and where you downloaded the file, or from who you got it? does it give contacts if you search for a friend’s name? and his emails and the files he send you?

sorry to bring this so harsh, but it’s 2007.

2007-01-18 9:48 pm

trenchsol
Anyone tried Namazu ? I used it when I needed to “crunch”

5 MB documentation, mixed PDF, HTML and Word .doc, in a very short time. What I liked the most was an ability to execute a simple, Google-like query from the command line, and results were displayed in a preferred browser, just like Yahoo or Google.

One can have different index sets with Namazu, too.

I have not tried any of reviewed programs, so I can’t compare them to Namazu. Namazu can be usefull as a search engine for single website, because it can run as CGI, too.

DG
2007-01-19 8:59 am

lupus
It may look strange that the reviewer went as far as using a November 8th cvs release for one indexer and instead used a several months old release of beagle (when a beagle release happened on November 1st). It may not be malice, but just ignorance on his part.

Anyway, beagle had many fixes for memory usage in the releases following 0.2.7 and some more even after November 1st: the actual memory usage of bagle is much lower now.