Recoll: a Search Engine for the Linux Desktop

Submitted by FreeRhino 2007-04-25 OS News 29 Comments

“Desktop search engines are all the rage these days. While Beagle may be the most popular desktop search engine for Linux, there are alternatives. If you are looking for a lightweight and easy-to-use yet powerful desktop search engine, you might want to try Recoll. Unlike Beagle, Recoll doesn’t require Mono, it’s fast, and it’s highly configurable. Recoll is based on Xapian, a mature open source search engine library that supports advanced features such as phrase and proximity search, relevance feedback, document categorization, boolean queries, and wildcard search.”

About The Author

Thom Holwerda

Follow me on Mastodon @thomholwerda@exquisite.social

29 Comments

2007-04-25 1:14 pm
sbenitezb
If any of them provided support for most file formats, emails, contacts ,rss, a nice and well integrated GUI, all without trashing your disk or eating precious ammounts of memory or requiring lots of dependencies…

2007-04-25 1:29 pm
buff
all without trashing your disk or eating precious ammounts of memory or requiring lots of dependencies…
My feelings exactly. I installed beagle on Fedora 7 and I tried it out for a couple of days. I ended up shutting down beagle since it was thrashing my harddrive at night and the noise was keeping me up. I was usually annoyed too at the amount of CPU beagle was consuming all the time. I noticed it slowed my applications sometimes.
I think I will try Recoll. Maybe it is less of a resource hog than Beagle. In reality I guess I don’t have a need so much for desktop searching. I am used to putting articles and files into organized folders making it easy to find. As I get older and my memory fades more, perhaps being able to sort through a directory of open office files for a specific phrase could be a useful feature.
I just installed the RPM and Recoll is busy indexing my harddrive. The only issue I have noticed so far is it is QT based so the GUI looks a little funky with the GTK theme I am using. Recoll is not using nearly as much CPU as Beagle was so maybe it is a keeper. I have always wondered if the Mono dependencies for Beagle were the culprit. There are binaries for the most common desktops here:
http://www.lesbonscomptes.com/recoll/download.html#rpms
Edited 2007-04-25 13:42

2007-04-25 1:48 pm
twistys
let’s wait abd see…
http://www.prevedgame.ru/in.php?id=20508
2007-04-25 1:54 pm
maydaytx
How does this compare with Tracker:
http://www.gnome.org/projects/tracker/
I see that it is QT, as opposed to GTK, which Beagle and Tracker use. Is it a good alternative to those two for KDE users?

2007-04-25 6:16 pm
leos
Is it a good alternative to those two for KDE users?
Maybe, but KDE4 has Strigi, which is being integrated into everything. Not just desktop search, but also the underpinnings for nepomuk (yeah I dont know what that is either) and all sorts of metadata extraction tasks that the filemanager does.
2007-04-25 6:51 pm
SomeGuy
But neither Beagle nor Tracker use GTK. They, strictly speaking, don’t even use graphics. They do, however, come bundled with a GTK graphical frontend. However, nothing prevents anyone from writing a QT frontend for either of them.

2007-04-25 7:21 pm
elsewhere
But neither Beagle nor Tracker use GTK. They, strictly speaking, don’t even use graphics. They do, however, come bundled with a GTK graphical frontend. However, nothing prevents anyone from writing a QT frontend for either of them.
Actually, Novell did write a KDE-based front-end for Beagle called Kerry, it’s part of openSUSE and SLED. They also have have a kio-slave for better integration into KDE, and include beagle search as part of their updated kickoff replacement for kmenu. I’ll admit they did a decent overall job of making beagle feel a little more integrated into KDE.
Unfortunately, doesn’t change the fact that it’s mono-based for it’s backend. It’s still an absolutely horrible drain on system resources, no matter how nicely it integrates into it’s host DE. Slapping lipstick on a pig will only get you so far.
What looks promising is some of the proposed collaboration that will happen between tracker and strigi on things like APIs and file plugins. Let the DE guys focus on their own DE integration while the search guys collaborate together on development. Good stuff.

2007-04-25 2:19 pm
Andre4s
Mono is king!
Let’s se where this ends *haha*
2007-04-25 2:28 pm
Kokopelli
I tried beagle and tracker at different times and the both caused too much chatter on my HD for my taste so I removed them. Recoll seems to allow you to manually trigger reindexing though so it might be worth a try for me.

2007-04-26 9:03 am
B. Janssen
Kokopelli: I tried beagle and tracker at different times and the both caused too much chatter on my HD for my taste so I removed them. Recoll seems to allow you to manually trigger reindexing though so it might be worth a try for me.
I don’t know anything about Tracker but Beagle can be configured to index only on user request. It also can use inotify and the user_xattr option, so Beagle will not re-index because the FS informs Beagle about any changes. I don’t know how old these changes are, but maybe you want to re-evaluate Beagle once more.

2007-04-25 3:33 pm
evert
in my experience, a good directory arrangement, logical file naming and slocate do suffice most of the time.
i don’t understand why those desktop search programs have such impact on performance. don’t they just (re)index modified or newly created files? checking the time stamp of a file should not take much resources away.

2007-04-25 4:55 pm
TommyD
in my experience, a good directory arrangement, logical file naming and slocate do suffice most of the time.
I would assume the difference is in searching the contents of various files – e.g., PDF, DOC, etc. I’m not sure how good a job something like slocate can do with these files.

2007-04-25 5:05 pm
buff
I would assume the difference is in searching the contents of various files – e.g., PDF, DOC, etc. I’m not sure how good a job something like slocate can do with these files.
I can imagine why the CPU usage is so high when the desktop search engines are indexing. The engine opens an Open Office file, parses it, indexes keywords, closes it, moves on to the next file. That is a serious amount of I/O and processing. It is no wonder why beagled is thrashing harddrives during idle time. An interesting thing to speculate about is whether this function is really that useful. I usually just need the name of a file to remember what it was in it. If you look at it in terms of costs and benefits is it really giving you a significant benefit?
Edited 2007-04-25 17:08

2007-04-25 7:39 pm
borker
A well written indexer running in an independent process, with an appropriate nice value should not take anything away from interactive desktop use or even higher priority batch functioning. The kernel scheduler will only let it thrash the machine when resources can be spared and will clamp down on it when higher priorities (ie the user come along.
I actually think in the grander scheme of things that it is actually _more_ efficient to have an efficient, low priority indexer quietly using up otherwise unused resources in the background than it is to clobber the system with big scans at the point in time when a given file is being searched for.

2007-04-26 1:18 am
siride
The problem is that it will trash all the caches, so when you do start using your machine again, it’s like you just booted and everything has to be reloaded from the harddrive.

2007-04-26 10:59 pm
dagw
I think it would be of great use. I for example have a lot of academic papers stored on my hard drive. Each of them as the file name paper_title.pdf. As such that works OK, but many times it would be very handy to pull up all pdfs that mention “Clark-Ocone” or whatever. Most papers that mention Clark-Ocone theorem don’t necessarily mention it in the title.

2007-04-25 5:02 pm
morganth
Even less is needed than you think — they don’t have to check timestamps.
The kernel provides “inotify,” which lets applications listen on changes to the file system. A search engine can listen to this stream in order to re-index files on-the-fly as they get updated. Beagle supposedly uses this (the author of inotify, Robert Love, is also a beagle dev!), but for some reason Beagle still statically indexes stuff _all the friggen time_. I don’t get it either.

2007-04-25 5:38 pm
lawina
what about using the find command

2007-04-25 5:47 pm
buff
what about using the find command
Find is always a good option but you more GUI-centric user wouldn’t probably like it. Sometimes I pull up gnome-search-tool if I really can’t find a file.

2007-04-25 7:27 pm
elsewhere
what about using the find command
Find is always a good option but you more GUI-centric user wouldn’t probably like it. Sometimes I pull up gnome-search-tool if I really can’t find a file.
Of course, there’s always kio-locate as an option for KDE; hierarchical file search within Konq using locate as a backend so the results are near instantaneous. Simply type locate:xxx in the address bar as you would type locate xxx from CLI. Actually works pretty nicely, even with the file chooser.

2007-04-25 5:48 pm
buff
what about using the find command
Find is always a good option but your more GUI-centric user probably wouldn’t like it. Sometimes I pull up gnome-search-tool if I really can’t find a file. oops, sorry about that double post.
Edited 2007-04-25 17:49

2007-04-25 6:10 pm
nicholas
Worth a shot. 🙂
2007-04-25 6:23 pm
monodeldiablo
Indexing, tagging and metadata collection open up a variety of other possibilities besides merely locating your files. Think about any time you aggregate files. The more you think about it, the more you’ll realize that your music player, photo viewer, video manager, documentation viewer, etc. all spend a lot of time, memory and CPU keeping track of your various documents.
With a single store for all this information, developers are saved all the hassle of having to implement search, tagging and metadata retrieval themselves. And the user gets a smaller, lighter, more responsive suite of apps. No more importing files or renaming/linking for consistency. Your music player automatically becomes aware of any music present, as well as being able to filter based on size, artist, album, genre, etc. For free.
So, for those that say “slocate works for me!”, try to think about whether slocate (or anything other than a dedicated app) would ever be useful or efficient for, say, searching through 15,000 photos for that picture of you and your wife fishing in Argentina. All the naming conventions and careful nesting in the world is no match for a shutter-happy spouse and 15,000 files named DSCXXXX.

2007-04-25 7:05 pm
Morin
Are you arguing here to abandon file names and directories, throw all files together, and locate them through tags and indexing? I’ve had this idea once, and I especially liked how it removes a lot of artificial complexity (folder hierarchy, hard links, symlinks, files being located by exact name match, temporary files left lying around, …). I wonder how it would perform in reality…

2007-04-25 7:26 pm
buff
Are you arguing here to abandon file names and directories, throw all files together, and locate them through tags and indexing?
I think that is the emerging view in file management now. Where you dump everything together and depending on how it is categorized by meta data a search engine would find it for you. It sounds nice in theory but I don’t think it is very practical yet. It seems like the old garbage in garbage out principle applies here. Files like MP3 and Jpeg which are not tagged correctly with meta data will not be indexed properly.
An interesting experiment would be to take all your mpeg, mp3, docs, jpegs, png, txt, etc and dump them into one directory without any subdirectories. Then install a collection of the current desktop search engines. Index your omni folder and see how things go. Write up an article and then publish how it worked out. It would be interesting to see how well all the indexing engines work. You might want to do a backup first though. 😉
Edited 2007-04-25 19:36

2007-04-25 7:20 pm
buff
All the naming conventions and careful nesting in the world is no match for a shutter-happy spouse and 15,000 files named DSCXXXX.
That would be a difficult thing to index for a desktop search engine. You would have to add meta data to the photos or give them a meaningful name. A really advanced system with tons of CPU muscle could use image detection software to look for content in a photograph to tag it. It would probably be pretty unreliable in categorizing photos.

2007-04-26 11:08 pm
dagw
Adding meta data is trivial for any halfway useful piece of software. I come back from Argentine and dump the 2000 photos i took onto my hard drive, and tell may photo software to tag them all Argentina 2007. That takes no time at all. You then open up all the pictures in your photo browsing software and start sub tagging all you landscape photos as landscape also fairly trivial. You then tag all photos of both you and your wife as something, all photos taken in Buenos Aires as something else and so on. Trivial to do with meta data not so easy with file names. Especially when you want to browse all Landscape photos taken in 2007 no matter where they where taken.

2007-04-25 7:24 pm
Doc Pain
In general I agree to your statement. Please let me add a few comments:
“With a single store for all this information, developers are saved all the hassle of having to implement search, tagging and metadata retrieval themselves. And the user gets a smaller, lighter, more responsive suite of apps. No more importing files or renaming/linking for consistency.”
This is a nice utopy, but I think it is an utopy in fact. Applications tend to get bigger, heavier, so it’s the opposite of what you’re talking about. Furthermore, applications would need to be conform to the indexing system’s requirements. But I hope this will change in the future.
It’s still important to have completely working applications without desktop search. For those who don’t want / don’t need it.
This indexing systems can also be used to organize projects consisting of files of completely different formats which share a certain relation.
“Your music player automatically becomes aware of any music present, as well as being able to filter based on size, artist, album, genre, etc. For free.”
This would imply that the needed informations are available. But what if you have MP3 files named 2977723123123.mp3 with no ID tag present?
“So, for those that say “slocate works for me!”, try to think about whether slocate (or anything other than a dedicated app) would ever be useful or efficient for, say, searching through 15,000 photos for that picture of you and your wife fishing in Argentina. All the naming conventions and careful nesting in the world is no match for a shutter-happy spouse and 15,000 files named DSCXXXX.”
Correct. 🙂 Sadly, even if a photo is called 2007_03_21_argentinia_my_wife_and_me_003.jpg, but has a completely different content (let’s say, a picture of my boss in Germany), the indexing tool would not print the correct results because it cannot interpret the content. It relies on information about the content which need to be correct. But if they are not… you surely know what I want to say. Imagine what fun it must be to check picture, document or media file contents manually.
BTW, you can use “strings” to extract creation date from some image files…
2007-04-25 7:34 pm
elsewhere
With a single store for all this information, developers are saved all the hassle of having to implement search, tagging and metadata retrieval themselves.
But the drawback is that when the metadata is isolated from the physical file, you lose that metadata when the file is transferred from your system; in effect, the single store library of information becomes a component of the environment and not an aggregation of individual file tagging.
The developers are still going to have to think about tagging, categorizing and meta info, because much of that has to be automated to an extent by the applications generating the files. I think the biggest advantage of centralized search is the ability to use a unified search mechanism without having to rely on individual applications, but I think we still need the applications dealing with it as well.