Plugins I've Known and Loved #3: Ultrasphinx


Mar 03, 2008

If you've ever implemented searching in your rails app, you probably noticed that it's a major hassle (unless, of course, you already know about Ultrasphinx). Ferret is probably the most commonly used indexing solution, but, it isn't anywhere near production ready. acts_as_solr, another option, is so full of show stopper bugs that it really isn't even worth bothering with. The worst part about both of those solutions is that they usually work great in your development environment. Try to put them in to production, though, and you're in for big problems.

Enter Ultrasphinx, by Evan Weaver. Sphinx is a super high performance search daemon, designed to suck information out of a mysql or pgsql database (though, it isn't limited to that). Normally, one would configure Sphinx with SQL queries that it uses to fetch the data. Apparently, the configuration file can be a major hassle to work with — I wouldn't know. Ultrasphinx provides you with a declarative API that it ultimately uses to generate the Sphinx configuration file for you, making the process quick and painless. A few lines of code, a couple of rake commands, and you're searching.

The Sphinx approach has several advantages over ferret, and acts_as_solr. First of all, Sphinx is extremely stable. They're nearing a 1.0 release, and its stability certainly merits that. You don't even really need to worry about monitoring the search daemon, because it just isn't going to crash. Also, if you're running any rails apps in production, you know that long running ActiveRecord callbacks can lead to your app performing very poorly. The ferret and solr solutions both rely on active record callbacks to inform the indexer of new data. Moreover, if the search daemon goes down for any period of time, or, say, the index becomes corrupted (ferret, I'm looking at you), your entire app is going to be down for the count. You set the Sphinx indexer to run every half hour on a cron job, and since it works with the db directly, its performance and stability characteristics have absolutely zero impact on your app.

Trying it Out

So, let's implement a search for our blog. First, we'll want to edit the paths to the index and logs in the default config file. Note that Ultrasphinx uses two types of config files: one that the programmer edits, and one that it generates; both are necessary on all machines that are accessing the index (seriously, not having the generated config on a slave machine caused me some trouble). Then, we'll want to declare our post model as indexed (note that you must declare the fields as strings, not symbols):

class Post < ActiveRecord::Base
  is_indexed :fields => ['title', 'body']
end

Then, we'll need to ask Ultrasphinx to generate a configuration file. You've got to re-run this rake task any time you make changes to the definition of your models' is_indexed declaration. Make sure to .gitignore (svn:ignore for people still stuck in svn land) the generated file in development. Evan recommends checking the production file in to version control, but I have it set to generate automatically on deploy with a cap recipe. That way, if I make changes to the indexing, I won't forget to re-generate the config file. All you have to do is run:

$ rake ultrasphinx:configure

Then, since it's our first time, we've got to run the indexer:

$ rake ultrasphinx:index

Finally, we'll need to start the search daemon:

$ rake ultrasphinx:daemon:start

Now, we can start searching.

@search = Ultrasphinx::Search.new(:query => params[:query])
@search.run
@search.results

So, it's really easy to build a basic search engine using Ultrasphinx. There are some gotchas, though.

Gotchas & Notes

  • More complex indexing with Ultrasphinx can be slightly more verbose, and SQL-focused than it would be with a solution that relies on AR callbacks to do its indexing.
  • Transforming data with Ruby is impossible; the data you index must be in your database (unless you use a stored procedure, which can actually be written in ruby if you're using pgsql, but I digress).
  • Ultrasphinx preloads your indexed models when it is initialized. So, if your models depend on any monkey patches that live in your app's lib directory, they must be loaded before the Ultrasphinx (in my experience, this has meant pluginizing my monkey patches). Because of the way that exceptions are caught in the preloading routine, you won't see the actual error that is stopping your model from loading. Instead, you'll just get a constant loading error, or a name error, or something. If you see something like that, and your models load fine without Ultrasphinx installed, look for dependency issues.
  • Ultrasphinx attempts to preload your indexed models using a regex that doesn't respect commenting (at the time of writing). If you're struggling with issues mentioned in the last gotcha, you'll probably try commenting out the is_indexed call to see whether that's what's causing the problem. That won't work. You can either delete the is_indexed call entirely when debugging, or pull from my git repo, where I've modified the regex to respect #-style commenting (but not the =begin/=end style).
  • If you see DivideByZeroErrors in production, it's probably because you're missing the generated configuration file on one or more of your app server machines.

Check it Out

You'll need to grab Sphinx. Then, the plugin...

Get it from svn:

$ svn export svn://rubyforge.org/var/svn/fauna/ultrasphinx/trunk vendor/plugins/ultrasphinx

Or pull from my git repo (for the change described in the gotchas section):

git clone git://github.com/giraffesoft/ultrasphinx.git

To Sum Up

Ultrasphinx is by far the most effective rails searching solution I've come across. Unlike most of the other options, the search daemon is incredibly stable, and the index never seems to become corrupted (I'm running it in a relatively high load production environment with absolutely zero trouble so far). Also, since Ultrasphinx doesn't rely on AR callbacks for indexing, your application isn't quite as coupled to your search daemon; if it dies, search functionality will break, but the rest of your app will still function. It's not without problems, and complex indexing can be trickier, but Ultrasphinx's stability and performance make the choice a no-brainer.