Jesse Erlbaum scribbled on 3/19/07 11:07 AM:
Hi Peter --As one of the Swish-e developers, I can second Michael's endorsement. ;)I've used switch-e many times over the years. Do you think it is lacking in any way compared to Xapian or Plucene? What about compared to commercial systems such as Verity or Autonomy?
I can't compare Swish-e to any commercial systems since I haven't used them.As Tim just posted, I wouldn't even consider Plucene at this point in its life cycle, unless you're talking about a small, fairly static site. Performance is Not Good.
Compared to Xapian: Swish-e doesn't do UTF-8 or good incremental indexing. Swish-e is very fast at both indexing and search; I did see a benchmark once that showed Swish-e was faster than Xapian but that was some time ago.
Xapian svn has a UTF-8 version; it's due to be official with the 1.0 release, whenever that is.
Xapian does have good incremental indexing. It's also a library (unlike Swish-e) so it has some more flexibility.
Swish-e has a built-in HTML/XML parser, which is pretty good (especially if you use libxml2). Xapian has Omega, an additional package that does some HTML parsing iirc.
I find Swish-e "just works" a little more "out of the box" than Xapian, but the two big features (UTF-8 and increm indexing) are a show-stopper if your project requires those.
See also my article here, comparing Lucene, Xapian and Swish-e: http://dewey.library.nd.edu/mylibrary/manual/ch/ch17.html
I've felt that Switch-e was a bit "long in the tooth" owing to its legacy. Do you disagree? If you were not deeply involved in the Switch-e project, would you choose it over Xapian or Plucene (or any other system)?
Swish-e is old, true. The 2.x versions added a lot of features, but those are getting on 6 years old now too. Still, things that Work don't need to be New, do they? ;)
Would I choose it over Plucene? Not a question. Xapian? Well, it would depend on a couple things:
(1) data set. Am I indexing data that is fairly static or mostly dynamic? Example: static HTML or PDF docs, vs providing fulltext search for a database. Swish-e is fast enough and has merging and multi-index search features that let you get around the lack of incremental indexing, so if your data doesn't change much, I'd go with Swish-e. If it does change a lot (e.g., you need to update your index everytime you update your db), then I'd probably go with Xapian.
(2) i18n. Swish-e was first written back in the mid90s so Unicode wasn't even a consideration. There are lots of optimizations in the C code that assume 1 byte = 1 character and so things like UTF-8 Just Don't Work.
You can get around that (as I do) with things like Search::Tools::Transliterate (shameless plug) but if you need Real I18N Support, I'd be going with Xapian.
That said, I would suggest Xapian over Plucene, hands down.Why do you say? Any particular gripes?
See Tim's recent post.
And I would also check out KinoSearch, which is (along with Xapian) going to be one of the optional backend IR libraries for the next version of Swish-e.I've heard of KinoSearch before, but I've not tried it out. It seemed that Plucene and Xapian had more established Perl interfaces, but I could be wrong.
Tim summarized it well. KinoSearch is new-ish, but Marvin is really cranking out some quality stuff. And it's all C and Perl, so if those are your primary languages, the barrier to hacking on it yourself is all the lower.
If I had to do a full UTF-8 compatible, robust, incremental, highly scalable search application tomorrow, I'd be looking seriously at KS and Xapian. Which is why Swish-e will offer those 2 (among others) as backends for version 3. :)
pek
--
Peter Karman . http://peknet.com/ . suppressed
---------------------------------------------------------------------
Web Archive: http://www.mail-archive.com/suppressed/
http://marc.theaimsgroup.com/?l=cgiapp&r=1&w=2
To unsubscribe, e-mail: suppressed
For additional commands, e-mail: suppressed
Mail converted by mhonarc 2.6.15
This archive provided courtesy of JSW4.NET, Internet Hosting Services for Small Business.