Zend_Search_Lucene Datasource for CakePHP

Major update January 22/10: much of the content of this article has been updated to reflect the changes to the datasource, the latest version of which you can download on Github.
Just out of the oven – a Zend_Search_Lucene datasource for CakePHP (built with 1.2 but probably works just fine in 1.3) that I originally wrote for an in-house CMS site search plugin. I can’t release the plugin itself (and there’s so much CMS-specific code that it would need a lot of work to make it generic anyway), but I thought that someone might find the datasource itself useful. It’s pretty basic at this point and doesn’t implement some of the fancier Zend_Search_Lucene features such as sorting (it just returns sorted in score order, which is probably what you want anyway).
Zend_Search_Lucene is a text-based search index system for developers who don’t want to (or can’t) use a database for search indexing.
Download the current version of the ZendSearchLuceneDatsource from my Github repository.
I won’t go into detail about how to add data into the Lucene database since the Zend Framework documention is so good (CakePHP should be jealous!). You’ll find all the info you need there. There are also a couple of older articles out there that show how you can integrate Zend_Search_Lucene into CakePHP:

Setup
First, copy zend_search_lucene.php to models/datasources.
Then, you’ll need to download the Zend_Search_Lucene library from the Zend Framework website and put some files into your /vendors directory:

  • Zend/Search (the directory and all of its contents)
  • Zend/Exception.php

You’ll also need to update your include path to include app/vendors, since the Zend Framework loads a lot of classes on its own. I also made a little autoload function to make the loading of Zend Framework classes easier. Put the following code somewhere common, such as app/bootstrap.php:

ini_set('include_path', ini_get('include_path') . ':' . CAKE_CORE_INCLUDE_PATH . DS . '/vendors');
function __autoload($path) {
if (substr($path, 0, 5) == 'Zend_') {
include str_replace('_', '/', $path) . '.php';
}
return $path;
}

You also need to put the DB config for the datasource in config/database.php (updated Jan 20/2010 for better DebugKit compatibility):

var $zendSearchLucene = array(
'datasource' => 'ZendSearchLucene',
'indexFile' => 'lucene', // stored in the cache dir.
'driver' => '',
'source' => 'search_indices'
);

Then, in the model that’ll act as your search index (say, for example, SearchIndex), specify the DB config:

<?php class SearchIndex extends AppModel {
var $useDbConfig = 'zendSearchLucene';
}
?>

Saving/Indexing
I’ve tried to keep the datasource functions as simple and familiar as possible. When saving an item to the index, the datasource expects a multidimensional array for each item. For compatibility with CakePHP’s datasource code, the ‘meat’ of the data is nested in the third level of the array. Each sub-array contains information about a field to be stored. For example:

$saveData = array('SearchIndex' => array(
'document' => array(
array(
'key' => 'name',
'value' => $record[$Model->alias][$this->settings[$Model->alias]['name']],
'type' => 'Text'
),
array(
'key' => 'description',
'value' => $record[$Model->alias][$this->settings[$Model->alias]['description']],
'type' => 'Text'
),
array(
'key' => 'url',
'value' => $this->__constructUrl($Model, $record),
'type' => 'Text'
)
)
));

Passing that data in a Model::save() call will in turn execute the following Zend code (more or less – this is a very simplified version of the actual ZendSearchLuceneSource saving code):

$index = Zend_Search_Lucene::open('/path/to/the/index/set/in/dbConfig');
$doc = new Zend_Search_Lucene_Document();
foreach ($data as $field) {
$doc->addField(Zend_Search_Lucene_Field::$field['type']($field['key'], $field['value']));
}
$index->addDocument($doc);

Obviously that’s a basic example; you’ll probably send a whole bunch of dynamic info to the indexer. But that’s the gist of it anyway.
Querying
You can search for records just like you would a regular datasource. Pass the search terms as a “query” condition. If you want the search terms to be highlighted in the returned results, pass ‘highlight’ => true in the array of options. Note that only indexed fields will be highlighted.
You can find all results:

function search($term) {
$results = $this->SearchIndex->find('all', array('highlight' => true, 'conditions' => array('query' => 'best cakephp tutorials')));
}

You can mimic Google’s I’m Feeling Lucky with find(’first’):

function search($term) {
$topResult = $this->SearchIndex->find('first', array('conditions' => array('query' => 'best cakephp tutorials')));
}

You can even paginate:

function search($term) {
$this->paginate = array(
'limit' => 10,
'conditions' => array('query' => 'best CakePHP tutorials'),
'highlight' => true
);

$results = $this->paginate();
}

Results are returned in the expected CakePHP way, as a multidimensional array – $results[0]['MyModelAlias'] for multiple records, $results['MyModelAlias'] for one (i.e. with find(’first’)).
There you go – enjoy! As always, comments and suggestions are welcomed.
I used the RSS Feed datasource by Loadsys as a guide to good datasource design. I may have borrowed a function or two.
Neil Crookes’ Searchable plugin also helped.