Zend_Search_Lucene Datasource for CakePHP

  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 40.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 41.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 42.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 43.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 44.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 45.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 46.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 47.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 48.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 49.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 50.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 51.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 56.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 57.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 58.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 59.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 60.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 61.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 62.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 63.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 64.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 65.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 66.
  • warning: date() [function.date]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Denver' for 'MDT/-6.0/DST' instead in /home/lyniqne/public_html/planetcakephp.org/drupal/sites/all/modules/token/token_node.inc on line 67.

Major update January 22/10: much of the content of this article has been updated to reflect the changes to the datasource, the latest version of which you can download on Github.
Just out of the oven – a Zend_Search_Lucene datasource for CakePHP (built with 1.2 but probably works just fine in 1.3) that I originally wrote for an in-house CMS site search plugin. I can’t release the plugin itself (and there’s so much CMS-specific code that it would need a lot of work to make it generic anyway), but I thought that someone might find the datasource itself useful. It’s pretty basic at this point and doesn’t implement some of the fancier Zend_Search_Lucene features such as sorting (it just returns sorted in score order, which is probably what you want anyway).
Zend_Search_Lucene is a text-based search index system for developers who don’t want to (or can’t) use a database for search indexing.
Download the current version of the ZendSearchLuceneDatsource from my Github repository.
I won’t go into detail about how to add data into the Lucene database since the Zend Framework documention is so good (CakePHP should be jealous!). You’ll find all the info you need there. There are also a couple of older articles out there that show how you can integrate Zend_Search_Lucene into CakePHP:

Setup
First, copy zend_search_lucene.php to models/datasources.
Then, you’ll need to download the Zend_Search_Lucene library from the Zend Framework website and put some files into your /vendors directory:

  • Zend/Search (the directory and all of its contents)
  • Zend/Exception.php

You’ll also need to update your include path to include app/vendors, since the Zend Framework loads a lot of classes on its own. I also made a little autoload function to make the loading of Zend Framework classes easier. Put the following code somewhere common, such as app/bootstrap.php:

ini_set('include_path', ini_get('include_path') . ':' . CAKE_CORE_INCLUDE_PATH . DS . '/vendors');
function __autoload($path) {
if (substr($path, 0, 5) == 'Zend_') {
include str_replace('_', '/', $path) . '.php';
}
return $path;
}

You also need to put the DB config for the datasource in config/database.php (updated Jan 20/2010 for better DebugKit compatibility):

var $zendSearchLucene = array(
'datasource' => 'ZendSearchLucene',
'indexFile' => 'lucene', // stored in the cache dir.
'driver' => '',
'source' => 'search_indices'
);

Then, in the model that’ll act as your search index (say, for example, SearchIndex), specify the DB config:

<?php class SearchIndex extends AppModel {
var $useDbConfig = 'zendSearchLucene';
}
?>

Saving/Indexing
I’ve tried to keep the datasource functions as simple and familiar as possible. When saving an item to the index, the datasource expects a multidimensional array for each item. For compatibility with CakePHP’s datasource code, the ‘meat’ of the data is nested in the third level of the array. Each sub-array contains information about a field to be stored. For example:

$saveData = array('SearchIndex' => array(
'document' => array(
array(
'key' => 'name',
'value' => $record[$Model->alias][$this->settings[$Model->alias]['name']],
'type' => 'Text'
),
array(
'key' => 'description',
'value' => $record[$Model->alias][$this->settings[$Model->alias]['description']],
'type' => 'Text'
),
array(
'key' => 'url',
'value' => $this->__constructUrl($Model, $record),
'type' => 'Text'
)
)
));

Passing that data in a Model::save() call will in turn execute the following Zend code (more or less – this is a very simplified version of the actual ZendSearchLuceneSource saving code):

$index = Zend_Search_Lucene::open('/path/to/the/index/set/in/dbConfig');
$doc = new Zend_Search_Lucene_Document();
foreach ($data as $field) {
$doc->addField(Zend_Search_Lucene_Field::$field['type']($field['key'], $field['value']));
}
$index->addDocument($doc);

Obviously that’s a basic example; you’ll probably send a whole bunch of dynamic info to the indexer. But that’s the gist of it anyway.
Querying
You can search for records just like you would a regular datasource. Pass the search terms as a “query” condition. If you want the search terms to be highlighted in the returned results, pass ‘highlight’ => true in the array of options. Note that only indexed fields will be highlighted.
You can find all results:

function search($term) {
$results = $this->SearchIndex->find('all', array('highlight' => true, 'conditions' => array('query' => 'best cakephp tutorials')));
}

You can mimic Google’s I’m Feeling Lucky with find(‘first’):

function search($term) {
$topResult = $this->SearchIndex->find('first', array('conditions' => array('query' => 'best cakephp tutorials')));
}

You can even paginate:

function search($term) {
$this->paginate = array(
'limit' => 10,
'conditions' => array('query' => 'best CakePHP tutorials'),
'highlight' => true
);

$results = $this->paginate();
}

Results are returned in the expected CakePHP way, as a multidimensional array – $results[0]['MyModelAlias'] for multiple records, $results['MyModelAlias'] for one (i.e. with find(‘first’)).
There you go – enjoy! As always, comments and suggestions are welcomed.
I used the RSS Feed datasource by Loadsys as a guide to good datasource design. I may have borrowed a function or two.
Neil Crookes’ Searchable plugin also helped.