How to apply search query for special characters like @ in Zend_Search_Lucene?

556 Views Asked by At

In the Zend_Search_Lucene I am using below code for indexing and I have changed default analyzer to search for numeric values.

public function executeIndexIT() {

   $path = '/home/project/mgh/lib/';
   set_include_path(get_include_path() . PATH_SEPARATOR . $path);       
   require_once '/home/project/mgh/lib/Zend/Search/Lucene.php';

   Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());

   $index = new Zend_Search_Lucene('/home/project/mgh/data/search_file/lucene.customer.index',true);

   $filenames1='/home/project/mgh/web/cvcollection/data8/ASBABranches10546.pdf';
   $filenames2='/home/project/mgh/web/cvcollection/data2/manoj_new10550.pdf';

   $fc1=htmlentities("'".$this->ConvertPDF($filenames1)."'");       
   $fc2=htmlentities("'".$this->ConvertPDF($filenames2)."'");

   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::unIndexed('URL', $filenames1));
   $doc->addField(Zend_Search_Lucene_Field::text('contents',$fc1));     
   $index->addDocument($doc);

   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::unIndexed('URL', $filenames2));
   $doc->addField(Zend_Search_Lucene_Field::text('contents',$fc2));     
   $index->addDocument($doc);

   $index->commit();
   exit;
}

and after indexing for searching I am using below piece of code:

public function executeSearchLucene() {

    $path = '/home/project/mgh/lib/';
    set_include_path(get_include_path() . PATH_SEPARATOR . $path);
    require_once('Zend/Search/Lucene.php');

    Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());

    $hits = array();
    $txtSearch='@';
    try {
        $query = Zend_Search_Lucene_Search_QueryParser::parse($txtSearch);
    } catch (Zend_Search_Lucene_Search_QueryParserException $e) {
        echo "Query syntax error: " . $e->getMessage() . "\n";
    }

    $index = new Zend_Search_Lucene('/home/project/mgh/data/search_file/lucene.customer.index');

    //**added on 29 may**/      
    $results = $index->find($query);
    echo count($results);
    foreach ( $results as $result ) {
        echo "<pre>";
        var_dump($result->URL); 
   }
   exit;
}

Here $fc2 contains few email address and I need to search for them. But I am getting 0 hits.

How to search for characters like @ or ! using Zend_Search_Lucene?

1

There are 1 best solutions below

0
On

It will work only with keyword fields as they're not tokenized. So you need to ensure, that you provided email (or other text with special characters) as a separate data, like in example. Also you can't use query parser because query parser will convert it to Zend_Search_Lucene_Search_Query_Preprocessing_Term object:

echo('<pre>');
var_dump(Zend_Search_Lucene_Search_QueryParser::parse("*@*"));
var_dump(Zend_Search_Lucene_Search_QueryParser::parse("@"));
echo('</pre>');
die();

Which according to documentation:

is not actually involved into query execution

So working code is below:

$index = Zend_Search_Lucene::create('/tmp/index');

$doc1 = new Zend_Search_Lucene_Document;
$doc1->addField(Zend_Search_Lucene_Field::text('title', 'Some Title Here'))
    ->addField(Zend_Search_Lucene_Field::keyword('content', '[email protected]'));
$index->addDocument($doc1);

$doc2 = new Zend_Search_Lucene_Document;
$doc2->addField(Zend_Search_Lucene_Field::text('title', 'Another title Here'))
    ->addField(Zend_Search_Lucene_Field::keyword('content', 'test!test.com'));
$index->addDocument($doc2);

$index->commit();

Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0);
$term  = new Zend_Search_Lucene_Index_Term("*@*");
$query = new Zend_Search_Lucene_Search_Query_Wildcard($term);

$hits = $index->find($query);
echo('<pre>');
var_dump(count($hits));
foreach($hits as $hit) {
    var_dump($hit->title);
    var_dump($hit->content);
}
echo('</pre>');

Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength(0);
$term  = new Zend_Search_Lucene_Index_Term("*!*");
$query = new Zend_Search_Lucene_Search_Query_Wildcard($term);

$hits = $index->find($query);
echo('<pre>');
var_dump(count($hits));
foreach($hits as $hit) {
    var_dump($hit->title);
    var_dump($hit->content);
}
echo('</pre>');

die();

Hope it's clear now. Zend Lucene implementation has a lot of limitations.