Speed up Search with Lucy::Search::IndexSearcher and Lucy::Search::PolySearcher from multiple index folders

221 Views Asked by At

I'm creating indexes on multiple subfolders under one parent folder.

Indexes are created on multiple folders since files are getting created in parallel and I want to avoid segment locking between multiple indexers.

One of my applications creates the directory structure with lots of log files within different subfolders.

I'm indexing all those files in parallel as and when they are created.

The directory structure looks like this:

TopDir/00_log.log
      /01_log2.log
      /.lucyindexer/1/seg_1
                     /seg_2
      /03_log3.log
      /03_log3/log31.log
              /log32.log
              /.lucyindexer/1/seg_1
                             /seg 2
              /log32/log321.log
                    /log322.log
                    /.lucyindexer/1/seg_1
                                   /seg_2
                                 /2/seg_1

This works fine, and while my application is running all log files get indexed as well.

Search is a different application which does following:

  1. Scan through all the directories till .lucyindexer/1 and create a list of all such folders. I use File::Find to do that.
  2. Create searchers using Lucy::Search::IndexSearcher in loop and add all the searchers to Lucy::Search::PolySearcher.

My code looks like this:

my @searchers;

my $schema;

for my $index ( @all_dirs ) {
    chomp $index;
    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
    push @searchers, $indexer;
    $schema = $indexer->get_schema;
}

# Poly server is the only way to get all search results combined.
my $poly_searcher = Lucy::Search::PolySearcher->new(
    schema    => $schema,
    searchers => \@searchers,
);

my $query_parser = Lucy::Search::QueryParser->new(
    schema => $poly_searcher->get_schema,
    fields => ['title'],
);

# Build up a Query.
my $q = "1 2 3 4 5 6 7 11 12 13 14 18";

my $query = $query_parser->parse( $q );

# Execute the Query and get a Hits object.
my $hits = $poly_searcher->hits(
    query      => $query,
    num_wanted => -1,       # -1 equivalent to all results

    # sort_spec => $sort_spec,
);

while ( my $hit = $hits->next ) {

    ## Do some operation
}

This runs and returns the expected results. However, the performance is really bad when the directory structure is deeply nested.

I did profiling using Devel::NYTProf and found two places where the maximum time was taken:

  1. While scanning the directory. (This I will try to solve by generating a list of directories while the application is generating the indexes).
  2. When creating the searchers using Lucy::Search::IndexSearcher. This takes maximum time when running in loop for all indexed directories.

To solve the item #2 I tried to generate a Lucy::Search::IndexSearcher object for different index folders using Parallel::ForkManager but I got the following error:

The storable module was unable to store the child's data structure to the temp file "/tmp/Parallel-ForkManager-27339-27366.txt": Storable serialization not implemented for Lucy::Search::IndexSearcher at /usr/software/lib/perl5/site_perl/5.14.0/x86_64-linux-thread-multi/Clownfish.pm line 93

Using following code:

my $pm = new Parallel::ForkManager( $max_procs );

$pm->run_on_finish(
    sub {
        my ( $pid, $exit_code, $ident, $exit_signal, $core_dump, $index ) = @_;
        print Dumper $index;
        push( @searchers, $index );
    }
);

for my $index ( @all_dirs ) {
    chomp $index;
    my $forkpid = $pm->start( $index ) and next;    #fork
    my $indexer = Lucy::Search::IndexSearcher->new( index => $index );
    $pm->finish( 0, \$indexer );
}

$pm->wait_all_children;

This whole process takes up to 60-120 seconds for a large log directory. At the end of the whole process I create a nested JSON object from all search results to display using JQuery.

I'm looking for ideas to improve its performance. Any idea how to create multiple searchers using Parallel::ForkManager or any other method? Or some other way to improve the search performance.

Also, is there any way I can merge all the indexes in one place?

0

There are 0 best solutions below