Context index generation for meilisearch

172 Views Asked by At

I've been using all sorts of hacks to generate file indexes out of SMB shares. And it's all cool with basic filepath plus metadata indexing.

The next step I want to implement is an algorithm combining some unix-like utilities and php, to index specific context from within files.

Now the first step in this context generation is something like this

while read p; do egrep -rH '^;|\(|^\(|\)$' "$p"; done <textual.txt > text_context_search.txt

This is specific regexing for my purpose for indexing contents of programs, this extracts lines that are whole comments or contains comments out of CNC program files.

resulting output is something like

file_path:regex_hit

now obviously most programs has more than one comment, so theres too much redundancy not only in repetition, but an exhaustive context index is about a gigabyte in size

I am now working towards script that would compact redudancy in such pattern

file_path_1:regex_hit_1
file_path_1:regex_hit_2
file_path_1:regex_hit_3
...

would become:

file_path_1:regex_hit1,regex_hit_2,regex_hit3

and if I succeed to do this in efficient manner its all ok.

The problem here is whether I'm doing this in a proper way. Maybe I should be using different tools to generate such context index in the first place ?

EDIT

After further copying and pasting from stack overflow and thinking about it I glued up solution using not my code, that nearly entirely solves my previously mentioned issue.

    <?php
//    https://stackoverflow.com/questions/26238299/merging-csv-lines-where-column-value-is-the-same



$rows = array_map('str_getcsv', file('text_context_search2.1.txt'));
//echo '<pre>';
print_r($csv);
//echo '</pre>';
// Array for output
$concatenated = array();

// Key to organize over
$sortKey = '0';

// Key to concatenate
$concatenateKey = '1';

// Separator string
$separator = ' ';

foreach($rows as $row) {

    // Guard against invalid rows
    if (!isset($row[$sortKey]) || !isset($row[$concatenateKey])) {
        continue;
    }

    // Current identifier
    $identifier = $row[$sortKey];

    if (!isset($concatenated[$identifier])) {
        // If no matching row has been found yet, create a new item in the
        // concatenated output array
        $concatenated[$identifier] = $row;
    } else {
        // An array has already been set, append the concatenate value
        $concatenated[$identifier][$concatenateKey] .= $separator . $row[$concatenateKey];
    }
}

// Do something useful with the output
//var_dump($concatenated);

//echo json_encode($concatenated)."\n";


$fp = fopen('exemplar.csv', 'w');

foreach ($concatenated as $fields) {
    fputcsv($fp, $fields);
}

fclose($fp);
0

There are 0 best solutions below