I've been using all sorts of hacks to generate file indexes out of SMB shares. And it's all cool with basic filepath plus metadata indexing.
The next step I want to implement is an algorithm combining some unix-like utilities and php, to index specific context from within files.
Now the first step in this context generation is something like this
while read p; do egrep -rH '^;|\(|^\(|\)$' "$p"; done <textual.txt > text_context_search.txt
This is specific regexing for my purpose for indexing contents of programs, this extracts lines that are whole comments or contains comments out of CNC program files.
resulting output is something like
file_path:regex_hit
now obviously most programs has more than one comment, so theres too much redundancy not only in repetition, but an exhaustive context index is about a gigabyte in size
I am now working towards script that would compact redudancy in such pattern
file_path_1:regex_hit_1
file_path_1:regex_hit_2
file_path_1:regex_hit_3
...
would become:
file_path_1:regex_hit1,regex_hit_2,regex_hit3
and if I succeed to do this in efficient manner its all ok.
The problem here is whether I'm doing this in a proper way. Maybe I should be using different tools to generate such context index in the first place ?
EDIT
After further copying and pasting from stack overflow and thinking about it I glued up solution using not my code, that nearly entirely solves my previously mentioned issue.
<?php
// https://stackoverflow.com/questions/26238299/merging-csv-lines-where-column-value-is-the-same
$rows = array_map('str_getcsv', file('text_context_search2.1.txt'));
//echo '<pre>';
print_r($csv);
//echo '</pre>';
// Array for output
$concatenated = array();
// Key to organize over
$sortKey = '0';
// Key to concatenate
$concatenateKey = '1';
// Separator string
$separator = ' ';
foreach($rows as $row) {
// Guard against invalid rows
if (!isset($row[$sortKey]) || !isset($row[$concatenateKey])) {
continue;
}
// Current identifier
$identifier = $row[$sortKey];
if (!isset($concatenated[$identifier])) {
// If no matching row has been found yet, create a new item in the
// concatenated output array
$concatenated[$identifier] = $row;
} else {
// An array has already been set, append the concatenate value
$concatenated[$identifier][$concatenateKey] .= $separator . $row[$concatenateKey];
}
}
// Do something useful with the output
//var_dump($concatenated);
//echo json_encode($concatenated)."\n";
$fp = fopen('exemplar.csv', 'w');
foreach ($concatenated as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);