Java-based library Aho-Corasick string matching algorithm for PHP application

984 Views Asked by At

I have a piece of PHP code which can successfully search $list keywords in $post data and echo the results where there is ~80-90% similarity. Below is the code:

$list = array(
    "Data" => "9",
    "Data Structure" => "10",
    "Database" => "11",
    "Creativity" => "12",
    "Forest" => "13",
    "Al Pacino" => "14",
    "Humans" => "15",
    "Technology" => "16"
    );

$post = array ('Database', 'Law', 'Tech', 'Creative');

$all_key_values = $all_keys = array();

foreach ($post as $keyword) {
    foreach ($list as $word=>$num) {
        $sim_chars = similar_text($keyword, $word);
        if ($sim_chars/strlen($keyword) > .8 || $sim_chars/strlen($word) > .8) {
            $all_key_values[] = $num;
            $all_keys[] = $word;
        }
        elseif (stripos($keyword, $word) !== false || strpos($word, $keyword) !== false) {
            $sll_key_values[] = $num;
            $all_keys[] = $word;
        }
    }        
}

print_r(implode(',', $all_key_values));
print_r(implode(',', $all_keys));

Now, the problem is that I want to search the $list keywords in $fulltext using the Aho-Corasick library that is written in Java. You can find the code in here.

require_once("http://localhost:8080/JavaBridge/java/Java.inc");

$list = array(
    "Data" => "9",
    "Data Structure" => "10",
    "Database" => "11",
    "Creativity" => "12",
    "Forest" => "13",
    "Al Pacino" => "14",
    "Humans" => "15",
    "Technology" => "16"
    );

$fulltext = "A forest, also referred to as a wood or the woods, is an area with a high density of trees. As with cities, depending on various cultural definitions, what is considered a forest may vary significantly in size and have different classifications according to how and of what the forest is composed.[1] A forest is usually an area filled with trees but any tall densely packed area of vegetation may be considered a forest, even underwater vegetation such as kelp forests, or non-vegetation such as fungi,[2] and bacteria. Tree forests cover approximately 9.4 percent of the Earth's surface (or 30 percent of total land area), though they once covered much more (about 50 percent of total land area). They function as habitats for organisms, hydrologic flow modulators, and soil conservers, constituting one of the most important aspects of the biosphere. A typical tree forest is composed of the overstory (canopy or upper tree layer) and the understory. The understory is further subdivided into the shrub layer, herb layer, and also the moss layer and soil microbes. In some complex forests, there is also a well-defined lower tree layer. Forests are central to all human life because they provide a diverse range of resources: they store carbon, aid in regulating the planetary climate, purify water and mitigate natural hazards such as floods. Forests also contain roughly 90 percent of the worlds terrestrial biodiversity.";

So, my question is that how to call the Aho-Corasick library in order to search the $list in $fulltext and find the keywords with 100% similarity. Thanks a lot for your help and time.

2

There are 2 best solutions below

0
On

You cannot include a java libraray in your PHP code. You could however write a java server application (in java) that can accept data from your php code. Any number of ways is thinkable- from socket communication, web services to simple command line tool. As an alternative you could of course always reimplement the java library in PHP- which would likely learn you a lot about php and java both as well as about the algorithm.

0
On

The old php-Java bridge is defunct, there's still php/Java bridge, but that may require quite some extra coding to get going.

But there are Aho Corasick implementations in PHP that would solve your problem in a minimum of time, and if you want to try something really cool, have a look at Caucho Quercus, a reimplementation of php in Java that runs inside a Java appserver. Really cool, and it's a breeze to call Java code from php.