box/spout - Massive memory usage when using hex color vs pre-defined colors

1.6k Views Asked by At

I'm using the Box/Spout library and it seems that using StyleBuilder with a custom hex color (e.g. 0000FF for blue) uses a ton of memory compared to using pre-defined colors such as Color::BLUE. Why would that be?

Relevant snippet:

//LOW MEMORY
$row[] = WriterEntityFactory::createCell('test', (new StyleBuilder())->setFontColor(Color::BLUE)->build());

//HIGH MEMORY
$row[] = WriterEntityFactory::createCell('test', (new StyleBuilder())->setFontColor($colorHex)->build());

Output:

setFontColor(Color::BLUE): Peak memory usage: 1666 KB

setFontColor($colorHex): Peak memory usage: 189436 KB

Full code:

(For demo purposes I'm loading a small 250x150 image to provide multiple color values)

<?php

    require_once 'Spout/Autoloader/autoload.php';
    use Box\Spout\Writer\Common\Creator\WriterEntityFactory;
    use Box\Spout\Common\Entity\Style\Color;
    use Box\Spout\Writer\Common\Creator\Style\StyleBuilder;

    //load an image
    $img = imagecreatefrompng('input/test250x150.png');

    $writer = WriterEntityFactory::createXLSXWriter();
    $writer->openToFile('output/MyExcel.xlsx');

    //height of the image
    for($y=0; $y<150; $y++) {

        //create or reset array to hold this row's cells
        $row = [];

        //width of the image
        for($x=0; $x<250; $x++) {

            //gets the pixel color
            $index = imagecolorat($img, $x, $y);
            $colorRGBArr = imagecolorsforindex($img, $index);
            $colorHex = sprintf("%02x%02x%02x", $colorRGBArr['red'], $colorRGBArr['green'], $colorRGBArr['blue']);

            //LOW MEMORY
            //$row[] = WriterEntityFactory::createCell('test', (new StyleBuilder())->setFontColor(Color::BLUE)->build());
            //HIGH MEMORY
            $row[] = WriterEntityFactory::createCell('test', (new StyleBuilder())->setFontColor($colorHex)->build());

        }
        $writer->addRow(WriterEntityFactory::createRow($row));
    }

    $writer->close();

    echo 'Peak memory usage: '.round(memory_get_peak_usage() / 1024).' KB';
?>
1

There are 1 best solutions below

3
On BEST ANSWER

tl;dr

While the Spout can be improved, Excel isn't designed for large quantities of styles, so this isn't really a flaw of the library (you might want to rescind the issue)

The Story

Alright, there's a few things at play here.. the code I used to test is at the bottom of my post - the key relevant color functions are jonEg and jonEgQuant (which is a $by variable inside that can be tuned)

  • Styling in Excel is a little like HTML+CSS. Styles are defined initially in the file (in a header block) and then referenced by name (class) in the sheets. Fundamentally this suggests the Excel storage format was not designed for a plethora of styles, but only a handful that are reused on many cells

  • Spout hides the style-definition complexity from the user - if the style's already defined (ie. an exact style already exists) then it reuses that definition, otherwise a new style is defined

  • Spout builds an index of all styles, by serializing the style object and storing it in an array. At approximately 1749 bytes per style this index chews up memory (in addition to the actual style objects it has in memory to generate the file) - this could be improved in Spout

  • With this code I'm guaranteeing each cell has a different style (using the row/column position to define the red and green components of the color) - which is more extreme than your image-color-picking, but perhaps not by much (I cannot tell without the sample image). There are 16M colors available with an RGB triplet.. but our eyes cannot always detect a difference of a few points. For example it'll be clear in a red gradient that 255,254,253...128 looks smooth, but a single block of 255,254,253 randomly distributed will probably look like a single color. For anything but a primary color, this variance now applies in three dimensions (r, g, & b). The JPEG format exploits this (a combination of compression, and noise in rebuilding) - so color picking what might look like a uniform block will still return slightly different values each time

  • When you run my code with jonEg you get (100*150+1) 15001 styles, which takes (15001*1749/1024/1024) ~25MB of memory for the index in Spout alone. Again this index is needed to prevent every single cell in Excel having it's own style (but of course, in this contrived example I've made sure it's pointless - every cell does have it's own style) - this uses ~100MB of memory

  • When you run my code with jonEgQuant (leaving $by=16;) only 71 styles are needed (but this is pretty extreme rounding, allowing only 4096 colors in total), - this uses ~2MB of memory

  • When you set $by=4 (inside ColorBuilder::jonEgQuant) - you now have up to quarter of a million colours, 989 styles are needed, using ~7MB of memory (more to the point, this looks similar to jonEg when opened in Excel)

  • Spout includes a method to convert RGB decimal values into a string Color::rgb($r,$g,$b) - but this doesn't change the outcome (spoutDoc method)

  • A style has many more dimensions than this - I've used background colour, but there's foreground (your example), underline, bold, italic, font face, font-size - all of these multiply the style quantity (reducing the color space here addresses your documented problem, but could be undone by altering other style components)

Take Aways

  • It's not the use of hex codes, RGB values or constants that's the issue, it's the quantity of styles you're using

  • Reduce the number of styles you use - Excel doesn't expect a large count, Spout likewise isn't optimized for this. Compressing the color (by rounding values as with jonEgQuant) is one avenue

  • Improve the Spout style index mechanism, but notice only some of the memory is consumed by the index - it's not small, but it's not the only contributor - each style object is needed to generate the result. Removing this style cache (which for this contrived example doesn't help) brings memory usage to 40MB (and generates the same Excel file). 60% savings isn't nothing, but more styles makes for more memory usage, larger Excel files, and slower opening/rendering of those files

Testing notes

  • I dropped the image reading/color harvesting while testing since it may add to memory usage but not to the meat of the problem

  • With $colCount=250; (from your example) you exceed the 128MB default memory limit in PHP, better to set it $colCount=100; and you can run all the tests without changing that setting

  • I switched to setting background-color, it's easier to see the difference when opened in Excel

  • For each color generator (jonEg, spoutDoc, fixedEg, jonEgQuant) a different XLSX is generated (helps to see the file size and rendering differences)

  • The file sizes match the Excel complexity - jonEg is a 179KB file (and Excel struggles to open), while jonEgQuant is 43KB

Code

<?php

require_once 'Spout/Autoloader/autoload.php';
use Box\Spout\Writer\Common\Creator\WriterEntityFactory;
use Box\Spout\Common\Entity\Style\Color;
use Box\Spout\Writer\Common\Creator\Style\StyleBuilder;

// -- -- Set this to one of the method names on ColorBuilder (that isn't a helper)
$choice='jonEg';
// -- -- 

class ColorBuilder {
    static $defaultBlue=255;
    static function jonEg($x,$y) {return sprintf("%02x%02x%02x", $x, $y, static::$defaultBlue);}
    static function spoutDoc($x,$y) {return Color::rgb($x, $y, static::$defaultBlue);}
    static function fixedEg($x,$y) {return Color::BLUE;}
    static function jonEgQuant($x,$y) {$by=16;return sprintf("%02x%02x%02x", static::_quantize($x,$by),static::_quantize($y,$by), static::_quantize(static::$defaultBlue,$by));}

    //Helpers - don't use these for choice
    static function validate(string $name):bool {
        if ($name==null) return false;//Null or empty
        if (substr($name,0,1)=='_') return false;//Private by convention
        if ($name==='validate') return false;//Not the function you seek
        return method_exists('ColorBuilder',$name);
    }
    private static function _quantize(int $i,int $by=16):int {return round($i/$by)*$by;}
}

function createRow($y,$color) {
    $colCount=100;
    $row = [];

    for($x=0; $x<$colCount; $x++) {
        $row[] = WriterEntityFactory::createCell('*', (new StyleBuilder())->setBackgroundColor(ColorBuilder::$color($x,$y))->build());
    }
    return $row;
}

function buildSheet($name) {
    if (!ColorBuilder::validate($name)) {throw new Error('Invalid color provider');}

    $writer = WriterEntityFactory::createXLSXWriter();
    $writer->openToFile('output/'.$name.'.xlsx');

    for($y=0; $y<150; $y++) {
        $writer->addRow(WriterEntityFactory::createRow(createRow($y,$name)));
    }

    $writer->close();
}

buildSheet($choice);
echo 'Peak memory usage: '.round(memory_get_peak_usage() / 1024).' KB';

Tech: PHP 7.4.2 CLI, Spout: 3.1.0, Win: 7 x64 (I know), Coffee: Venti Dark