Compression Discrepancy in BMEcat XML Files Generated via Talend vs. eprocat

29 Views Asked by At

Hello Stack Overflow community,

I am currently working on a project where I generate a BMEcat XML catalog (version 1.2) using Talend Studio. I have encountered a peculiar issue related to file compression, and I'm seeking insights on why there might be a significant difference in the compressed file sizes compared to another tool, e-procat.

Here's a brief overview of the problem:

  • Tool Used: Talend Studio
  • XML Schema: BMEcat version 1.2
  • Compression Algorithm: Deflate, normal compression level

Issue: The generated XML file has a raw size of approximately 3GB. However, when I compress it using the Deflate algorithm at the normal compression level, the resulting zip file is around 400MB in size, achieving only a 12% compression ratio.

Interestingly, when I use e-procat to generate the same BMEcat XML file (with identical content and size), and compress it using the same Deflate algorithm at the normal compression level, the resulting zip file is significantly smaller—around 170MB, with a compression ratio of 5%.

Observations:

  1. The notation at the top of the files differs. The Talend-generated file starts with:

    <?xml version='1.0' encoding='UTF-8'?><BMECAT xmlns="http://www.bmecat.org/XMLSchema/1.2/bmecat_new_catalog" version="1.2">
    

    While the eprocat-generated file starts with:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE BMECAT SYSTEM "bmecat_new_catalog_1_2.dtd">
    <BMECAT version="1.2">
    
  2. The Talend-generated file is in a single line, whereas the e-procat-generated file is indented.

Questions:

  1. Why the difference in compression ratio? I did some testing by changing the notation at the top and indenting the Talend-generated file and it still reaches only 12% compression ratio.
  2. Are there any Talend Studio configurations or best practices related to XML generation and compression that I might be overlooking?

I appreciate any insights or suggestions on how to achieve a more efficient compression ratio using Talend Studio.

Thank you in advance!

I experimented with different approaches to improve the compression ratio of the Talend-generated BMEcat XML file. Specifically, I made the following attempts:

  1. Changed Notation: Altered the notation at the top of the Talend-generated file to match the eprocat-generated file. Modified it from:

    <?xml version='1.0' encoding='UTF-8'?><BMECAT xmlns="http://www.bmecat.org/XMLSchema/1.2/bmecat_new_catalog" version="1.2">
    

    to:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE BMECAT SYSTEM "bmecat_new_catalog_1_2.dtd">
    <BMECAT version="1.2">
    

    Despite this change, the compression ratio remained around 12%.

  2. Indented the File: Formatted the Talend-generated file to be indented, similar to the eprocat-generated file. However, even with this adjustment, the compression ratio did not improve.

My expectation is to achieve a compression ratio comparable to eprocat, which consistently reaches 5%. I'm seeking guidance on potential Talend configurations, best practices, or other approaches that could help improve the compression efficiency.

0

There are 0 best solutions below