Wring JSON data to a text file but its going into heap space

358 Views Asked by At

I had this code which works fine when the database is small with few records, it writes the json to the file properly, but when the data is huge, it just times out

<cfloop list="t" index="k">
            <cfquery name="qry">
                select * from #k# 
            </cfquery>
            <cfset js= serializeJSON(qry,'struct')>         
            <cffile action="write" file="#k#" output="#js#">
        </cfloop>

I tried using threads but they are also not working, it just creates empty tables files with no values if i use cfthread with joins

Thought of splitting the files into a combination of 1000 records for each table and and then doing like

table_1, table2, table3, of the same table which is table because it has millions of records and skip for those if they have less than 1000 records to create only 1 file.

but i am just thinking which approach is best and a starting pointing is needed

3

There are 3 best solutions below

6
Alex On

First of all, let's split this up:

Resultset from database

<cfquery name="qry">
    select * from #k# 
</cfquery>
  • Database server retrieves data and streams it via network to the ColdFusion server
  • ColdFusion stores the data in a query object and stores it in the heap

Serializing the resultset from database

<cfset js= serializeJSON(qry,'struct')>
  • ColdFusion recursively serializes the whole query object
  • ColdFusion creates a string object that contains the serialized data and stores it in the heap

Writing the serialized resultset from memory onto the filesystem

<cffile action="write" file="#k#" output="#js#">
  • ColdFusion writes the string object into a file on the filesystem

Doing all of this within the same request/thread

<cfloop list="t" index="k">
    ...
 </cfloop>

Conclusion

Your code tortures the JVM heap, because references have to be kept until the end of each iteration. The GC can only clean up after a full table has been processed. Large tables (1.000.000+ rows) will likely kill the thread or even hang the JVM.


The Fix: Resultset from database

Retrieving large resultsets at once will always hurt performance. While streaming lots of data within a local network (assuming the database is in the same network) just takes a bit more time, the memory required to store the full resultset is going to be an issue for the JVM.

Instead of doing everything at once, consider splitting it up in smaller chunks of data. Use OFFSET and FETCH in the SQL statement to limit the number of rows per loop. Having multiple iterations will allow the Java GC to free up memory used by previous iterations, relieving the heap.

The Fix: Serializing the resultset from database

Same issue. Big datasets whill hurt performance. Split the resultset by serializing row by row instead of all rows at once.

Writing the serialized resultset from memory onto the filesystem

While this one probably doesn't need a fix, you eventually have to switch to writing line after line.


Some code

<cfset maxRowsPerIteration = 50000>

<cfloop list="t" index="k">

    <!--- create empty file to append lines later --->
    <cfset fileWrite(k, "")>

    <cfset rowsOffset = 0>

    <!--- NOTE: you might want to lock the table (prevent write access) here --->

    <!--- infinite loop will be terminated as soon the query no longer returns any rows --->
    <cfloop condition="true">
    
        <!--- fetch a slice of the full table --->
        <cfquery name="qry">
            select * from #k# OFFSET #rowsOffset# ROWS FETCH NEXT #maxRowsPerIteration# ROWS ONLY
        </cfquery>

        <cfif not qry.recordCount>
            <cfbreak>
        </cfif>

        <cfset rowsOffset += maxRowsPerIteration>

        <cfloop query="qry">

            <cfset rowJSON = serializeJSON(
                queryRowToStruct(qry, qry.currentRow)
            )>

            <cfset fileAppend(k, rowJSON, "UTF-8", true)>

        </cfloop>

    </cfloop>

    <!--- NOTE: if you locked the table previously, unlock it here --->

</cfloop>

For an reference implementation of queryRowToStruct, check CFLib.

3
James A Mohler On

This is really a comment, but it is way too long.

SQL Server 2017 can create JSON directly.

   <cfloop list="t" index="k">
        <cfquery name="qry">
            SELECT (
                SELECT * 
                FROM #k#
                FOR JSON AUTO
            ) AS data  
        </cfquery>

        <cffile action="write" file="#k#" output="#qry.data#">
    </cfloop>
3
Sev Roberts On

Others have touched upon the JVM and Garbage Collection but have not followed up on the potential quick win due to how CF handles GC.

CF can GC after each function returns and also at the end of each request. So if you do something that uses a lot of memory a few times in a loop, or do something that uses a moderate amount of memory a lot of times in a loop, then you should abstract that 'something' into a function, and call that function inside the loop, so that the memory can be released per iteration if necessary, instead of being held until the end of the request and potentially maxing out the heap space before the end-of-request Garbage Collection.

Eg refactoring your original code to this, is much more GC friendly:

<cffunction name="tableToFile" output="false">
    <cfargument name="tableName" type="variableName" required="true" />
    <cfquery name="local.qry">
        select * from #arguments.tableName#
    </cfquery>
    <cfset local.js = serializeJSON(local.qry,'struct') />
    <cffile action="write" file="#arguments.tableName#" output="#local.js#" />
</cffunction>

<cfloop list="t" index="k">
    <cfset tableToFile(tableName=k) />
</cfloop>

This approach won't solve your problem though if any single iteration of that loop consumes too much memory because the query is too large. If that is your problem then you should implement this in combination with an approach like Alex's to fetch your rows in batches, and presuming your SQL Server is better up to the task than your Lucee Server, then also James' approach to let SQL Server do the JSON serialization.