I am using PL/Container(Python) in GreenplumDB. However, it seems that if the PL/Container function returns multiple values, the same data is executed redundantly. For example, in the code below, result_var1 and result_var2 are return values, so the same data is executed twice in the pl/container function.

-- Create PL/Container(Python) Return Type
create type anal.plpy_func_01_type as (
    result_var1 text,
    result_var2 _text
);
-- Create PL/Container Function Example
create or replace function anal.fn_plpy_func_01(
    seq_str text,
    col1 text,
    col2 numeric
)
returns anal.plpy_func_01_type
language plcontainer
volatile
as $$

# container: plc_python_shared
plpy.notice('Input : {} - {} - {}'.format(seq_str, col1, col2))
col2_str = str(col2)

return {
    'result_var1' : col1,
    'result_var2' : [col1, col2_str]
}

$$
EXECUTE ON ANY;
-- Create Input Data Table of PL/Container Function
drop table if exists tb_00 ;
create temp table tb_00 (col1 text, col2 text, col3 numeric);
insert into tb_00 (col1, col2, col3) values
('id1', 'sample_100', 9.9),
('id2', 'sample_200', 99.99),
('id3', 'sample_300', 999.999);
-- Craete Result Table of PL/Container Function
drop table if exists tb_01 ;
create temp table tb_01 with (appendonly='true', compresstype=zstd, compresslevel='1') as
select 
    t.result_var1,
    unnest(t.result_var2) result_var2
from (
    select
    (anal.fn_plpy_func_01(col1, col2, col3)).* from tb_00
) t
;

When executed in this way, there are 6 rows in tb_01, and indeed, when you perform a SELECT, you can see that.

select * from tb_01;

However, when checking the logs recorded with plpy.notice, the logs for the same input data are duplicated as shown below:

Input : id3 - sample_300 - 999.999 (seg6 slice1 ... pid=#####)
Input : id3 - sample_300 - 999.999 (seg6 slice1 ... pid=#####)
Input : id2 - sample_200 - 99.99 (seg7 slice1 ... pid=####)
Input : id2 - sample_200 - 99.99 (seg7 slice1 ... pid=####)
Input : id1 - sample_100 - 9.9 (seg1 slice1 ... pid=###)
Input : id1 - sample_100 - 9.9 (seg1 slice1 ... pid=###)

And when testing with a return type of result_var1, result_var2, result_var3, result_var4, etc., I confirmed that it is printed four times.

Because of this, when the number of return columns in the pl/container function is large, it causes an increase in load and longer processing times, becoming a problem. Is there a way to ensure that the pl/container function is executed only once for the same data?

0

There are 0 best solutions below