Note: I am looking for any hints on how to debug this issue, not necessarily a direct answer to this specific problem.
I am measuring the performance of PipelineDB for use in our system. I have defined a few continuous views (Calculating sums, top-K and such), feeding from a single stream (Which has ~20 columns, some text, mostly integers and booleans). The test is written in Python, and I am using the psycopg2 cursor.copy_from() function, to achieve max performance. PipelineDB behaves nicely when there the work specified by the continuous views is not too complicated. However, when I ask it to calculate many top-K results, or many percentile_cont() values, the test hangs with the following symptoms:
- The (single) 'worker0' process starts eating 100% CPU
- The input process shows that it is running the COPY command, never changing to IDLE (During normal work, it changes between COPY and IDLE).
- Test hangs (i.e. the copy_from() function call does not return)
Below is the output of a 'ps -ef' command showing all pipelinedb processes, after about a minute or running the test. Note that the worker0 process is consuming 100% CPU since the beginning of the test. It never resumes normal work ('top' shows that it is consuming exactly 100% CPU)
Test logs show that it is running OK for the first ~1 second, inserting ~30,000 events (In batches of 100), and then it hangs, because a call to the copy_from() function does not return.
When I reduce the amount of work PipelineDB has (By removing some of the continuous views), the test works OK, achieving up to 20,000 inserts per second, sustained for at least one minute.
I would like to note that all events have the same time stamp, and all views have a "GROUP BY minute" clause, hence a single row should be created/updated in every continuous view, during the test.
I have played with some configuration parameters, specifically those related to memory buffer sizes, sync methods, time intervals, max_wait and such, amount of workers, and could not find any combination that avoids the problem.
I do not know if I am hitting a PipelineDB issue, or a PostgreSQL issue. Certainly it is not expected behavior, and cannot be tolerated in a real application. Any hints, guesses, gut feelings etc. are welcome.
[orens@rd10 ~]$ps -ef | grep pipelinedb
UID PID PPID C STIME TTY TIME CMD
orens 3005 3004 0 11:17 ? 00:00:00 pipelinedb: logger process
orens 3007 3004 0 11:17 ? 00:00:00 pipelinedb: checkpointer process
orens 3008 3004 0 11:17 ? 00:00:00 pipelinedb: writer process
orens 3009 3004 0 11:17 ? 00:00:00 pipelinedb: wal writer process
orens 3010 3004 0 11:17 ? 00:00:00 pipelinedb: autovacuum launcher process
orens 3011 3004 0 11:17 ? 00:00:00 pipelinedb: stats collector process
orens 3012 3004 0 11:17 ? 00:00:00 pipelinedb: pipelinedb scheduler process
orens 3014 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: reaper0 [pipeline]
orens 3015 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: queue0 [pipeline]
orens 3016 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: combiner1 [pipeline]
orens 3017 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: combiner0 [pipeline]
orens 3018 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: worker0 [pipeline]
orens 3046 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: reaper0 [db1]
orens 3050 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: queue0 [db1]
orens 3052 3004 0 11:17 ? 00:00:00 pipelinedb: bgworker: combiner0 [db1]
orens 3056 3004 90 11:17 ? 00:01:06 pipelinedb: bgworker: worker0 [db1]
orens 3132 3004 1 11:17 ? 00:00:01 pipelinedb: ut_user db1 ::1(58830) COPY
[orens@rd10 ~]$