Thread: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

Hi,

I am using an SQL queue for distributing work to massively parallel workers. Workers come in servers with 12 parallel threads. One of those worker sets handles 7 transactions per second. If I add a second one, for 24 parallel workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can add to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble. But that isn't even so much my problem rather than the fact that in short time, the performance will deteriorate, and it looks like that is because the queue index deteriorates and needs a REINDEX. The queue table is essentially this:

CREATE TABLE Queue (  jobId bigint,  action text,  pending boolean,  result text
);

the dequeue operation is essentially this:

BEGIN

SELECT jobId, action
  FROM Queue 
  WHERE pending  FOR UPDATE SKIP LOCKED

which is a wonderful concept with the SKIP LOCKED.

Then I perform the action and finally:

UPDATE Queue   SET pending = false,       result = ? WHERE jobId = ?

COMMIT

I thought to keep my index tight, I would define it like this:

CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;

so that only pending jobs are in that index.

When a job is done, follow up work is often inserted into the Queue as pending, thus adding to that index.

Below is the performance chart.

The blue line at the bottom is the db server.


You can see the orange line is the first worker server with 12 threads. It settled into a steady state of 7/s ran with 90% CPU for some 30 min, and then the database CPU% started climbing and I tried to rebuild the indexes on the queue, got stuck there, exclusive lock, no jobs were processing, but the exclusive lock was never obtained for too long. So I shut down the worker server. Database quiet I could resolve the messed up indexes and restarted again. Soon I added a second worker server (green line) starting around 19:15. Once settled in they were pulling 14/s together. but you can see in just 15 min, the db server CPU % climbed again to over 40% and the performance of the workers dropped, their load falling to 30%. Now at around 19:30 I stopped them all, REINDEXed the queue table and then started 3 workers servers simultaneously. They settled in to 21/s but in just 10 min again the deterioration happened. Again I stopped them all, REINDEXed, and now started 4 worker servers (48 threads). This time 5 min was not enough to see them ever settling into a decent 28/s transaction rate, but I guess they might have reached that for a minute or two, only for the index deteriorating again. I did another stop now started only 2 servers and again, soon the index deteriorated again.

Clearly that index is deteriorating quickly, in about 10,000 transactions. 
BTW: when I said 7/s, it is in reality about 4 times as many transactions, because of the follow up jobs that also get added on this queue. So 10,0000 transactions may be 30 or 40 k transactions before the index deteriorates.

Do you have any suggestion how I can avoid that index deterioration problem smoothly?

I figured I might just pause all workers briefly to schedule the REINDEX Queue command, but the problem with this is that while the transaction volume is large, some jobs may take minutes to process, and in that case we need to wait minutes to quiet the database with then 47 workers sitting as idle capacity waiting for the 48th to finish so that the index can be rebuilt!

Of course I tried to resolve the issue with vacuumdb --analyze (just in case if the autovacuum doesn't act in time) and that doesn't help. Vacuumdb --full --analyze would probably help but can't work because it required an exclusive table lock. I tried to just create a new index of the same

CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending;
DROP INDEX Queue_idx_pending;
ANALYZE Queue; 

but with that I got completely stuck with two indexes where I could not remove either of them for those locking issues. And REINDEX will give me a deadlock error rightout.

I am looking for a way to manage that index so that it does not deteriorate.

May be if I was not defining it with

... WHERE pending;

then it would only grow, but never shrink. May be that helps somehow? I doubt it though. Adding to an index also causes deterioration, and most of the rows would be irrelevant because they would be past work. It would be nicer if there was another smooth way.

regards,
-Gunther



Attachment
On Sat, Feb 23, 2019 at 1:06 PM Gunther <raj@gusw.net> wrote:
> I thought to keep my index tight, I would define it like this:
>
> CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;
>
> so that only pending jobs are in that index.
>
> When a job is done, follow up work is often inserted into the Queue as pending, thus adding to that index.

How many distinct jobIds are there in play, roughly? Would you say
that there are many fewer distinct Jobs than distinct entries in the
index/table? Is the number of jobs fixed at a fairly low number, that
doesn't really grow as the workload needs to scale up?

-- 
Peter Geoghegan


On 2/23/2019 16:13, Peter Geoghegan wrote:
> On Sat, Feb 23, 2019 at 1:06 PM Gunther <raj@gusw.net> wrote:
>> I thought to keep my index tight, I would define it like this:
>>
>> CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;
>>
>> so that only pending jobs are in that index.
>>
>> When a job is done, follow up work is often inserted into the Queue as pending, thus adding to that index.
> How many distinct jobIds are there in play, roughly? Would you say
> that there are many fewer distinct Jobs than distinct entries in the
> index/table? Is the number of jobs fixed at a fairly low number, that
> doesn't really grow as the workload needs to scale up?

Jobs start on another, external queue, there were about 200,000 of them 
waiting when I started the run.

When the SQL Queue is empty, the workers pick one job from the external 
queue and add it to the SQL queue.

When that happens immediately 2 more jobs are created on that queue. 
Let's cal it phase 1 a and b

When phase 1 a has been worked off, another follow-up job is created. 
Let' s call it phase 2.

When phase 2 has been worked off, a final phase 3 job is created.

When that is worked off, nothing new is created, and the next item is 
pulled from the external queue and added to the SQL queue.

So this means, each of the 200,000 items add (up to) 4 jobs onto the 
queue during their processing.

But since these 200,000 items are on an external queue, the SQL queue 
itself is not stuffed full at all. It only slowly grows, and on the main 
index where we have only the pending jobs, there are only probably than 
20 at any given point in time. When I said 7 jobs per second, it meant 
7/s simultaneously for all these 3+1 phases, i.e., 28 jobs per second. 
And at that rate it takes little less than 30 min for the index to 
deteriorate. I.e. once about 50,000 queue entries have been processed 
through that index it has deteriorated to become nearly unusable until 
it is rebuilt.

thanks,
-Gunther




On Sun, 24 Feb 2019 at 10:06, Gunther <raj@gusw.net> wrote:
> I am using an SQL queue for distributing work to massively parallel workers. Workers come in servers with 12 parallel
threads.One of those worker sets handles 7 transactions per second. If I add a second one, for 24 parallel workers, it
scalesto 14 /s. Even a third, for 36 parallel workers, I can add to reach 21 /s. If I try a fourth set, 48 workers, I
endup in trouble. But that isn't even so much my problem rather than the fact that in short time, the performance will
deteriorate,and it looks like that is because the queue index deteriorates and needs a REINDEX. 

It sounds very much like auto-vacuum is simply unable to keep up with
the rate at which the table is being updated.   Please be aware, that
by default, auto-vacuum is configured to run fairly slowly so as not
to saturate low-end machines.

vacuum_cost_limit / autovacuum_vacuum_cost limit control how many
"points" the vacuum process can accumulate before it will perform an
autovacuum_vacuum_cost_delay / vacuum_cost_delay.

Additionally, after an auto-vacuum run completes it will wait for
autovacuum_naptime before checking again if any tables require some
attention.

I think you should be monitoring how many auto-vacuums workers are
busy during your runs. If you find that the "queue" table is being
vacuumed almost constantly, then you'll likely want to increase
vacuum_cost_limit / autovacuum_vacuum_cost_limit. You could get an
idea of how often this table is being auto-vacuumed by setting
log_autovacuum_min_duration to 0 and checking the logs.  Another way
to check would be to sample what: SELECT query FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'; returns. You may find that all of the
workers are busy most of the time.  If so, that indicates that the
cost limits need to be raised.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Some ideas:

You could ALTER TABLE SET (fillfactor=50) to try to maximize use of HOT indices
during UPDATEs (check pg_stat_user_indexes).

You could also ALTER TABLE SET autovacuum parameters for more aggressive vacuuming.

You could recreate indices using the CONCURRENTLY trick
(CREATE INDEX CONCURRENTLY new; DROP old; ALTER .. RENAME;)

On Sat, Feb 23, 2019 at 04:05:51PM -0500, Gunther wrote:
> I am using an SQL queue for distributing work to massively parallel workers.
> Workers come in servers with 12 parallel threads. One of those worker sets
> handles 7 transactions per second. If I add a second one, for 24 parallel
> workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can
> add to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble.
> But that isn't even so much my problem rather than the fact that in short
> time, the performance will deteriorate, and it looks like that is because
> the queue index deteriorates and needs a REINDEX.
> 
> The queue table is essentially this:
> 
> CREATE TABLE Queue (
>   jobId bigint,
>   action text,
>   pending boolean,
>   result text
> );
> 
> the dequeue operation is essentially this:
> 
> BEGIN
> 
> SELECT jobId, action
>   FROM Queue
>   WHERE pending
>   FOR UPDATE SKIP LOCKED
> 
> which is a wonderful concept with the SKIP LOCKED.
> 
> Then I perform the action and finally:
> 
> UPDATE Queue
>    SET pending = false,
>        result = ?
>  WHERE jobId = ?
> 
> COMMIT
> 
> I thought to keep my index tight, I would define it like this:
> 
> CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending;
> 
> so that only pending jobs are in that index.
> 
> When a job is done, follow up work is often inserted into the Queue as
> pending, thus adding to that index.
> 
> Below is the performance chart.
> 
> The blue line at the bottom is the db server.
> 
> 
> You can see the orange line is the first worker server with 12 threads. It
> settled into a steady state of 7/s ran with 90% CPU for some 30 min, and
> then the database CPU% started climbing and I tried to rebuild the indexes
> on the queue, got stuck there, exclusive lock, no jobs were processing, but
> the exclusive lock was never obtained for too long. So I shut down the
> worker server. Database quiet I could resolve the messed up indexes and
> restarted again. Soon I added a second worker server (green line) starting
> around 19:15. Once settled in they were pulling 14/s together. but you can
> see in just 15 min, the db server CPU % climbed again to over 40% and the
> performance of the workers dropped, their load falling to 30%. Now at around
> 19:30 I stopped them all, REINDEXed the queue table and then started 3
> workers servers simultaneously. They settled in to 21/s but in just 10 min
> again the deterioration happened. Again I stopped them all, REINDEXed, and
> now started 4 worker servers (48 threads). This time 5 min was not enough to
> see them ever settling into a decent 28/s transaction rate, but I guess they
> might have reached that for a minute or two, only for the index
> deteriorating again. I did another stop now started only 2 servers and
> again, soon the index deteriorated again.
> 
> Clearly that index is deteriorating quickly, in about 10,000 transactions.
> 
> BTW: when I said 7/s, it is in reality about 4 times as many transactions,
> because of the follow up jobs that also get added on this queue. So 10,0000
> transactions may be 30 or 40 k transactions before the index deteriorates.
> 
> Do you have any suggestion how I can avoid that index deterioration problem
> smoothly?
> 
> I figured I might just pause all workers briefly to schedule the REINDEX
> Queue command, but the problem with this is that while the transaction
> volume is large, some jobs may take minutes to process, and in that case we
> need to wait minutes to quiet the database with then 47 workers sitting as
> idle capacity waiting for the 48th to finish so that the index can be
> rebuilt!
> 
> Of course I tried to resolve the issue with vacuumdb --analyze (just in case
> if the autovacuum doesn't act in time) and that doesn't help. Vacuumdb
> --full --analyze would probably help but can't work because it required an
> exclusive table lock.
> 
> I tried to just create a new index of the same
> 
> CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending;
> DROP INDEX Queue_idx_pending;
> ANALYZE Queue;
> 
> but with that I got completely stuck with two indexes where I could not
> remove either of them for those locking issues. And REINDEX will give me a
> deadlock error rightout.
> 
> I am looking for a way to manage that index so that it does not deteriorate.
> 
> May be if I was not defining it with
> 
> ... WHERE pending;
> 
> then it would only grow, but never shrink. May be that helps somehow? I
> doubt it though. Adding to an index also causes deterioration, and most of
> the rows would be irrelevant because they would be past work. It would be
> nicer if there was another smooth way.




On Sun, Feb 24, 2019 at 2:04 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
Some ideas:

You could ALTER TABLE SET (fillfactor=50) to try to maximize use of HOT indices
during UPDATEs (check pg_stat_user_indexes).

You could also ALTER TABLE SET autovacuum parameters for more aggressive vacuuming.

You could recreate indices using the CONCURRENTLY trick
(CREATE INDEX CONCURRENTLY new; DROP old; ALTER .. RENAME;)

I have basically the same issue with a table. Each new row enters the table with a active=true kind of flag. The row is updated a lot, until a business condition expires, it is updated to active=false and then the row is almost never updated after that.

We also used a partial index, to good effect, but also had/have an issue where the index bloats and performs worse rather quickly, only to recover a bit after an autovacuum pass completes.

Lowering the fillfactor isn't a good solution because 99%+ of the table is "cold".

One manual VACUUM FREEZE coupled with lowering the vacuum sensitivity on that one table helps quite a bit by increasing the frequency shortening the runtimes of autovacuums, but it's not a total solution.

My next step is to partition the table on the "active" boolean flag, which eliminates the need for the partial indexes, and allows for different fillfactor for each partition (50 for true, 100 for false). This should also aid vacuum speed and make re-indexing the hot partition much faster. However, we have to upgrade to v11 first to enable row migration, so I can't yet report on how much of a solution that is.

Thank you all for responding so far.

David Rowley  and Justin Pryzby suggested things about autovacuum. But I don't think autovacuum has any helpful role here. I am explicitly doing a vacuum on that table. And it doesn't help at all. Almost not at all.

I want to believe that

VACUUM FREEZE Queue;

will push the database CPU% down again once it is climbing up, and I can do this may be 3 to 4 times, but ultimately I will always have to rebuild the index. But also, none of these vaccuum operations I do takes very long at all. It is just not efficacious at all.

Rebuilding the index by building  a new index and removing the old, then rename, and vacuum again, is prone to get stuck.

I tried to do it in a transaction. But it says CREATE INDEX can't be done in a transaction.

Need to CREATE INDEX CONCURRENTLY ... and I can't even do that in a procedure.

If I do it manually by issuing first CREATE INDEX CONCURRENTLY new and then DROP INDEX CONCURRENTLY old,  it might work once, but usually it just gets stuck with two indexes.  Although I noticed that it would actually put CPU back down and improve transaction throughput.

I also noticed that after I quit from DROP INDEX CONCURRENTLY old, that index is shown as INVALID

\d Queue...
Indexes: "queue_idx_pending" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending2" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending3" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending4" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending5" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending6" UNIQUE, btree (jobId, action) WHERE pending
...

and so I keep doing that same routine hands-on, every time that the CPU% creeps above 50% I do

CREATE UNIQUE INDEX CONCURRENTLY Queue_idx_pending6 ON Queue(jobId, action) WHERE currentlyOpen;
DROP INDEX CONCURRENTLY Queue_idx_pending5;

at which place it hangs, I interrupt the DROP command, which leaves the old index behind as "INVALID".

VACUUM FREEZE ANALYZE Queue;

At this point the db's CPU% dropping below 20% after the new index has been built.

Unfortunately this is totally hands on approach I have to do this every 5 minutes or so. And possibly the time between these necessities decreases. It also leads to inefficiency over time, even despite the CPU seemingly recovering.

So this isn't sustainable like that (worse because my Internet constantly drops).

What I am most puzzled by is that no matter how long I wait, the DROP INDEX CONCURRENTLY never completes. Why is that?

Also, the REINDEX command always fails with a deadlock because there is a row lock and a complete table lock involved.

I consider this ultimately a bug, or at the very least there is room for improvement. And I am on version 11.1.
regards,
-Gunther

On Sun, Feb 24, 2019 at 12:45:34PM -0500, Gunther wrote:
> What I am most puzzled by is that no matter how long I wait, the DROP INDEX
> CONCURRENTLY never completes. Why is that?

https://www.postgresql.org/docs/11/sql-dropindex.html
CONCURRENTLY
[...] With this option, the command instead waits until conflicting transactions have completed.

Do you have a longrunning txn ?  That's possibly the cause of the original
issue; vacuum cannot mark tuples as dead until txns have finished.

> I consider this ultimately a bug, or at the very least there is room for
> improvement. And I am on version 11.1.

Did you try upgrading to 11.2 ?  I suspect this doesn't affect the query plan..but may be relevant?

https://www.postgresql.org/docs/11/release-11-2.html
|Improve ANALYZE's handling of concurrently-updated rows (Jeff Janes, Tom Lane)
|
|Previously, rows deleted by an in-progress transaction were omitted from ANALYZE's sample, but this has been found to
leadto more inconsistency than including them would do. In effect, the sample now corresponds to an MVCC snapshot as of
ANALYZE'sstart time.
 

Justin


On Sun, Feb 24, 2019 at 10:02 AM Gunther <raj@gusw.net> wrote:
> David Rowley  and Justin Pryzby suggested things about autovacuum. But I don't think autovacuum has any helpful role
here.

My suspicion is that this has something to do with the behavior of
B-Tree indexes with lots of duplicates. See also:


https://www.postgresql.org/message-id/flat/CAH2-Wznf1uVBguutwrvR%2B6NcXTKYhagvNOY3-dg9dzcYiu_vKw%40mail.gmail.com#993f152a41a1e2c257d12d118aa7ebfc

I am working on a patch to address the problem which is slated for
Postgres 12. I give an illustrative example of one of the problems
that my patch addresses here (it actually addresses a number of
distinct issues all at once):

https://postgr.es/m/CAH2-Wzmf0fvVhU+SSZpGW4Qe9t--j_DmXdX3it5JcdB8FF2EsA@mail.gmail.com

Do you think that that could be a significant factor here? I found
your response to my initial questions unclear.

Using a Postgres table as a queue is known to create particular
problems with bloat, especially index bloat. Fixing the underlying
behavior in the nbtree code would likely sharply limit the growth in
index bloat over time, though it still may not make queue-like tables
completely painless to operate. The problem is described in high level
terms from a user's perspective here:

https://brandur.org/postgres-queues

--
Peter Geoghegan


Also, the REINDEX command always fails with a deadlock because there is a row lock and a complete table lock involved.

I consider this ultimately a bug, or at the very least there is room for improvement. And I am on version 11.1.
regards,
-Gunther


REINDEX doesn't work concurrently yet (slated for v12).

I think your solution may be something like this:
1. Create a new table, same columns, partitioned on the pending column.
2. Rename your existing queue table old_queue to the partitioned table as a default partition.
3. Rename new table to queue
4. add old_queue as the default partition of queue
5. add a new partition for pending = true rows, set the fillfactor kind of low, maybe 50, you can always change it.  Now your pending = true rows can be one of two places, but your pending = false rows are all in 
6. add all existing old_queue indexes (except those that are partial indexes on pending) to queue, these will be created on the new (empty) partition, and just matched to the existing indexes on old_queue
7. If pending = true records all ultimately become pending = false, wait for normal attrition to reach a state where all rows in the default partition are pending = false. If that won't happen, you may need to manually migrate some with a DELETE-INSERT
8. At this point, you can transactionally remove old_queue as a partition of queue, and then immediately re-add it to queue as the pending = false partition. There won't need to be a default partition.
9. drop all remaining partial indexes on pending, they're no longer useful.

That's roughly my plan for my own hotspot table when we can upgrade to 11.

 
On Sun, Feb 24, 2019 at 04:34:34PM -0500, Corey Huinker wrote:
> I think your solution may be something like this:
> 1. Create a new table, same columns, partitioned on the pending column.
> 2. Rename your existing queue table old_queue to the partitioned table as a
> default partition.
> 3. Rename new table to queue
> 4. add old_queue as the default partition of queue
> 5. add a new partition for pending = true rows, set the fillfactor kind of

FYI, the "default partition" isn't just for various and sundry uncategorized
tuples (like a relkind='r' inheritence without any constraint).  It's for
"tuples which are excluded by every other partition".  And "row migration"
doesn't happen during "ALTER..ATTACH", only UPDATE.  So you'll be unable to
attach a partition for pending=true if the default partition includes any such
rows:

|ERROR:  updated partition constraint for default partition "t0" would be violated by some row

I think you'll need to schedule a maintenance window, create a new partitioned
heirarchy, and INSERT INTO queue SELECT * FROM old_queue, or similar.

Justin


On Sun, Feb 24, 2019 at 5:43 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Sun, Feb 24, 2019 at 04:34:34PM -0500, Corey Huinker wrote:
> I think your solution may be something like this:
> 1. Create a new table, same columns, partitioned on the pending column.
> 2. Rename your existing queue table old_queue to the partitioned table as a
> default partition.
> 3. Rename new table to queue
> 4. add old_queue as the default partition of queue
> 5. add a new partition for pending = true rows, set the fillfactor kind of

FYI, the "default partition" isn't just for various and sundry uncategorized
tuples (like a relkind='r' inheritence without any constraint).  It's for
"tuples which are excluded by every other partition".  And "row migration"
doesn't happen during "ALTER..ATTACH", only UPDATE.  So you'll be unable to
attach a partition for pending=true if the default partition includes any such
rows:

|ERROR:  updated partition constraint for default partition "t0" would be violated by some row

I think you'll need to schedule a maintenance window, create a new partitioned
heirarchy, and INSERT INTO queue SELECT * FROM old_queue, or similar.

Justin

Good point, I forgot about that. I had also considered making a partitioned table, adding a "true" partition to that, and then making the partitioned table an inheritance partition of the existing table, then siphoning off rows from the original table until such time as it has no more pending rows, then doing a transaction where you de-inherit the partitioned table, and then attach the original table as the false partition. It's all a lot of acrobatics to try to minimize downtime and it could be done better by having a longer maintenance window, but I got the impression from the OP that big windows were not to be had.


 
On Sat, Feb 23, 2019 at 4:06 PM Gunther <raj@gusw.net> wrote:

the dequeue operation is essentially this:

BEGIN

SELECT jobId, action
  FROM Queue 
  WHERE pending  FOR UPDATE SKIP LOCKED

There is no LIMIT shown.  Wouldn't the first thread to start up just lock all the rows and everyone else would starve?

Cheers,

Jeff
Attachment
On Sun, Feb 24, 2019 at 1:02 PM Gunther <raj@gusw.net> wrote:

Thank you all for responding so far.

David Rowley  and Justin Pryzby suggested things about autovacuum. But I don't think autovacuum has any helpful role here. I am explicitly doing a vacuum on that table. And it doesn't help at all. Almost not at all.

If you do a vacuum verbose, what stats do you get back?  What is the size of the index when the degradation starts to show, and immediately after a successful reindex?

Also, how is JobID assigned?  Are they from a sequence, or some other method where they are always added to the end of the index's keyspace?

When it starts to degrade, what is the EXPLAIN plan for the query?

Cheers,

Jeff

Wow, yes, partition instead of index, that is interesting. Thanks Corey and Justin.

The index isn't required at all if all my pending jobs are in a partition of only pending jobs. In that case the plan can just be a sequential scan.

And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue query. That was an omission. My query is actually somewhat more complex and I just translated it down to the essentials but forgot the LIMIT 1 clause.

SELECT seqNo, action FROM QueueWHERE pending  AND ... other criteria ...LIMIT 1FOR UPDATE SKIP LOCKED; 

And sorry I didn't capture the stats for vacuum verbose. And they would be confusing because there are other things involved.

Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away.

Good thing is, I don't worry about maintenance window.  I have the leisure to simply tear down my design now and make a better design. What's 2 million transactions if I can re-process them at a rate of 80/s? 7 hours max. I am still in development. So, no need to worry about migration / transition acrobatics. So if I take Corey's steps and envision the final result, not worrying about the transition steps, then I understand this:

1. Create the Queue table partitioned on the pending column, this creates the partition with the pending jobs (on which I set the fillfactor kind of low, maybe 50) and the default partition with all the rest. Of course that allows people with a constant transaction volume to also partition on jobId or completionTime and move chunks out to cold archive storage. But that's beside the current point.

2. Add all needed indexes on the partitioned table, except the main partial index that I used before and that required all that reindexing maintenance. Actually I won't need any other indexes really, why invite another similar problem again.

That's really simple.

One question I have though: I imagine our pending partition heap file to now be essentially sequentially organized as a queue. New jobs are appended at the end, old jobs are at the beginning. As pending jobs become completed (pending = false) these initial rows will be marked as dead. So, while the number of live rows will remain small in that pending partition, sequential scans will have to skip over the dead rows in the beginning.

Does PostgreSQL structure its files such that skipping over dead rows is fast? Or do the dead rows have to be read and discarded during a table scan?

Of course vacuum eliminates dead rows, but unless I do vacuum full, it will not re-pack the live rows, and that requires an exclusive table lock. So, what is the benefit of vacuuming that pending partition? What I don't want is insertion of new jobs to go into open slots at the beginning of the file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL? 

Ultimately that partition will amass too many dead rows, then what do I do? I don't think that the OS has a way to truncate files physically from the head, does it? I guess it could set the file pointer from the first block to a later block. But I don't know of an IOCTL/FCNTL command for that. On some OS there is a way of making blocks sparse again, is that how PostgreSQL might do it? Just knock out blocks as sparse from the front of the file?

If not, the next thing I can think of is to partition the table further by time, may be alternating even and odd days, such that on any given day one of the two pending partitions are quiet? Is that how it's done?

regards,
-Gunther






Wow, yes, partition instead of index, that is interesting. Thanks Corey and Justin.

The index isn't required at all if all my pending jobs are in a partition of only pending jobs. In that case the plan can just be a sequential scan.

And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue query. That was an omission. My query is actually somewhat more complex and I just translated it down to the essentials but forgot the LIMIT 1 clause.

SELECT seqNo, action FROM QueueWHERE pending  AND ... other criteria ...LIMIT 1FOR UPDATE SKIP LOCKED; 

And sorry I didn't capture the stats for vacuum verbose. And they would be confusing because there are other things involved.

Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away.

Good thing is, I don't worry about maintenance window.  I have the leisure to simply tear down my design now and make a better design. What's 2 million transactions if I can re-process them at a rate of 80/s? 7 hours max. I am still in development. So, no need to worry about migration / transition acrobatics. So if I take Corey's steps and envision the final result, not worrying about the transition steps, then I understand this:

1. Create the Queue table partitioned on the pending column, this creates the partition with the pending jobs (on which I set the fillfactor kind of low, maybe 50) and the default partition with all the rest. Of course that allows people with a constant transaction volume to also partition on jobId or completionTime and move chunks out to cold archive storage. But that's beside the current point.

2. Add all needed indexes on the partitioned table, except the main partial index that I used before and that required all that reindexing maintenance. Actually I won't need any other indexes really, why invite another similar problem again.

That's really simple.

One question I have though: I imagine our pending partition heap file to now be essentially sequentially organized as a queue. New jobs are appended at the end, old jobs are at the beginning. As pending jobs become completed (pending = false) these initial rows will be marked as dead. So, while the number of live rows will remain small in that pending partition, sequential scans will have to skip over the dead rows in the beginning.

Does PostgreSQL structure its files such that skipping over dead rows is fast? Or do the dead rows have to be read and discarded during a table scan?

Of course vacuum eliminates dead rows, but unless I do vacuum full, it will not re-pack the live rows, and that requires an exclusive table lock. So, what is the benefit of vacuuming that pending partition? What I don't want is insertion of new jobs to go into open slots at the beginning of the file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL? 

Ultimately that partition will amass too many dead rows, then what do I do? I don't think that the OS has a way to truncate files physically from the head, does it? I guess it could set the file pointer from the first block to a later block. But I don't know of an IOCTL/FCNTL command for that. On some OS there is a way of making blocks sparse again, is that how PostgreSQL might do it? Just knock out blocks as sparse from the front of the file?

If not, the next thing I can think of is to partition the table further by time, may be alternating even and odd days, such that on any given day one of the two pending partitions are quiet? Is that how it's done?

regards,
-Gunther






On Sun, Feb 24, 2019 at 10:06:10PM -0500, Gunther wrote:
> The index isn't required at all if all my pending jobs are in a partition of
> only pending jobs. In that case the plan can just be a sequential scan.
..
> because an index really isn't required. The actual pending partition will
> always remain quite small, and being a queue, it doesn't even matter how big
> it might grow, as long as new rows are inserted at the end and not in the
> middle of the data file and still there be some way of fast skip over the
> part of the dead rows at the beginning that have already been processed and
> moved away.
..
> One question I have though: I imagine our pending partition heap file to now
> be essentially sequentially organized as a queue. New jobs are appended at
> the end, old jobs are at the beginning. As pending jobs become completed
> (pending = false) these initial rows will be marked as dead. So, while the
> number of live rows will remain small in that pending partition, sequential
> scans will have to skip over the dead rows in the beginning.
> 
> Does PostgreSQL structure its files such that skipping over dead rows is
> fast? Or do the dead rows have to be read and discarded during a table scan?
..
> Of course vacuum eliminates dead rows, but unless I do vacuum full, it will
> not re-pack the live rows, and that requires an exclusive table lock. So,
> what is the benefit of vacuuming that pending partition? What I _/don't/_
> want is insertion of new jobs to go into open slots at the beginning of the
> file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/
> hint for that. How does that work in PostgreSQL?
> 
> Ultimately that partition will amass too many dead rows, then what do I do?

Why don't you want to reuse free space in the table ?

When you UPDATE the tuples and set pending='f', the row will be moved to
another partition, and the "dead" tuple in the pending partition can be reused
for a future INSERT.  The table will remain small, as you said, only as large
as the number of items in the pending queue, plus tuples not yet vacuumed and
not yet available for reuse.

(BTW, index scans do intelligently skip over dead rows if the table/index are
vacuumed sufficiently often).

> 1. Create the Queue table partitioned on the pending column, this creates
> the partition with the pending jobs (on which I set the fillfactor kind of
> low, maybe 50) and the default partition with all the rest. Of course that
> allows people with a constant transaction volume to also partition on jobId
> or completionTime and move chunks out to cold archive storage. But that's
> beside the current point.

I suggest you might want to partition on something other than (or in addition
to) the boolean column.  For example, if you have a timestamp column for
"date_processed", then the active queue would be for "processed IS NULL" (which
I think would actually have to be the DEFAULT partition).  Or you could use
sub-partitioning, or partition on multiple columns (pending, date_processed) or
similar.

Justin



Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away.

Good thing is, I don't worry about maintenance window.  I have the leisure to simply tear down my design now and make a better design. What's 2 million transactions if I can re-process them at a rate of 80/s? 7 hours max. I am still in development. So, no need to worry about migration / transition acrobatics. So if I take Corey's steps and envision the final result, not worrying about the transition steps, then I understand this:

1. Create the Queue table partitioned on the pending column, this creates the partition with the pending jobs (on which I set the fillfactor kind of low, maybe 50) and the default partition with all the rest. Of course that allows people with a constant transaction volume to also partition on jobId or completionTime and move chunks out to cold archive storage. But that's beside the current point.

I'm guessing there's a fairly insignificant difference in performance between one true partition and one false partition vs one true partition and a default partition, but I don't have insight into which one is better.  


One question I have though: I imagine our pending partition heap file to now be essentially sequentially organized as a queue. New jobs are appended at the end, old jobs are at the beginning. As pending jobs become completed (pending = false) these initial rows will be marked as dead. So, while the number of live rows will remain small in that pending partition, sequential scans will have to skip over the dead rows in the beginning.

That's basically true, but vacuums are erasing deleted rows, and that space gets re-used. So the table builds up to a working-set size, and I envision it looking like a clock sweep, where your existing rows are at 11pm to 7pm, your new rows are inserting into space at 8pm that was vacuumed clean a while ago, and 9pm and 10pm have deleted rows that haven't been vacuumed yet. Where the empty spot is just keeps cycling through the table.

Of course vacuum eliminates dead rows, but unless I do vacuum full, it will not re-pack the live rows, and that requires an exclusive table lock. So, what is the benefit of vacuuming that pending partition? What I don't want is insertion of new jobs to go into open slots at the beginning of the file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL? 

See above, the db (tries to) reuse the space space before new space is allocated.

I don't know of an append equivalent for pgsql. If memory servers, the big win of /*+ APPEND */ was that raw data blocks were assembled out-of-band and then just written to disk.
 
Ultimately that partition will amass too many dead rows, then what do I do? I don't think that the OS has a way to truncate files physically from the head, does it? I guess it could set the file pointer from the first block to a later block. But I don't know of an IOCTL/FCNTL command for that. On some OS there is a way of making blocks sparse again, is that how PostgreSQL might do it? Just knock out blocks as sparse from the front of the file?

See clock sweep analogy above.
 
On Sat, Feb 23, 2019 at 4:06 PM Gunther <raj@gusw.net> wrote:

Hi,

I am using an SQL queue for distributing work to massively parallel workers.

You should look into specialized queueing software.

...

I figured I might just pause all workers briefly to schedule the REINDEX Queue command, but the problem with this is that while the transaction volume is large, some jobs may take minutes to process, and in that case we need to wait minutes to quiet the database with then 47 workers sitting as idle capacity waiting for the 48th to finish so that the index can be rebuilt!

The jobs that take minutes are themselves the problem.  They prevent tuples from being cleaned up, meaning all the other jobs needs to grovel through the detritus every time they need to claim a new row.  If you got those long running jobs to end, you probably wouldn't even need to reindex--the problem would go away on its own as the dead-to-all tuples get cleaned up.

Locking a tuple and leaving the transaction open for minutes is going to cause no end of trouble on a highly active system.  You should look at a three-state method where the tuple can be pending/claimed/finished, rather than pending/locked/finished.  That way the process commits immediately after claiming the tuple, and then records the outcome in another transaction once it is done processing.  You will need a way to detect processes that failed after claiming a row but before finishing, but implementing that is going to be easier than all of this re-indexing stuff you are trying to do now.  You would claim the row by updating a field in it to have something distinctive about the process, like its hostname and pid, so you can figure out if it is still running when it comes time to clean up apparently forgotten entries.

Cheers,

Jeff
Attachment
On Mon, Feb 25, 2019 at 11:13 AM Gunther Schadow <gunther@pragmaticdata.com> wrote:

Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away. 

Why do you want to do that?  If you are trying to force the queue to be handled in a "fair" order, then this isn't the way to do it, you would want to add an "ORDER BY" to your dequeueing query (in which case you are probably back to adding an index).  

Once the space in the beginning of the table has been reclaimed as free, then it will be reused for newly inserted tuples.  After the space is freed up but before it is reused, the seq scan can't skip those blocks entirely, but it can deal with the blocks quickly because they are empty.  If the blocks are full of dead but not freed tuples (because the long-running transactions are preventing them from being cleaned up) then it will have to go through each dead tuple to satisfy itself that it actually is dead.  This might not be as bad as it is for indexes, but certainly won't be good for performance.

 Cheers,

Jeff
Was wondering when that would come up, taking queuing logic outside the database.  Can be overly painful architecting queuing logic in relational databases. imho.

Regards,
Michael Vitale

Monday, February 25, 2019 3:30 PM
On Sat, Feb 23, 2019 at 4:06 PM Gunther <raj@gusw.net> wrote:

Hi,

I am using an SQL queue for distributing work to massively parallel workers.

You should look into specialized queueing software.

...

I figured I might just pause all workers briefly to schedule the REINDEX Queue command, but the problem with this is that while the transaction volume is large, some jobs may take minutes to process, and in that case we need to wait minutes to quiet the database with then 47 workers sitting as idle capacity waiting for the 48th to finish so that the index can be rebuilt!

The jobs that take minutes are themselves the problem.  They prevent tuples from being cleaned up, meaning all the other jobs needs to grovel through the detritus every time they need to claim a new row.  If you got those long running jobs to end, you probably wouldn't even need to reindex--the problem would go away on its own as the dead-to-all tuples get cleaned up.

Locking a tuple and leaving the transaction open for minutes is going to cause no end of trouble on a highly active system.  You should look at a three-state method where the tuple can be pending/claimed/finished, rather than pending/locked/finished.  That way the process commits immediately after claiming the tuple, and then records the outcome in another transaction once it is done processing.  You will need a way to detect processes that failed after claiming a row but before finishing, but implementing that is going to be easier than all of this re-indexing stuff you are trying to do now.  You would claim the row by updating a field in it to have something distinctive about the process, like its hostname and pid, so you can figure out if it is still running when it comes time to clean up apparently forgotten entries.

Cheers,

Jeff