Thread: Question about optimizing access to a table.

Question about optimizing access to a table.

From
Herouth Maoz
Date:

Hello.

I have one particular table with very specialized use. I am sending messages to some partner. The partner processes them asynchronously, and then returns the status report to me. The table is used to store a serialized version of the message object, together with a few identifiers, expiry date, and a reference from the partner. When the status report comes back from the partner, we:

  1. Select the record using the partner's reference number
  2. reconstruct and queue the object.
  3. Delete the record from database using the serial primary key ID.

Every once in a while we run a garbage collection process which selects expired messages, reconstructs and queues them for processing, and deletes the record using the primary key.

This works wonderfully as long as the table remains small - a few thousand records waiting for their status report, and that's it. The table is set to have frequent auto-anylize runs.

The problem starts when our partner has some glitch, under high load, and fails to send back a few hundred thousand reports. In that case, the table grows to a few hundred records, and they are not deleted until they hit their expiry date, at which point the "garbage collector" takes care of them and everything goes back to normal. When it contains hundreds of thousands of records, performance deteriorates considerably.

I am trying to figure out a solution that will keep the system working well even when there is a large number of records in the table. At first I thought of partitioning the table on the partner's reference field. But the DELETE would actually slow down if I do this, right?

Any ideas?

Here is the table structure (some irrelevant fields cut):

     Column      |            Type             |                       Modifiers                        
-----------------+-----------------------------+--------------------------------------------------------
 id              | bigint                      | not null default nextval('transient_id_seq'::regclass)
 time_submitted  | timestamp without time zone | not null
 validity        | timestamp without time zone | 
 msisdn          | character varying(16)       | not null
 sme_reference   | integer                     | not null
 smsc_reference  | numeric(21,0)               | not null
 sm              | text                        | 
Indexes:
    "transient_pkey" PRIMARY KEY, btree (id)
    "transient_msisdn_ref" btree (msisdn, smsc_reference)
    "transient_sm_vp" btree (validity)
Has OIDs: no
Options: autovacuum_enabled=true, autovacuum_analyze_threshold=200, autovacuum_analyze_scale_factor=0.001


TIA,
Herouth

Re: Question about optimizing access to a table.

From
Kevin Grittner
Date:
Herouth Maoz <herouth@unicell.co.il> wrote:

> The problem starts when our partner has some glitch, under high
> load, and fails to send back a few hundred thousand reports. In
> that case, the table grows to a few hundred records, and they are
> not deleted until they hit their expiry date, at which point the
> "garbage collector" takes care of them and everything goes back
> to normal. When it contains hundreds of thousands of records,
> performance deteriorates considerably-

First, make sure that you are on the latest minor release of
whatever major release you are running.  There were some serious
problems with autovacuum's table truncation when a table was used
as a queue and size fluctuated.  These are fixed in the latest set
of minor releases.

If that doesn't clear up the problem, please post an actual slow
query to the pgsql-performance list, with its EXPLAIN ANALYZE
output and other details, as suggested here:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

People will be able to provide more useful and specific advice if
they have the additional detail.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Question about optimizing access to a table.

From
Jeff Janes
Date:

On Tue, Dec 10, 2013 at 8:23 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

Hello.

I have one particular table with very specialized use. I am sending messages to some partner. The partner processes them asynchronously, and then returns the status report to me. The table is used to store a serialized version of the message object, together with a few identifiers, expiry date, and a reference from the partner. When the status report comes back from the partner, we:

  1. Select the record using the partner's reference number
  2. reconstruct and queue the object.
  3. Delete the record from database using the serial primary key ID.

Where do you "queue" the object?  Back into a different table within the same database?  Why not use "DELETE ...RETURNING"?
 

Every once in a while we run a garbage collection process which selects expired messages, reconstructs and queues them for processing, and deletes the record using the primary key.

This works wonderfully as long as the table remains small - a few thousand records waiting for their status report, and that's it. The table is set to have frequent auto-anylize runs.

You might want to turn 
 

The problem starts when our partner has some glitch, under high load, and fails to send back a few hundred thousand reports. In that case, the table grows to a few hundred records, and they are not deleted until they hit their expiry date, at which point the "garbage collector" takes care of them and everything goes back to normal. When it contains hundreds of thousands of records, performance deteriorates considerably.

There is no inherent reason the performance needs to degrade.  Can you give the specific queries that perform worse?

 
I am trying to figure out a solution that will keep the system working well even when there is a large number of records in the table. At first I thought of partitioning the table on the partner's reference field. But the DELETE would actually slow down if I do this, right?

Not necessarily, but partitioning should be your last resort not your first resort, and I don't see any reason it would be needed here.

Indexes:
    "transient_pkey" PRIMARY KEY, btree (id)
    "transient_msisdn_ref" btree (msisdn, smsc_reference)
    "transient_sm_vp" btree (validity)

What is the partner reference?  If it is smsc_reference, then you probably need a index in which that is the lead (or only) column.

Cheers,

Jeff
 

Re: Question about optimizing access to a table.

From
Herouth Maoz
Date:

On 10/12/2013, at 20:55, Jeff Janes wrote:


On Tue, Dec 10, 2013 at 8:23 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

Hello.

I have one particular table with very specialized use. I am sending messages to some partner. The partner processes them asynchronously, and then returns the status report to me. The table is used to store a serialized version of the message object, together with a few identifiers, expiry date, and a reference from the partner. When the status report comes back from the partner, we:

  1. Select the record using the partner's reference number
  2. reconstruct and queue the object.
  3. Delete the record from database using the serial primary key ID.

Where do you "queue" the object?  Back into a different table within the same database?  Why not use "DELETE ...RETURNING"?

The object is used in a Java application. Basically, it goes into a queue in memory. Some of its data is updated in two different tables in the same database at a later time, but only after some logic has been performed which may or may not change them relative to what they were when they came from the partner.

I'm not using DELETE...RETURNING because basically the application that's running is old code, originally written around 2002 or so, before we even had PostgreSQL in the organization. I will probably not get permission to change this code without very good reason as it is a sensitive bit of code.

 

Every once in a while we run a garbage collection process which selects expired messages, reconstructs and queues them for processing, and deletes the record using the primary key.

This works wonderfully as long as the table remains small - a few thousand records waiting for their status report, and that's it. The table is set to have frequent auto-anylize runs.

You might want to turn 

...yes?

 

The problem starts when our partner has some glitch, under high load, and fails to send back a few hundred thousand reports. In that case, the table grows to a few hundred records, and they are not deleted until they hit their expiry date, at which point the "garbage collector" takes care of them and everything goes back to normal. When it contains hundreds of thousands of records, performance deteriorates considerably.

There is no inherent reason the performance needs to degrade.  Can you give the specific queries that perform worse?

They are the simplest queries possible. The performance deterioration is only fractions of a second, you see. But when we have thousands of status reports per minute, it builds up to a large delay after a few minutes.

The queries are:
select id, sm from transient where smsc_reference = ? and msisdn = ?
and then:
delete from transient where id = ?


 
I am trying to figure out a solution that will keep the system working well even when there is a large number of records in the table. At first I thought of partitioning the table on the partner's reference field. But the DELETE would actually slow down if I do this, right?

Not necessarily, but partitioning should be your last resort not your first resort, and I don't see any reason it would be needed here.

Indexes:
    "transient_pkey" PRIMARY KEY, btree (id)
    "transient_msisdn_ref" btree (msisdn, smsc_reference)
    "transient_sm_vp" btree (validity)

What is the partner reference?  If it is smsc_reference, then you probably need a index in which that is the lead (or only) column.

Well, I simplified a bit, but as you can see from the query above, it queries on both the smsc_reference and the msisdn.


Thanks,
Herouth

Re: Question about optimizing access to a table.

From
Herouth Maoz
Date:
On 10/12/2013, at 20:55, Kevin Grittner wrote:

> Herouth Maoz <herouth@unicell.co.il> wrote:
>
>> The problem starts when our partner has some glitch, under high
>> load, and fails to send back a few hundred thousand reports. In
>> that case, the table grows to a few hundred records, and they are
>> not deleted until they hit their expiry date, at which point the
>> "garbage collector" takes care of them and everything goes back
>> to normal. When it contains hundreds of thousands of records,
>> performance deteriorates considerably-
>
> First, make sure that you are on the latest minor release of
> whatever major release you are running.  There were some serious
> problems with autovacuum's table truncation when a table was used
> as a queue and size fluctuated.  These are fixed in the latest set
> of minor releases.

Thank you. Indeed, I failed to mention which version of PostgreSQL I was on.  9.1.2 in this case. Do you mean that I
haveto go to 9.3.x or simply to 9.1.11? 

> If that doesn't clear up the problem, please post an actual slow
> query to the pgsql-performance list, with its EXPLAIN ANALYZE
> output and other details, as suggested here:
>
> http://wiki.postgresql.org/wiki/SlowQueryQuestions
>
> People will be able to provide more useful and specific advice if
> they have the additional detail.


Thank you. I think it's more a matter of design than an issue with the query. The queries themselves are the simplest
formof SELECT and DELETE possible. 

Herouth



Re: Question about optimizing access to a table.

From
Kevin Grittner
Date:
Herouth Maoz <herouth@unicell.co.il> wrote:
> On 10/12/2013, at 20:55, Kevin Grittner wrote:

>> First, make sure that you are on the latest minor release of
>> whatever major release you are running.  There were some serious
>> problems with autovacuum's table truncation when a table was
>> used as a queue and size fluctuated.  These are fixed in the
>> latest set of minor releases.
>
> I failed to mention which version of PostgreSQL I was on.
> 9.1.2 in this case. Do you mean that I have to go to 9.3.x or
> simply to 9.1.11?

In the PostgreSQL world, a minor release is when everything to the
left of the second dot stays the same.

http://www.postgresql.org/support/versioning/

9.1.11 contains the fix for the problems I was describing.  While
it's a good idea to stay current with minor releases in general,
for those using a table for a queue which may fluctuate in size,
there is an important set of fixes that make it particularly
important right now.  I have seen multiple sites using Slony where,
without the fixes, the database became unusable for normal
production after falling behind and subsequently catching up.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Question about optimizing access to a table.

From
Jeff Janes
Date:
On Wed, Dec 11, 2013 at 1:49 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

On 10/12/2013, at 20:55, Jeff Janes wrote:


On Tue, Dec 10, 2013 at 8:23 AM, Herouth Maoz <herouth@unicell.co.il> wrote:

Hello.

I have one particular table with very specialized use. I am sending messages to some partner. The partner processes them asynchronously, and then returns the status report to me. The table is used to store a serialized version of the message object, together with a few identifiers, expiry date, and a reference from the partner. When the status report comes back from the partner, we:

  1. Select the record using the partner's reference number
  2. reconstruct and queue the object.
  3. Delete the record from database using the serial primary key ID.

Where do you "queue" the object?  Back into a different table within the same database?  Why not use "DELETE ...RETURNING"?

The object is used in a Java application. Basically, it goes into a queue in memory. Some of its data is updated in two different tables in the same database at a later time, but only after some logic has been performed which may or may not change them relative to what they were when they came from the partner.

What happens if your Java application crashes after the delete occurs but before the two different tables get updated?  If this risk is acceptable, perhaps it would also be acceptable to expire your data early when the partner glitches and your table gets too big.
 
 

Every once in a while we run a garbage collection process which selects expired messages, reconstructs and queues them for processing, and deletes the record using the primary key.

This works wonderfully as long as the table remains small - a few thousand records waiting for their status report, and that's it. The table is set to have frequent auto-anylize runs.

You might want to turn 

...yes?


Sorry, before I got distracted by a shiny object, I meant to say you might want to turn up vacuuming, not just analyze.
 

 

The problem starts when our partner has some glitch, under high load, and fails to send back a few hundred thousand reports. In that case, the table grows to a few hundred records, and they are not deleted until they hit their expiry date, at which point the "garbage collector" takes care of them and everything goes back to normal. When it contains hundreds of thousands of records, performance deteriorates considerably.

There is no inherent reason the performance needs to degrade.  Can you give the specific queries that perform worse?

They are the simplest queries possible. The performance deterioration is only fractions of a second, you see. But when we have thousands of status reports per minute, it builds up to a large delay after a few minutes.

The queries are:
select id, sm from transient where smsc_reference = ? and msisdn = ?
and then:
delete from transient where id = ?

I thought the problem was that when the partner glitches, the above queries were not getting executed at all, because the stream driving them was no longer feeding them data.  But if they aren't executing, they can't be showing performance deterioration.   Or are those same queries also part of the garbage collection?  Or is only when the partner stream resumes that the problem shows up?


Assuming the problem is just the logN of crawling through the bloated indexes, that is where DELETE...RETURNING would really help be cutting the impact of the bloat down by 2 fold.  But if you can't change the code, more aggressive vacuuming might be your best option.  It won't help when the table is bloated with live rows (while the partner is glitching) but it should help it recover faster once the glitch is resolved.
 
If the problem is not the normal logN time of using the index, but something more pathological, then we probably need to see an explain (analyze, buffers) of the queries during the bad period.

Cheers,

Jeff