Thread: UPDATEDs slowing SELECTs in a fully cached database

From:
lars
Date:

I am doing some research that will hopefully lead to replacing a big
Oracle installation with a set PostgreSQL servers.

The current Oracle installations consists of multiple of RAC clusters
with 8 RAC nodes each. Each RAC node has 256gb of
memory (to be doubled soon).
The nature of our service is such that over a reasonable time (a day or
so) the database *is* the working set.

So I am looking at Postgres in a context where (almost) all of the data
is cached and disk IO is only required for persistence.

Setup:
PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
instance (kernel 2.6.35) with the database and
WAL residing on the same EBS volume with EXT4 (data=ordered, barriers=1)
- yes that is not an ideal setup
(WAL should be on separate drive, EBS is slow to begin, etc), but I am
mostly interested in read performance for a fully cached database.

shared_buffers: varied between 1gb and 20gb
checkpoint_segments/timeout: varied accordingly between 16-256 and
5-10m, resp.
bgwriter tweaked to get a good distribution of checkpoints, bg-writes,
and backend writes.
wal_sync_method: tried fdatasync and open_datasync.

I read "PostgreSQL 9.0 high performance", and have spent some
significant amount of time on this already.

PostgreSQL holds up extremely well, once things like "storing
hint-bits", checkpoints vs bgwriter vs backend_writes, etc
are understood. I installed pg_buffercache and pgfincore to monitor how
and where the database is stored.

There is one observation that I wasn't able to explain:
A SELECT only client is severely slowed down by a concurrent client
performing UPDATES on the same table the other
client selects from, even when the database resides 100% in the cache (I
tried with shared_buffers large enough to hold
the database, and also with a smaller setting relying on the OS cache,
the behavior is the same).

As long as only the reader is running I get great performance (20-30ms,
query reading a random set of about 10000 rows
out of 100m row table in a single SELECT). The backend is close to 100%
cpu, which is what want in a cached database.

Once the writer starts the read performance drops almost immediately to
 >200ms.
The reading backend's cpu drop drop to <10%,  and is mostly waiting (D
state in top).
The UPDATE touches a random set of also about 10000 rows (in one update
statement, one of the columns touched is
indexed - and that is the same index used for the SELECTs).

What I would have expected is that the SELECTs would just continue to
read from the cached buffers (whether dirtied
or not) and not be affected by concurrent updates. I could not find
anything explaining this.

The most interesting part:
that this does not happen with an exact clone of that relation but
UNLOGGED. The same amount of buffers get dirty,
the same amount checkpointing, bgwriting, vacuuming. The only difference
is WAL maintenance as far as I can tell.

Is there some (intentional or not) synchronization between backend when
the WAL is maintained? Are there times when
read only query needs to compete disk IO when everything is cached? Or
are there any other explanations?

I am happy to provide more information. Although I am mainly looking for
a qualitative answer, which could explain this behavior.

Thanks.

-- Lars


From:
lars
Date:

I have since moved the WAL to its own EBS volume (ext4, data=writeback)
to make it easier to monitor IO.
The times where the SELECTs slow down coincide with heavy write traffic
to the WAL volume.

Maybe this has to do with WALInsertLock or WALWriteLock (or some other
lock).
Since the slowdown was less severe with WAL on its own volume it seems
some exclusive lock on the pages in
shared_buffers is held while WAL IO is in progres(?) - that would be
"frustrating". (wal_buffers default to 16mb in my setup)

Next I am going to have a look at the code. I would be thankful for any
further insights, though :)

Thanks.

-- Lars

On 07/07/2011 04:56 PM, lars wrote:
> I am doing some research that will hopefully lead to replacing a big
> Oracle installation with a set PostgreSQL servers.
>
> The current Oracle installations consists of multiple of RAC clusters
> with 8 RAC nodes each. Each RAC node has 256gb of
> memory (to be doubled soon).
> The nature of our service is such that over a reasonable time (a day
> or so) the database *is* the working set.
>
> So I am looking at Postgres in a context where (almost) all of the
> data is cached and disk IO is only required for persistence.
>
> Setup:
> PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
> instance (kernel 2.6.35) with the database and
> WAL residing on the same EBS volume with EXT4 (data=ordered,
> barriers=1) - yes that is not an ideal setup
> (WAL should be on separate drive, EBS is slow to begin, etc), but I am
> mostly interested in read performance for a fully cached database.
>
> shared_buffers: varied between 1gb and 20gb
> checkpoint_segments/timeout: varied accordingly between 16-256 and
> 5-10m, resp.
> bgwriter tweaked to get a good distribution of checkpoints, bg-writes,
> and backend writes.
> wal_sync_method: tried fdatasync and open_datasync.
>
> I read "PostgreSQL 9.0 high performance", and have spent some
> significant amount of time on this already.
>
> PostgreSQL holds up extremely well, once things like "storing
> hint-bits", checkpoints vs bgwriter vs backend_writes, etc
> are understood. I installed pg_buffercache and pgfincore to monitor
> how and where the database is stored.
>
> There is one observation that I wasn't able to explain:
> A SELECT only client is severely slowed down by a concurrent client
> performing UPDATES on the same table the other
> client selects from, even when the database resides 100% in the cache
> (I tried with shared_buffers large enough to hold
> the database, and also with a smaller setting relying on the OS cache,
> the behavior is the same).
>
> As long as only the reader is running I get great performance
> (20-30ms, query reading a random set of about 10000 rows
> out of 100m row table in a single SELECT). The backend is close to
> 100% cpu, which is what want in a cached database.
>
> Once the writer starts the read performance drops almost immediately
> to >200ms.
> The reading backend's cpu drop drop to <10%,  and is mostly waiting (D
> state in top).
> The UPDATE touches a random set of also about 10000 rows (in one
> update statement, one of the columns touched is
> indexed - and that is the same index used for the SELECTs).
>
> What I would have expected is that the SELECTs would just continue to
> read from the cached buffers (whether dirtied
> or not) and not be affected by concurrent updates. I could not find
> anything explaining this.
>
> The most interesting part:
> that this does not happen with an exact clone of that relation but
> UNLOGGED. The same amount of buffers get dirty,
> the same amount checkpointing, bgwriting, vacuuming. The only
> difference is WAL maintenance as far as I can tell.
>
> Is there some (intentional or not) synchronization between backend
> when the WAL is maintained? Are there times when
> read only query needs to compete disk IO when everything is cached? Or
> are there any other explanations?
>
> I am happy to provide more information. Although I am mainly looking
> for a qualitative answer, which could explain this behavior.
>
> Thanks.
>
> -- Lars
>


From:
Craig Ringer
Date:

On 11/07/2011 4:34 AM, lars wrote:
> I have since moved the WAL to its own EBS volume (ext4, data=writeback)
> to make it easier to monitor IO.
> The times where the SELECTs slow down coincide with heavy write traffic
> to the WAL volume.

In theory, UPDATEs shouldn't be blocking or slowing SELECTs. Whether
that holds up to the light of reality, real-world hardware, and software
implementation detail, I really don't know. I avoided responding to your
first mail because I generally work with smaller and less performance
critical databases so I haven't accumulated much experience with
fine-tuning.

If your SELECTs were slower *after* your UPDATEs I'd be wondering if
your SELECTs are setting hint bits on the pages touched by the UPDATEs.
See: http://wiki.postgresql.org/wiki/Hint_Bits . It doesn't sound like
that's the case if the SELECTs are slowed down *during* a big UPDATE
that hasn't yet committed, though.

Could it just be cache pressure - either on shm, or operating system
disk cache? All the dirty buffers that have to be flushed to WAL and to
the heap may be evicting cached data your SELECTs were benefitting from.
Unfortunately, diagnostics in this area are ... limited ... though some
of the pg_catalog views
(http://www.postgresql.org/docs/9.0/static/catalogs.html) may offer some
information.

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

From:
lars hofhansl
Date:

Thanks Craig.

Yep, I am not seeing the SELECTs slow down (measurably) during  checkpoints
(i.e. when dirty pages are flushed to disk), but only during  writing of the WAL
files. The buffers shared and OS are big enough to hold the entire  database, so
evicting cached data should not be necessary. (The database  including indexes
can fit into 16 or so GB, and I have 68GB on that  machine).
Interestingly I initially thought there might be a correlation between
checkpointing and slower SELECTs, but it turns out that checkpointing just
slowed down IO to the WAL - until I move it to its own drive, and then increased
the effect I was seeing.

I'll do more research and try to provide more useful details.

Thanks for the pg_catalog link, I'll have a look at it.

-- Lars



----- Original Message ----
From: Craig Ringer <>
To: 
Sent: Sun, July 10, 2011 4:11:39 PM
Subject: Re: [PERFORM] UPDATEDs slowing SELECTs in a fully cached database

On 11/07/2011 4:34 AM, lars wrote:
> I have since moved the WAL to its own EBS volume (ext4, data=writeback)
> to make it easier to monitor IO.
> The times where the SELECTs slow down coincide with heavy write traffic
> to the WAL volume.

In theory, UPDATEs shouldn't be blocking or slowing SELECTs. Whether that holds
up to the light of reality, real-world hardware, and software implementation
detail, I really don't know. I avoided responding to your first mail because I
generally work with smaller and less performance critical databases so I haven't
accumulated much experience with fine-tuning.

If your SELECTs were slower *after* your UPDATEs I'd be wondering if your
SELECTs are setting hint bits on the pages touched by the UPDATEs. See:
http://wiki.postgresql.org/wiki/Hint_Bits . It doesn't sound like that's the
case if the SELECTs are slowed down *during* a big UPDATE that hasn't yet
committed, though.

Could it just be cache pressure - either on shm, or operating system disk cache?
All the dirty buffers that have to be flushed to WAL and to the heap may be
evicting cached data your SELECTs were benefitting from. Unfortunately,
diagnostics in this area are ... limited ... though some of the pg_catalog views
(http://www.postgresql.org/docs/9.0/static/catalogs.html) may offer some
information.

--
Craig Ringer

POST Newspapers
276 Onslow Rd, Shenton Park
Ph: 08 9381 3088     Fax: 08 9388 2258
ABN: 50 008 917 717
http://www.postnewspapers.com.au/

-- Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


From:
"ktm@rice.edu"
Date:

Hi Lars,

I do not know if this makes sense in PostgreSQL and that readers
do not block writers and writes do not block readers. Are your
UPDATEs to individual rows, each in a separate transaction, or
do you UPDATE multiple rows in the same transaction? If you
perform multiple updates in a single transaction, you are
synchronizing the changes to that set of rows and that constraint
is causing other readers that need to get the correct values post-
transaction to wait until the COMMIT completes. This means that
the WAL write must be completed.

Have you tried disabling synchronous_commit? If this scenario
holds, you should be able to reduce the slowdown by un-batching
your UPDATEs, as counter-intuitive as that is. This seems to
be similar to a problem that I have been looking at with using
PostgreSQL as the backend to a Bayesian engine. I am following
this thread with interest.

Regards,
Ken

On Thu, Jul 07, 2011 at 04:56:13PM -0700, lars wrote:
> I am doing some research that will hopefully lead to replacing a big
> Oracle installation with a set PostgreSQL servers.
>
> The current Oracle installations consists of multiple of RAC
> clusters with 8 RAC nodes each. Each RAC node has 256gb of
> memory (to be doubled soon).
> The nature of our service is such that over a reasonable time (a day
> or so) the database *is* the working set.
>
> So I am looking at Postgres in a context where (almost) all of the
> data is cached and disk IO is only required for persistence.
>
> Setup:
> PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
> instance (kernel 2.6.35) with the database and
> WAL residing on the same EBS volume with EXT4 (data=ordered,
> barriers=1) - yes that is not an ideal setup
> (WAL should be on separate drive, EBS is slow to begin, etc), but I
> am mostly interested in read performance for a fully cached
> database.
>
> shared_buffers: varied between 1gb and 20gb
> checkpoint_segments/timeout: varied accordingly between 16-256 and
> 5-10m, resp.
> bgwriter tweaked to get a good distribution of checkpoints,
> bg-writes, and backend writes.
> wal_sync_method: tried fdatasync and open_datasync.
>
> I read "PostgreSQL 9.0 high performance", and have spent some
> significant amount of time on this already.
>
> PostgreSQL holds up extremely well, once things like "storing
> hint-bits", checkpoints vs bgwriter vs backend_writes, etc
> are understood. I installed pg_buffercache and pgfincore to monitor
> how and where the database is stored.
>
> There is one observation that I wasn't able to explain:
> A SELECT only client is severely slowed down by a concurrent client
> performing UPDATES on the same table the other
> client selects from, even when the database resides 100% in the
> cache (I tried with shared_buffers large enough to hold
> the database, and also with a smaller setting relying on the OS
> cache, the behavior is the same).
>
> As long as only the reader is running I get great performance
> (20-30ms, query reading a random set of about 10000 rows
> out of 100m row table in a single SELECT). The backend is close to
> 100% cpu, which is what want in a cached database.
>
> Once the writer starts the read performance drops almost immediately
> to >200ms.
> The reading backend's cpu drop drop to <10%,  and is mostly waiting
> (D state in top).
> The UPDATE touches a random set of also about 10000 rows (in one
> update statement, one of the columns touched is
> indexed - and that is the same index used for the SELECTs).
>
> What I would have expected is that the SELECTs would just continue
> to read from the cached buffers (whether dirtied
> or not) and not be affected by concurrent updates. I could not find
> anything explaining this.
>
> The most interesting part:
> that this does not happen with an exact clone of that relation but
> UNLOGGED. The same amount of buffers get dirty,
> the same amount checkpointing, bgwriting, vacuuming. The only
> difference is WAL maintenance as far as I can tell.
>
> Is there some (intentional or not) synchronization between backend
> when the WAL is maintained? Are there times when
> read only query needs to compete disk IO when everything is cached?
> Or are there any other explanations?
>
> I am happy to provide more information. Although I am mainly looking
> for a qualitative answer, which could explain this behavior.
>
> Thanks.
>
> -- Lars
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:
Robert Klemme
Date:

On Mon, Jul 11, 2011 at 3:13 PM,  <> wrote:
> I do not know if this makes sense in PostgreSQL and that readers
> do not block writers and writes do not block readers. Are your
> UPDATEs to individual rows, each in a separate transaction, or
> do you UPDATE multiple rows in the same transaction? If you
> perform multiple updates in a single transaction, you are
> synchronizing the changes to that set of rows and that constraint
> is causing other readers that need to get the correct values post-
> transaction to wait until the COMMIT completes. This means that
> the WAL write must be completed.

What readers should that be?  Docs explicitly state that readers are
never blocked by writers:
http://www.postgresql.org/docs/9.0/interactive/mvcc-intro.html
http://www.postgresql.org/docs/9.0/interactive/mvcc.html

From what I understand about this issue the observed effect must be
caused by the implementation and not by a conceptual issue with
transactions.

> Have you tried disabling synchronous_commit? If this scenario
> holds, you should be able to reduce the slowdown by un-batching
> your UPDATEs, as counter-intuitive as that is. This seems to
> be similar to a problem that I have been looking at with using
> PostgreSQL as the backend to a Bayesian engine. I am following
> this thread with interest.

I don't think this will help (see above).  Also, I would be very
cautious to do this because although the client might get a faster
acknowledge the DB still has to do the same work as without
synchronous_commit (i.e. WAL, checkpointing etc.) but it still has to
do significantly more transactions than in the batched version.

Typically there is an optimum batch size: if batch size is too small
(say, one row) the ratio of TX overhead to "work" is too bad.  If
batch size is too large (say, millions of rows) you hit resource
limitations (memory) which inevitable force the RDBMS to do additional
disk IO.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From:
"ktm@rice.edu"
Date:

On Mon, Jul 11, 2011 at 05:26:49PM +0200, Robert Klemme wrote:
> On Mon, Jul 11, 2011 at 3:13 PM,  <> wrote:
> > I do not know if this makes sense in PostgreSQL and that readers
> > do not block writers and writes do not block readers. Are your
> > UPDATEs to individual rows, each in a separate transaction, or
> > do you UPDATE multiple rows in the same transaction? If you
> > perform multiple updates in a single transaction, you are
> > synchronizing the changes to that set of rows and that constraint
> > is causing other readers that need to get the correct values post-
> > transaction to wait until the COMMIT completes. This means that
> > the WAL write must be completed.
>
> What readers should that be?  Docs explicitly state that readers are
> never blocked by writers:
> http://www.postgresql.org/docs/9.0/interactive/mvcc-intro.html
> http://www.postgresql.org/docs/9.0/interactive/mvcc.html
>
> From what I understand about this issue the observed effect must be
> caused by the implementation and not by a conceptual issue with
> transactions.
>
> > Have you tried disabling synchronous_commit? If this scenario
> > holds, you should be able to reduce the slowdown by un-batching
> > your UPDATEs, as counter-intuitive as that is. This seems to
> > be similar to a problem that I have been looking at with using
> > PostgreSQL as the backend to a Bayesian engine. I am following
> > this thread with interest.
>
> I don't think this will help (see above).  Also, I would be very
> cautious to do this because although the client might get a faster
> acknowledge the DB still has to do the same work as without
> synchronous_commit (i.e. WAL, checkpointing etc.) but it still has to
> do significantly more transactions than in the batched version.
>
> Typically there is an optimum batch size: if batch size is too small
> (say, one row) the ratio of TX overhead to "work" is too bad.  If
> batch size is too large (say, millions of rows) you hit resource
> limitations (memory) which inevitable force the RDBMS to do additional
> disk IO.
>
> Kind regards
>
> robert
>
Okay,

If we assume that the current implementation of MVCC is preventing
readers from blocking writers and writers from blocking readers, then
the application may have some statements that are implicitly locking
the database and that is conflicting with the UPDATEs. Maybe the
slowdown is caused by index updates caused by the write activity.
Just throwing out some ideas.

Regards,
Ken
regarding index updates with the read-only queries.
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>

From:
"Kevin Grittner"
Date:

lars hofhansl <> wrote:

> Yep, I am not seeing the SELECTs slow down (measurably) during
> checkpoints (i.e. when dirty pages are flushed to disk), but only
> during writing of the WAL files.

How about if you do a whole slew of the UPDATEs and then stop those
and run a bunch of SELECTs?  (I don't think I've seen anything
mentioned so far which rules out hint bit rewrites as an issue.)

I see you have tweaked things to balance the writes -- you might
want to try further adjustments to reduce backend writes and see
what happens.

-Kevin

From:
lars
Date:

On 07/11/2011 10:33 AM, Kevin Grittner wrote:
> lars hofhansl<>  wrote:
>
>> Yep, I am not seeing the SELECTs slow down (measurably) during
>> checkpoints (i.e. when dirty pages are flushed to disk), but only
>> during writing of the WAL files.
>
> How about if you do a whole slew of the UPDATEs and then stop those
> and run a bunch of SELECTs?  (I don't think I've seen anything
> mentioned so far which rules out hint bit rewrites as an issue.)
>
> I see you have tweaked things to balance the writes -- you might
> want to try further adjustments to reduce backend writes and see
> what happens.
>
> -Kevin

Hmm... You are right. Stopping the UPDATEs, waiting for any CHECKPOINTs
to finish,
and then running the SELECTs indeed shows a similar slowdown.

Interestingly I see very heavy WAL traffic while executing the SELECTs.
(So I was confused as to what caused the WAL traffic).

Why do changes to the hint bits need to be logged to the WAL? If we
loose them we can always get that information back from the commit log.
Maybe the backend does not know why the page is dirty and will write it
to the WAL anyway(?)
If that is the case there seems to be room to optimize that.

-- Lars


From:
"Kevin Grittner"
Date:

lars <> wrote:

> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
> and then running the SELECTs indeed shows a similar slowdown.
>
> Interestingly I see very heavy WAL traffic while executing the
> SELECTs. (So I was confused as to what caused the WAL traffic).

Hint bit changes aren't logged, so if it was that you would be
seeing writes to the heap, but not to the WAL.  Clean-up of dead
tuples is logged -- this is probably the result of pruning dead
tuples.  You could probably reduce the impact on your SELECT
statements at least a little by making autovacuum more aggressive.

At some point you could see the overhead of autovacuum having some
impact on your SELECT statements, so you may need to experiment to
find the "sweet spot" for your load.

-Kevin

From:
Merlin Moncure
Date:

On Mon, Jul 11, 2011 at 2:16 PM, Kevin Grittner
<> wrote:
> lars <> wrote:
>
>> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
>> and then running the SELECTs indeed shows a similar slowdown.
>>
>> Interestingly I see very heavy WAL traffic while executing the
>> SELECTs. (So I was confused as to what caused the WAL traffic).
>
> Hint bit changes aren't logged, so if it was that you would be
> seeing writes to the heap, but not to the WAL.  Clean-up of dead
> tuples is logged -- this is probably the result of pruning dead
> tuples.  You could probably reduce the impact on your SELECT
> statements at least a little by making autovacuum more aggressive.

yeah.  In fact, I'd like to disable autovacuum completely just to
confirm this.  In particular I'd like to know if that removes wal
traffic when only selects are going on.  Another way to check is to
throw some queries to pg_stat_activity during your select period and
see if any non-select activity (like autovacum vacuum).  Basically I'm
suspicious there is more to this story.

hint bit flusing causing i/o during SELECT is a typical complaint
(especially on systems with underperformant i/o), but I'm suspicious
if that's really the problem here. Since you are on a virtualized
platform, I can't help but wonder if you are running into some
bottleneck that you wouldn't see on native hardware.

What's iowait during the slow period?

merlin

From:
"Kevin Grittner"
Date:

Merlin Moncure <> wrote:
> Kevin Grittner <> wrote:
>> lars <> wrote:
>>
>>> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
>>> and then running the SELECTs indeed shows a similar slowdown.
>>>
>>> Interestingly I see very heavy WAL traffic while executing the
>>> SELECTs. (So I was confused as to what caused the WAL traffic).
>>
>> Hint bit changes aren't logged, so if it was that you would be
>> seeing writes to the heap, but not to the WAL.  Clean-up of dead
>> tuples is logged -- this is probably the result of pruning dead
>> tuples.  You could probably reduce the impact on your SELECT
>> statements at least a little by making autovacuum more
>> aggressive.
>
> yeah.  In fact, I'd like to disable autovacuum completely just to
> confirm this.

If I'm right, disabling autovacuum would tend to make this *worse*.

> In particular I'd like to know if that removes wal traffic when
> only selects are going on.

My guess: no.

> Another way to check is to throw some queries to pg_stat_activity
> during your select period and see if any non-select activity (like
> autovacum vacuum).

That's not a bad thing to check, but be careful what causality you
assume based on a correlation here -- blaming autovacuum might be a
bit like like blaming firefighters for fires, because you keep
seeing them at the same time.  You might actually want them to
respond faster and more aggressively, rather than keeping them away.

You do realize, that just reading a page with dead tuples can cause
dead tuple pruning, right?  No autovacuum involved.  Your SELECT
statement waits for things to be tidied up and the page is marked
dirty.  I'm thinking that more aggressive autovacuum runs would
clean more of this up in background processes and let the SELECT
statement avoid some of this work -- speeding them up.

-Kevin

From:
Merlin Moncure
Date:

On Mon, Jul 11, 2011 at 4:55 PM, Kevin Grittner
<> wrote:
> Merlin Moncure <> wrote:
>> Kevin Grittner <> wrote:
>>> lars <> wrote:
>>>
>>>> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
>>>> and then running the SELECTs indeed shows a similar slowdown.
>>>>
>>>> Interestingly I see very heavy WAL traffic while executing the
>>>> SELECTs. (So I was confused as to what caused the WAL traffic).
>>>
>>> Hint bit changes aren't logged, so if it was that you would be
>>> seeing writes to the heap, but not to the WAL.  Clean-up of dead
>>> tuples is logged -- this is probably the result of pruning dead
>>> tuples.  You could probably reduce the impact on your SELECT
>>> statements at least a little by making autovacuum more
>>> aggressive.
>>
>> yeah.  In fact, I'd like to disable autovacuum completely just to
>> confirm this.
>
> If I'm right, disabling autovacuum would tend to make this *worse*.
>
>> In particular I'd like to know if that removes wal traffic when
>> only selects are going on.
>
> My guess: no.
>
>> Another way to check is to throw some queries to pg_stat_activity
>> during your select period and see if any non-select activity (like
>> autovacum vacuum).
>
> That's not a bad thing to check, but be careful what causality you
> assume based on a correlation here -- blaming autovacuum might be a
> bit like like blaming firefighters for fires, because you keep
> seeing them at the same time.  You might actually want them to
> respond faster and more aggressively, rather than keeping them away.
>
> You do realize, that just reading a page with dead tuples can cause
> dead tuple pruning, right?  No autovacuum involved.  Your SELECT
> statement waits for things to be tidied up and the page is marked
> dirty.  I'm thinking that more aggressive autovacuum runs would
> clean more of this up in background processes and let the SELECT
> statement avoid some of this work -- speeding them up.

Yeah, but that doesn't jive with the facts as I understand them -- 10k
records are being written at random place and 10k records are being
read at random place on some large table.  I'm assuming (!) that most
of the time the 'selects' are not hitting tuples that are recently
updated unless the table is relatively small against the 10k window
size. If the select is reading a bunch of un-recently-updated tuples,
and this is not a sequential scan, and autovauum is not running and
causing sideband i/o, wal activity should be small.  What percentage
of the table is updated at the end of the test?  If it's high, or
greater than 1.0 write ration, then disabling AV would be a huge
negative.

Or maybe the random select/update index is synchronized -- but then
autovacuum wouldn't really be a player either way.

merlin

From:
lars
Date:

On 07/11/2011 02:43 PM, Merlin Moncure wrote:
> On Mon, Jul 11, 2011 at 2:16 PM, Kevin Grittner
> <>  wrote:
>> lars<>  wrote:
>>
>>> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
>>> and then running the SELECTs indeed shows a similar slowdown.
>>>
>>> Interestingly I see very heavy WAL traffic while executing the
>>> SELECTs. (So I was confused as to what caused the WAL traffic).
>> Hint bit changes aren't logged, so if it was that you would be
>> seeing writes to the heap, but not to the WAL.  Clean-up of dead
>> tuples is logged -- this is probably the result of pruning dead
>> tuples.  You could probably reduce the impact on your SELECT
>> statements at least a little by making autovacuum more aggressive.
> yeah.  In fact, I'd like to disable autovacuum completely just to
> confirm this.  In particular I'd like to know if that removes wal
> traffic when only selects are going on.  Another way to check is to
> throw some queries to pg_stat_activity during your select period and
> see if any non-select activity (like autovacum vacuum).  Basically I'm
> suspicious there is more to this story.
>
> hint bit flusing causing i/o during SELECT is a typical complaint
> (especially on systems with underperformant i/o), but I'm suspicious
> if that's really the problem here. Since you are on a virtualized
> platform, I can't help but wonder if you are running into some
> bottleneck that you wouldn't see on native hardware.
>
> What's iowait during the slow period?
>
> merlin
Thanks Kevin and Merlin this is extremely helpful...

Ok, that makes much more sense (WALing hint bits did not make sense).

I disabled auto-vacuum and did four tests:
1. Run a bunch of updates, stop that process, wait until checkpointing
is finished, and run the selects (as before).
2. run VACUUM manually, then run the SELECTs
3. Have the UPDATEs and SELECTs touch a mutually exclusive, random sets
of row (still in sets of 10000).
So the SELECTs are guaranteed not to select rows that were updated.
4. Lastly, change the UPDATEs to update a non-indexed column. To rule
out Index maintenance. Still distinct set of rows.

In the first case I see the same slowdown (from ~14ms to ~400-500ms).
pg_stat_activity shows no other load during that
time. I also see write activity only on the WAL volume.

In the 2nd case after VACUUM is finished the time is back to 14ms. As an
aside: If I run the SELECTs while VACUUM is
running the slowdown is about the same as in the first case until
(apparently) VACUUM has cleaned up most of the table,
at which point the SELECTs become faster again (~50ms).

In the 3rd case I see exactly the same behavior, which is interesting.
Both before VACUUM is run and after.
There's no guarantee obviously that distinct rows do not share the same
page of course especially since the index is
updated as part of this (but see the 4th case).

In case 4 I still see the same issue. Again both before and after VACUUM.

In all cases I see from pg_stat_bgwriter that no backend writes buffers
directly (but I think that only pertains to dirty buffers, and not the WAL).

So I think I have a partial answer to my initial question.

However, that brings me to some other questions:
Why do SELECTs cause dead tuples to be pruned (as Kevin suggests)?
That even happens when the updates did not touch the selected rows(?)
And why does that slow down the SELECTs? (checkpointing activity on the
EBS volume holding the database for example
does not slow down SELECTs at all, only WAL activity does). Does the
selecting backend do that work itself?

Lastly, is this documented somewhere? (I apologize if it is and I missed
it). If not I'd be happy to write a wiki entry for this.


From:
lars
Date:

On 07/11/2011 04:02 PM, lars wrote:
> On 07/11/2011 02:43 PM, Merlin Moncure wrote:
>> On Mon, Jul 11, 2011 at 2:16 PM, Kevin Grittner
>> <>  wrote:
>>> lars<>  wrote:
>>>
>>>> Stopping the UPDATEs, waiting for any CHECKPOINTs to finish,
>>>> and then running the SELECTs indeed shows a similar slowdown.
>>>>
>>>> Interestingly I see very heavy WAL traffic while executing the
>>>> SELECTs. (So I was confused as to what caused the WAL traffic).
>>> Hint bit changes aren't logged, so if it was that you would be
>>> seeing writes to the heap, but not to the WAL.  Clean-up of dead
>>> tuples is logged -- this is probably the result of pruning dead
>>> tuples.  You could probably reduce the impact on your SELECT
>>> statements at least a little by making autovacuum more aggressive.
>> yeah.  In fact, I'd like to disable autovacuum completely just to
>> confirm this.  In particular I'd like to know if that removes wal
>> traffic when only selects are going on.  Another way to check is to
>> throw some queries to pg_stat_activity during your select period and
>> see if any non-select activity (like autovacum vacuum).  Basically I'm
>> suspicious there is more to this story.
>>
>> hint bit flusing causing i/o during SELECT is a typical complaint
>> (especially on systems with underperformant i/o), but I'm suspicious
>> if that's really the problem here. Since you are on a virtualized
>> platform, I can't help but wonder if you are running into some
>> bottleneck that you wouldn't see on native hardware.
>>
>> What's iowait during the slow period?
>>
>> merlin
> Thanks Kevin and Merlin this is extremely helpful...
>
> Ok, that makes much more sense (WALing hint bits did not make sense).
>
> I disabled auto-vacuum and did four tests:
> 1. Run a bunch of updates, stop that process, wait until checkpointing
> is finished, and run the selects (as before).
> 2. run VACUUM manually, then run the SELECTs
> 3. Have the UPDATEs and SELECTs touch a mutually exclusive, random
> sets of row (still in sets of 10000).
> So the SELECTs are guaranteed not to select rows that were updated.
> 4. Lastly, change the UPDATEs to update a non-indexed column. To rule
> out Index maintenance. Still distinct set of rows.
>
> In the first case I see the same slowdown (from ~14ms to ~400-500ms).
> pg_stat_activity shows no other load during that
> time. I also see write activity only on the WAL volume.
>
> In the 2nd case after VACUUM is finished the time is back to 14ms. As
> an aside: If I run the SELECTs while VACUUM is
> running the slowdown is about the same as in the first case until
> (apparently) VACUUM has cleaned up most of the table,
> at which point the SELECTs become faster again (~50ms).
>
> In the 3rd case I see exactly the same behavior, which is interesting.
> Both before VACUUM is run and after.
> There's no guarantee obviously that distinct rows do not share the
> same page of course especially since the index is
> updated as part of this (but see the 4th case).
>
> In case 4 I still see the same issue. Again both before and after VACUUM.
>
> In all cases I see from pg_stat_bgwriter that no backend writes
> buffers directly (but I think that only pertains to dirty buffers, and
> not the WAL).
>
> So I think I have a partial answer to my initial question.
>
> However, that brings me to some other questions:
> Why do SELECTs cause dead tuples to be pruned (as Kevin suggests)?
> That even happens when the updates did not touch the selected rows(?)
> And why does that slow down the SELECTs? (checkpointing activity on
> the EBS volume holding the database for example
> does not slow down SELECTs at all, only WAL activity does). Does the
> selecting backend do that work itself?
>
> Lastly, is this documented somewhere? (I apologize if it is and I
> missed it). If not I'd be happy to write a wiki entry for this.
>
>
Oh, and iowait hovers around 20% when SELECTs are slow:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.54    0.00    0.98   18.49    0.07   78.92

When SELECTs are fast it looks like this:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            8.72    0.00    0.26    0.00    0.00   91.01

Note that this is a 12 core VM. So one core at 100% would show as 8.33% CPU.


From:
lars
Date:

On 07/11/2011 08:26 AM, Robert Klemme wrote:
> On Mon, Jul 11, 2011 at 3:13 PM, <>  wrote:
>> I do not know if this makes sense in PostgreSQL and that readers
>> do not block writers and writes do not block readers. Are your
>> UPDATEs to individual rows, each in a separate transaction, or
>> do you UPDATE multiple rows in the same transaction? If you
>> perform multiple updates in a single transaction, you are
>> synchronizing the changes to that set of rows and that constraint
>> is causing other readers that need to get the correct values post-
>> transaction to wait until the COMMIT completes. This means that
>> the WAL write must be completed.
> What readers should that be?  Docs explicitly state that readers are
> never blocked by writers:
> http://www.postgresql.org/docs/9.0/interactive/mvcc-intro.html
> http://www.postgresql.org/docs/9.0/interactive/mvcc.html
>
>  From what I understand about this issue the observed effect must be
> caused by the implementation and not by a conceptual issue with
> transactions.
>
>> Have you tried disabling synchronous_commit? If this scenario
>> holds, you should be able to reduce the slowdown by un-batching
>> your UPDATEs, as counter-intuitive as that is. This seems to
>> be similar to a problem that I have been looking at with using
>> PostgreSQL as the backend to a Bayesian engine. I am following
>> this thread with interest.
> I don't think this will help (see above).  Also, I would be very
> cautious to do this because although the client might get a faster
> acknowledge the DB still has to do the same work as without
> synchronous_commit (i.e. WAL, checkpointing etc.) but it still has to
> do significantly more transactions than in the batched version.
>
> Typically there is an optimum batch size: if batch size is too small
> (say, one row) the ratio of TX overhead to "work" is too bad.  If
> batch size is too large (say, millions of rows) you hit resource
> limitations (memory) which inevitable force the RDBMS to do additional
> disk IO.
>
> Kind regards
>
> robert
>
Thanks Ken and Robert,

What I am observing is definitely not readers blocked by writers by
means of row-level locking.

This seems to be some implementation detail in Postgres about how dirty
pages (or dead versions of tuples) are
flushed to the disk (see the other parts of this thread) when they
accessed by a SELECT query.

The batch size in this case is one SELECT statement accessing 10000 rows
via an aggregate (such as COUNT) and
an UPDATE updating 10000 rows in a single statement.

I am not trying to optimize this particular use case, but rather to
understand what Postgres is doing, and why SELECT
queries are affected negatively (sometimes severely) by concurrent (or
even preceding) UPDATEs at all when the
database resides in the cache completely.

-- Lars


From:
Ivan Voras
Date:

On 08/07/2011 01:56, lars wrote:

> Setup:
> PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
> instance (kernel 2.6.35) with the database and
> WAL residing on the same EBS volume with EXT4 (data=ordered, barriers=1)
> - yes that is not an ideal setup
> (WAL should be on separate drive, EBS is slow to begin, etc), but I am
> mostly interested in read performance for a fully cached database.

I know you said you know these things - but do you really know the
(huge) extent to which all your IO is slowed? Even context switches in a
virtualized environment are slowed down by a huge margin - which would
make practically all in-memory lock operations very slow - much slower
than they would be on "real" hardware, and EBS by definition is even
slower then regular private virtual storage environments. I regrettably
didn't bookmark the page which did exact measurements of EBS, but
http://www.google.com/search?q=how+slow+is+ec2 will illustrate my point.
(of course, you may already know all this :) ).


From:
Merlin Moncure
Date:

On Tue, Jul 12, 2011 at 8:22 AM, Ivan Voras <> wrote:
> On 08/07/2011 01:56, lars wrote:
>
>> Setup:
>> PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
>> instance (kernel 2.6.35) with the database and
>> WAL residing on the same EBS volume with EXT4 (data=ordered, barriers=1)
>> - yes that is not an ideal setup
>> (WAL should be on separate drive, EBS is slow to begin, etc), but I am
>> mostly interested in read performance for a fully cached database.
>
> I know you said you know these things - but do you really know the (huge)
> extent to which all your IO is slowed? Even context switches in a
> virtualized environment are slowed down by a huge margin - which would make
> practically all in-memory lock operations very slow - much slower than they
> would be on "real" hardware, and EBS by definition is even slower then
> regular private virtual storage environments. I regrettably didn't bookmark
> the page which did exact measurements of EBS, but
> http://www.google.com/search?q=how+slow+is+ec2 will illustrate my point. (of
> course, you may already know all this :) ).

sure, but the OP's question is valid: in postgres, readers don't block
writers, so why is the reader waiting?  I'd like to know definitively:
*) is the reader bottlenecked on disk i/o (it seems yes)
*) is that disk i/o heap or wal (it seems wal)
*) is that disk i/o reading/writing (it seems writing)
*) given the above, why is this happening (presumably disk page tidying)?

We need some more information here. I'd like to see the table
information -- at least the average width of the record both pre/post
update, if it is or is not toasted, and the number of size and indexes
pre/post update.  I'm really suspicious of the virtualization tech as
well -- is it possible to run this test on at least semi decent native
hardware?

merlin

From:
"Kevin Grittner"
Date:

lars <> wrote:

> I am not trying to optimize this particular use case, but rather
> to understand what Postgres is doing, and why SELECT queries are
> affected negatively (sometimes severely) by concurrent (or even
> preceding) UPDATEs at all when the database resides in the cache
> completely.

I think we're stuck in terms of trying to guess what is causing the
behavior you are seeing.  Things which would help get it "unstuck":

(1)  More information about your load process.  Looking at the code,
I could sort of see a possible path to this behavior if the load
process involves any adjustments beyond straight INSERTs or COPYs
in.

(2)  You could poke around with a profiler, a debugger, and/or
the contrib/pageinspect module to sort things out.

(3)  You could post a reproducible test case -- where you start with
CREATE TABLE and populate with something like the generate_series()
function and go through a clearly described set of steps to see the
behavior.   With the someone else could do the profiling, debugging,
and/or page inspection.

From what you've said, it seems like either you're omitting some
detail as irrelevant which is actually significant, or you've found
a bug we should hunt down and fix.  I really don't know which it is.

-Kevin

From:
Ivan Voras
Date:

On 12/07/2011 16:18, Merlin Moncure wrote:
> On Tue, Jul 12, 2011 at 8:22 AM, Ivan Voras<>  wrote:
>> On 08/07/2011 01:56, lars wrote:
>>
>>> Setup:
>>> PostgreSQL 9.1beta2 on a high memory (~68gb, 12 cores) EC2 Linux
>>> instance (kernel 2.6.35) with the database and
>>> WAL residing on the same EBS volume with EXT4 (data=ordered, barriers=1)
>>> - yes that is not an ideal setup
>>> (WAL should be on separate drive, EBS is slow to begin, etc), but I am
>>> mostly interested in read performance for a fully cached database.
>>
>> I know you said you know these things - but do you really know the (huge)
>> extent to which all your IO is slowed? Even context switches in a
>> virtualized environment are slowed down by a huge margin - which would make
>> practically all in-memory lock operations very slow - much slower than they
>> would be on "real" hardware, and EBS by definition is even slower then
>> regular private virtual storage environments. I regrettably didn't bookmark
>> the page which did exact measurements of EBS, but
>> http://www.google.com/search?q=how+slow+is+ec2 will illustrate my point. (of
>> course, you may already know all this :) ).
>
> sure, but the OP's question is valid: in postgres, readers don't block
> writers, so why is the reader waiting?

Yes, but I'm suggesting a different question: are we sure we are not
seeing the influences of the environment (EC2+EBS) instead of the
software system?

> We need some more information here.

Definitely.


From:
Ivan Voras
Date:

On 12/07/2011 02:09, lars wrote:

> Oh, and iowait hovers around 20% when SELECTs are slow:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.54 0.00 0.98 18.49 0.07 78.92
>
> When SELECTs are fast it looks like this:
> avg-cpu: %user %nice %system %iowait %steal %idle
> 8.72 0.00 0.26 0.00 0.00 91.01
>
> Note that this is a 12 core VM. So one core at 100% would show as 8.33%
> CPU.

Now only if you could do an "iostat -x" and show the output in both cases...

From:
Merlin Moncure
Date:

On Tue, Jul 12, 2011 at 9:36 AM, Kevin Grittner
<> wrote:
> lars <> wrote:
>
>> I am not trying to optimize this particular use case, but rather
>> to understand what Postgres is doing, and why SELECT queries are
>> affected negatively (sometimes severely) by concurrent (or even
>> preceding) UPDATEs at all when the database resides in the cache
>> completely.
>
> I think we're stuck in terms of trying to guess what is causing the
> behavior you are seeing.  Things which would help get it "unstuck":
>
> (1)  More information about your load process.  Looking at the code,
> I could sort of see a possible path to this behavior if the load
> process involves any adjustments beyond straight INSERTs or COPYs
> in.
>
> (2)  You could poke around with a profiler, a debugger, and/or
> the contrib/pageinspect module to sort things out.


hm, also strace on the 'select' process might give some clues.

merlin

From:
lars
Date:

On 07/12/2011 08:13 AM, Ivan Voras wrote:
> On 12/07/2011 02:09, lars wrote:
>
>> Oh, and iowait hovers around 20% when SELECTs are slow:
>>
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 1.54 0.00 0.98 18.49 0.07 78.92
>>
>> When SELECTs are fast it looks like this:
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 8.72 0.00 0.26 0.00 0.00 91.01
>>
>> Note that this is a 12 core VM. So one core at 100% would show as 8.33%
>> CPU.
>
> Now only if you could do an "iostat -x" and show the output in both
> cases...
>
>
Sure (sorry for missing details):

iostat -x during selects when all's fine:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            8.25    0.00    0.00    0.00    0.00   91.75

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     1.00    0.00    2.00     0.00    24.00
12.00     0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

xvdap1 is OS volumn.
xvdf holds the database files
xvdg holds the WAL

No IO on database/WAL volumes, one core is pegged close to 100% CPU.

------------------------------------

iostat -x during update:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.05    0.00    0.58    4.00    0.00   94.37

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdf              0.00     0.00    7.00    0.00   128.00     0.00
18.29     0.00    0.00   0.00   0.00
xvdg              0.00  7352.00    0.00  804.00     0.00 62368.00
77.57    66.07   68.83   0.86  69.20

Just updating the WAL.

-----------------------------------

and while it's checkpointing:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.64    0.00    0.32    8.88    0.00   90.16

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdf              0.00  2548.00    2.00 1658.00    32.00 33408.00
20.14   144.18   86.69   0.60 100.00
xvdg              0.00  5428.00    0.00  778.00     0.00 58480.00
75.17    77.44  100.22   1.21  94.00

Updating the WAL, and database volume due to checkpointing.

----------------------------------

iostat -x after I stopped the update process and checkpointing is done:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0.00    0.00    0.00    0.00    0.00  100.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

No activity at all.

---------------------------------

iostat -x after I started the select queries after the updates:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            2.09    0.00    1.49   12.15    0.00   84.26

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
xvdap1            0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
xvdf              0.00     8.00    0.00    2.00     0.00    80.00
40.00     0.00    2.00   2.00   0.40
xvdg              0.00  7844.00    1.00 1098.00     8.00 82336.00
74.93    58.27   59.39   0.70  77.20

Heavy writes to the WAL volume.

select * from pg_stat_activity;
  datid | datname | procpid | usesysid | usename  | application_name |
client_addr | client_hostname | client_port |
backend_start         |          xact_start           |          qu
ery_start          | waiting |                         current_query

-------+---------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+------------
-------------------+---------+---------------------------------------------------------------
  16385 | lars    |    2654 |    16384 | lars     |                  |
127.0.0.1   |                 |       44972 | 2011-07-12
18:44:09.479581+00 | 2011-07-12 18:50:32.629412+00 | 2011-07-12
18:50:32.629473+00 | f       | select count(*) from test where tenant =
$1 and created_date = $2
  16385 | lars    |    2658 |       10 | postgres | psql
|             |                 |          -1 | 2011-07-12
18:49:02.675436+00 | 2011-07-12 18:50:32.631013+00 | 2011-07-12
18:50:32.631013+00 | f       | select * from pg_stat_activity;
  16385 | lars    |    2660 |    16384 | lars     | psql
|             |                 |          -1 | 2011-07-12
18:49:45.711643+00 |                               |
                    | f       | <IDLE>
(3 rows)


From:
"Kevin Grittner"
Date:

lars <> wrote:

> select count(*) from test where tenant = $1 and created_date = $2

Ah, that might be a major clue -- prepared statements.

What sort of a plan do you get for that as a prepared statement?
(Note, it is very likely *not* to be the same plan as you get if you
run with literal values!)  It is not at all unlikely that it could
resort to a table scan if you have one tenant which is five or ten
percent of the table, which would likely trigger the pruning as it
passed over the modified pages.

-Kevin

From:
"Kevin Grittner"
Date:

lars <> wrote:

> select count(*) from test where tenant = $1 and created_date = $2

Thinking about this some more, it would be interesting to know your
PostgreSQL configuration.  I seem to remember you mentioning some
settings up-thread, but I'm not sure whether it was comprehensive.
Could you paste the result of running the query on this Wiki page?:

http://wiki.postgresql.org/wiki/Server_Configuration

It might be that if you generate your queries with literals rather
than using server-side prepared statements, and tweak a couple
configuration settings, this problem will evaporate.

-Kevin

From:
lars
Date:

On 07/12/2011 12:08 PM, Kevin Grittner wrote:
> lars<>  wrote:
>
>> select count(*) from test where tenant = $1 and created_date = $2
>
> Ah, that might be a major clue -- prepared statements.
>
> What sort of a plan do you get for that as a prepared statement?
> (Note, it is very likely *not* to be the same plan as you get if you
> run with literal values!)  It is not at all unlikely that it could
> resort to a table scan if you have one tenant which is five or ten
> percent of the table, which would likely trigger the pruning as it
> passed over the modified pages.
>
> -Kevin
So a read of a row *will* trigger dead tuple pruning, and that requires
WAL logging, and this is known/expected?
This is actually the only answer I am looking for. :) I have not seen
this documented anywhere.

I know that Postgres will generate general plans for prepared statements
(how could it do otherwise?),
I also know that it sometimes chooses a sequential scan.
This can always be tweaked to touch fewer rows and/or use a different
plan. That's not my objective, though!

The fact that a select (maybe a big analytical query we'll run) touching
many rows will update the WAL and wait
(apparently) for that IO to complete is making a fully cached database
far less useful.
I just artificially created this scenario.

... Just dropped the table to test something so I can't get the plan
right now. Will send an update as soon as I get
it setup again.

Thanks again.

-- Lars


From:
"Kevin Grittner"
Date:

lars <> wrote:

> So a read of a row *will* trigger dead tuple pruning, and that
> requires WAL logging, and this is known/expected?

Yes, because pruning dead line pointers will make subsequent reads
faster.  It's intended to be an optimization.

> This is actually the only answer I am looking for. :) I have not
> seen this documented anywhere.

You would currently need to look at the README-HOT file or source
code, I think.  There probably should be some mention in the user
docs, but I haven't noticed any, and it is more technical than most
of the documentation gets.  Perhaps a "note" block somewhere...

> The fact that a select (maybe a big analytical query we'll run)
> touching many rows will update the WAL and wait (apparently) for
> that IO to complete is making a fully cached database far less
> useful.

Well, I've never run into this because I have directly attached
storage through a RAID controller with a battery-backed cache
configured for write-back.  The pruning is pretty light on CPU
usage, and with a BBU controller, the WAL writes just move from one
cache to another.

If that's not an option for you, you could contrive to have the
update code reread the modified rows after COMMIT, or configure your
autovacuum to be very aggressive so that a background process
usually takes care of this before a SELECT gets to it.  And there's
a good chance that tuning your query and/or running with literal
values available to the planner would be a big net win even without
this issue; if this issue is a problem for you, it's just another
reason to do that tuning.

> Just dropped the table to test something so I can't get the plan
> right now. Will send an update as soon as I get
> it setup again.

I'll be surprised if you don't see a seqscan.  The most interesting
bit at this point (at least to me) is working on tuning the cost
factors for the planner.

-Kevin

From:
lars
Date:

On 07/12/2011 01:04 PM, lars wrote:
> On 07/12/2011 12:08 PM, Kevin Grittner wrote:
>> lars<>  wrote:
>>
>>> select count(*) from test where tenant = $1 and created_date = $2
>>
>> Ah, that might be a major clue -- prepared statements.
>>
>> What sort of a plan do you get for that as a prepared statement?
>> (Note, it is very likely *not* to be the same plan as you get if you
>> run with literal values!)  It is not at all unlikely that it could
>> resort to a table scan if you have one tenant which is five or ten
>> percent of the table, which would likely trigger the pruning as it
>> passed over the modified pages.
>>
>> -Kevin
> So a read of a row *will* trigger dead tuple pruning, and that
> requires WAL logging, and this is known/expected?
> This is actually the only answer I am looking for. :) I have not seen
> this documented anywhere.
>
> I know that Postgres will generate general plans for prepared
> statements (how could it do otherwise?),
> I also know that it sometimes chooses a sequential scan.
> This can always be tweaked to touch fewer rows and/or use a different
> plan. That's not my objective, though!
>
> The fact that a select (maybe a big analytical query we'll run)
> touching many rows will update the WAL and wait
> (apparently) for that IO to complete is making a fully cached database
> far less useful.
> I just artificially created this scenario.
>
> ... Just dropped the table to test something so I can't get the plan
> right now. Will send an update as soon as I get
> it setup again.
>
> Thanks again.
>
> -- Lars
>
>
Ok... Slightly changes the indexes:
\d test
             Table "lars.test"
     Column    |     Type      | Modifiers
--------------+---------------+-----------
  tenant       | character(15) |
  created_by   | character(15) |
  created_date | date          |
Indexes:
     "i1" btree (tenant)

So just just a simple index on tenant.

prepare x as select count(*) from test where tenant = $1 and
created_date = $2;
PREPARE
explain execute x('000000000000001','2011-6-30');
                                   QUERY PLAN
-------------------------------------------------------------------------------
  Aggregate  (cost=263301.40..263301.41 rows=1 width=0)
    ->  Bitmap Heap Scan on test  (cost=3895.99..263299.28 rows=847 width=0)
          Recheck Cond: (tenant = $1)
          Filter: (created_date = $2)
          ->  Bitmap Index Scan on i1  (cost=0.00..3895.77 rows=169372
width=0)
                Index Cond: (tenant = $1)
(6 rows)

-- this is when the WAL rows are written:
explain (analyze on, buffers on) execute x('000000000000001','2011-6-30');
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=263301.40..263301.41 rows=1 width=0) (actual
time=191.150..191.151 rows=1 loops=1)
    Buffers: shared hit=3716
    ->  Bitmap Heap Scan on test  (cost=3895.99..263299.28 rows=847
width=0) (actual time=1.966..188.221 rows=3712 loops=1)
          Recheck Cond: (tenant = $1)
          Filter: (created_date = $2)
          Buffers: shared hit=3716
          ->  Bitmap Index Scan on i1  (cost=0.00..3895.77 rows=169372
width=0) (actual time=1.265..1.265 rows=3712 loops=1)
                Index Cond: (tenant = $1)
                Buffers: shared hit=20
  Total runtime: 191.243 ms
(10 rows)

-- this is when no WAL is written:
explain (analyze on, buffers on) execute x('000000000000001','2011-6-30');
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=263301.40..263301.41 rows=1 width=0) (actual
time=11.529..11.530 rows=1 loops=1)
    Buffers: shared hit=3715
    ->  Bitmap Heap Scan on test  (cost=3895.99..263299.28 rows=847
width=0) (actual time=1.341..9.187 rows=3712 loops=1)
          Recheck Cond: (tenant = $1)
          Filter: (created_date = $2)
          Buffers: shared hit=3715
          ->  Bitmap Index Scan on i1  (cost=0.00..3895.77 rows=169372
width=0) (actual time=0.756..0.756 rows=3712 loops=1)
                Index Cond: (tenant = $1)
                Buffers: shared hit=19
  Total runtime: 11.580 ms
(10 rows)

If you wanted to recreate this scenario I created a simple script to
create the table:

create table test(tenant char(15), created_by char(15), created_date date);
insert into test values('x', 'y','2011-6-30');
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test; -- 256k rows
update test set tenant = lpad((random()*10000)::int::text,15,'0'),
created_by = lpad((random()*10000)::int::text,15,'0');
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test; -- 32m rows
create index i1 on test(tenant);
vacuum analyze;

I use JDBC to perform the updates and selects, but this will do too:

prepare x as select count(*) from test where tenant = $1 and
created_date = $2;
prepare y as update test set created_by = $1 where tenant = $2 and
created_date = $3;

execute y('000000000000001', '000000000000001','2011-6-30');
execute x('000000000000001','2011-6-30');

I'll probably compile Postgres with WAL debug on next, and try the page
inspect module.
On the face of it, though, this looks like Postgres would not be that
useful as database that resides (mostly) in the cache.

Here what the query mentioned here returns (note that shared_buffers is
very large, but I observed the same with smaller settings):

http://wiki.postgresql.org/wiki/Server_Configuration


              name
|                                                 current_setting

------------------------------+------------------------------------------------------------------------------------------------------------------
  version                      | PostgreSQL 9.1beta2 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.4 20100726 (Red Hat
4.4.4-13), 64-bit
  autovacuum                   | off
  bgwriter_delay               | 10ms
  bgwriter_lru_maxpages        | 1000
  checkpoint_completion_target | 0.9
  checkpoint_segments          | 128
  client_encoding              | UTF8
  effective_cache_size         | 64GB
  lc_collate                   | en_US.UTF-8
  lc_ctype                     | en_US.UTF-8
  log_checkpoints              | on
  log_line_prefix              | %m
  maintenance_work_mem         | 2GB
  max_connections              | 100
  max_stack_depth              | 2MB
  server_encoding              | UTF8
  shared_buffers               | 20GB
  TimeZone                     | UTC
  wal_buffers                  | 16MB
  work_mem                     | 1GB
(20 rows)

Thanks.

-- Lars


From:
Merlin Moncure
Date:

On Tue, Jul 12, 2011 at 2:08 PM, Kevin Grittner
<> wrote:
> lars <> wrote:
>
>> select count(*) from test where tenant = $1 and created_date = $2
>
> Ah, that might be a major clue -- prepared statements.

I'm really skeptical that this is the case -- the table is 100m and
there is no way you are banging through 100m in 500ms records
particularly if doing i/o.

Also regarding the page cleanup, IIRC the optimization only takes
place if the dead tuples are HOT -- when the OP opened the thread he
stated it was happening with update on indexed field (and later tested
it on unindexed field).

Something is not adding up here.  Perhaps there is an alternate route
to WAL logged activity from selects I'm not thinking of.  Right now
I'm thinking to run the selects on table 'a' and the inserts
concurrently on table 'b' and seeing how that behaves.  Another way to
get to the bottom is to oprofile the selecting-during-load backend to
see where the time is getting spent.   Alternative way to do this is
to strace attach to the selecting-during-load backend to see if it's
really writing to the WAL (I'm really curious about this).

Another interesting test would be to try and reproduce the results on
native machine. It should be possible to do this on your workstation
with a more modestly sized scaling factor.

merlin

From:
"Kevin Grittner"
Date:

lars <> wrote:

> vacuum analyze;

I tried this out on a 16 core, 64 GB machine.  It was a replication
target for a few dozen source databases into a couple 2 TB reporting
databases, and had some light testing going on, but it was only at
about 50% capacity, so that shouldn't throw this off by *too* much,
I hope.  Since our data is long-lived enough to worry about
transaction ID freezing issues, I always follow a bulk load with
VACUUM FREEZE ANALYZE; so I did that here.  I also just threw this
into the 2 TB database without changing our configuration.  Among
other things, that means that autovacuum was on.

> prepare x as select count(*) from test where tenant = $1 and
> created_date = $2;
> prepare y as update test set created_by = $1 where tenant = $2 and
> created_date = $3;
>
> execute y('000000000000001', '000000000000001','2011-6-30');
> execute x('000000000000001','2011-6-30');

I ran x a bunch of times to get a baseline, then y once, then x a
bunch more times.  The results were a bit surprising:

cir=> \timing
Timing is on.
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 9.823 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 8.481 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 14.054 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 10.169 ms
cir=> execute y('000000000000001', '000000000000001','2011-6-30');
UPDATE 3456
Time: 404.244 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 128.643 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 2.657 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 5.883 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 2.645 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 2.753 ms
cir=> execute x('000000000000001','2011-6-30');
 count
-------
  3456
(1 row)

Time: 2.253 ms

Running the update made the next SELECT slow, then it was much
*faster*.  My best guess is that the data landed in a more
concentrated set of pages after the update, and once autovacuum
kicked in and cleaned things up it was able to get to that set of
data faster.

> On the face of it, though, this looks like Postgres would not be
> that useful as database that resides (mostly) in the cache.

>   autovacuum                   | off

Well, certainly not while under modification without running
autovacuum.  That's disabling an integral part of what keeps
performance up.  There are very few, if any, situations where
running PostgreSQL in production without autovacuum makes any sense,
and benchmarks which disable it don't give a very accurate picture
of typical performance.  Now, if you're looking to artificially
create a worst-case scenario, then it makes sense, but I'm not clear
on the point of it.

I do understand the impulse, though.  When we first started using
PostgreSQL there were certain very small tables which were updated
very frequently which got slow when autovacuum kicked in.  We made
autovacuum less aggressive, and found that things go worse!  Se we
went the other way and made autovacuum much more aggressive than the
defaults, and everything was fine.

-Kevin

From:
lars
Date:

On 07/12/2011 02:51 PM, Kevin Grittner wrote:
> I ran x a bunch of times to get a baseline, then y once, then x a
> bunch more times.  The results were a bit surprising:
>
> cir=>  \timing
> Timing is on.
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 9.823 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 8.481 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 14.054 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 10.169 ms
> cir=>  execute y('000000000000001', '000000000000001','2011-6-30');
> UPDATE 3456
> Time: 404.244 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 128.643 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 2.657 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 5.883 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 2.645 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 2.753 ms
> cir=>  execute x('000000000000001','2011-6-30');
>   count
> -------
>    3456
> (1 row)
>
> Time: 2.253 ms
>
Interesting. When you did you test, did you also find WAL write activity
when running x the first time after y?
(It's very hard to catch in only a single query, though).

> Running the update made the next SELECT slow, then it was much
> *faster*.  My best guess is that the data landed in a more
> concentrated set of pages after the update, and once autovacuum
> kicked in and cleaned things up it was able to get to that set of
> data faster.
>
>>    autovacuum                   | off
> Well, certainly not while under modification without running
> autovacuum.  That's disabling an integral part of what keeps
> performance up.
Oh, it's just switched off for testing, so that I can control when
vacuum runs and make sure that it's not
skewing the results while I am measuring something.
In a real database I would probably err on vacuuming more than less.

For a fully cached database I would probably want to switch off HOT
pruning and compaction (which from what we see
is done synchronously with the select) and leave it up to the
asynchronous auto vacuum to do that. But maybe I am
still not quite understanding the performance implications.


From:
lars
Date:

On 07/12/2011 02:38 PM, Merlin Moncure wrote:
>
> Something is not adding up here.  Perhaps there is an alternate route
> to WAL logged activity from selects I'm not thinking of.  Right now
> I'm thinking to run the selects on table 'a' and the inserts
> concurrently on table 'b' and seeing how that behaves.  Another way to
> get to the bottom is to oprofile the selecting-during-load backend to
> see where the time is getting spent.   Alternative way to do this is
> to strace attach to the selecting-during-load backend to see if it's
> really writing to the WAL (I'm really curious about this).
>
> Another interesting test would be to try and reproduce the results on
> native machine. It should be possible to do this on your workstation
> with a more modestly sized scaling factor.
>
> merlin
>
Just tried with two of my test tables.
Updates on 'a' have no (measurable) effect on select from 'b'.

Back to the first case, here's an strace from the backend doing the
select right after the updates.

"Q\0\0\0`select count(*) from test1 "..., 8192, 0, NULL, NULL) = 97
gettimeofday({1310512219, 723762}, NULL) = 0
open("base/16385/33032", O_RDWR)        = 8
lseek(8, 0, SEEK_END)                   = 1073741824
open("base/16385/33032.1", O_RDWR|O_CREAT, 0600) = 9
lseek(9, 0, SEEK_END)                   = 1073741824
open("base/16385/33032.2", O_RDWR|O_CREAT, 0600) = 10
lseek(10, 0, SEEK_END)                  = 191348736
open("base/16385/33035", O_RDWR)        = 11
lseek(11, 0, SEEK_END)                  = 1073741824
open("base/16385/33035.1", O_RDWR|O_CREAT, 0600) = 12
lseek(12, 0, SEEK_END)                  = 3571712
lseek(10, 0, SEEK_END)                  = 191348736
brk(0x28ad000)                          = 0x28ad000
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5f28ca0000
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x7f5f28c5f000
munmap(0x7f5f28c5f000, 266240)          = 0
munmap(0x7f5f28ca0000, 135168)          = 0
open("pg_xlog/00000001000000BB00000012", O_RDWR) = 13
lseek(13, 1564672, SEEK_SET)            = 1564672
write(13,
"f\320\1\0\1\0\0\0\273\0\0\0\0\340\27\22`\32\0\00000002833!000"...,
2400256) = 2400256
fdatasync(13)                           = 0
semop(229383, {{9, 1, 0}}, 1)           = 0
gettimeofday({1310512219, 885287}, NULL) = 0
sendto(5,
"\2\0\0\0\300\3\0\0\1@\0\0\t\0\0\0\1\0\0\0\0\0\0\0\353\4\0\0@\0\2\0"...,
960, 0, NULL, 0) = 960
sendto(5,
"\2\0\0\0\300\3\0\0\1@\0\0\t\0\0\0\0\0\0\0\0\0\0\0009\n\0\0@\0\2\0"...,
960, 0, NULL, 0) = 960
sendto(5,
"\2\0\0\0\300\3\0\0\1@\0\0\t\0\0\0\0\0\0\0\0\0\0\0v\n\0\0@\0\2\0"...,
960, 0, NULL, 0) = 960
sendto(5,
"\2\0\0\0\270\1\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\275\4\0\0\377\177\0\0"...,
440, 0, NULL, 0) = 440
sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66

So the backend definitely writing to the WAL, directly and synchronously.

Selecting the same set of rows again:
"Q\0\0\0`select count(*) from test1 "..., 8192, 0, NULL, NULL) = 97
gettimeofday({1310512344, 823728}, NULL) = 0
lseek(10, 0, SEEK_END)                  = 191348736
lseek(12, 0, SEEK_END)                  = 3571712
lseek(10, 0, SEEK_END)                  = 191348736
brk(0x28d5000)                          = 0x28d5000
brk(0x2915000)                          = 0x2915000
brk(0x2897000)                          = 0x2897000
gettimeofday({1310512344, 831043}, NULL) = 0
sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\10\201\0\0?\0\2\0"..., 232,
0, NULL, 0) = 232
sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66

No writing to the WAL.

-- Lars


From:
Jeff Janes
Date:

On 7/12/11, lars <> wrote:
>
>
> The fact that a select (maybe a big analytical query we'll run) touching
> many rows will update the WAL and wait
> (apparently) for that IO to complete is making a fully cached database
> far less useful.
> I just artificially created this scenario.

I can't think of any reason that that WAL would have to be flushed
synchronously.

There is already code that makes transactions that only have certain
kinds of "maintenance" WAL to skip the flush. Could this pruning WAL
be added to that group?

From:
Tom Lane
Date:

Jeff Janes <> writes:
> On 7/12/11, lars <> wrote:
>> The fact that a select (maybe a big analytical query we'll run) touching
>> many rows will update the WAL and wait
>> (apparently) for that IO to complete is making a fully cached database
>> far less useful.
>> I just artificially created this scenario.

> I can't think of any reason that that WAL would have to be flushed
> synchronously.

Maybe he's running low on shared_buffers?  We would have to flush WAL
before writing a dirty buffer out, so maybe excessive pressure on
available buffers is part of the issue here.

            regards, tom lane

From:
Lars
Date:

shared_buffers is big enough to hold the entire database, and there is plenty of extra space. (verified with
PG_buffercache) 
So i don't think that is the reason.


Tom Lane <> schrieb:

>Jeff Janes <> writes:
>> On 7/12/11, lars <> wrote:
>>> The fact that a select (maybe a big analytical query we'll run) touching
>>> many rows will update the WAL and wait
>>> (apparently) for that IO to complete is making a fully cached database
>>> far less useful.
>>> I just artificially created this scenario.
>
>> I can't think of any reason that that WAL would have to be flushed
>> synchronously.
>
>Maybe he's running low on shared_buffers?  We would have to flush WAL
>before writing a dirty buffer out, so maybe excessive pressure on
>available buffers is part of the issue here.
>
>            regards, tom lane
>
>--
>Sent via pgsql-performance mailing list ()
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

From:
"Kevin Grittner"
Date:

[combining responses to two posts on this thread by lars]

lars <> wrote:

> On the face of it, though, this looks like Postgres would not be
> that useful as database that resides (mostly) in the cache.

I've mentioned this in a hand-wavy general sense, but I should have
mentioned specifics ages ago: for a database where the active
portion of the database is fully cached, it is best to set
seq_page_cost and random_page_cost to the same value, somewhere in
the 0.1 to 0.05 range.  (In your case I would use 0.05.)  In highly
cached databases I have sometimes also found it necessary to
increase cpu_tuple_cost.  (In your case I might try 0.02.)

This won't directly address the specific issue you've been
investigating in this thread, but it will tend to give you faster
plans for your actual environment without needing to fuss with
things on a query-by-query basis.  It may indirectly mitigate the
problem at hand through heavier use of indexes which would reduce
pruning and hint-bit setting by readers.

> Interesting. When you did you test, did you also find WAL write
> activity when running x the first time after y?

I wasn't able to check that in this quick, ad hoc run.

> Oh, it's just switched off for testing, so that I can control
> when vacuum runs and make sure that it's not skewing the results
> while I am measuring something.

Again, that's an impulse I can certainly understand, but the problem
is that turning autovacuum off skews results in other ways, such as
forcing readers to do maintenance work which might otherwise be done
in a cost-limited background process.  Or if that didn't happen you
would be constantly chasing through lots of dead line pointers which
would hurt performance in another way.  It's probably best to
consider autovacuum an integral part of normal database operations
and run benchmarks like this with it operational.  This will also
give you an opportunity to tune thresholds and costing factors to
evaluate the impact that such adjustments have on potential
workloads.

> For a fully cached database I would probably want to switch off
> HOT pruning and compaction (which from what we see is done
> synchronously with the select) and leave it up to the asynchronous
> auto vacuum to do that. But maybe I am still not quite
> understanding the performance implications.

Code comments indicate that they expect the pruning to be a pretty
clear win on multiple reads, although I don't know how much that was
benchmarked.  Jeff does raise a good point, though -- it seems odd
that WAL-logging of this pruning would need to be synchronous.  We
support asynchronous commits -- why not use that feature
automatically for transactions where the only writes are this sort
of thing.  Which raises an interesting question -- what happens to
the timings if your SELECTs are done with synchronous_commit = off?

I wonder if it would make any sense to implicitly use async commit
for a transaction which is declared READ ONLY or which never
acquires and XID?

-Kevin

From:
Merlin Moncure
Date:

On Tue, Jul 12, 2011 at 6:15 PM, lars <> wrote:
> Back to the first case, here's an strace from the backend doing the select
> right after the updates.
> write(13,
> "f\320\1\0\1\0\0\0\273\0\0\0\0\340\27\22`\32\0\00000002833!000"..., 2400256)
> = 2400256

On Wed, Jul 13, 2011 at 9:46 AM, Kevin Grittner
<> wrote:
> Code comments indicate that they expect the pruning to be a pretty
> clear win on multiple reads, although I don't know how much that was
> benchmarked.  Jeff does raise a good point, though -- it seems odd
> that WAL-logging of this pruning would need to be synchronous.  We
> support asynchronous commits -- why not use that feature

Right -- here are my thoughts.  notice the above is writing out 293
pages. this is suggesting to me that Kevin is right and you've
identified a pattern where you are aggravating the page cleanup
facilities of HOT.   What threw me off here (and perhaps bears some
additional investigation) is that early on in the report you were
claiming an update to an indexed field which effectively disables HOT.
 The fairly lousy I/O performance of EBS is further hurting you here:
you have a very fast computer with lots of memory with a late 90's
disk system in terms of performance.  This means that AWS is not all
that great for simulating load profiles unless you are also highly
underweight I/O in your servers.  Postgres btw demands (as does
Oracle) a decent i/o system for many workloads that might be
surprising.

A note about HOT: there is no way to disable it (other than updating
an indexed field to bypass it) -- HOT was a performance revolution for
Postgres and numerous benchmarks as well as anecdotal reports have
confirmed this.  HOT mitigates the impact of dead tuples by 1. highly
reducing index bloat under certain conditions and 2. allowing dead
tuples to be more aggressively cleaned up -- a 'page level vacuum' if
it were.  HOT is an especially huge win when updates are frequent and
transactions are small and short....but maybe in your simulated case
it's not helping. hm.

merlin

From:
Tom Lane
Date:

"Kevin Grittner" <> writes:
> ...  Jeff does raise a good point, though -- it seems odd
> that WAL-logging of this pruning would need to be synchronous.

Yeah, we need to get to the bottom of that.  If there's enough
shared_buffer space then it shouldn't be.

> We
> support asynchronous commits -- why not use that feature
> automatically for transactions where the only writes are this sort
> of thing.  Which raises an interesting question -- what happens to
> the timings if your SELECTs are done with synchronous_commit = off?
> I wonder if it would make any sense to implicitly use async commit
> for a transaction which is declared READ ONLY or which never
> acquires and XID?

Huh?  If there was never an XID, there's no commit WAL record, hence
nothing to make asynchronous.

            regards, tom lane

From:
lars
Date:

On 07/13/2011 08:17 AM, Tom Lane wrote:
> "Kevin Grittner"<>  writes:
>> ...  Jeff does raise a good point, though -- it seems odd
>> that WAL-logging of this pruning would need to be synchronous.
> Yeah, we need to get to the bottom of that.  If there's enough
> shared_buffer space then it shouldn't be.
This thread has gotten long, let me try to compile all the relevant
information in one email.

\d test
             Table "lars.test"
     Column    |     Type      | Modifiers
--------------+---------------+-----------
  tenant       | character(15) |
  created_by   | character(15) |
  created_date | date          |
Indexes:
     "i1" btree (tenant)
     "i11" btree (created_by)

-- Table is populated like this:
------------------------------------
create table test(tenant char(15), created_by char(15), created_date date);
insert into test values('x', 'y','2011-6-30');
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test; -- 256k rows
update test set tenant = lpad((random()*10000)::int::text,15,'0'),
created_by = lpad((random()*10000)::int::text,15,'0');
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test;
insert into test select * from test; -- 32m rows
create index i1 on test(tenant);
create index i11 on test(created_by);
vacuum analyze;

-- I doubt it needs that many rows.

=> SELECT
    'version'::text AS "name",
    version() AS "current_setting"
  UNION ALL
  SELECT
    name,current_setting(name)
  FROM pg_settings
  WHERE NOT source='default' AND NOT name IN
    ('config_file','data_directory','hba_file','ident_file',
    'log_timezone','DateStyle','lc_messages','lc_monetary',
    'lc_numeric','lc_time','timezone_abbreviations',
    'default_text_search_config','application_name',
    'transaction_deferrable','transaction_isolation',
    'transaction_read_only');

              name
|                                                 current_setting

------------------------------+------------------------------------------------------------------------------------------------------------------
  version                      | PostgreSQL 9.1beta2 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.4 20100726 (Red Hat
4.4.4-13), 64-bit
  bgwriter_delay               | 10ms
  bgwriter_lru_maxpages        | 1000
  checkpoint_completion_target | 0.9
  checkpoint_segments          | 128
  client_encoding              | UTF8
  effective_cache_size         | 64GB
  lc_collate                   | en_US.UTF-8
  lc_ctype                     | en_US.UTF-8
  log_checkpoints              | on
  log_line_prefix              | %m
  maintenance_work_mem         | 2GB
  max_connections              | 100
  max_stack_depth              | 2MB
  server_encoding              | UTF8
  shared_buffers               | 20GB
  TimeZone                     | UTC
  wal_buffers                  | 16MB
  work_mem                     | 1GB
(19 rows)


-- Now:
----------
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

=> SELECT c.relname, isdirty, count(*) * 8192 / 1024/1024 AS buffers
FROM pg_buffercache b, pg_class c
WHERE b.relfilenode = pg_relation_filenode(c.oid)
AND b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname =
current_database()))
GROUP BY c.relname,isdirty
ORDER BY 3 DESC
LIMIT 6;

             relname            | isdirty | buffers
-------------------------------+---------+---------
  test                          | t       |      14
  pg_opclass_oid_index          | f       |       0
  pg_rewrite                    | f       |       0
  i11                           | t       |       0
  pg_rewrite_rel_rulename_index | f       |       0
  pg_constraint                 | f       |       0

-- Just started the server, no nothing else is cached, yet

-- it doesn't matter if that update is executed by the same or another
backend.
=> update test set created_by = '000000000000001' where tenant =
'000000000000001';
UPDATE 3712
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

strace now shows:
-------------------------
Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
gettimeofday({1310579341, 854669}, NULL) = 0
lseek(38, 0, SEEK_END)                  = 187465728
lseek(41, 0, SEEK_END)                  = 115564544
lseek(43, 0, SEEK_END)                  = 101040128
lseek(38, 0, SEEK_END)                  = 187465728
brk(0x1a85000)                          = 0x1a85000
brk(0x1a5d000)                          = 0x1a5d000
lseek(52, 12443648, SEEK_SET)           = 12443648
write(52,
"f\320\1\0\1\0\0\0\276\0\0\0\0\340\275\0016\0\0\0S\0\0\0\4\0cY\3\0\0\0"...,
122880) = 122880
fdatasync(52)                           = 0
semop(688135, {{9, 1, 0}}, 1)           = 0
gettimeofday({1310579341, 889163}, NULL) = 0
sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\376\200\0\0<\0\2\0"...,
232, 0, NULL, 0) = 232
sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66

-- fd 52 was opened earlier to the WAL file in pg_xlog.

=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
   count
-------
   3712
(1 row)

now strace shows:
-------------------------
Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
gettimeofday({1310579380, 862637}, NULL) = 0
lseek(38, 0, SEEK_END)                  = 187465728
lseek(41, 0, SEEK_END)                  = 115564544
lseek(43, 0, SEEK_END)                  = 101040128
lseek(38, 0, SEEK_END)                  = 187465728
gettimeofday({1310579380, 868149}, NULL) = 0
sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\376\200\0\0@\0\2\0"...,
232, 0, NULL, 0) = 232
sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66

=> update test set created_by = '000000000000001' where tenant =
'000000000000001';
UPDATE 3712
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
   count
-------
   3712
(1 row)

strace again indicates that a WAL log is written by the backend by the
select:
---------------------------------------------------------------------------------------
Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
gettimeofday({1310579663, 890641}, NULL) = 0
lseek(38, 0, SEEK_END)                  = 187596800
lseek(41, 0, SEEK_END)                  = 115638272
lseek(43, 0, SEEK_END)                  = 101122048
lseek(38, 0, SEEK_END)                  = 187596800
brk(0x1a85000)                          = 0x1a85000
brk(0x1a5d000)                          = 0x1a5d000
semop(688135, {{3, -1, 0}}, 1)          = 0
lseek(52, 10223616, SEEK_SET)           = 10223616
write(52,
"f\320\1\0\1\0\0\0\276\0\0\0\0\0\234\2\16\0\0\0C(\0\0\1\0\0\0~\0\177\0"...,
16384) = 16384
fdatasync(52)                           = 0
gettimeofday({1310579663, 921932}, NULL) = 0
sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\376\200\0\0?\0\2\0"...,
232, 0, NULL, 0) = 232
sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66

=> explain (analyze on, buffers on) select count(*) from test where
tenant = '000000000000001' and created_date = '2011-6-30';
                                                        QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------------
  Aggregate  (cost=12284.65..12284.66 rows=1 width=0) (actual
time=9.150..9.150 r
ows=1 loops=1)
    Buffers: shared hit=1976
    ->  Bitmap Heap Scan on test  (cost=91.78..12276.35 rows=3319
width=0) (actua
l time=2.338..6.866 rows=3712 loops=1)
          Recheck Cond: (tenant = '000000000000001'::bpchar)
          Filter: (created_date = '2011-06-30'::date)
          Buffers: shared hit=1976
          ->  Bitmap Index Scan on i1  (cost=0.00..90.95 rows=3319
width=0) (actu
al time=2.063..2.063 rows=15179 loops=1)
                Index Cond: (tenant = '000000000000001'::bpchar)
                Buffers: shared hit=98
  Total runtime: 9.206 ms
(10 rows)

=> update test set created_by = '000000000000001' where tenant =
'000000000000001';
UPDATE 3712
=> explain (analyze on, buffers on) select count(*) from test where
tenant = '000000000000001' and created_date = '2011-6-30';
                                                        QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------------
  Aggregate  (cost=12284.68..12284.69 rows=1 width=0) (actual
time=30.738..30.739
  rows=1 loops=1)
    Buffers: shared hit=2001
    ->  Bitmap Heap Scan on test  (cost=91.78..12276.38 rows=3319
width=0) (actua
l time=2.589..28.361 rows=3712 loops=1)
          Recheck Cond: (tenant = '000000000000001'::bpchar)
          Filter: (created_date = '2011-06-30'::date)
          Buffers: shared hit=2001
          ->  Bitmap Index Scan on i1  (cost=0.00..90.95 rows=3319
width=0) (actu
al time=2.301..2.301 rows=17123 loops=1)
                Index Cond: (tenant = '000000000000001'::bpchar)
                Buffers: shared hit=107
  Total runtime: 30.785 ms
(10 rows)

----

There seems to be definitely something funky going on. Since created_by
is indexed it shouldn't do any HOT logic.

Is there any other information that I can provide? I'm happy to
recompile with a patch applied, etc.

Thanks.

-- Lars


From:
lars
Date:

On 07/13/2011 07:46 AM, Kevin Grittner wrote:
>
> I've mentioned this in a hand-wavy general sense, but I should have
> mentioned specifics ages ago: for a database where the active
> portion of the database is fully cached, it is best to set
> seq_page_cost and random_page_cost to the same value, somewhere in
> the 0.1 to 0.05 range.  (In your case I would use 0.05.)  In highly
> cached databases I have sometimes also found it necessary to
> increase cpu_tuple_cost.  (In your case I might try 0.02.)
>
I've been doing that for other tests already (I didn't want to add too
many variations here).
The Bitmap Heap scans through the table are only useful for spinning
media and not the cache
(just to state the obvious).

As an aside: I found that queries in a cold database take almost twice
as long when I make that change,
so for spinning media this is very important.

> Which raises an interesting question -- what happens to
> the timings if your SELECTs are done with synchronous_commit = off?

Just tried that...
In that case the WAL is still written (as seen via iostat), but not
synchronously by the transaction (as seen by strace).

-- Lars


From:
"Kevin Grittner"
Date:

lars <> wrote:
> On 07/13/2011 07:46 AM, Kevin Grittner wrote:
>>
>> I've mentioned this in a hand-wavy general sense, but I should
>> have mentioned specifics ages ago: for a database where the
>> active portion of the database is fully cached, it is best to set
>> seq_page_cost and random_page_cost to the same value, somewhere
>> in the 0.1 to 0.05 range.  (In your case I would use 0.05.)  In
>> highly cached databases I have sometimes also found it necessary
>> to increase cpu_tuple_cost.  (In your case I might try 0.02.)
>>
> I've been doing that for other tests already (I didn't want to add
> too many variations here).
> The Bitmap Heap scans through the table are only useful for
> spinning media and not the cache (just to state the obvious).
>
> As an aside: I found that queries in a cold database take almost
> twice as long when I make that change,
> so for spinning media this is very important.

No doubt.  We normally run months to years between reboots, with
most of our cache at the OS level.  We don't have much reason to
ever restart PostgreSQL except to install new versions.  So we don't
worry overly much about the cold cache scenario.

>> Which raises an interesting question -- what happens to the
>> timings if your SELECTs are done with synchronous_commit = off?
>
> Just tried that...
> In that case the WAL is still written (as seen via iostat), but
> not synchronously by the transaction (as seen by strace).

So transactions without an XID *are* sensitive to
synchronous_commit.  That's likely a useful clue.

How much did it help the run time of the SELECT which followed the
UPDATE?

-Kevin

From:
Merlin Moncure
Date:

On Wed, Jul 13, 2011 at 1:10 PM, lars <> wrote:
> On 07/13/2011 08:17 AM, Tom Lane wrote:
>>
>> "Kevin Grittner"<>  writes:
>>>
>>> ...  Jeff does raise a good point, though -- it seems odd
>>> that WAL-logging of this pruning would need to be synchronous.
>>
>> Yeah, we need to get to the bottom of that.  If there's enough
>> shared_buffer space then it shouldn't be.
>
> This thread has gotten long, let me try to compile all the relevant
> information in one email.
>
> \d test
>            Table "lars.test"
>    Column    |     Type      | Modifiers
> --------------+---------------+-----------
>  tenant       | character(15) |
>  created_by   | character(15) |
>  created_date | date          |

small aside here: try to avoid use of character(n) type -- varchar(n)
is superior in every way, including performance (although that has
nothing to do with your WAL issues on this thread).

merlin

From:
"Kevin Grittner"
Date:

Tom Lane <> wrote:

> Huh?  If there was never an XID, there's no commit WAL record,
> hence nothing to make asynchronous.

If you look at the RecordTransactionCommit() function in xact.c
you'll see that's not correct.  Currently the commit record has
nothing to do with whether it synchronizes on WAL writes.  In
particular, this section around line 1096 is where the choice is
made:

    if ((wrote_xlog && synchronous_commit > SYNCHRONOUS_COMMIT_OFF)
        || forceSyncCommit || nrels > 0)
    {
        /*
         * Synchronous commit case:

In the OP's test case, wrote_xlog is true, while forceSyncCommit is
false and nrels == 0.

It doesn't seem like commit of a read-only transaction should be a
magical time for pruning WAL entries to hit the disk, so it probably
would work to modify that "if" to not drop into the synchronous
commit code if the transaction is explicitly declared READ ONLY or
if it never acquired an XID; although there's likely some better way
to deal with it.

-Kevin

From:
Tom Lane
Date:

"Kevin Grittner" <> writes:
> Tom Lane <> wrote:
>> Huh?  If there was never an XID, there's no commit WAL record,
>> hence nothing to make asynchronous.

> If you look at the RecordTransactionCommit() function in xact.c
> you'll see that's not correct.

Oh, hmmm ... that code was written with the idea that things like
sequence XLOG_SEQ_LOG records ought to be flushed to disk before
reporting commit; otherwise you don't have a guarantee that the same
sequence value couldn't be handed out again after crash/restart,
in a transaction that just does something like
    SELECT nextval('myseq');
without any updates of regular tables.

It seems like we ought to distinguish heap cleanup activities from
user-visible semantics (IOW, users shouldn't care if a HOT cleanup has
to be done over after restart, so if the transaction only wrote such
records there's no need to flush).  This'd require more process-global
state than we keep now, I'm afraid.

Another approach we could take (also nontrivial) is to prevent
select-only queries from doing HOT cleanups.  You said upthread that
there were alleged performance benefits from aggressive cleanup, but
IMO that can charitably be described as unproven.  The real reason it
happens is that we didn't see a simple way for page fetches to know soon
enough whether a tuple update would be likely to happen later, so they
just do cleanups unconditionally.

            regards, tom lane

From:
lars
Date:

On 07/13/2011 11:42 AM, Kevin Grittner wrote:
> So transactions without an XID *are* sensitive to
> synchronous_commit.  That's likely a useful clue.
>
> How much did it help the run time of the SELECT which followed the
> UPDATE?

It has surprisingly little impact on the SELECT side:

=> set synchronous_commit = on;
=> update test set created_by = '000000000000001' where tenant =
'000000000000001';
UPDATE 3712
Time: 384.702 ms
lars=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

Time: 36.571 ms
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

Time: 5.702 ms
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

Time: 5.822 ms
=> set synchronous_commit = off;
SET
Time: 0.145 ms
=> update test set created_by = '000000000000001' where tenant =
'000000000000001';
UPDATE 3712
Time: 96.227 ms
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

Time: 32.422 ms
=> select count(*) from test where tenant = '000000000000001' and
created_date = '2011-6-30';
  count
-------
   3712
(1 row)

Time: 6.080 ms

I tried it multiple times, and while the numbers change by 5-10ms the
relationship is the same.

The same results show when I use my JDBC code to run updates/selects as
fast as possible. When synchronous_commit is
off for the SELECTing process it seems to be slightly faster.

-- Lars


From:
"Kevin Grittner"
Date:

Tom Lane <> wrote:

> It seems like we ought to distinguish heap cleanup activities from
> user-visible semantics (IOW, users shouldn't care if a HOT cleanup
> has to be done over after restart, so if the transaction only
> wrote such records there's no need to flush).  This'd require more
> process-global state than we keep now, I'm afraid.

That makes sense, and seems like the right long-term fix.  It seems
like a boolean might do it; the trick would be setting it (or not)
in all the right places.

> Another approach we could take (also nontrivial) is to prevent
> select-only queries from doing HOT cleanups.  You said upthread
> that there were alleged performance benefits from aggressive
> cleanup, but IMO that can charitably be described as unproven.
> The real reason it happens is that we didn't see a simple way for
> page fetches to know soon enough whether a tuple update would be
> likely to happen later, so they just do cleanups unconditionally.

Hmm.  One trivial change could be to skip it when the top level
transaction is declared to be READ ONLY.  At least that would give
people a way to work around it for now.  Of course, that can't be
back-patched before 9.1 because subtransactions could override READ
ONLY before that.

-Kevin

From:
Robert Klemme
Date:

On Thu, Jul 14, 2011 at 4:05 PM, Kevin Grittner
<> wrote:
> Tom Lane <> wrote:
>
>> It seems like we ought to distinguish heap cleanup activities from
>> user-visible semantics (IOW, users shouldn't care if a HOT cleanup
>> has to be done over after restart, so if the transaction only
>> wrote such records there's no need to flush).  This'd require more
>> process-global state than we keep now, I'm afraid.
>
> That makes sense, and seems like the right long-term fix.  It seems
> like a boolean might do it; the trick would be setting it (or not)
> in all the right places.

I also believe this is the right way to go.  I think the crucial thing
is in "distinguish heap cleanup activities from user-visible
semantics" - basically this is what happens with auto vacuum: it does
work concurrently that you do not want to burden on user transactions.

>> Another approach we could take (also nontrivial) is to prevent
>> select-only queries from doing HOT cleanups.  You said upthread
>> that there were alleged performance benefits from aggressive
>> cleanup, but IMO that can charitably be described as unproven.
>> The real reason it happens is that we didn't see a simple way for
>> page fetches to know soon enough whether a tuple update would be
>> likely to happen later, so they just do cleanups unconditionally.
>
> Hmm.  One trivial change could be to skip it when the top level
> transaction is declared to be READ ONLY.  At least that would give
> people a way to work around it for now.  Of course, that can't be
> back-patched before 9.1 because subtransactions could override READ
> ONLY before that.

What I don't like about this approach is that it a) increases
complexity for the user, b) might not be for everyone (i.e. tools like
OR mappers which do not allow such setting of the TX or cases where
you do not know what type of TX this is when you start it) and c) it
still keeps the performance penalty to suddenly come to haunt a
different TX.

I can only speculate whether the latter might actually cause other
people to run into issues because their usage patterns currently force
the cleanout activities into an unimportant TX while the workaround
would suddenly have the cleanout delay show up in an important TX
which used to be fast.  This is also hard to debug since you would
normally only look at the slow TX before you realize you need to look
elsewhere (the length of this thread is kind of proof of this already
:-)).

My 0.01 EUR...

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

From:
Tom Lane
Date:

"Kevin Grittner" <> writes:
> Tom Lane <> wrote:
>> It seems like we ought to distinguish heap cleanup activities from
>> user-visible semantics (IOW, users shouldn't care if a HOT cleanup
>> has to be done over after restart, so if the transaction only
>> wrote such records there's no need to flush).  This'd require more
>> process-global state than we keep now, I'm afraid.

> That makes sense, and seems like the right long-term fix.  It seems
> like a boolean might do it; the trick would be setting it (or not)
> in all the right places.

The implementation I was imagining was to define another bit in the info
parameter for XLogInsert, say XLOG_NON_TRANSACTIONAL.  This could be a
high-order bit that would not go to disk.  Anytime it was *not* set,
XLogInsert would set a global boolean that would remember that the
current transaction wrote a transactional WAL record.  This is the
right default since the vast majority of call sites are writing records
that we would want to have flushed at commit.  There are just a couple
of places that would need to be changed to add this flag to their calls.

            regards, tom lane

From:
Jeff Janes
Date:

On Wed, Jul 13, 2011 at 11:10 AM, lars <> wrote:

...

> => update test set created_by = '000000000000001' where tenant =
> '000000000000001';
> UPDATE 3712
...
>
> There seems to be definitely something funky going on. Since created_by is
> indexed it shouldn't do any HOT logic.

Once the update has been run once, further executions are degenerate
(they replace the updated indexed column with the same value it
already holds).  The HOT code detects this and uses a HOT update in
this case despite the apparent update of an indexed column.


Cheers,

Jeff

From:
Jeff Janes
Date:

On Wed, Jul 13, 2011 at 3:41 PM, lars <> wrote:
> On 07/13/2011 11:42 AM, Kevin Grittner wrote:
>>
>> So transactions without an XID *are* sensitive to
>> synchronous_commit.  That's likely a useful clue.
>>
>> How much did it help the run time of the SELECT which followed the
>> UPDATE?
>
> It has surprisingly little impact on the SELECT side:

If your fsync is truly fsyncing, it seems like it should have
considerable effect.

Could you strace with both -ttt and -T, with and without synchronous commit?

Cheers,

Jeff

From:
lars
Date:

On 07/14/2011 04:03 PM, Jeff Janes wrote:
> On Wed, Jul 13, 2011 at 3:41 PM, lars<>  wrote:
>> On 07/13/2011 11:42 AM, Kevin Grittner wrote:
>>> So transactions without an XID *are* sensitive to
>>> synchronous_commit.  That's likely a useful clue.
>>>
>>> How much did it help the run time of the SELECT which followed the
>>> UPDATE?
>> It has surprisingly little impact on the SELECT side:
> If your fsync is truly fsyncing, it seems like it should have
> considerable effect.
>
> Could you strace with both -ttt and -T, with and without synchronous commit?
>
> Cheers,
>
> Jeff
Ok, here we go:

"Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
<5.357152>
1310774187.750791 gettimeofday({1310774187, 750809}, NULL) = 0 <0.000022>
1310774187.751023 lseek(12, 0, SEEK_END) = 329908224 <0.000023>
1310774187.751109 lseek(15, 0, SEEK_END) = 396607488 <0.000022>
1310774187.751186 lseek(18, 0, SEEK_END) = 534175744 <0.000022>
1310774187.751360 lseek(12, 0, SEEK_END) = 329908224 <0.000023>
1310774187.753389 brk(0x248e000)        = 0x248e000 <0.000026>
1310774187.753953 brk(0x24ce000)        = 0x24ce000 <0.000023>
1310774187.755158 brk(0x254e000)        = 0x254e000 <0.000024>
1310774187.766605 brk(0x2450000)        = 0x2450000 <0.000170>
1310774187.766852 lseek(23, 4513792, SEEK_SET) = 4513792 <0.000023>
1310774187.766927 write(23,
"f\320\1\0\1\0\0\0\320\0\0\0\0\340D-\22\0\0\0\30@!000000000"..., 32768)
= 32768 <0.000075>
1310774187.767071 fdatasync(23)         = 0 <0.002618>
1310774187.769760 gettimeofday({1310774187, 769778}, NULL) = 0 <0.000022>
1310774187.769848 sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\r\201\0\0>\0\2\0"...,
232, 0, NULL, 0) = 232 <0.000064>
1310774187.769993 sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66 <0.000199>

(23 is the WAL fd) vs.

"Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
<7.343720>
1310774306.978767 gettimeofday({1310774306, 978785}, NULL) = 0 <0.000021>
1310774306.978994 lseek(12, 0, SEEK_END) = 330883072 <0.000024>
1310774306.979080 lseek(15, 0, SEEK_END) = 397131776 <0.000021>
1310774306.979157 lseek(18, 0, SEEK_END) = 534732800 <0.000022>
1310774306.979332 lseek(12, 0, SEEK_END) = 330883072 <0.000022>
1310774306.983096 brk(0x248e000)        = 0x248e000 <0.000026>
1310774306.983653 brk(0x24ce000)        = 0x24ce000 <0.000023>
1310774306.984667 brk(0x254e000)        = 0x254e000 <0.000023>
1310774306.996040 brk(0x2450000)        = 0x2450000 <0.000168>
1310774306.996298 gettimeofday({1310774306, 996317}, NULL) = 0 <0.000021>
1310774306.996388 sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\r\201\0\0>\0\2\0"...,
232, 0, NULL, 0) = 232 <0.000078>
1310774306.996550 sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66 <0.000202>

So the difference is only 2ms. The size of the WAL buffers written is on
32k,

Here's an example with more dirty rows (I basically let the updater run
for a while dirtying very many rows).

"Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
<23.690018>
1310775141.398780 gettimeofday({1310775141, 398801}, NULL) = 0 <0.000028>
1310775141.399018 lseek(12, 0, SEEK_END) = 372514816 <0.000023>
1310775141.399105 lseek(15, 0, SEEK_END) = 436232192 <0.000022>
1310775141.399185 lseek(18, 0, SEEK_END) = 573620224 <0.000023>
1310775141.399362 lseek(12, 0, SEEK_END) = 372514816 <0.000024>
1310775141.414017 brk(0x2490000)        = 0x2490000 <0.000028>
1310775141.414575 brk(0x24d0000)        = 0x24d0000 <0.000025>
1310775141.415600 brk(0x2550000)        = 0x2550000 <0.000024>
1310775141.417757 semop(229383, {{0, -1, 0}}, 1) = 0 <0.000024>
...
1310775141.448998 semop(229383, {{0, -1, 0}}, 1) = 0 <0.000025>
1310775141.453134 brk(0x2452000)        = 0x2452000 <0.000167>
1310775141.453377 fadvise64(22, 0, 0, POSIX_FADV_DONTNEED) = 0 <0.000025>
1310775141.453451 close(22)             = 0 <0.000032>
1310775141.453537 open("pg_xlog/00000001000000D1000000C2", O_RDWR) = 22
<0.000059>
1310775141.453696 write(22,
"f\320\3\0\1\0\0\0\321\0\0\0\0\0\0\3023\356\17N\23l\vN\0\0\0\1\0
\0\0"..., 5365760) = 5365760 <0.005991>
1310775141.459798 write(22,
"f\320\1\0\1\0\0\0\321\0\0\0\0\340Q\302`\5\0\00000000915!000"...,
9019392) = 9019392 <0.010062>
1310775141.469965 fdatasync(22)         = 0 <0.231385>
1310775141.701424 semop(229383, {{2, 1, 0}}, 1) = 0 <0.000031>
1310775141.702657 gettimeofday({1310775141, 702682}, NULL) = 0 <0.000028>
1310775141.702765 sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\r\201\0\0>\0\2\0"...,
232, 0, NULL, 0) = 232 <0.000071>
1310775141.702942 sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66 <0.000220>

vs

"Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
<55.595425>
1310775406.842823 gettimeofday({1310775406, 842842}, NULL) = 0 <0.000026>
1310775406.843092 lseek(12, 0, SEEK_END) = 382787584 <0.000023>
1310775406.843179 lseek(15, 0, SEEK_END) = 457596928 <0.000042>
1310775406.843280 lseek(18, 0, SEEK_END) = 594968576 <0.000023>
1310775406.843459 lseek(12, 0, SEEK_END) = 382787584 <0.000022>
1310775406.860266 brk(0x2490000)        = 0x2490000 <0.000046>
1310775406.860968 brk(0x24d0000)        = 0x24d0000 <0.000095>
1310775406.862449 brk(0x2550000)        = 0x2550000 <0.000112>
1310775406.865095 semop(229383, {{2, -1, 0}}, 1) = 0 <0.111698>
...
1310775407.027235 semop(229383, {{2, -1, 0}}, 1) = 0 <0.000039>
1310775407.027503 semop(229383, {{2, -1, 0}}, 1) = 0 <2.215137>
1310775409.243291 semop(229383, {{1, 1, 0}}, 1) = 0 <0.000029>
...
1310775409.246963 semop(229383, {{2, -1, 0}}, 1) = 0 <0.000024>
1310775409.252029 brk(0x2452000)        = 0x2452000 <0.000168>
1310775409.252288 gettimeofday({1310775409, 252307}, NULL) = 0 <0.000021>
1310775409.252393 sendto(5,
"\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\2\0\0\0\0\0\0\0\r\201\0\0>\0\2\0"...,
232, 0, NULL, 0) = 232 <0.000078>
1310775409.252557 sendto(6,
"T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
66, 0, NULL, 0) = 66 <0.000201>

No WAL, but checkout that one expensive semop! 2s!!

-- Lars


From:
Jeff Janes
Date:

On Fri, Jul 15, 2011 at 5:21 PM, lars <> wrote:
> On 07/14/2011 04:03 PM, Jeff Janes wrote:
>>
>> On Wed, Jul 13, 2011 at 3:41 PM, lars<>  wrote:
>>>
>>> On 07/13/2011 11:42 AM, Kevin Grittner wrote:
>>>>
>>>> So transactions without an XID *are* sensitive to
>>>> synchronous_commit.  That's likely a useful clue.
>>>>
>>>> How much did it help the run time of the SELECT which followed the
>>>> UPDATE?
>>>
>>> It has surprisingly little impact on the SELECT side:
>>
>> If your fsync is truly fsyncing, it seems like it should have
>> considerable effect.
>>
>> Could you strace with both -ttt and -T, with and without synchronous
>> commit?
>>
>> Cheers,
>>
>> Jeff
>
> Ok, here we go:
>
> "Q\0\0\0_select count(*) from test w"..., 8192, 0, NULL, NULL) = 96
> <5.357152>
> 1310774187.750791 gettimeofday({1310774187, 750809}, NULL) = 0 <0.000022>
> 1310774187.751023 lseek(12, 0, SEEK_END) = 329908224 <0.000023>
> 1310774187.751109 lseek(15, 0, SEEK_END) = 396607488 <0.000022>
> 1310774187.751186 lseek(18, 0, SEEK_END) = 534175744 <0.000022>
> 1310774187.751360 lseek(12, 0, SEEK_END) = 329908224 <0.000023>
> 1310774187.753389 brk(0x248e000)        = 0x248e000 <0.000026>
> 1310774187.753953 brk(0x24ce000)        = 0x24ce000 <0.000023>
> 1310774187.755158 brk(0x254e000)        = 0x254e000 <0.000024>
> 1310774187.766605 brk(0x2450000)        = 0x2450000 <0.000170>
> 1310774187.766852 lseek(23, 4513792, SEEK_SET) = 4513792 <0.000023>
> 1310774187.766927 write(23,
> "f\320\1\0\1\0\0\0\320\0\0\0\0\340D-\22\0\0\0\30@!000000000"..., 32768) =
> 32768 <0.000075>
> 1310774187.767071 fdatasync(23)         = 0 <0.002618>
> 1310774187.769760 gettimeofday({1310774187, 769778}, NULL) = 0 <0.000022>
> 1310774187.769848 sendto(5,
> "\2\0\0\0\350\0\0\0\1@\0\0\2\0\0\0\1\0\0\0\0\0\0\0\r\201\0\0>\0\2\0"...,
> 232, 0, NULL, 0) = 232 <0.000064>
> 1310774187.769993 sendto(6,
> "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"...,
> 66, 0, NULL, 0) = 66 <0.000199>

The total time for this is about 19 ms, but your previous example was
around 35 ms.  Is this reproducible?  A change of set up between then
and now?

2.6 ms for an fsync seems awfully quick.  I wonder if EBS uses
nonvolatile/battery-backed write cache, or if it just lies about fsync
actually hitting disk.

But anyway it looks like you aren't blocking much in system calls, and
I don't think there are non-system-call ways to block, so the time is
probably being spent in something CPU intensive.

On my (physical) computer, synchronous_commit=off does eliminate the
timing differences between the select immediately after the update and
subsequent selects.  So while I could reproduce timing differences
that were superficially similar to yours, they seem to have some
fundamentally different cause.

Maybe the best way to figure out what is going on is to loop the
update and the select in different processes, and use perf or oprof to
profile just the select process (with and without the update running).
 It would also be good to know the timings without profiling turned on
as well, to know how much the profiling is disturbing the timing.

...

> Here's an example with more dirty rows (I basically let the updater run for
> a while dirtying very many rows).

I'm surprised that make that much of a difference.  The select should
only clean up blocks it actually visits, and updating more rows
shouldn't change that very much.

...
> 1310775407.027503 semop(229383, {{2, -1, 0}}, 1) = 0 <2.215137>
...
> No WAL, but checkout that one expensive semop! 2s!!

Is that reproducible, or is it just a one time anomaly?

Cheers,

Jeff

From:
lars
Date:

On 07/14/2011 08:47 AM, Tom Lane wrote:
>
> The implementation I was imagining was to define another bit in the info
> parameter for XLogInsert, say XLOG_NON_TRANSACTIONAL.  This could be a
> high-order bit that would not go to disk.  Anytime it was *not* set,
> XLogInsert would set a global boolean that would remember that the
> current transaction wrote a transactional WAL record.  This is the
> right default since the vast majority of call sites are writing records
> that we would want to have flushed at commit.  There are just a couple
> of places that would need to be changed to add this flag to their calls.
>
>             regards, tom lane
>

If you have a patch in mind I'm happy to test it on my setup and report
back.

-- Lars


From:
Greg Smith
Date:

On 07/16/2011 06:33 PM, Jeff Janes wrote:
> 2.6 ms for an fsync seems awfully quick.  I wonder if EBS uses
> nonvolatile/battery-backed write cache, or if it just lies about fsync
> actually hitting disk.
>

They have the right type of cache in there to make fsync quick, when you
happen to be the lucky one to find it free of a write backlog.  So the
best case is much better than a typical spinning drive with no such
cache.  The worst case is in the 100ms+ range though on EBS.

--
Greg Smith   2ndQuadrant US       Baltimore, MD



From:
Pavan Deolasee
Date:

On Wed, Jul 13, 2011 at 10:52 AM, Merlin Moncure <> wrote:
> On Tue, Jul 12, 2011 at 6:15 PM, lars <> wrote:
>> Back to the first case, here's an strace from the backend doing the select
>> right after the updates.
>> write(13,
>> "f\320\1\0\1\0\0\0\273\0\0\0\0\340\27\22`\32\0\00000002833!000"..., 2400256)
>> = 2400256
>
> On Wed, Jul 13, 2011 at 9:46 AM, Kevin Grittner
> <> wrote:
>> Code comments indicate that they expect the pruning to be a pretty
>> clear win on multiple reads, although I don't know how much that was
>> benchmarked.  Jeff does raise a good point, though -- it seems odd
>> that WAL-logging of this pruning would need to be synchronous.  We
>> support asynchronous commits -- why not use that feature
>
> Right -- here are my thoughts.  notice the above is writing out 293
> pages. this is suggesting to me that Kevin is right and you've
> identified a pattern where you are aggravating the page cleanup
> facilities of HOT.   What threw me off here (and perhaps bears some
> additional investigation) is that early on in the report you were
> claiming an update to an indexed field which effectively disables HOT.

There are couple of other (very important) things that HOT does, but
probably its not advertised a lot. Even for non-HOT updates (which
means either indexed columns were changed or page ran out of free
space) or deletes, HOT prunes those tuples and instead mark the line
pointer as DEAD. The page is defragmented and dead space is recovered.
Each such dead tuple now only consumes two bytes in the page until
vacuum removes the dead line pointers. Thats the reason why OP is
seeing the behavior even when index columns are being updated.

We made a few adjustments to ensure that a page is not pruned too
early. So we track the oldest XID that did any updates/deletes to the
page and attempt pruning only when the RecentXmin is past the XID. We
also mark the page as "full" if some previous update did not find
enough free space to do in-block update and use that hint to decide if
we should attempt to prune the page. Finally, we prune only if we get
the cleanup lock without blocking.

What might be worth looking at this condition in pruneheap.c:

/*
     * We prune when a previous UPDATE failed to find enough space on the page
     * for a new tuple version, or when free space falls below the relation's
     * fill-factor target (but not less than 10%).
     *
     * Checking free space here is questionable since we aren't holding any
     * lock on the buffer; in the worst case we could get a bogus answer. It's
     * unlikely to be *seriously* wrong, though, since reading either pd_lower
     * or pd_upper is probably atomic.  Avoiding taking a lock seems more
     * important than sometimes getting a wrong answer in what is after all
     * just a heuristic estimate.
     */
    minfree = RelationGetTargetPageFreeSpace(relation,
                                             HEAP_DEFAULT_FILLFACTOR);
    minfree = Max(minfree, BLCKSZ / 10);

    if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
    {


So if the free space in a page falls below the fill-factor or 10% of
the block size, we would try to prune the page. We probably need to
revisit this area and see if we need to tune HOT ever better. One
option could be to see how much space we are going to free and carry
out the operation only if its significant enough to justify the cost.

I know we had done several benchmarking tests while HOT development,
but the tuning mechanism still may not be perfect for all kinds of
work loads and it would probably never be.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

From:
lars hofhansl
Date:

Thanks Pavan!


I think the most important points are still that:
1. The WAL write should be happening asynchronously (if that is possible)
2. There should be an option do not perform these compactions if the page is only touched by reads.

(Assuming that when most of the databaseresides in the cache these optimizations are less important.)


-- Lars


----- Original Message -----
From: Pavan Deolasee <>
To: Merlin Moncure <>
Cc: lars <>; Kevin Grittner <>; Ivan Voras <>;

Sent: Wednesday, July 27, 2011 7:15 AM
Subject: Re: [PERFORM] UPDATEDs slowing SELECTs in a fully cached database

On Wed, Jul 13, 2011 at 10:52 AM, Merlin Moncure <> wrote:
...

There are couple of other (very important) things that HOT does, but
probably its not advertised a lot. Even for non-HOT updates (which
means either indexed columns were changed or page ran out of free
space) or deletes, HOT prunes those tuples and instead mark the line
pointer as DEAD. The page is defragmented and dead space is recovered.
Each such dead tuple now only consumes two bytes in the page until
vacuum removes the dead line pointers. Thats the reason why OP is
seeing the behavior even when index columns are being updated.

We made a few adjustments to ensure that a page is not pruned too
early. So we track the oldest XID that did any updates/deletes to the
page and attempt pruning only when the RecentXmin is past the XID. We
also mark the page as "full" if some previous update did not find
enough free space to do in-block update and use that hint to decide if
we should attempt to prune the page. Finally, we prune only if we get
the cleanup lock without blocking.

What might be worth looking at this condition in pruneheap.c:

/*
     * We prune when a previous UPDATE failed to find enough space on the page
     * for a new tuple version, or when free space falls below the relation's
     * fill-factor target (but not less than 10%).
     *
     * Checking free space here is questionable since we aren't holding any
     * lock on the buffer; in the worst case we could get a bogus answer. It's
     * unlikely to be *seriously* wrong, though, since reading either pd_lower
     * or pd_upper is probably atomic.  Avoiding taking a lock seems more
     * important than sometimes getting a wrong answer in what is after all
     * just a heuristic estimate.
     */
    minfree = RelationGetTargetPageFreeSpace(relation,
                                             HEAP_DEFAULT_FILLFACTOR);
    minfree = Max(minfree, BLCKSZ / 10);

    if (PageIsFull(page) || PageGetHeapFreeSpace(page) < minfree)
    {


So if the free space in a page falls below the fill-factor or 10% of
the block size, we would try to prune the page. We probably need to
revisit this area and see if we need to tune HOT ever better. One
option could be to see how much space we are going to free and carry
out the operation only if its significant enough to justify the cost.

I know we had done several benchmarking tests while HOT development,
but the tuning mechanism still may not be perfect for all kinds of
work loads and it would probably never be.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


From:
Jeff Janes
Date:

On 7/29/11, lars hofhansl <> wrote:
> Thanks Pavan!
>
>
> I think the most important points are still that:
> 1. The WAL write should be happening asynchronously (if that is possible)

I think it is agreed that this is a "todo"; but since you reported
that turning off synchronous commit did not improve performance, it is
not directly relevant for you.


> 2. There should be an option do not perform these compactions if the page is
> only touched by reads.

If the page is only touched by reads, there would be nothing to compact.

Cheers,

Jeff