Thread: I/O on select count(*)

I/O on select count(*)

From
Doug Eck
Date:
I have a large table (~ 2B rows) that contains an indexed timestamp column.  I am attempting to run a query to determine the number of rows for a given day using something like "select count(*) from tbl1 where ts between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'".  Explain tells me that the query will be done using an index scan (as I would expect), and I realize that it is going to take a while.  My question concerns some unusual I/O activity on the box (SUSE)  when I run the query.

For the first couple of minutes I see reads only.  After that vmstat shows mixed reads and writes in a ratio of about 1 block read to 5 blocks written.  We have determined that files in our data and log partitions are being hit, but the file system itself is not growing during this time (it appears to be writing over the same chunk of space over and over again).  Memory on the box is not being swapped while all of this is happening.  I would have guessed that a "select count(*)" would not require a bunch of writes, and I can't begin to figure out why the number of blocks written are so much higher than the blocks read.  If I modify the where clause to only count the rows for a given minute or two, I see the reads but I never see the unusual write behavior.

Any thoughts into what could be going on?  Thanks in advance for your help.

Doug

Re: I/O on select count(*)

From
"Merlin Moncure"
Date:
On Wed, May 14, 2008 at 4:09 PM, Doug Eck <deck1@yahoo.com> wrote:
> I have a large table (~ 2B rows) that contains an indexed timestamp column.
> I am attempting to run a query to determine the number of rows for a given
> day using something like "select count(*) from tbl1 where ts between
> '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'".  Explain tells me
> that the query will be done using an index scan (as I would expect), and I
> realize that it is going to take a while.  My question concerns some unusual
> I/O activity on the box (SUSE)  when I run the query.
>
> For the first couple of minutes I see reads only.  After that vmstat shows
> mixed reads and writes in a ratio of about 1 block read to 5 blocks
> written.  We have determined that files in our data and log partitions are
> being hit, but the file system itself is not growing during this time (it
> appears to be writing over the same chunk of space over and over again).
> Memory on the box is not being swapped while all of this is happening.  I
> would have guessed that a "select count(*)" would not require a bunch of
> writes, and I can't begin to figure out why the number of blocks written are
> so much higher than the blocks read.  If I modify the where clause to only
> count the rows for a given minute or two, I see the reads but I never see
> the unusual write behavior.
>
> Any thoughts into what could be going on?  Thanks in advance for your help.

can you post the exact output of explain analyze? (or, at least,
explain if the query takes too long)

merlin

Re: I/O on select count(*)

From
Doug Eck
Date:


----- Original Message ----
From: Merlin Moncure <mmoncure@gmail.com>
To: Doug Eck <deck1@yahoo.com>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, May 14, 2008 3:38:23 PM
Subject: Re: [PERFORM] I/O on select count(*)

On Wed, May 14, 2008 at 4:09 PM, Doug Eck <deck1@yahoo.com> wrote:
> I have a large table (~ 2B rows) that contains an indexed timestamp column.
> I am attempting to run a query to determine the number of rows for a given
> day using something like "select count(*) from tbl1 where ts between
> '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'".  Explain tells me
> that the query will be done using an index scan (as I would expect), and I
> realize that it is going to take a while.  My question concerns some unusual
> I/O activity on the box (SUSE)  when I run the query.
>
> For the first couple of minutes I see reads only.  After that vmstat shows
> mixed reads and writes in a ratio of about 1 block read to 5 blocks
> written.  We have determined that files in our data and log partitions are
> being hit, but the file system itself is not growing during this time (it
> appears to be writing over the same chunk of space over and over again).
> Memory on the box is not being swapped while all of this is happening.  I
> would have guessed that a "select count(*)" would not require a bunch of
> writes, and I can't begin to figure out why the number of blocks written are
> so much higher than the blocks read.  If I modify the where clause to only
> count the rows for a given minute or two, I see the reads but I never see
> the unusual write behavior.
>
> Any thoughts into what could be going on?  Thanks in advance for your help.

can you post the exact output of explain analyze? (or, at least,
explain if the query takes too long)

merlin

The query takes a long time to run, so I'll start with the explain output.  I can run explain analyze (given enough time) if you believe its output could hold some clues.

db_2008=> explain select count(*) from ot_2008_05 where transact_time between '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999';
                                                                                QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10368613.47..10368613.48 rows=1 width=0)
   ->  Index Scan using ot_2008_05_ak2 on ot_2008_05  (cost=0.00..10011333.27 rows=142912078 width=0)
         Index Cond: ((transact_time >= '2008-05-12 00:00:00-04'::timestamp with time zone) AND (transact_time <= '2008-05-12 23:59:59.999-04'::timestamp with time zone))
(3 rows)

db_2008=>

Doug


Re: I/O on select count(*)

From
"Kevin Grittner"
Date:
>>> Doug Eck <deck1@yahoo.com> wrote:

> I am attempting to run a query to determine the number of rows for a
given
> day using something like "select count(*) from tbl1 where ts between

> '2008-05-12 00:00:00.000' and '2008-05-12 23:59:59.999'".  Explain
tells me
> that the query will be done using an index scan (as I would expect),
and I
> realize that it is going to take a while.  My question concerns some
unusual
> I/O activity on the box (SUSE)  when I run the query.
>
> For the first couple of minutes I see reads only.  After that vmstat
shows
> mixed reads and writes in a ratio of about 1 block read to 5 blocks
written.

> Any thoughts into what could be going on?  Thanks in advance for your
help.

Odd as it may seem, a SELECT can cause a page to be rewritten.

If this is the first time that the rows are being read since they were
inserted (or since the database was loaded, including from backup), it
may be rewriting the rows to set hint bits, which can make subsequent
access faster.

The best solution may be to vacuum more often.

http://archives.postgresql.org/pgsql-performance/2007-12/msg00206.php

-Kevin



Re: I/O on select count(*)

From
Greg Smith
Date:
On Wed, 14 May 2008, Kevin Grittner wrote:

> If this is the first time that the rows are being read since they were
> inserted (or since the database was loaded, including from backup), it
> may be rewriting the rows to set hint bits, which can make subsequent
> access faster.

This is the second time this has come up recently, and I know it used to
puzzle me too.  This is a particularly relevant area to document better
for people doing benchmarking.  As close I've found to a useful commentary
on this subject is the thread at
http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php

I still don't completely understand this myself though, if I did I'd add a
FAQ on it.  Anyone want to lecture for a minute on the birth and care of
hint bits?  I'll make sure any comments here get onto the wiki.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Greg Smith wrote:
> On Wed, 14 May 2008, Kevin Grittner wrote:
>
>> If this is the first time that the rows are being read since they were
>> inserted (or since the database was loaded, including from backup), it
>> may be rewriting the rows to set hint bits, which can make subsequent
>> access faster.
>
> This is the second time this has come up recently, and I know it used to
> puzzle me too.  This is a particularly relevant area to document better
> for people doing benchmarking.  As close I've found to a useful
> commentary on this subject is the thread at
> http://archives.postgresql.org/pgsql-patches/2005-07/msg00390.php
>
> I still don't completely understand this myself though, if I did I'd add
> a FAQ on it.  Anyone want to lecture for a minute on the birth and care
> of hint bits?  I'll make sure any comments here get onto the wiki.

Hint bits are used to mark tuples as created and/or deleted by
transactions that are know committed or aborted.  To determine the
visibility of a tuple without such bits set, you need to consult pg_clog
and possibly pg_subtrans, so it is an expensive check.  On the other
hand, if the tuple has the bits set, then it's state is known (or, at
worst, it can be calculated easily from your current snapshot, without
looking at pg_clog.)

There are four hint bits:
XMIN_COMMITTED -- creating transaction is known committed
XMIN_ABORTED   -- creating transaction is known aborted
XMAX_COMMITTED -- same, for the deleting transaction
XMAX_ABORTED   -- ditto

If neither of the bits is set, then the transaction is either in
progress (which you can check by examining the list of running
transactions in shared memory) or your process is the first one to check
(in which case, you need to consult pg_clog to know the status, and you
can update the hint bits if you find out a permanent state).


Regarding FAQs, I'm having trouble imagining putting this in the user
FAQ; I think it belongs into the developer's FAQ.  However, a
benchmarker is not going to look there.  Maybe we should start "a
benchmarker's FAQ"?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: I/O on select count(*)

From
Greg Smith
Date:
On Wed, 14 May 2008, Alvaro Herrera wrote:

> If neither of the bits is set, then the transaction is either in
> progress (which you can check by examining the list of running
> transactions in shared memory) or your process is the first one to check
> (in which case, you need to consult pg_clog to know the status, and you
> can update the hint bits if you find out a permanent state).

So is vacuum helpful here because it will force all that to happen in one
batch?  To put that another way:  if I've run a manual vacuum, is it true
that it will have updated all the hint bits to XMIN_COMMITTED for all the
tuples that were all done when the vacuum started?

> Regarding FAQs, I'm having trouble imagining putting this in the user
> FAQ; I think it belongs into the developer's FAQ.  However, a
> benchmarker is not going to look there.  Maybe we should start "a
> benchmarker's FAQ"?

On the wiki I've started adding a series of things that are
performance-related FAQs.  There's three of them mixed in the bottom of
http://wiki.postgresql.org/wiki/Frequently_Asked_Questions right now,
about slow count(*) and dealing with slow queries.

Here the FAQ would be "Why am I seeing all these writes when I'm just
doing selects on my table?", and if it's mixed in with a lot of other
performance related notes people should be able to find it.  The answer
and suggestions should be simple enough to be useful to a user who just
noticed this behavior, while perhaps going into developer land for those
who want to know more about the internals.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
"Jan de Visser"
Date:
On 5/14/08, Greg Smith <gsmith@gregsmith.com> wrote:
> On Wed, 14 May 2008, Alvaro Herrera wrote:
>
>
> > If neither of the bits is set, then the transaction is either in progress
> (which you can check by examining the list of running transactions in shared
> memory) or your process is the first one to check (in which case, you need
> to consult pg_clog to know the status, and you can update the hint bits if
> you find out a permanent state).
> >
>
>  So is vacuum helpful here because it will force all that to happen in one
> batch?  To put that another way:  if I've run a manual vacuum, is it true
> that it will have updated all the hint bits to XMIN_COMMITTED for all the
> tuples that were all done when the vacuum started?

From my benchmarking experience: Yes, vacuum helps. See also below.

>
>
> > Regarding FAQs, I'm having trouble imagining putting this in the user
> > FAQ; I think it belongs into the developer's FAQ.  However, a
> > benchmarker is not going to look there.  Maybe we should start "a
> > benchmarker's FAQ"?
> >
>
>  On the wiki I've started adding a series of things that are
> performance-related FAQs.  There's three of them mixed in the bottom of
> http://wiki.postgresql.org/wiki/Frequently_Asked_Questions
> right now, about slow count(*) and dealing with slow queries.
>
>  Here the FAQ would be "Why am I seeing all these writes when I'm just doing
> selects on my table?", and if it's mixed in with a lot of other performance
> related notes people should be able to find it.  The answer and suggestions
> should be simple enough to be useful to a user who just noticed this
> behavior, while perhaps going into developer land for those who want to know
> more about the internals.

Obviously, this issue is tied to the slow count(*) one, as I found out
the hard way. Consider the following scenario:
* Insert row
* Update that row a couple of times
* Rinse and repeat many times

Now somewhere during that cycle, do a select count(*) just to see
where you are. You will be appalled by how slow that is, due to not
only the usual 'slow count(*)' reasons. This whole hint bit business
makes it even worse, as demonstrated by the fact that running a vacuum
before the count(*) makes the latter noticably faster.

jan

Re: I/O on select count(*)

From
"Pavan Deolasee"
Date:
On Thu, May 15, 2008 at 7:51 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>
>
> So is vacuum helpful here because it will force all that to happen in one
> batch?  To put that another way:  if I've run a manual vacuum, is it true
> that it will have updated all the hint bits to XMIN_COMMITTED for all the
> tuples that were all done when the vacuum started?
>

Yes. For that matter, even a plain SELECT or count(*) on the entire
table is good enough. That will check every tuple for visibility and
set it's hint bits.

Another point to note is that the hint bits are checked and set on a
per tuple basis. So especially during index scan, the same heap page
may get rewritten many times. I had suggested in the past that
whenever we set hint bits for a tuple, we should check all other
tuples in the page and set their hint bits too to avoid multiple
writes of the same page. I guess the idea got rejected because of lack
of benchmarks to prove the benefit.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: I/O on select count(*)

From
Luke Lonergan
Date:
BTW – we’ve removed HINT bit checking in Greenplum DB and improved the visibility caching which was enough to provide performance at the same level as with the HINT bit optimization, but avoids this whole “write the data, write it to the log also, then write it again just for good measure” behavior.

For people doing data warehousing work like the poster, this Postgres behavior is miserable.  It should be fixed for 8.4 for sure (volunteers?)

BTW – for the poster’s benefit, you should implement partitioning by date, then load each partition and VACUUM ANALYZE after each load.  You probably won’t need the date index anymore – so your load times will vastly improve (no indexes), you’ll store less data (no indexes) and you’ll be able to do simpler data management with the partitions.

You may also want to partition AND index if you do a lot of short range selective date predicates.  Example would be: partition by day, index on date field, queries selective on date ranges by hour will then select out only the day needed, then index scan to get the hourly values.  Typically time-oriented data is nearly time sorted anyway, so you’ll also get the benefit of a clustered index.

- Luke


On 5/15/08 10:40 AM, "Pavan Deolasee" <pavan.deolasee@gmail.com> wrote:

On Thu, May 15, 2008 at 7:51 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>
>
> So is vacuum helpful here because it will force all that to happen in one
> batch?  To put that another way:  if I've run a manual vacuum, is it true
> that it will have updated all the hint bits to XMIN_COMMITTED for all the
> tuples that were all done when the vacuum started?
>

Yes. For that matter, even a plain SELECT or count(*) on the entire
table is good enough. That will check every tuple for visibility and
set it's hint bits.

Another point to note is that the hint bits are checked and set on a
per tuple basis. So especially during index scan, the same heap page
may get rewritten many times. I had suggested in the past that
whenever we set hint bits for a tuple, we should check all other
tuples in the page and set their hint bits too to avoid multiple
writes of the same page. I guess the idea got rejected because of lack
of benchmarks to prove the benefit.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: I/O on select count(*)

From
Greg Smith
Date:
On Thu, 15 May 2008, Pavan Deolasee wrote:

> I had suggested in the past that whenever we set hint bits for a tuple,
> we should check all other tuples in the page and set their hint bits too
> to avoid multiple writes of the same page. I guess the idea got rejected
> because of lack of benchmarks to prove the benefit.

From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the
impression the idea was to have the background writer get involved to help
with this particular situation.  The way things are setup right now, I
would guess it's impractical for an individual client to be forced to wait
for all the tuples in a block to be checked just because it ran into one
tuple that needed its hint bits refreshed.

If the pages that had any hint bit updates since they were read/created
were made easy to identify (maybe they already are), the writer could do
the kind of scan you suggest anytime it was about to evict that page.
That wouldn't be in the client's critical path and it would maximize the
possible improvement here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
"Joshua D. Drake"
Date:
On Thu, 15 May 2008 10:52:01 +0800
Luke Lonergan <llonergan@greenplum.com> wrote:

> BTW ­ we¹ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the
> same level as with the HINT bit optimization, but avoids this whole
> ³write the data, write it to the log also, then write it again just
> for good measure² behavior.
>
> For people doing data warehousing work like the poster, this Postgres
> behavior is miserable.  It should be fixed for 8.4 for sure
> (volunteers?)

Donations? You have the code Luke :)

Sincerely,

Joshua D. Drake

P.S. Sorry for the almost bad Star Wars pun.

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate



Attachment

Re: I/O on select count(*)

From
Tino Wildenhain
Date:
Hi,

Luke Lonergan wrote:
> BTW – we’ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the same
> level as with the HINT bit optimization, but avoids this whole “write
> the data, write it to the log also, then write it again just for good
> measure” behavior.

can you go a bit deeper into how you implemented this or is it some IP
of greenplum you cannot reveal?

Btw, is there something with your eyes:
<FONT SIZE="4"><FONT FACE="Verdana, Helvetica, Arial"><SPAN
STYLE='font-size:14pt'> ? :-))

Cheers
Tino


Attachment

Re: I/O on select count(*)

From
Tom Lane
Date:
"Jan de Visser" <jdevisser@digitalfairway.com> writes:
> Obviously, this issue is tied to the slow count(*) one, as I found out
> the hard way. Consider the following scenario:
> * Insert row
> * Update that row a couple of times
> * Rinse and repeat many times

> Now somewhere during that cycle, do a select count(*) just to see
> where you are. You will be appalled by how slow that is, due to not
> only the usual 'slow count(*)' reasons. This whole hint bit business
> makes it even worse, as demonstrated by the fact that running a vacuum
> before the count(*) makes the latter noticably faster.

Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
has simply *removed* two-thirds of the rows in the system, resulting in
a large drop in the number of rows that the select even has to look at.

It's certainly true that hint-bit updates cost something, but
quantifying how much isn't easy.  The off-the-cuff answer is to do the
select count(*) twice and see how much cheaper the second one is.  But
there are two big holes in that answer: the first is the possible cache
effects from having already read in the pages, and the second is that
the follow-up scan gets to avoid the visits to pg_clog that the first
scan had to make (which after all is the point of the hint bits).

I don't know any easy way to disambiguate the three effects that are at
work here.  But blaming it all on the costs of writing out hint-bit
updates is wrong.

            regards, tom lane

Re: I/O on select count(*)

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> ... To put that another way:  if I've run a manual vacuum, is it true
> that it will have updated all the hint bits to XMIN_COMMITTED for all the
> tuples that were all done when the vacuum started?

Any examination whatsoever of a tuple --- whether by vacuum or any
ordinary DML operation --- will update its hint bits to match the
commit/abort status of the inserting/deleting transaction(s) as of
the instant of the examination.  Your statement above is true but
is weaker than reality.

            regards, tom lane

Re: I/O on select count(*)

From
Matthew Wakeling
Date:
On Wed, 14 May 2008, Alvaro Herrera wrote:
> Hint bits are used to mark tuples as created and/or deleted by
> transactions that are know committed or aborted.  To determine the
> visibility of a tuple without such bits set, you need to consult pg_clog
> and possibly pg_subtrans, so it is an expensive check.

So, as I understand it, Postgres works like this:

1. You begin a transaction. Postgres writes an entry into pg_clog.
2. You write some tuples. Postgres writes them to the WAL, but doesn't
     bother fsyncing.
3. At some point, the bgwriter or a checkpoint may write the tuples to the
     database tables, and fsync the lot.
4. You commit the transaction. Postgres alters pg_clog again, writes that
     to the WAL, and fsyncs the WAL.
5. If the tuples hadn't already made it to the database tables, then a
     checkpoint or bgwriter will do it later on, and fsync the lot.
6. You read the tuples. Postgres reads them from the database table, looks
     in pg_clog, notices that the transaction has been committed, and
     writes the tuples to the database table again with the hint bits set.
     This write is not WAL protected, and is not fsynced.

This seems like a good architecture, with some cool characteristics,
mainly that at no point does Postgres have to hold vast quantities of data
in memory. I have two questions though:

Is it really safe to update the hint bits in place? If there is a power
cut in the middle of writing a block, is there a guarantee from the disc
that the block will never be garbled?

Is there a way to make a shortcut and have the hint bits written the first
time the data is written to the table? One piece of obvious low-hanging
fruit would be to enhance step five above, so that the bgwriter or
checkpoint that writes the data to the database table checks the pg_clog
and writes the correct hint bits. In fact, if the tuple's creating
transaction has aborted, then the tuple can be vacuumed right there and
then before it is even written. For OLTP, almost all the hint bits will be
written first time, and also the set of transactions that will be looked
up in the pg_clog will be small (the set of transactions that were active
since the last checkpoint), so its cache coherency will be good.

However, this idea does not deal well with bulk data loads, where the data
is checkpointed before transaction is committed or aborted.

Matthew

--
Now, you would have thought these coefficients would be integers, given that
we're working out integer results. Using a fraction would seem really
stupid. Well, I'm quite willing to be stupid here - in fact, I'm going to
use complex numbers.                    -- Computer Science Lecturer

Re: I/O on select count(*)

From
"Heikki Linnakangas"
Date:
Matthew Wakeling wrote:
> Is it really safe to update the hint bits in place? If there is a power
> cut in the middle of writing a block, is there a guarantee from the disc
> that the block will never be garbled?

Don't know, to be honest. We've never seen any reports of corrupted data
that would suggest such a problem, but it doesn't seem impossible to me
that some exotic storage system might do that.

> Is there a way to make a shortcut and have the hint bits written the
> first time the data is written to the table? One piece of obvious
> low-hanging fruit would be to enhance step five above, so that the
> bgwriter or checkpoint that writes the data to the database table checks
> the pg_clog and writes the correct hint bits.

Yep, that's an idea that's been suggested before. In fact, I seem to
remember a patch to do just that. Don't remember what happened to it,

> In fact, if the tuple's
> creating transaction has aborted, then the tuple can be vacuumed right
> there and then before it is even written.

Not if you have any indexes on the table. To vacuum, you'll have to scan
all indexes to remove pointers to the tuple.

> However, this idea does not deal well with bulk data loads, where the
> data is checkpointed before transaction is committed or aborted.

Yep, that's the killer :-(.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: I/O on select count(*)

From
Matthew Wakeling
Date:
On Thu, 15 May 2008, Luke Lonergan wrote:
> BTW ­ we¹ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the same level
> as with the HINT bit optimization, but avoids this whole ³write the data,
> write it to the log also, then write it again just for good measure²
> behavior.

This sounds like a good option. I believe I suggested this a few months
ago, however it was rejected because in the worst case (when the hints are
not cached), if you're doing an index scan, you can do twice the number of
seeks as before.

http://archives.postgresql.org/pgsql-performance/2007-12/msg00217.php

The hint data will be four bits per tuple plus overheads, so it could be
made very compact, and therefore likely to stay in the cache fairly well.
Each tuple fetched would have to be spaced really far apart in the
database table in order to exhibit the worst case, because fetching a page
of hint cache will cause 64kB or so of disc to appear in the disc's
read-ahead buffer, which will be equivalent to 128MB worth of database
table (assuming eight tuples per block and no overhead). As soon as you
access another tuple in the same 128MB bracket, you'll hit the disc
read-ahead buffer for the hints.

On balance, to me it still seems like a good option.

Matthew

--
Those who do not understand Unix are condemned to reinvent it, poorly.
                -- Henry Spencer

Re: I/O on select count(*)

From
Jan de Visser
Date:
On Thursday 15 May 2008 03:02:19 Tom Lane wrote:
> "Jan de Visser" <jdevisser@digitalfairway.com> writes:
> > Obviously, this issue is tied to the slow count(*) one, as I found out
> > the hard way. Consider the following scenario:
> > * Insert row
> > * Update that row a couple of times
> > * Rinse and repeat many times
> >
> > Now somewhere during that cycle, do a select count(*) just to see
> > where you are. You will be appalled by how slow that is, due to not
> > only the usual 'slow count(*)' reasons. This whole hint bit business
> > makes it even worse, as demonstrated by the fact that running a vacuum
> > before the count(*) makes the latter noticably faster.
>
> Uh, well, you can't blame that entirely on hint-bit updates.  The vacuum
> has simply *removed* two-thirds of the rows in the system, resulting in
> a large drop in the number of rows that the select even has to look at.
>
> It's certainly true that hint-bit updates cost something, but
> quantifying how much isn't easy.  The off-the-cuff answer is to do the
> select count(*) twice and see how much cheaper the second one is.  But
> there are two big holes in that answer: the first is the possible cache
> effects from having already read in the pages, and the second is that
> the follow-up scan gets to avoid the visits to pg_clog that the first
> scan had to make (which after all is the point of the hint bits).
>
> I don't know any easy way to disambiguate the three effects that are at
> work here.  But blaming it all on the costs of writing out hint-bit
> updates is wrong.
>
>             regards, tom lane

True. But it still contributes to the fact that queries sometimes behave in a
non-deterministic way, which IMHO is the major annoyance when starting to
work with pgsql. And contrary to other causes (vacuum, checkpoints) this is
woefully underdocumented.

jan

Re: I/O on select count(*)

From
Matthew Wakeling
Date:
On Thu, 15 May 2008, Heikki Linnakangas wrote:
> > Is it really safe to update the hint bits in place? If there is a
> > power cut in the middle of writing a block, is there a guarantee from
> > the disc that the block will never be garbled?
>
> Don't know, to be honest. We've never seen any reports of corrupted data
> that would suggest such a problem, but it doesn't seem impossible to me
> that some exotic storage system might do that.

Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
it? So basically, if you're telling people that WAL full-page-writes is
safer than partial WAL, because it avoids updating pages in-place, then
you shouldn't be updating pages in-place for the hint bits either. You
can't win!

>> In fact, if the tuple's creating transaction has aborted, then the tuple
>> can be vacuumed right there and then before it is even written.
>
> Not if you have any indexes on the table. To vacuum, you'll have to scan all
> indexes to remove pointers to the tuple.

Ah. Well, would that be so expensive? After all, someone has to do it
eventually, and these are index entries that have only just been added
anyway.

I can understand index updating being a bit messy in the middle of a
checkpoint though, as you would have to write the update to the WAL, which
you are checkpointing...

So, I don't know exactly how the WAL updates to indexes work, but my guess
is that it has been implemented as "write the blocks that we would change
to the WAL". The problem with this is that all the changes to the index
are done individually, so there's no easy way to "undo" one of them later
on when you find out that the transaction has been aborted during the
checkpoint.

An alternative would be to build a "list of changes" in the WAL without
actually changing the underlying index at all. When reading the index, you
would read the "list" first (which would be in memory, and in an
efficient-to-search structure), then read the original index and add the
two. Then when checkpointing, vet all the changes against known aborted
transactions before making all the changes to the index together. This is
likely to speed up index writes quite a bit, and also allow you to
effectively vacuum aborted tuples before they get written to the disc.

Matthew

--
Vacuums are nothings. We only mention them to let them know we know
they're there.

Re: I/O on select count(*)

From
Ron Mayer
Date:
Matthew Wakeling wrote:
> On Thu, 15 May 2008, Luke Lonergan wrote:
>> ...HINT bit optimization, but avoids this whole ³write the data,
>> write it to the log also, then write it again just for good measure²
> ...
> The hint data will be four bits per tuple plus overheads, so it could be
> made very compact, and therefore likely to stay in the cache fairly
> well.

Does it seem like these HINT bits would be good candidates to move
off to map forks similar to how the visibility map stuff will be handled?

Since (if I understand right) only the hint bits change during the
select(*) it seems a lot less write-IO would happen if such a map
were updated rather than the data pages themselves.

Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Greg Smith escribió:
> On Thu, 15 May 2008, Pavan Deolasee wrote:
>
>> I had suggested in the past that whenever we set hint bits for a tuple,
>> we should check all other tuples in the page and set their hint bits
>> too to avoid multiple writes of the same page. I guess the idea got
>> rejected because of lack of benchmarks to prove the benefit.
>
> From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the
> impression the idea was to have the background writer get involved to
> help with this particular situation.

The problem is that the bgwriter does not understand about the content
of the pages it is writing -- they're opaque pages for all it knows.  So
it cannot touch the hint bits.

I agree with Pavan that it's likely that setting hint bits in batches
instead of just for the tuple being examined is a benefit.  However,
it's perhaps not so good to be doing it in a foreground process, because
you're imposing extra cost to the client queries which we want to be as
fast as possible.  Perhaps the thing to do is have a "database-local
bgwriter" which would scan pages and do this kind of change ...
a different kind of process to be launched by autovacuum perhaps.

If we had the bitmask in a separate map fork, this could be cheap.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: I/O on select count(*)

From
Tom Lane
Date:
Matthew Wakeling <matthew@flymine.org> writes:
> Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
> it? So basically, if you're telling people that WAL full-page-writes is
> safer than partial WAL, because it avoids updating pages in-place, then
> you shouldn't be updating pages in-place for the hint bits either. You
> can't win!

This argument ignores the nature of the data change.  With a hint-bit
update, no data is being shuffled around, so there is no danger from a
partial page write.

A disk that leaves an individual sector corrupt would be a problem,
but I don't think that's a huge risk.  Keep in mind that disks aren't
designed to just stop dead when power dies --- they are made to be able
to park their heads before the juice is entirely gone.  I think it's
reasonable to assume they'll finish writing the sector in progress
before they start parking.

            regards, tom lane

Re: I/O on select count(*)

From
"Heikki Linnakangas"
Date:
Matthew Wakeling wrote:
> On Thu, 15 May 2008, Heikki Linnakangas wrote:
>> > Is it really safe to update the hint bits in place? If there is a >
>> power cut in the middle of writing a block, is there a guarantee from
>> > the disc that the block will never be garbled?
>>
>> Don't know, to be honest. We've never seen any reports of corrupted
>> data that would suggest such a problem, but it doesn't seem impossible
>> to me that some exotic storage system might do that.
>
> Hmm. That problem is what WAL full-page-writes is meant to handle, isn't
> it? So basically, if you're telling people that WAL full-page-writes is
> safer than partial WAL, because it avoids updating pages in-place, then
> you shouldn't be updating pages in-place for the hint bits either. You
> can't win!

Full-page-writes protect from torn pages, that is, when one half of an
update hits the disk but the other one doesn't. In particular, if the
beginning of the page where the WAL pointer (XLogRecPtr) is flushed to
disk, but the actual changes elsewhere in the page aren't, you're in
trouble. WAL replay will look at the WAL pointer, and think that the
page doesn't need to be replayed, while other half of the update is
still missing.

Hint bits are different. We're only updating a single bit, and it
doesn't matter from correctness point of view whether the hint bit
update hits the disk or not. But what would spell trouble is if the disk
controller/whatever garbles the whole sector, IOW changes something else
than the changed bit, while doing the update.

>>> In fact, if the tuple's creating transaction has aborted, then the
>>> tuple can be vacuumed right there and then before it is even written.
>>
>> Not if you have any indexes on the table. To vacuum, you'll have to
>> scan all indexes to remove pointers to the tuple.
>
> Ah. Well, would that be so expensive? After all, someone has to do it
> eventually, and these are index entries that have only just been added
> anyway.

Scanning all indexes? Depends on your table of course, but yes it would
be expensive in general.

> An alternative would be to build a "list of changes" in the WAL without
> actually changing the underlying index at all. When reading the index,
> you would read the "list" first (which would be in memory, and in an
> efficient-to-search structure), then read the original index and add the
> two. Then when checkpointing, vet all the changes against known aborted
> transactions before making all the changes to the index together. This
> is likely to speed up index writes quite a bit, and also allow you to
> effectively vacuum aborted tuples before they get written to the disc.

There's not much point optimizing something that only helps with aborted
transactions.

The general problem with any idea that involves keeping a list of
changes made in a transaction is that that list will grow big during
bulk loads, so you'll have to overflow to disk or abandon the list
approach. Which means that it won't help with bulk loads.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: I/O on select count(*)

From
"Heikki Linnakangas"
Date:
Alvaro Herrera wrote:
> Greg Smith escribió:
>> On Thu, 15 May 2008, Pavan Deolasee wrote:
>>
>>> I had suggested in the past that whenever we set hint bits for a tuple,
>>> we should check all other tuples in the page and set their hint bits
>>> too to avoid multiple writes of the same page. I guess the idea got
>>> rejected because of lack of benchmarks to prove the benefit.
>>
>> From glancing at http://www.postgresql.org/docs/faqs.TODO.html I got the
>> impression the idea was to have the background writer get involved to
>> help with this particular situation.
>
> The problem is that the bgwriter does not understand about the content
> of the pages it is writing -- they're opaque pages for all it knows.  So
> it cannot touch the hint bits.

We know what kind of a relation we're dealing with in ReadBuffer, so we
could add a flag to BufferDesc to mark heap pages.

> If we had the bitmask in a separate map fork, this could be cheap.

I don't buy that. The point of a hint bit is that it's right there along
with the tuple you're looking at. If you have to look at a separate
buffer, you might as well just look at clog.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: I/O on select count(*)

From
Matthew Wakeling
Date:
On Thu, 15 May 2008, Heikki Linnakangas wrote:
> There's not much point optimizing something that only helps with aborted
> transactions.

That's fair enough, but this list method is likely to speed up index
writes anyway.

> The general problem with any idea that involves keeping a list of changes
> made in a transaction is that that list will grow big during bulk loads, so
> you'll have to overflow to disk or abandon the list approach. Which means
> that it won't help with bulk loads.

Yeah, it wouldn't be a list of changes for the transaction, it would be a
list of changes since the last checkpoint. Keeping data in memory for the
length of the transaction is doomed to failure, because there is no bound
on its size, so bulk loads are still going to miss out on hint
optimisation.

Matthew

--
for a in past present future; do
  for b in clients employers associates relatives neighbours pets; do
  echo "The opinions here in no way reflect the opinions of my $a $b."
done; done

Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Heikki Linnakangas escribió:
> Alvaro Herrera wrote:

>> The problem is that the bgwriter does not understand about the content
>> of the pages it is writing -- they're opaque pages for all it knows.  So
>> it cannot touch the hint bits.
>
> We know what kind of a relation we're dealing with in ReadBuffer, so we
> could add a flag to BufferDesc to mark heap pages.

Hmm, I was thinking that it would need access to the catalogs to know
where the tuples are, but that's certainly not true, so perhaps it could
be made to work.

>> If we had the bitmask in a separate map fork, this could be cheap.
>
> I don't buy that. The point of a hint bit is that it's right there along
> with the tuple you're looking at. If you have to look at a separate
> buffer, you might as well just look at clog.

True -- I was confusing this with the idea of having the tuple MVCC
header (xmin, xmax, etc) in a separate fork, which would make the idea
of index-only scans more feasible at the expense of seqscans.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Matthew Wakeling wrote:

Aside from the rest of commentary, a slight clarification:

> So, as I understand it, Postgres works like this:
>
> 1. You begin a transaction. Postgres writes an entry into pg_clog.

Starting a transaction does not write anything to pg_clog.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: I/O on select count(*)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Heikki Linnakangas escribi�:
>> We know what kind of a relation we're dealing with in ReadBuffer, so we
>> could add a flag to BufferDesc to mark heap pages.

> Hmm, I was thinking that it would need access to the catalogs to know
> where the tuples are, but that's certainly not true, so perhaps it could
> be made to work.

The issue in my mind is not so much could bgwriter physically do it
as that it's a violation of module layering.  That has real
consequences, like potential for deadlocks.  It'll become particularly
pressing if we go forward with the plans to get rid of the separate
dedicated buffers for pg_clog etc and have them work in the main
shared-buffer pool.

            regards, tom lane

Re: I/O on select count(*)

From
Robert Lor
Date:
Tom Lane wrote:
> It's certainly true that hint-bit updates cost something, but
> quantifying how much isn't easy.
Maybe we can instrument the code with DTrace probes to quantify the
actual costs.  I'm not familiar with the code, but if I know where to
place the probes, I can easily do a quick test and provide the data.
>  The off-the-cuff answer is to do the
> select count(*) twice and see how much cheaper the second one is.
Doesn't seem the second run is cheaper as shown in the results below.
The data came from the probes I've added recently.

*************** Run #1 **********************
SQL Statement  : select count(*) from accounts;
Execution time : 1086.58 (ms)

============ Buffer Read Counts ============
Tablespace   Database      Table      Count
      1663      16384       1247          1
      1663      16384       2600          1
      1663      16384       2703          1
      1663      16384       1255          2
      1663      16384       2650          2
      1663      16384       2690          3
      1663      16384       2691          3
      1663      16384      16397       8390

======== Dirty Buffer Write Counts =========
Tablespace   Database      Table      Count
      1663      16384      16397       2865

Total buffer cache hits      :      1932
Total buffer cache misses    :      6471
Average read time from cache :      5638 (ns)
Average read time from disk  :    143371 (ns)
Average write time to disk   :     20368 (ns)


*************** Run #2 **********************
SQL Statement  : select count(*) from accounts;
Execution time : 1115.94 (ms)

============ Buffer Read Counts ============
Tablespace   Database      Table      Count
      1663      16384      16397       8390

======== Dirty Buffer Write Counts =========
Tablespace   Database      Table      Count
      1663      16384      16397       2865

Total buffer cache hits      :      1931
Total buffer cache misses    :      6459
Average read time from cache :      4357 (ns)
Average read time from disk  :    154127 (ns)
Average write time to disk   :     20368 (ns)


-Robert

Re: I/O on select count(*)

From
Tom Lane
Date:
Robert Lor <Robert.Lor@Sun.COM> writes:
> Tom Lane wrote:
>> It's certainly true that hint-bit updates cost something, but
>> quantifying how much isn't easy.

> Maybe we can instrument the code with DTrace probes to quantify the
> actual costs.

Hmm, the problem would be trying to figure out what percentage of writes
could be blamed solely on hint-bit updates and not any other change to
the page.  I don't think that the bufmgr currently keeps enough state to
know that, but you could probably modify it easily enough, since callers
distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
another flag bit that's set only by the first, and test it during
write-out.

            regards, tom lane

Re: I/O on select count(*)

From
Robert Lor
Date:
Tom Lane wrote:
> Hmm, the problem would be trying to figure out what percentage of writes
> could be blamed solely on hint-bit updates and not any other change to
> the page.  I don't think that the bufmgr currently keeps enough state to
> know that, but you could probably modify it easily enough, since callers
> distinguish MarkBufferDirty from SetBufferCommitInfoNeedsSave.  Define
> another flag bit that's set only by the first, and test it during
> write-out.
>
Ok, I made a few changes to bufmgr per my understanding of your
description above and with my limited understanding of the code. Patch
is attached.

Assuming the patch is correct, the effect of writes due to hint bits is
quite significant. I collected the data below by runing pgbench in one
terminal and psql on another to run the query.

Is the data plausible?

-Robert

--------------


Backend PID    : 16189
SQL Statement  : select count(*) from accounts;
Execution time : 17.33 sec

============ Buffer Read Counts ============
Tablespace   Database      Table      Count
      1663      16384       2600          1
      1663      16384       2601          1
      1663      16384       2615          1
      1663      16384       1255          2
      1663      16384       2602          2
      1663      16384       2603          2
      1663      16384       2616          2
      1663      16384       2650          2
      1663      16384       2678          2
      1663      16384       1247          3
      1663      16384       1249          3
      1663      16384       2610          3
      1663      16384       2655          3
      1663      16384       2679          3
      1663      16384       2684          3
      1663      16384       2687          3
      1663      16384       2690          3
      1663      16384       2691          3
      1663      16384       2703          4
      1663      16384       1259          5
      1663      16384       2653          5
      1663      16384       2662          5
      1663      16384       2663          5
      1663      16384       2659          7
      1663      16384      16397       8390

======== Dirty Buffer Write Counts =========
Tablespace   Database      Table      Count
      1663      16384      16402          2
      1663      16384      16394         11
      1663      16384      16397       4771

========== Hint Bits Write Counts ==========
Tablespace   Database      Table      Count
      1663      16384      16397       4508

Total buffer cache hits      :       732
Total buffer cache misses    :      7731
Average read time from cache :      9136 (ns)
Average read time from disk  :    384201 (ns)
Average write time to disk   :    210709 (ns)


Backend PID    : 16189
SQL Statement  : select count(*) from accounts;
Execution time : 12.72 sec

============ Buffer Read Counts ============
Tablespace   Database      Table      Count
      1663      16384      16397       8392

======== Dirty Buffer Write Counts =========
Tablespace   Database      Table      Count
      1663      16384      16394          6
      1663      16384      16402          7
      1663      16384      16397       2870

========== Hint Bits Write Counts ==========
Tablespace   Database      Table      Count
      1663      16384      16402          2
      1663      16384      16397       2010

Total buffer cache hits      :       606
Total buffer cache misses    :      7786
Average read time from cache :      6949 (ns)
Average read time from disk  :    706288 (ns)
Average write time to disk   :     90426 (ns)

Index: bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -u -3 -p -r1.228 bufmgr.c
--- bufmgr.c    1 Jan 2008 19:45:51 -0000    1.228
+++ bufmgr.c    15 May 2008 20:56:38 -0000
@@ -42,6 +42,7 @@
 #include "storage/smgr.h"
 #include "utils/resowner.h"
 #include "pgstat.h"
+#include "pg_trace.h"


 /* Note: these two macros only work on shared buffers, not local ones! */
@@ -171,6 +172,7 @@ ReadBuffer_common(Relation reln, BlockNu
     if (isExtend)
         blockNum = smgrnblocks(reln->rd_smgr);

+    TRACE_POSTGRESQL_BUFFER_READ_START(blockNum, reln->rd_node.spcNode, reln->rd_node.dbNode, reln->rd_node.relNode,
isLocalBuf);
     pgstat_count_buffer_read(reln);

     if (isLocalBuf)
@@ -200,12 +202,16 @@ ReadBuffer_common(Relation reln, BlockNu
     {
         if (!isExtend)
         {
+            TRACE_POSTGRESQL_BUFFER_HIT();
             /* Just need to update stats before we exit */
             pgstat_count_buffer_hit(reln);

             if (VacuumCostActive)
                 VacuumCostBalance += VacuumCostPageHit;

+            TRACE_POSTGRESQL_BUFFER_READ_DONE(blockNum,
+                 reln->rd_node.spcNode, reln->rd_node.dbNode,
+                 reln->rd_node.relNode, isLocalBuf, found);
             return BufferDescriptorGetBuffer(bufHdr);
         }

@@ -257,6 +263,7 @@ ReadBuffer_common(Relation reln, BlockNu
             } while (!StartBufferIO(bufHdr, true));
         }
     }
+    TRACE_POSTGRESQL_BUFFER_MISS();

     /*
      * if we have gotten to this point, we have allocated a buffer for the
@@ -324,6 +331,9 @@ ReadBuffer_common(Relation reln, BlockNu
     if (VacuumCostActive)
         VacuumCostBalance += VacuumCostPageMiss;

+    TRACE_POSTGRESQL_BUFFER_READ_DONE(blockNum, reln->rd_node.spcNode,
+        reln->rd_node.dbNode, reln->rd_node.relNode, isLocalBuf, found);
+
     return BufferDescriptorGetBuffer(bufHdr);
 }

@@ -466,6 +476,11 @@ BufferAlloc(Relation reln,
              * happens to be trying to split the page the first one got from
              * StrategyGetBuffer.)
              */
+
+            TRACE_POSTGRESQL_DIRTY_BUFFER_WRITE_START(blockNum,
+              reln->rd_node.spcNode, reln->rd_node.dbNode,
+              reln->rd_node.relNode);
+
             if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
             {
                 /*
@@ -488,6 +503,11 @@ BufferAlloc(Relation reln,
                 /* OK, do the I/O */
                 FlushBuffer(buf, NULL);
                 LWLockRelease(buf->content_lock);
+
+                TRACE_POSTGRESQL_DIRTY_BUFFER_WRITE_DONE(
+                  blockNum, reln->rd_node.spcNode,
+                  reln->rd_node.dbNode, reln->rd_node.relNode,
+                  (oldFlags & BM_HINT_BITS_TEST));
             }
             else
             {
@@ -1734,6 +1754,10 @@ FlushBuffer(volatile BufferDesc *buf, SM
     buf->flags &= ~BM_JUST_DIRTIED;
     UnlockBufHdr(buf);

+    TRACE_POSTGRESQL_BUFFER_WRITE_START(buf->tag.blockNum,
+         reln->smgr_rnode.spcNode, reln->smgr_rnode.dbNode,
+         reln->smgr_rnode.relNode);
+
     smgrwrite(reln,
               buf->tag.blockNum,
               (char *) BufHdrGetBlock(buf),
@@ -1741,6 +1765,10 @@ FlushBuffer(volatile BufferDesc *buf, SM

     BufferFlushCount++;

+    TRACE_POSTGRESQL_BUFFER_WRITE_DONE(buf->tag.blockNum,
+         reln->smgr_rnode.spcNode, reln->smgr_rnode.dbNode,
+         reln->smgr_rnode.relNode, (buf->flags & BM_HINT_BITS_TEST));
+
     /*
      * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
      * end the io_in_progress state.
@@ -2155,7 +2183,7 @@ SetBufferCommitInfoNeedsSave(Buffer buff
         Assert(bufHdr->refcount > 0);
         if (!(bufHdr->flags & BM_DIRTY) && VacuumCostActive)
             VacuumCostBalance += VacuumCostPageDirty;
-        bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
+        bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED | BM_HINT_BITS_TEST);
         UnlockBufHdr(bufHdr);
     }
 }
@@ -2492,7 +2520,7 @@ TerminateBufferIO(volatile BufferDesc *b
     Assert(buf->flags & BM_IO_IN_PROGRESS);
     buf->flags &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
     if (clear_dirty && !(buf->flags & BM_JUST_DIRTIED))
-        buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);
+        buf->flags &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED | BM_HINT_BITS_TEST);
     buf->flags |= set_flag_bits;

     UnlockBufHdr(buf);
Index: buf_internals.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/buf_internals.h,v
retrieving revision 1.95
diff -u -3 -p -r1.95 buf_internals.h
--- buf_internals.h    1 Jan 2008 19:45:58 -0000    1.95
+++ buf_internals.h    15 May 2008 20:59:42 -0000
@@ -36,6 +36,7 @@
 #define BM_JUST_DIRTIED            (1 << 5)        /* dirtied since write started */
 #define BM_PIN_COUNT_WAITER        (1 << 6)        /* have waiter for sole pin */
 #define BM_CHECKPOINT_NEEDED    (1 << 7)        /* must write for checkpoint */
+#define BM_HINT_BITS_TEST    (1 << 8)        /* test effect of writes due to hint bits */

 typedef bits16 BufFlags;


Re: I/O on select count(*)

From
James Mansion
Date:
Alvaro Herrera wrote:
> Hint bits are used to mark tuples as created and/or deleted by
> transactions that are know committed or aborted.  To determine the
> visibility of a tuple without such bits set, you need to consult pg_clog
> and possibly pg_subtrans, so it is an expensive check.  On the other
>
So, how come there is this outstanding work to do, which will inevitably
be done, and it
hasn't been done until it is 'just too late' to avoid getting in the way
of the query?

The OP didn't suggest that he had just loaded the data.

Also - is it the case that this only affects the case where updated
pages were spilled
during the transaction that changed them?  ie, if we commit a
transaction and there
are changed rows still in the cache since their pages are not evicted
yet, are the hint
bits set immediately so that page is written just once?  Seems this
would be common
in most OLTP systems.

Heikki points out that the list might get big and need to be abandoned,
but then you
fall back to scheduling a clog scan that can apply the bits, which does
what you have
now, though hopefully in a way that fills slack disk IO rather than
waiting for the
read.

Matthew says: 'it would be a list of changes since the last checkpoint'
but I don't
see why you can't start writing hints to in-memory pages as soon as the
transaction
ends.  You might fall behind, but I doubt it with modern CPU speeds.

I can't see why Pavan's suggestion to try to update as many of the bits
as possible
when a dirty page is evicted would be contentious.

I do think this is something of interest to users, not just developers,
since it
may influence the way updates are processed where it is reasonable to do
so in 'bite sized chunks' as a multipart workflow.



Re: I/O on select count(*)

From
Gregory Stark
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:

> BTW ­ we¹ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the same level
> as with the HINT bit optimization, but avoids this whole ³write the data,
> write it to the log also, then write it again just for good measure²
> behavior.
>
> For people doing data warehousing work like the poster, this Postgres
> behavior is miserable.  It should be fixed for 8.4 for sure (volunteers?)

For people doing data warehousing I would think the trick would be to do
something like what we do to avoid WAL logging for tables created in the same
transaction.

That is, if you're loading a lot of data at the same time then all of that
data is going to be aborted or committed and that will happen at the same
time. Ideally we would find a way to insert the data with the hint bits
already set to committed and mark the section of the table as being only
provisionally extended so other transactions wouldn't even look at those pages
until the transaction commits.

This is similar to the abortive attempt to have the abovementioned WAL logging
trick insert the records pre-frozen. I recall there were problems with that
idea though but I don't recall if they were insurmountable or just required
more work.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: I/O on select count(*)

From
"Kevin Grittner"
Date:
>>> On Thu, May 15, 2008 at  5:11 PM, in message
<482CB528.9000600@mansionfamily.plus.com>, James Mansion
<james@mansionfamily.plus.com> wrote:
> Alvaro Herrera wrote:
>> Hint bits are used to mark tuples as created and/or deleted by
>> transactions that are know committed or aborted.  To determine the
>> visibility of a tuple without such bits set, you need to consult
pg_clog
>> and possibly pg_subtrans, so it is an expensive check.  On the
other
>>
> So, how come there is this outstanding work to do, which will
inevitably
> be done, and it
> hasn't been done until it is 'just too late' to avoid getting in the
way
> of the query?

There has been discussion from time to time about setting the hint
bits for tuple inserts which occur within the same database
transaction as the creation of the table into which they're being
inserted.  That would allow people to cover many of the bulk load
situations.  I don't see it on the task list.  (I would also argue
that there is little information lost, even from a forensic
perspective, to writing such rows as "frozen".)  Is this idea done,
dead, or is someone working on it?

If we could set hint bits on dirty buffer pages after the commit, we'd
cover the OLTP situation.  In many situations, there is a bigger OS
cache than PostgreSQL shared memory, and an attempt to set the bits
soon after the commit would coalesce the two writes into one physical
write using RAM-based access, which would be almost as good.  I don't
know if it's feasible to try to do that after the pages have moved
from the PostgreSQL cache to the OS cache, but it would likely be a
performance win.

If we are going to burden any requester process with the job of
setting the hint bits, it would typically be better to burden the one
doing the data modification rather than some random thread later
trying to read data from the table.  Of course, getting work off the
requester processes onto some background worker process is generally
even better.

-Kevin


Re: I/O on select count(*)

From
Greg Smith
Date:
On Thu, 15 May 2008, Alvaro Herrera wrote:

> Starting a transaction does not write anything to pg_clog.

For Matt and others, some details here are in
src/backend/access/transam/README:

"pg_clog records the commit status for each transaction that has been
assigned an XID."

"Transactions and subtransactions are assigned permanent XIDs only when/if
they first do something that requires one --- typically,
insert/update/delete a tuple, though there are a few other places that
need an XID assigned."

After reading the code and that documentation a bit, the part I'm still
not sure about is whether the CLOG entry is created when the XID is
assigned and then kept current as the state changes, or whether that isn't
even in CLOG until the transaction is committed.  It seems like the
latter, but there's some ambiguity in the wording and too many code paths
for me to map right now.

From there, it doesn't make its way out to disk until the internal CLOG
buffers are filled, at which point the least recently used buffer there is
evicted to permanent storage.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
Simon Riggs
Date:
On Fri, 2008-05-16 at 14:05 -0400, Greg Smith wrote:
> After reading the code and that documentation a bit, the part I'm
> still not sure about is whether the CLOG entry is created when the XID
> is assigned and then kept current as the state changes, or whether
> that isn't even in CLOG until the transaction is committed.  It seems
> like the latter, but there's some ambiguity in the wording and too
> many code paths for me to map right now.

Alvaro already said this, I thought? The clog is updated only at sub or
main transaction end, thank goodness. When the transactionid is assigned
the page of the clog that contains that transactionid is checked to see
if it already exists and if not, it is initialised.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Greg Smith wrote:

> After reading the code and that documentation a bit, the part I'm still
> not sure about is whether the CLOG entry is created when the XID is
> assigned and then kept current as the state changes, or whether that
> isn't even in CLOG until the transaction is committed.  It seems like the
> latter, but there's some ambiguity in the wording and too many code paths
> for me to map right now.

pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
zeroed, which is the bit pattern for "transaction in progress".  So when
a transaction starts, it only needs to ensure that the pg_clog page that
corresponds to it is allocated, but it need not write anything to it.

(*) Each transaction needs 2 bits, so on a 8 kB page there is space for
4 transactions/byte * 8 pages * 1kB/page = 32k transactions.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: I/O on select count(*)

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
> zeroed, which is the bit pattern for "transaction in progress".  So when
> a transaction starts, it only needs to ensure that the pg_clog page that
> corresponds to it is allocated, but it need not write anything to it.

Of course, in 8.3 it's not when the transaction starts, but when the Xid
is assigned (i.e. when the transaction first calls a read-write
command).  In previous versions it happens when the first snapshot is
taken (i.e. normally on the first command of any type, with very few
exceptions.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: I/O on select count(*)

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Greg Smith wrote:
>> After reading the code and that documentation a bit, the part I'm still
>> not sure about is whether the CLOG entry is created when the XID is
>> assigned and then kept current as the state changes, or whether that
>> isn't even in CLOG until the transaction is committed.

> pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
> zeroed, which is the bit pattern for "transaction in progress".  So when
> a transaction starts, it only needs to ensure that the pg_clog page that
> corresponds to it is allocated, but it need not write anything to it.

One additional point: this means that one transaction in every 32K
writing transactions *does* have to do extra work when it assigns itself
an XID, namely create and zero out the next page of pg_clog.  And that
doesn't just slow down the transaction in question, but the next few
guys that would like an XID but arrive on the scene while the
zeroing-out is still in progress.

This probably contributes to the behavior that Simon and Josh regularly
complain about, that our transaction execution time is subject to
unpredictable spikes.  I'm not sure how to get rid of it though.

            regards, tom lane

Re: I/O on select count(*)

From
Jeremy Harris
Date:
Tom Lane wrote:
> One additional point: this means that one transaction in every 32K
> writing transactions *does* have to do extra work when it assigns itself
> an XID, namely create and zero out the next page of pg_clog.  And that
> doesn't just slow down the transaction in question, but the next few
> guys that would like an XID but arrive on the scene while the
> zeroing-out is still in progress.
>
> This probably contributes to the behavior that Simon and Josh regularly
> complain about, that our transaction execution time is subject to
> unpredictable spikes.  I'm not sure how to get rid of it though.

A thread maintaining a pool of assigned and cleared pg_clog pages, ahead
of the immediate need?    Possibly another job for an existing daemon
thread.

- Jeremy

Re: I/O on select count(*)

From
Greg Smith
Date:
I just collected all the good internals information included in this
thread and popped it onto http://wiki.postgresql.org/wiki/Hint_Bits where
I'll continue to hack away at the text until it's readable.  Thanks to
everyone who answered my questions here, that's good progress toward
clearing up a very underdocumented area.

I note a couple of potential TODO items not on the official list yet that
came up during this discussion:

-Smooth latency spikes when switching commit log pages by preallocating
cleared pages before they are needed

-Improve bulk loading by setting "frozen" hint bits for tuple inserts
which occur within the same database transaction as the creation of the
table into which they're being inserted

Did I miss anything?  I think everything brought up falls either into one
of those two or the existing "Consider having the background writer update
the transaction status hint bits..." TODO.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
Matthew Wakeling
Date:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> pg_clog is allocated in pages of 8kB apiece(*).  On allocation, pages are
>> zeroed, which is the bit pattern for "transaction in progress".  So when
>> a transaction starts, it only needs to ensure that the pg_clog page that
>> corresponds to it is allocated, but it need not write anything to it.

Yeah, that's pretty-much how I imagined it would be. Nice and compact. I
would imagine that if there are only a few transactions, doing the pg_clog
lookup would be remarkably quick. However, once there have been a
bazillion transactions, with tuples pointing to the whole range of them,
it would degenerate into having to perform an extra seek for each tuple,
and that's why you added the hint bits.

On Fri, 16 May 2008, Tom Lane wrote:
> One additional point: this means that one transaction in every 32K
> writing transactions *does* have to do extra work when it assigns itself
> an XID, namely create and zero out the next page of pg_clog.  And that
> doesn't just slow down the transaction in question, but the next few
> guys that would like an XID but arrive on the scene while the
> zeroing-out is still in progress.
>
> This probably contributes to the behavior that Simon and Josh regularly
> complain about, that our transaction execution time is subject to
> unpredictable spikes.  I'm not sure how to get rid of it though.

Does it really take that long to zero out 8kB of RAM? I thought CPUs were
really quick at doing that!

Anyway, the main thing you need to avoid is all the rest of the
transactions waiting for the new pg_clog page. The trick is to generate
the new page early, outside any locks on existing pages. It doesn't
necessarily need to be done by a daemon thread at all.

Matthew

--
I'm NOT paranoid!  Which of my enemies told you this?

Re: I/O on select count(*)

From
Bruce Momjian
Date:
Matthew Wakeling wrote:
> On Fri, 16 May 2008, Tom Lane wrote:
> > One additional point: this means that one transaction in every 32K
> > writing transactions *does* have to do extra work when it assigns itself
> > an XID, namely create and zero out the next page of pg_clog.  And that
> > doesn't just slow down the transaction in question, but the next few
> > guys that would like an XID but arrive on the scene while the
> > zeroing-out is still in progress.
> >
> > This probably contributes to the behavior that Simon and Josh regularly
> > complain about, that our transaction execution time is subject to
> > unpredictable spikes.  I'm not sure how to get rid of it though.
>
> Does it really take that long to zero out 8kB of RAM? I thought CPUs were
> really quick at doing that!

Yea, that was my assumption too.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: I/O on select count(*)

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Matthew Wakeling wrote:
>> Does it really take that long to zero out 8kB of RAM? I thought CPUs were
>> really quick at doing that!

> Yea, that was my assumption too.

You have to write the page (to be sure there is space for it on disk)
not only zero it.

This design is kind of a holdover, though, from back when we had one
ever-growing clog file.  Today I'd be inclined to think about managing
it more like pg_xlog, ie, have some background process pre-create a
whole segment file at a time.

            regards, tom lane

Re: I/O on select count(*)

From
Greg Smith
Date:
On Mon, 19 May 2008, Matthew Wakeling wrote:

> Does it really take that long to zero out 8kB of RAM? I thought CPUs were
> really quick at doing that!

You don't get the whole CPU--you get time slices of one.  Some of the
cases complaints have come in about have over a thousand connections all
fighting for CPU time, and making every one of them block for one guy who
needs to fiddle with memory for a while can be a problem.  If you're
unlucky you won't even be on the same CPU you started on each time you get
a little check of time, and you'll run at the speed of RAM rather than
that of the CPU--again, fighting for RAM access with every other process
on the server.

The real question in my mind is why this turns into a bottleneck before
the similar task of cleaning the 16MB XLOG segment does.  I expected that
one would need to be cracked before the CLOG switch time could possibly be
an issue, but reports from the field seem to suggest otherwise.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: I/O on select count(*)

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> The real question in my mind is why this turns into a bottleneck before
> the similar task of cleaning the 16MB XLOG segment does.

Because we do the latter off-line, or at least try to.

            regards, tom lane

Re: I/O on select count(*)

From
PFC
Date:
> The real question in my mind is why this turns into a bottleneck before
> the similar task of cleaning the 16MB XLOG segment does.  I expected
> that one would need to be cracked before the CLOG switch time could
> possibly be an issue, but reports from the field seem to suggest
> otherwise.

    Hm, on current CPUs zeroing 8kB of RAM should take less than 2 us... now
if it has to be written to disk, that's another story !

Re: I/O on select count(*)

From
"Luke Lonergan"
Date:

Hi Hannu,

Interesting suggestion on the partial index!

I'll find out if we can extract our code that did the work.  It was simple but scattered in a few routines.

In concept it worked like this:

1 - Ignore if hint bits are unset, use them if set.  This affects heapam and vacuum I think.
2 - implement a cache for clog lookups based on the optimistic assumption that the data was inserted in bulk.  Put the cache one call away from heapgetnext()

I forget the details of (2).  As I recall, if we fall off of the assumption, the penalty for long scans get large-ish (maybe 2X), but since when do people full table scan when they're updates/inserts are so scattered across TIDs?  It's an obvious big win for DW work.

We also have a GUC to turn it off if needed, in which case a vacuum will write the hint bits.

- Luke

----- Original Message -----
From: Hannu Krosing <hannu@krosing.net>
To: Luke Lonergan
Cc: Pavan Deolasee <pavan.deolasee@gmail.com>; Greg Smith <gsmith@gregsmith.com>; Alvaro Herrera <alvherre@commandprompt.com>; pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Thu May 22 12:10:02 2008
Subject: Re: [PERFORM] I/O on select count(*)

On Thu, 2008-05-15 at 10:52 +0800, Luke Lonergan wrote:
> BTW – we’ve removed HINT bit checking in Greenplum DB and improved the
> visibility caching which was enough to provide performance at the same
> level as with the HINT bit optimization, but avoids this whole “write
> the data, write it to the log also, then write it again just for good
> measure” behavior.
>
> For people doing data warehousing work like the poster, this Postgres
> behavior is miserable.  It should be fixed for 8.4 for sure
> (volunteers?)

I might try it. I think I have told you about my ideas ;)
I plan to first do "cacheing" (for being able to doi index only scans
among other things) and then if the cache works reliably, use the
"cacheing" code as the main visibility / MVCC mechanism.

Is Greenplums code available, or should I roll my own ?

> BTW – for the poster’s benefit, you should implement partitioning by
> date, then load each partition and VACUUM ANALYZE after each load.
>  You probably won’t need the date index anymore – so your load times
> will vastly improve (no indexes), you’ll store less data (no indexes)
> and you’ll be able to do simpler data management with the partitions.
>
> You may also want to partition AND index if you do a lot of short
> range selective date predicates.  Example would be: partition by day,
> index on date field, queries selective on date ranges by hour will
> then select out only the day needed, then index scan to get the
> hourly values.

If your queries allow it, you may try indexing on
int2::extract('HOUR' from date)
so the index may be smaller

storing the date as type abstime is another way to reduce index size.

> Typically time-oriented data is nearly time sorted anyway, so you’ll
> also get the benefit of a clustered index.

----------------
Hannu


Re: I/O on select count(*)

From
Decibel!
Date:
On May 18, 2008, at 1:28 AM, Greg Smith wrote:
> I just collected all the good internals information included in
> this thread and popped it onto http://wiki.postgresql.org/wiki/
> Hint_Bits where I'll continue to hack away at the text until it's
> readable.  Thanks to everyone who answered my questions here,
> that's good progress toward clearing up a very underdocumented area.
>
> I note a couple of potential TODO items not on the official list
> yet that came up during this discussion:
>
> -Smooth latency spikes when switching commit log pages by
> preallocating cleared pages before they are needed
>
> -Improve bulk loading by setting "frozen" hint bits for tuple
> inserts which occur within the same database transaction as the
> creation of the table into which they're being inserted
>
> Did I miss anything?  I think everything brought up falls either
> into one of those two or the existing "Consider having the
> background writer update the transaction status hint bits..." TODO.

-Evaluate impact of improved caching of CLOG per Greenplum:

Per Luke Longergan:
I'll find out if we can extract our code that did the work. It was
simple but scattered in a few routines. In concept it worked like this:

1 - Ignore if hint bits are unset, use them if set.  This affects
heapam and vacuum I think.
2 - implement a cache for clog lookups based on the optimistic
assumption that the data was inserted in bulk.  Put the cache one
call away from heapgetnext()

I forget the details of (2).  As I recall, if we fall off of the
assumption, the penalty for long scans get large-ish (maybe 2X), but
since when do people full table scan when they're updates/inserts are
so scattered across TIDs?  It's an obvious big win for DW work.

--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: I/O on select count(*)

From
Decibel!
Date:
On May 18, 2008, at 1:28 AM, Greg Smith wrote:
> I just collected all the good internals information included in
> this thread and popped it onto http://wiki.postgresql.org/wiki/
> Hint_Bits where I'll continue to hack away at the text until it's
> readable.  Thanks to everyone who answered my questions here,
> that's good progress toward clearing up a very underdocumented area.
>
> I note a couple of potential TODO items not on the official list
> yet that came up during this discussion:
>
> -Smooth latency spikes when switching commit log pages by
> preallocating cleared pages before they are needed
>
> -Improve bulk loading by setting "frozen" hint bits for tuple
> inserts which occur within the same database transaction as the
> creation of the table into which they're being inserted
>
> Did I miss anything?  I think everything brought up falls either
> into one of those two or the existing "Consider having the
> background writer update the transaction status hint bits..." TODO.

Blah, sorry for the double-post, but I just remembered a few things...

Did we completely kill the idea of the bg_writer *or some other
background process* being responsible for setting all hint-bits on
dirty pages before they're written out?

Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the
back-end if the page isn't already dirty. We'd likely need some
heuristics on this... based on Luke's comments about improved CLOG
caching maybe we want to set the bits anyway if the tuples without
them set are from old transactions (idea being that pulling those
CLOG pages would be pretty expensive). Or better yet; if we have to
read a CLOG page off disk, set the bits.

This could still potentially be a big disadvantage for data
warehouses; though perhaps the way to fix that is recommend a
backgrounded vacuum after data load.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Attachment

Re: I/O on select count(*)

From
"Heikki Linnakangas"
Date:
Decibel! wrote:
> Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the
> back-end if the page isn't already dirty.

Or even better: set the hint-bits, but don't dirty the page.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: I/O on select count(*)

From
Tom Lane
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Decibel! wrote:
>> Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the
>> back-end if the page isn't already dirty.

> Or even better: set the hint-bits, but don't dirty the page.

Which in fact is what Simon suggested, not the other thing.

            regards, tom lane

Re: I/O on select count(*)

From
Simon Riggs
Date:
On Mon, 2008-05-26 at 11:36 -0400, Tom Lane wrote:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> > Decibel! wrote:
> >> Also, Simon and Tom had an idea at PGCon: Don't set hint-bits in the
> >> back-end if the page isn't already dirty.
>
> > Or even better: set the hint-bits, but don't dirty the page.
>
> Which in fact is what Simon suggested, not the other thing.

Just raised this on -hackers, BTW.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support