Thread: reindex/vacuum locking/performance?

reindex/vacuum locking/performance?

From
Rob Nagler
Date:
I've read some posts that says vacuum doesn't lock, but my experience
today indicates the opposite.  It seemed that "vacuum full analyze"
was locked waiting and so were other postmaster processes.  It
appeared to be deadlock, because all were in "WAITING" state according
to ps.  I let this go for about a 1/2 hour, and then killed the vacuum
at which point all other processes completed normally.

The same thing seemed to be happening with reindex on a table.  It
seems that the reindex locks the table and some other resource which
then causes deadlock with other active processes.

Another issue seems to be performance.  A reindex on some indexes is
taking 12 minutes or so.  Vacuum seems to be slow, too.  Way longer
than the time it takes to reimport the entire database (30 mins).

In summary, I suspect that it is better from a UI perspective to bring
down the app on Sat at 3 a.m and reimport with a fixed time period
than to live through reindexing/vacuuming which may deadlock.  Am I
missing something?

Thanks,
Rob



Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
Rob Nagler <nagler@bivio.biz> writes:
> I've read some posts that says vacuum doesn't lock, but my experience
> today indicates the opposite.  It seemed that "vacuum full analyze"
> was locked waiting and so were other postmaster processes.

vacuum full does require exclusive lock, plain vacuum does not.

> It
> appeared to be deadlock, because all were in "WAITING" state according
> to ps.  I let this go for about a 1/2 hour, and then killed the vacuum
> at which point all other processes completed normally.

It's considerably more likely that the vacuum was waiting for an open
client transaction (that had a read or write lock on some table) to
finish than that there was an undetected deadlock.  I suggest looking at
your client code.  Also, in 7.3 or later you could look at the pg_locks
view to work out exactly who has the lock that's blocking vacuum.

> Another issue seems to be performance.  A reindex on some indexes is
> taking 12 minutes or so.  Vacuum seems to be slow, too.  Way longer
> than the time it takes to reimport the entire database (30 mins).

vacuum full is indeed slow.  That's why we do not recommend it as a
routine maintenance procedure.  The better approach is to do plain
vacuums often enough that you don't need vacuum full.  In pre-7.4
releases you might need periodic reindexes too, depending on whether
your usage patterns tickle the index-bloat problem.  But it is easily
demonstrable that reindexing is cheaper than rebuilding the database.

> In summary, I suspect that it is better from a UI perspective to bring
> down the app on Sat at 3 a.m and reimport with a fixed time period
> than to live through reindexing/vacuuming which may deadlock.  Am I
> missing something?

Almost certainly, though you've not provided enough detail to determine
what.

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
Rob Nagler
Date:
> vacuum full does require exclusive lock, plain vacuum does not.

I think I need full, because there are updates on the table.  As I
understand it, an update in pg is an insert/delete, so it needs
to be garbage collected.

> It's considerably more likely that the vacuum was waiting for an open
> client transaction (that had a read or write lock on some table) to
> finish than that there was an undetected deadlock.  I suggest looking at
> your client code.  Also, in 7.3 or later you could look at the pg_locks
> view to work out exactly who has the lock that's blocking vacuum.

My client code does a lot.  I look at more often than I'd like to. :-)

I don't understand why the client transaction would block if vacuum
was waiting.  Does vacuum lock the table and then try to get some
other "open transaction" resource?  Free space?  I guess I don't
understand what other resources would be required of vacuum.  The
client transactions are short (< 1s).  They don't deadlock normally,
only with reindex and vacuum did I see this behavior.

> vacuum full is indeed slow.  That's why we do not recommend it as a
> routine maintenance procedure.  The better approach is to do plain
> vacuums often enough that you don't need vacuum full.

The description of vacuum full implies that is required if the db
is updated frequently.   This db gets about 1 txn a second, possibly
more at peak load.

> In pre-7.4
> releases you might need periodic reindexes too, depending on whether
> your usage patterns tickle the index-bloat problem.

7.3, and yes, we have date indexes as well as sequences for primary
keys.

> But it is easily
> demonstrable that reindexing is cheaper than rebuilding the database.

IOW, vacuum+reindex is faster than dump+restore?  I didn't see this,
then again, I had this locking problem, so the stats are distorted.

One other question: The reindex seems to lock the table for the entire
process as opposed to freeing the lock between index rebuilds.  It was
hard to see, but it seemed like the clients were locked for the entire
"reindex table bla" command.

Sorry for lack of detail, but I didn't expect these issues so I wasn't
keeping track of the system state as closely as I should have.  Next
time. :-)

Thanks,
Rob

Re: reindex/vacuum locking/performance?

From
Bruno Wolff III
Date:
On Fri, Oct 03, 2003 at 15:47:01 -0600,
  Rob Nagler <nagler@bivio.biz> wrote:
> > vacuum full does require exclusive lock, plain vacuum does not.
>
> I think I need full, because there are updates on the table.  As I
> understand it, an update in pg is an insert/delete, so it needs
> to be garbage collected.

Plain vacuum will mark the space used by deleted tuples as reusable.
Most of the time this is good enough and you don't need to run vacuum full.

Re: reindex/vacuum locking/performance?

From
Neil Conway
Date:
On Fri, 2003-10-03 at 17:47, Rob Nagler wrote:
> They don't deadlock normally,
> only with reindex and vacuum did I see this behavior.

If you can provide a reproducible example of a deadlock induced by
REINDEX + VACUUM, that would be interesting.

(FWIW, I remember noticing a potential deadlock in the REINDEX code and
posting to -hackers about it, but I've never seen it occur in a
real-world situation...)

> One other question: The reindex seems to lock the table for the entire
> process as opposed to freeing the lock between index rebuilds.

Yeah, I wouldn't be surprised if there is some room for optimizing the
locks that are acquired by REINDEX.

-Neil



Re: reindex/vacuum locking/performance?

From
Christopher Browne
Date:
nagler@bivio.biz (Rob Nagler) writes:
> I've read some posts that says vacuum doesn't lock, but my experience
> today indicates the opposite.  It seemed that "vacuum full analyze"
> was locked waiting and so were other postmaster processes.  It
> appeared to be deadlock, because all were in "WAITING" state according
> to ps.  I let this go for about a 1/2 hour, and then killed the vacuum
> at which point all other processes completed normally.

VACUUM FULL certainly does lock.

See the man page:

   INPUTS
       FULL   Selects ``full'' vacuum, which may reclaim more space, but takes
              much longer and exclusively locks the table.

The usual answer is that you probably _didn't_ want to VACUUM FULL.

VACUUM ('no full') does NOT block updates.

> The same thing seemed to be happening with reindex on a table.  It
> seems that the reindex locks the table and some other resource which
> then causes deadlock with other active processes.

Not surprising either.  While the reindex takes place, updates to that
table have to be deferred.

> Another issue seems to be performance.  A reindex on some indexes is
> taking 12 minutes or so.  Vacuum seems to be slow, too.  Way longer
> than the time it takes to reimport the entire database (30 mins).

That seems a little surprising.

> In summary, I suspect that it is better from a UI perspective to
> bring down the app on Sat at 3 a.m and reimport with a fixed time
> period than to live through reindexing/vacuuming which may deadlock.
> Am I missing something?

Consider running pg_autovacuum, and thereby do a little bit of
vacuuming here and there all the time.  It DOESN'T block, so unless
your system is really busy, it shouldn't slow things down to a major
degree.
--
"cbbrowne","@","libertyrms.info"
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> > vacuum full does require exclusive lock, plain vacuum does not.
>
> I think I need full, because there are updates on the table.  As I
> understand it, an update in pg is an insert/delete, so it needs
> to be garbage collected.

Yes and no.  You only need a plain VACUUM that is run often enough to
recover space as fast as you need to grab it.  For heavily updated tables
run it often - I run it every 5 minutes on some tables.  A VACUUM FULL is
only needed if you haven't been running VACUUM often enough in the first
place.

> The description of vacuum full implies that is required if the db
> is updated frequently.   This db gets about 1 txn a second, possibly
> more at peak load.

Assuming you mean 1 update/insert per second that is an absolutely _trivial_
load on any reasonable hardware.  You can do thousands of updates/second on
hardware costing less than $2000.  If you vacuum every hour then you will be
fine.

> IOW, vacuum+reindex is faster than dump+restore?  I didn't see this,
> then again, I had this locking problem, so the stats are distorted.

REINDEX also locks tables like VACUUM FULL.  Either is terribly slow, but
unless you turn off fsync during the restore it's unlikely to be slower than
dump & restore.

Matt


Re: reindex/vacuum locking/performance?

From
Josh Berkus
Date:
Rob,

> > I think I need full, because there are updates on the table.  As I
> > understand it, an update in pg is an insert/delete, so it needs
> > to be garbage collected.
>
> Yes and no.  You only need a plain VACUUM that is run often enough to
> recover space as fast as you need to grab it.  For heavily updated tables
> run it often - I run it every 5 minutes on some tables.  A VACUUM FULL is
> only needed if you haven't been running VACUUM often enough in the first
> place.

Also, if you find that you need to run VACUUM FULL often, then you need to
raise your max_fsm_pages.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> > In summary, I suspect that it is better from a UI perspective to
> > bring down the app on Sat at 3 a.m and reimport with a fixed time
> > period than to live through reindexing/vacuuming which may deadlock.
> > Am I missing something?
>
> Consider running pg_autovacuum, and thereby do a little bit of
> vacuuming here and there all the time.  It DOESN'T block, so unless
> your system is really busy, it shouldn't slow things down to a major
> degree.

My real world experience on a *very* heavily updated OLTP type DB, following
advice from this list (thanks guys!), is that there is essentially zero cost
to going ahead and vacuuming as often as you feel like it.  Go crazy, and
speed up your DB!

OK, that's on a quad CPU box with goodish IO, so maybe there are issues on
very slow boxen, but in a heavy-update environment the advantages seem to
easily wipe out the costs.

Matt

p.s.  Sorry to sound like a "Shake'n'Vac" advert.


Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> Also, if you find that you need to run VACUUM FULL often, then
> you need to
> raise your max_fsm_pages.

Yes and no.  If it's run often enough then the number of tracked pages
shouldn't need to be raised, but then again...

...max_fsm_pages should be raised anyway.  I'm about to reclaim a Pentium
166 w/ 64MB of RAM from a friend I lent it to _many_ years ago, and I
suspect PG would run happily on it as configured by default.  Set it to at
least 50,000 I say.  What do you have to lose, I mean if they're not free
then they're not tracked in the FSM right?

Of course if anyone knows a reason _not_ to raise it then I'm all ears!

Matt


>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>


Re: reindex/vacuum locking/performance?

From
Neil Conway
Date:
On Fri, 2003-10-03 at 17:34, Christopher Browne wrote:
> Not surprising either.  While the reindex takes place, updates to that
> table have to be deferred.

Right, but that's no reason not to let SELECTs proceed, for example.
(Whether that would actually be *useful* is another question...)

-Neil



Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> > Also, if you find that you need to run VACUUM FULL often, then
> > you need to
> > raise your max_fsm_pages.
>
> Yes and no.  If it's run often enough then the number of tracked pages
> shouldn't need to be raised, but then again...

Oops, sorry, didn't pay attention and missed the mention of FULL.  My bad,
ignore my OT useless response.


Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Fri, 2003-10-03 at 17:34, Christopher Browne wrote:
>> Not surprising either.  While the reindex takes place, updates to that
>> table have to be deferred.

> Right, but that's no reason not to let SELECTs proceed, for example.

What if said SELECTs are using the index in question?

I suspect it is true that REINDEX locks more than it needs to, but we
should tread carefully about loosening it.

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Fri, Oct 03, 2003 at 02:24:42PM -0600, Rob Nagler wrote:
> I've read some posts that says vacuum doesn't lock, but my experience
> today indicates the opposite.  It seemed that "vacuum full analyze"

VACUUM doesn't.  VACUUM FULL does.

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Sat, Oct 04, 2003 at 12:29:55AM +0100, Matt Clark wrote:
> My real world experience on a *very* heavily updated OLTP type DB, following
> advice from this list (thanks guys!), is that there is essentially zero cost
> to going ahead and vacuuming as often as you feel like it.  Go crazy, and
> speed up your DB!

That's not quite true.  If vacuums start running into each other, you
can very easily start eating up all your I/O bandwidth.  Even if you
gots lots of it.

Also, a vacuum pretty much destroys your shared buffers, so you have
to be aware of that trade-off too.  Vacuum is not free.  It's _way_
cheaper than it used to be, though.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Fri, Oct 03, 2003 at 11:49:03PM -0400, Tom Lane wrote:
>
> What if said SELECTs are using the index in question?

That's a good reason to build a new index and, when it's done, drop
the old one.  It still prevents writes, of course.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> On Sat, Oct 04, 2003 at 12:29:55AM +0100, Matt Clark wrote:
> > My real world experience on a *very* heavily updated OLTP type
> DB, following
> > advice from this list (thanks guys!), is that there is
> essentially zero cost
> > to going ahead and vacuuming as often as you feel like it.  Go
> crazy, and
> > speed up your DB!
>
> That's not quite true.  If vacuums start running into each other, you
> can very easily start eating up all your I/O bandwidth.  Even if you
> gots lots of it.

Very true, which is why all my scripts write a lockfile and delete it when
they're finished, to prevent that happening.  I should have mentioned that.

> Also, a vacuum pretty much destroys your shared buffers, so you have
> to be aware of that trade-off too.  Vacuum is not free.  It's _way_
> cheaper than it used to be, though.

That's _very_ interesting.  I've never been quite clear what's in shared
buffers apart from scratch space for currently running transactions.  Also
the docs imply that vacuum uses it's own space for working in.  Do you have
more info on how it clobbers shared_buffers?

M


Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Sun, Oct 05, 2003 at 12:14:24PM +0100, Matt Clark wrote:
> more info on how it clobbers shared_buffers?

Vacuum is like a seqscan.  It touches everything on a table.  So it
doesn't clobber them, but that's the latest data.  It's unlikely your
buffers are big enough to hold your database, unless your database is
small.  So you'll end up expiring potentially useful data in the
buffer.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> On Sun, Oct 05, 2003 at 12:14:24PM +0100, Matt Clark wrote:
> > more info on how it clobbers shared_buffers?
>
> Vacuum is like a seqscan.  It touches everything on a table.  So it
> doesn't clobber them, but that's the latest data.  It's unlikely your
> buffers are big enough to hold your database, unless your database is
> small.  So you'll end up expiring potentially useful data in the
> buffer.

OK I'm definitely missing something here.  I thought that the FSM was there
to keep track of potentially free pages, and that all VACUUM did was double
check and then write that info out for all to see?  The promise being that a
VACUUM FULL will walk all pages on disk and do a soft-shoe-shuffle to
aggresively recover space, but a simple VACUUM won't (merely confirming
pages as available for reuse).

As for buffers, my understanding is that they are *not* meant to be big
enough to hold the DB, as PG explicitly leaves caching up to the underlying
OS.  'buffers' here meaning shared memory between PG processes, and 'cache'
meaning OS cache.  'buffers' only need to be big enough to hold the
intermediate calcs and the results for any current transactions?

M


Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> OK I'm definitely missing something here.

The point is that a big seqscan (either VACUUM or a plain table scan)
hits a lot of pages, and thereby tends to fill your cache with pages
that aren't actually likely to get hit again soon, perhaps pushing out
pages that will be needed again soon.  This happens at both the
shared-buffer and kernel-disk-cache levels of caching.

It would be good to find some way to prevent big seqscans from
populating cache, but I don't know of any portable way to tell the OS
that we don't want it to cache a page we are reading.

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> The point is that a big seqscan (either VACUUM or a plain table scan)
> hits a lot of pages, and thereby tends to fill your cache with pages
> that aren't actually likely to get hit again soon, perhaps pushing out
> pages that will be needed again soon.  This happens at both the
> shared-buffer and kernel-disk-cache levels of caching.

OK, I had thought (wrongly it seems, as usual, but this is how we learn!)
that a plain VACUUM did not incur a read of all pages.  I still don't
understand *why* it does, but I'll take your word for it.

Clearly if it distorts the 'normal' balance of pages in any caches, PG's or
the OS's, that's a _bad thing_.  I am currently in the nice position of
having a DB that (just about) fits in RAM, so I pretty much don't care about
read performance, but I will have to soon as it grows beyond 3GB :-(  These
conversations are invaluable in planning for that dread time...

> It would be good to find some way to prevent big seqscans from
> populating cache, but I don't know of any portable way to tell the OS
> that we don't want it to cache a page we are reading.

Quite.  The only natural way would be to read those pages through some
special device, but then you might as well do raw disk access from the
get-go.  Portability vs. Performance, the age old quandary.  FWIW I and many
others stand back in pure amazement at the sheer _quality_ of PostgreSQL.


Rgds,

Matt



Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
"Matt Clark" <matt@ymogen.net> writes:
> OK, I had thought (wrongly it seems, as usual, but this is how we learn!)
> that a plain VACUUM did not incur a read of all pages.  I still don't
> understand *why* it does, but I'll take your word for it.

Mainly 'cause it doesn't know where the dead tuples are till it's
looked.  Also, VACUUM is the data collector for the free space map,
and so it is also charged with finding out how much free space exists
on every page.

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
"Matt Clark"
Date:
> Mainly 'cause it doesn't know where the dead tuples are till it's
> looked.

At this point I feel very stupid...

> Also, VACUUM is the data collector for the free space map,
> and so it is also charged with finding out how much free space exists
> on every page.

Ah, now I just feel enlightened!  That makes perfect sense.  I think I had
been conflating free pages with free space, without understanding what the
difference was.  Of course I still don't really understand, but at least I
now _know_ I don't.

Many thanks

Matt


Re: reindex/vacuum locking/performance?

From
Christopher Browne
Date:
After a long battle with technology,matt@ymogen.net ("Matt Clark"), an earthling, wrote:
>> The point is that a big seqscan (either VACUUM or a plain table scan)
>> hits a lot of pages, and thereby tends to fill your cache with pages
>> that aren't actually likely to get hit again soon, perhaps pushing out
>> pages that will be needed again soon.  This happens at both the
>> shared-buffer and kernel-disk-cache levels of caching.
>
> OK, I had thought (wrongly it seems, as usual, but this is how we learn!)
> that a plain VACUUM did not incur a read of all pages.  I still don't
> understand *why* it does, but I'll take your word for it.

How does it know what to do on any given page if it does not read it
in?  It has to evaluate whether tuples can be thrown away or not, and
that requires looking at the tuples.  It may only be looking at a
small portion of the page, but that still requires reading each page.

No free lunch, unfortunately...
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','cbbrowne.com').
http://www3.sympatico.ca/cbbrowne/sgml.html
"End users  are just test loads  for verifying that  the system works,
kind of like resistors in an electrical circuit."
-- Kaz Kylheku in c.o.l.d.s

Re: reindex/vacuum locking/performance?

From
Neil Conway
Date:
On Sat, 2003-10-04 at 11:22, Andrew Sullivan wrote:
> Also, a vacuum pretty much destroys your shared buffers, so you have
> to be aware of that trade-off too.

True, although there is no reason that this necessary needs to be the
case (at least, as far as the PostgreSQL shared buffer goes). As has
been pointed out numerous times on -hackers and in the literature, using
LRU for a DBMS shared buffer cache is far from optimal, and better
algorithms have been proposed (e.g. LRU-K, ARC). We could even have the
VACUUM command inform the bufmgr that the pages it is in the process of
reading in are part of a seqscan, and so are unlikely to be needed in
the immediate future.

-Neil



Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> ... We could even have the
> VACUUM command inform the bufmgr that the pages it is in the process of
> reading in are part of a seqscan, and so are unlikely to be needed in
> the immediate future.

This would be relatively easy to fix as far as our own buffering is
concerned, but the thing that's needed to make it really useful is
to prevent caching of seqscan-read pages in the kernel disk buffers.
I don't know any portable way to do that :-(

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
Neil Conway
Date:
On Sun, 2003-10-05 at 19:43, Tom Lane wrote:
> This would be relatively easy to fix as far as our own buffering is
> concerned, but the thing that's needed to make it really useful is
> to prevent caching of seqscan-read pages in the kernel disk buffers.

True.

> I don't know any portable way to do that :-(

For the non-portable way of doing this, are you referring to O_DIRECT?

Even if it isn't available everywhere, it might be worth considering
this at least for the platforms on which it is supported.

-Neil



Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Sun, Oct 05, 2003 at 07:32:47PM -0400, Neil Conway wrote:

> been pointed out numerous times on -hackers and in the literature, using
> LRU for a DBMS shared buffer cache is far from optimal, and better
> algorithms have been proposed (e.g. LRU-K, ARC). We could even have the
> VACUUM command inform the bufmgr that the pages it is in the process of
> reading in are part of a seqscan, and so are unlikely to be needed in
> the immediate future.

Hey, when that happens, you'll find me first in line to praise the
implementor; but until then, it's important that people not get the
idea that vacuum is free.

It is _way_ imporved, and on moderately loaded boxes, it'salmost
unnoticable.  But under heavy load, you need to be _real_ careful
about calling vacuum.  I think one of the biggest needs in the AVD is
some sort of intelligence about current load on the postmaster, but I
haven't the foggiest idea how to give it such intelligence.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
Jeff
Date:
On Sun, 5 Oct 2003, Neil Conway wrote:

>
> > I don't know any portable way to do that :-(
>
> For the non-portable way of doing this, are you referring to O_DIRECT?
>
> Even if it isn't available everywhere, it might be worth considering
> this at least for the platforms on which it is supported.
>

I strongly agree here only if we can prove there is a benefit.
I think it would be silly of us if some OS supported SnazzyFeatureC that
was able to speed up PG by a large percentage (hopefully, in a rather
non-invasive way in the code).  But, I do see the problem here with bloat
and PG being radically different platform to platform.  I suppose we could
dictate that at least N os's had to have it.. or perhaps supply it has
contrib/ patches.... Something to think about.

I'd be interested in tinkering with this, but I'm more interested at the
moment of why (with proof, not antecdotal) Solaris is so much slower than
Linux and what we cna do about this.  We're looking to move a rather large
Informix db to PG and ops has reservations about ditching Sun hardware.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Re: reindex/vacuum locking/performance?

From
Andrew Sullivan
Date:
On Mon, Oct 06, 2003 at 08:07:27AM -0400, Jeff wrote:
> I strongly agree here only if we can prove there is a benefit.

There's plenty of academic work which purports to show that LRU is
far from the best choice.  Just in principle, it seems obvious that a
single-case seqscan-type operation (such as vacuum does) is a good
way to lose your cache for no real gain.

> I'd be interested in tinkering with this, but I'm more interested at the
> moment of why (with proof, not antecdotal) Solaris is so much slower than
> Linux and what we cna do about this.  We're looking to move a rather large
> Informix db to PG and ops has reservations about ditching Sun hardware.

Interestingly, we're contemplating ditching Solaris because of the
terrible reliability we're getting from the hardware.

You can use truss to find some of the problems on Solaris.  The
open() syscall takes forever when you don't hit the Postgres shared
buffers (even if you can be sure the page is in filesystem buffers --
we could demonstrate it on a 1 G database on a machine with 10 G of
RAM).  I've heard grumblings about spinlocks on Solaris which might
explain this problem.  I certainly notice that performance gets
geometrically worse when you add a few hundred extra connections.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: reindex/vacuum locking/performance?

From
Shridhar Daithankar
Date:
Jeff wrote:
> I'd be interested in tinkering with this, but I'm more interested at the
> moment of why (with proof, not antecdotal) Solaris is so much slower than
> Linux and what we cna do about this.  We're looking to move a rather large
> Informix db to PG and ops has reservations about ditching Sun hardware.

Is linux on sparc hardware is an option..:-)

  Shridhar


locking/performance, Solaris performance discovery

From
Jeff
Date:
On Mon, 6 Oct 2003, Andrew Sullivan wrote:

> There's plenty of academic work which purports to show that LRU is
> far from the best choice.  Just in principle, it seems obvious that a
> single-case seqscan-type operation (such as vacuum does) is a good
> way to lose your cache for no real gain.
>

Logically bypassing caches for a seq scan also makes sense.

> Interestingly, we're contemplating ditching Solaris because of the
> terrible reliability we're getting from the hardware.
>

The reason ops likes solaris / sun is twofold. 1. we have a pile of big
sun machines around.  2. Solaris / Sun is quite a bit more graceful in the
egvent of a hardware failure.  We've burned out our fair share of cpu's
etc and solaris has been rather graceful about it.

I've started profiling and running tests... currently it is leaning
towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is
linked to the sysv implementation.  So what I did was create a
semaphore set, and then fired off 5 copies of a program that attaches
to that semaphore and then locks/unlocks it 1M times.

2xP2-450, Linux 2.4.18: 1 process: 221680 / sec, 5 process: 98039 / sec
4xUltraSparc II-400Mhz, Solaris 2.6: 1 proc: 142857 / sec, 5 process:
23809

So I'm guessing that is where a LOT of the suck is coming from.

What I plan to do next is looking to see if there are other interprocess
locking mechanisms on solaris (perhaps pthread_mutex with that
inter-process flag or something) to see if I can get those numbers a
little closer.


--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Re: locking/performance, Solaris performance discovery

From
Tom Lane
Date:
Jeff <threshar@torgo.978.org> writes:
> I've started profiling and running tests... currently it is leaning
> towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is
> linked to the sysv implementation.

Does Solaris have Posix semaphores?  You could try using those instead.

            regards, tom lane

Re: locking/performance, Solaris performance discovery

From
Jeff
Date:
On Mon, 6 Oct 2003, Tom Lane wrote:

>
> Does Solaris have Posix semaphores?  You could try using those instead.
>
>             regards, tom lane

Yep. It does.

I learned them quick enough (using posix_sema.c as a guide)
and found out that at least on Sol 2.6 they are slower than sysv - with 5
processes it went to about 16k lock/unlock a second.

I'm going to try to find a box around here I can get sol(8|9) on that has
sufficient disk space and see.  I'm guessing sun has likely made
improvements...


Another odd thing I'm trying to work out is why my profiles come out so
radically different on the linux box and the sun box.

Sun:
 31.17     18.90    18.90                             internal_mcount
 19.10     30.48    11.58  8075381     0.00     0.00  _bt_checkkeys
  5.66     33.91     3.43 24375253     0.00     0.00  FunctionCall2
  4.82     36.83     2.92  8073010     0.00     0.00  _bt_step
  3.51     38.96     2.13    14198     0.15     0.15  _read
  2.77     40.64     1.68  8069040     0.00     0.00  varchareq
  2.59     42.21     1.57    28454     0.06     0.23  _bt_next
  2.29     43.60     1.39     1003     1.39     1.40  AtEOXact_Buffers
  1.86     44.73     1.13 16281197     0.00     0.00  pg_detoast_datum
  1.81     45.83     1.10                             _mcount
  1.68     46.85     1.02     2181     0.47     0.47  pglz_decompress


Linux:
 11.14      0.62     0.62     1879     0.00     0.00  pglz_decompress
  6.71      0.99     0.37     1004     0.00     0.00  AtEOXact_Buffers
  3.80      1.20     0.21  1103045     0.00     0.00  AllocSetAlloc
  3.23      1.38     0.18   174871     0.00     0.00  nocachegetattr
  2.92      1.54     0.16  1634957     0.00     0.00  AllocSetFreeIndex
  2.50      1.68     0.14    20303     0.00     0.00  heapgettup
  1.93      1.79     0.11     1003     0.00     0.00  AtEOXact_CatCache
  1.76      1.89     0.10   128442     0.00     0.00  hash_any
  1.72      1.98     0.10    90312     0.00     0.00  FunctionCall3
  1.69      2.08     0.09    50632     0.00     0.00  ExecTargetList
  1.60      2.17     0.09    51647     0.00     0.00  heap_formtuple
  1.55      2.25     0.09   406162     0.00     0.00  newNode
  1.46      2.33     0.08   133044     0.00     0.00  hash_search

It is the same query with slightly different data (The Sun has probably..
20-40k more rows in the table the query hits).

I'll be digging up more info later today.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/



Seqscan buffer promotion (was: reindex/vacuum locking/performance?)

From
James Rogers
Date:
On Mon, 2003-10-06 at 05:15, Andrew Sullivan wrote:
> There's plenty of academic work which purports to show that LRU is
> far from the best choice.  Just in principle, it seems obvious that a
> single-case seqscan-type operation (such as vacuum does) is a good
> way to lose your cache for no real gain.


Traditionally, seqscan type operations are accommodated in LRU type
managers by having multiple buffer promotion policies, primarily because
it is simple to implement.  For example, if you are doing a seqscan, a
buffer loaded from disk is never promoted to the top of the LRU.
Instead it is only partially promoted (say, halfway for example) toward
the top of the buffer list.  A page that is already in the buffer is
promoted either to the halfway point or top depending on where it was
found.  There are numerous variations on the idea, some being more
clever and complex than others.

The point of this being that a pathological or rare sequential scan can
never trash more than a certain percentage of the cache, while not
significantly impacting the performance of a sequential scan. The
primary nuisance is that it slightly increases the API complexity. I'll
add that I don't know what PostgreSQL actually does in this regard, but
from the thread it appears as though seqscans are handled like the
default case.

Cheers,

-James Rogers
 jamesr@best.com



Re: reindex/vacuum locking/performance?

From
Neil Conway
Date:
On Sun, 2003-10-05 at 19:50, Neil Conway wrote:
> On Sun, 2003-10-05 at 19:43, Tom Lane wrote:
> > This would be relatively easy to fix as far as our own buffering is
> > concerned, but the thing that's needed to make it really useful is
> > to prevent caching of seqscan-read pages in the kernel disk buffers.

> For the non-portable way of doing this, are you referring to O_DIRECT?

I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT,
right?

(If you were referring to O_DIRECT, I wanted to add that I wouldn't be
surprised if using O_DIRECT on many kernels reduces or eliminates any
readahead the OS will be doing on the sequential read, so the net result
may actually be a loss for a typical seqscan.)

-Neil



Re: reindex/vacuum locking/performance?

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Sun, 2003-10-05 at 19:50, Neil Conway wrote:
> I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT,
> right?

Not necessarily --- as you point out, it's not clear that O_DIRECT would
help us.  What would be way cool is something similar to what James
Rogers was talking about: a way to tell the kernel not to promote this
page all the way to the top of its LRU list.  I'm not sure that *any*
Unixen have such an API, let alone one that's common across more than
one platform :-(

            regards, tom lane

Re: reindex/vacuum locking/performance?

From
Larry Rosenman
Date:

--On Monday, October 06, 2003 14:26:10 -0400 Tom Lane <tgl@sss.pgh.pa.us>
wrote:

> Neil Conway <neilc@samurai.com> writes:
>> On Sun, 2003-10-05 at 19:50, Neil Conway wrote:
>> I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT,
>> right?
>
> Not necessarily --- as you point out, it's not clear that O_DIRECT would
> help us.  What would be way cool is something similar to what James
> Rogers was talking about: a way to tell the kernel not to promote this
> page all the way to the top of its LRU list.  I'm not sure that *any*
> Unixen have such an API, let alone one that's common across more than
> one platform :-(
I think Verita's VxFS has this as an option/IOCTL.

You can read the Veritas doc on my
http://www.lerctr.org:8458/

pages under filesystems.

That should work on UnixWare and Solaris sites that have VxFS installed.

VxFS is standard on UW.

LER


--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Attachment

Re: reindex/vacuum locking/performance?

From
Bruce Momjian
Date:
Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > On Sun, 2003-10-05 at 19:50, Neil Conway wrote:
> > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT,
> > right?
>
> Not necessarily --- as you point out, it's not clear that O_DIRECT would
> help us.  What would be way cool is something similar to what James
> Rogers was talking about: a way to tell the kernel not to promote this
> page all the way to the top of its LRU list.  I'm not sure that *any*
> Unixen have such an API, let alone one that's common across more than
> one platform :-(

Solaris has "free-behind", which prevents a large kernel sequential scan
from blowing out the cache.

I only read about it in the Mauro Solaris Internals book, and it seems
to be done automatically.  I guess most OS's don't do this optimization
because they usually don't read files larger than their cache.

I see BSD/OS madvise() has:

     #define MADV_NORMAL     0       /* no further special treatment */
     #define MADV_RANDOM     1       /* expect random page references */
     #define MADV_SEQUENTIAL 2       /* expect sequential references */
     #define MADV_WILLNEED   3       /* will need these pages */
-->  #define MADV_DONTNEED   4       /* don't need these pages */
     #define MADV_SPACEAVAIL 5       /* insure that resources are reserved */

The marked one seems to have the control we need.  Of course, the kernel
madvise() code has:

    /* Not yet implemented */

Looks like NetBSD implements it, but it also unmaps the page from the
address space, which might be more than we want.  NetBSD alao has:

     #define MADV_FREE       6       /* pages are empty, free them */

which frees the page.  I am unclear on its us.

FreeBSD has this comment:

/*
 * vm_page_dontneed
 *
 *  Cache, deactivate, or do nothing as appropriate.  This routine
 *  is typically used by madvise() MADV_DONTNEED.
 *
 *  Generally speaking we want to move the page into the cache so
 *  it gets reused quickly.  However, this can result in a silly syndrome
 *  due to the page recycling too quickly.  Small objects will not be
 *  fully cached.  On the otherhand, if we move the page to the inactive
 *  queue we wind up with a problem whereby very large objects
 *  unnecessarily blow away our inactive and cache queues.
 *
 *  The solution is to move the pages based on a fixed weighting.  We
 *  either leave them alone, deactivate them, or move them to the cache,
 *  where moving them to the cache has the highest weighting.
 *  By forcing some pages into other queues we eventually force the
 *  system to balance the queues, potentially recovering other unrelated
 *  space from active.  The idea is to not force this to happen too
 *  often.
 */

The Linux comment is:

/*
 * Application no longer needs these pages.  If the pages are dirty,
 * it's OK to just throw them away.  The app will be more careful about
 * data it wants to keep.  Be sure to free swap resources too.  The
 * zap_page_range call sets things up for refill_inactive to actually free
 * these pages later if no one else has touched them in the meantime,
 * although we could add these pages to a global reuse list for
 * refill_inactive to pick up before reclaiming other pages.
 *
 * NB: This interface discards data rather than pushes it out to swap,
 * as some implementations do.  This has performance implications for
 * applications like large transactional databases which want to discard
 * pages in anonymous maps after committing to backing store the data
 * that was kept in them.  There is no reason to write this data out to
 * the swap area if the application is discarding it.
 *
 * An interface that causes the system to free clean pages and flush
 * dirty pages is already available as msync(MS_INVALIDATE).
 */

It seems mmap is more for controlling the memory mapping of files rather
than controlling the cache itself.



--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: reindex/vacuum locking/performance?

From
Jason Hihn
Date:
Stepping out on a limb... (I'm not a disk kernel guy)

I have long thought that as part of a cache descriptor, there should be a
process-definable replacement-strategy (RS). Each cache entry would be
associated to each process's replacement-strategy variable and the
page-replacement algorithm would then take into consideration the desired
policy. Imagine for simplicity sake, that each strategy gets its own cache
table. When it comes time to replace a page, the system scans the cache
tables, picks the most likely page for replacement from each table, then
selects the most likely page between all policies. This allows the 99% of
apps that can make excellent use of use LRU to use LRU among themselves
(best for everyone), and the MRU (better for databases) (best for everyone
too) to only sacrifice the best pages between MRU apps. Though, once you
have an MRU process, the final decision between taking the page should be
use MRU, and not LRU. Of course there are a number of questions: does each
RS get its own table, to be managed independently, or can we combine them
all into one table? What are the performance implications of the multiple
table management?

One day, I'd like to see function pointers and kernel modules used as ways
for apps to manage replacement policy. fantasyland# insmod MRU.o
fantasyland# vi postgresql.conf { replacement_policy=MRU }
{meanwhile in some postgre .c file:}
set_cache_policy(get_cfg_replacement_policy());
fantasyland# service postmaster restart

Anyone want to throw this at the kernel developers?

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: Monday, October 06, 2003 2:26 PM
> To: Neil Conway
> Cc: Andrew Sullivan; PostgreSQL Performance
> Subject: Re: [PERFORM] reindex/vacuum locking/performance?
>
>
> Neil Conway <neilc@samurai.com> writes:
> > On Sun, 2003-10-05 at 19:50, Neil Conway wrote:
> > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT,
> > right?
>
> Not necessarily --- as you point out, it's not clear that O_DIRECT would
> help us.  What would be way cool is something similar to what James
> Rogers was talking about: a way to tell the kernel not to promote this
> page all the way to the top of its LRU list.  I'm not sure that *any*
> Unixen have such an API, let alone one that's common across more than
> one platform :-(
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>


Re: reindex/vacuum locking/performance?

From
"Ronald Khoo"
Date:
> This would be relatively easy to fix as far as our own buffering is
> concerned, but the thing that's needed to make it really useful is
> to prevent caching of seqscan-read pages in the kernel disk buffers.
> I don't know any portable way to do that :-(

raw disc ? :-)