Thread: Heh, the disappearing problem!

Heh, the disappearing problem!

From
Karl Denninger
Date:
Hi folks,

Remember me?  Remember I was bitching about hashed and indexes scans not
being indexed?

Guess what - it magically fixed itself.

If you want to talk about things that *bother* me, this one tops the pack.

The same query now returns an index hash query plan, which executes in a few
seconds and requires little memory (as opposed to looped sequential scans
requiring 500MB on the server).

This is really, really, odd.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Maarten Boekhold
Date:
On Mon, 9 Mar 1998, Karl Denninger wrote:

>
> Hi folks,
>
> Remember me?  Remember I was bitching about hashed and indexes scans not
> being indexed?
>
> Guess what - it magically fixed itself.
>
> If you want to talk about things that *bother* me, this one tops the pack.
>
> The same query now returns an index hash query plan, which executes in a few
> seconds and requires little memory (as opposed to looped sequential scans
> requiring 500MB on the server).
>
> This is really, really, odd.

Uh, maybe a stupid question, but did you do a 'vacuum'?

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------


Re: [HACKERS] Heh, the disappearing problem!

From
Karl Denninger
Date:
On Mon, Mar 09, 1998 at 10:22:13PM +0100, Maarten Boekhold wrote:
> On Mon, 9 Mar 1998, Karl Denninger wrote:
>
> >
> > Hi folks,
> >
> > Remember me?  Remember I was bitching about hashed and indexes scans not
> > being indexed?
> >
> > Guess what - it magically fixed itself.
> >
> > If you want to talk about things that *bother* me, this one tops the pack.
> >
> > The same query now returns an index hash query plan, which executes in a few
> > seconds and requires little memory (as opposed to looped sequential scans
> > requiring 500MB on the server).
> >
> > This is really, really, odd.
>
> Uh, maybe a stupid question, but did you do a 'vacuum'?

Uh, yeah.  Several of them, thinking that perhaps the indices were out of
date or otherwise damaged.  Then, after some period of time where queries
and updates were being put against the table, it started working properly (!)

I have *absolutely* no idea what could have changed - I made no alterations,
didn't shut the server down, nothing - between the time that it was last
returning the wrong "explain" output and when it started returning the
correct output.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Maarten Boekhold
Date:
> > Uh, maybe a stupid question, but did you do a 'vacuum'?
>
> Uh, yeah.  Several of them, thinking that perhaps the indices were out of
> date or otherwise damaged.  Then, after some period of time where queries
> and updates were being put against the table, it started working properly (!)
>
> I have *absolutely* no idea what could have changed - I made no alterations,
> didn't shut the server down, nothing - between the time that it was last
> returning the wrong "explain" output and when it started returning the
> correct output.

Maybe the tables grew so that the optimizer decided now was to time where
using indices would be faster? Just trying to make some sense out of this...

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------


Re: [HACKERS] Heh, the disappearing problem!

From
Karl Denninger
Date:
On Mon, Mar 09, 1998 at 10:27:19PM +0100, Maarten Boekhold wrote:
> > > Uh, maybe a stupid question, but did you do a 'vacuum'?
> >
> > Uh, yeah.  Several of them, thinking that perhaps the indices were out of
> > date or otherwise damaged.  Then, after some period of time where queries
> > and updates were being put against the table, it started working properly (!)
> >
> > I have *absolutely* no idea what could have changed - I made no alterations,
> > didn't shut the server down, nothing - between the time that it was last
> > returning the wrong "explain" output and when it started returning the
> > correct output.
>
> Maybe the tables grew so that the optimizer decided now was to time where
> using indices would be faster? Just trying to make some sense out of this...
>
> Maarten

I don't think so.  We do a lot of replaces on these tables (a shitload of
them in fact) but very few inserts.  Vacuum typically finds about the same
number of entries have been replaced as there are in the table to begin with
(ie: the update rate is roughly 1:1 for the number of tuples in the
database, or perhaps even a bit higher).

And, we run vacuum across the database every night through cron.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Bruce Momjian
Date:
>
>
> Hi folks,
>
> Remember me?  Remember I was bitching about hashed and indexes scans not
> being indexed?
>
> Guess what - it magically fixed itself.
>
> If you want to talk about things that *bother* me, this one tops the pack.
>
> The same query now returns an index hash query plan, which executes in a few
> seconds and requires little memory (as opposed to looped sequential scans
> requiring 500MB on the server).
>
> This is really, really, odd.

Vacuum?  Vacuum analyze?  Improper index?

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Heh, the disappearing problem!

From
Karl Denninger
Date:
On Mon, Mar 09, 1998 at 04:58:59PM -0500, Bruce Momjian wrote:
> >
> >
> > Hi folks,
> >
> > Remember me?  Remember I was bitching about hashed and indexes scans not
> > being indexed?
> >
> > Guess what - it magically fixed itself.
> >
> > If you want to talk about things that *bother* me, this one tops the pack.
> >
> > The same query now returns an index hash query plan, which executes in a few
> > seconds and requires little memory (as opposed to looped sequential scans
> > requiring 500MB on the server).
> >
> > This is really, really, odd.
>
> Vacuum?  Vacuum analyze?  Improper index?

Nope, nope and it was working until this morning with no indicated errors.

This and the text indices are the two really, really odd things I'm seeing
in 6.3, along with *some* operations taking a lot longer than they used to.

One of the things we do around here from libpq is an operation that looks
like this:


1)    Select a set of records from a join on a couple of tables.

2)    Select from a different table (using the results from the first
    select), looking specifically for certain data which we then use
    programmatically to perform a set of computations.

Now, the first select is just peachy.  It returns ~1500 or so records.

The iterative loop on the second select used to run through the entire 1500
records or so (doing a select for each) in ~20-30 seconds.  Now it takes
roughly 2-3 minutes to do the same job.  I have checked with "explain" that
the indices are being used for all queries - they are.

I've seen this also with a few other processes that do the same kind of
thing, and get the same results.

I'm wondering if what we're seeing here is a severe degredation of locking
performance.  That is, could this be related to the new deadlock detection
code?  We're doing selects/updates within several tables  in these jobs, but
all within one transaction block (begin/commit or begin/rollback).

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Bruce Momjian
Date:
>
> Nope, nope and it was working until this morning with no indicated errors.
>
> This and the text indices are the two really, really odd things I'm seeing
> in 6.3, along with *some* operations taking a lot longer than they used to.
>
> One of the things we do around here from libpq is an operation that looks
> like this:
>
>
> 1)    Select a set of records from a join on a couple of tables.
>
> 2)    Select from a different table (using the results from the first
>     select), looking specifically for certain data which we then use
>     programmatically to perform a set of computations.
>
> Now, the first select is just peachy.  It returns ~1500 or so records.
>
> The iterative loop on the second select used to run through the entire 1500
> records or so (doing a select for each) in ~20-30 seconds.  Now it takes
> roughly 2-3 minutes to do the same job.  I have checked with "explain" that
> the indices are being used for all queries - they are.
>
> I've seen this also with a few other processes that do the same kind of
> thing, and get the same results.
>
> I'm wondering if what we're seeing here is a severe degredation of locking
> performance.  That is, could this be related to the new deadlock detection
> code?  We're doing selects/updates within several tables  in these jobs, but
> all within one transaction block (begin/commit or begin/rollback).

The deadlock code only runs after a minute of waiting for a lock, and
every minute thereafter, so if you are not waiting on a lock, you don't
run that code.

Now the queue behavior has changed.  We used to give readers preference,
but now we do some queue management that is fairer to readers and
writers.

See backend/storage/lmgr/proc.c::ProcSleep for the details:

    /*
     * If the first entries in the waitQueue have a greater priority than
     * we have, we must be a reader, and they must be a writers, and we
     * must be here because the current holder is a writer or a reader but
     * we don't share shared locks if a writer is waiting. We put
     * ourselves after the writers.  This way, we have a FIFO, but keep
     * the readers together to give them decent priority, and no one
     * starves.  Because we group all readers together, a non-empty queue
     * only has a few possible configurations:
     *
     * [readers] [writers] [readers][writers] [writers][readers]
     * [writers][readers][writers]
     *
     * In a full queue, we would have a reader holding a lock, then a writer
     * gets the lock, then a bunch of readers, made up of readers who
     * could not share the first readlock because a writer was waiting,
     * and new readers arriving while the writer had the lock.
     *
     */


--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Heh, the disappearing problem!

From
Karl Denninger
Date:
On Mon, Mar 09, 1998 at 10:01:17PM -0500, Bruce Momjian wrote:
> > 1)    Select a set of records from a join on a couple of tables.
> >
> > 2)    Select from a different table (using the results from the first
> >     select), looking specifically for certain data which we then use
> >     programmatically to perform a set of computations.
> >
> > Now, the first select is just peachy.  It returns ~1500 or so records.
> >
> > The iterative loop on the second select used to run through the entire 1500
> > records or so (doing a select for each) in ~20-30 seconds.  Now it takes
> > roughly 2-3 minutes to do the same job.  I have checked with "explain" that
> > the indices are being used for all queries - they are.
> >
> > I've seen this also with a few other processes that do the same kind of
> > thing, and get the same results.
> >
> > I'm wondering if what we're seeing here is a severe degredation of locking
> > performance.  That is, could this be related to the new deadlock detection
> > code?  We're doing selects/updates within several tables  in these jobs, but
> > all within one transaction block (begin/commit or begin/rollback).
>
> The deadlock code only runs after a minute of waiting for a lock, and
> every minute thereafter, so if you are not waiting on a lock, you don't
> run that code.
>
> Now the queue behavior has changed.  We used to give readers preference,
> but now we do some queue management that is fairer to readers and
> writers.

Well, the other problem that comes with this however is CPU load on the
server.  I find that the system load goes through the ceiling - there's
not a lot of disk I/O going on during this time, but the CPU is damn busy,
almost all of it in user time.

I don't think this explains all of it.  Something is going on in the
interaction of the different processes being handled.

Basically, we have the following:

1)    One big "select" to get candidates.

2)    A second select on another table using values from the candidate to
    check two algorythmic values (which can't really be pushed up into
    the first select as a join - if it could, that would make things
    easy).

3)    If either of the two conditions are met, then we take a process
    detour and do the appropriate system-level things, and update the
    candidate record. If we do take either of these, we open a
    transaction to make sure we either commit all changes or none of
    them.

4)    If the candidate was modified (ie: a "begin" was executed along
    with one or more "updates") we send a "commit", assuming we were
    able to do the system-level things ok.

While this is running, no other instance can be (we use another table with a
single "flag" value as a means of preventing this), because if two of these
come along at once all hell would break loose.

This is "set off" by an async trigger when there could be candidates.

With 6.2.1 this worked fine - we got through the entire process in a few
seconds, and all was well.

With 6.3 I have had to schedule this to run on 30 minute intervals, because
its 10x+ slower than it was under 6.2.1.  I'm giving serious consideration
to a table reconstruction so I can use a join to get all of this in one
pass, which will get rid of the need to loop over the first select's
returned values.

I've seen this same kind of behavior in a few other places as well; in
places where you are doing reads and writes in a mixed environment (ie: read
something, write something (typically in a different table) based on what
you read) performance of 6.3 has gone in the toilet.  Update jobs that used
to run in tens of seconds are requiring several minutes to complete now.

And again, we're not seeing much disk I/O during this period - but we *ARE*
seeing a hell of a lot of CPU activity, almost all in user mode.

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Bruce Momjian
Date:
> I've seen this same kind of behavior in a few other places as well; in
> places where you are doing reads and writes in a mixed environment (ie: read
> something, write something (typically in a different table) based on what
> you read) performance of 6.3 has gone in the toilet.  Update jobs that used
> to run in tens of seconds are requiring several minutes to complete now.
>
> And again, we're not seeing much disk I/O during this period - but we *ARE*
> seeing a hell of a lot of CPU activity, almost all in user mode.

OK, how about using postgres -t option or profiling to get the function
using so much cpu?  This may help us tell where the problem lies.  Does
anyone else see such problems?  All other reports I hear was that 6.3
was faster.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Heh, the disappearing problem!

From
Karl Denninger
Date:
On Mon, Mar 09, 1998 at 11:05:49PM -0500, Bruce Momjian wrote:
> > I've seen this same kind of behavior in a few other places as well; in
> > places where you are doing reads and writes in a mixed environment (ie: read
> > something, write something (typically in a different table) based on what
> > you read) performance of 6.3 has gone in the toilet.  Update jobs that used
> > to run in tens of seconds are requiring several minutes to complete now.
> >
> > And again, we're not seeing much disk I/O during this period - but we *ARE*
> > seeing a hell of a lot of CPU activity, almost all in user mode.
>
> OK, how about using postgres -t option or profiling to get the function
> using so much cpu?  This may help us tell where the problem lies.  Does
> anyone else see such problems?  All other reports I hear was that 6.3
> was faster.

Docs on this somewhere?

I'll be happy to profile it if I can figure out how! :-)

--
--
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly to FULL DS-3 Service
                 | NEW! K56Flex support on ALL modems
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

Re: [HACKERS] Heh, the disappearing problem!

From
Bruce Momjian
Date:
> > was faster.
>
> Docs on this somewhere?
>
> I'll be happy to profile it if I can figure out how! :-)
>

See the FAQ.  Usually profiling is -pg.

--
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Re: [HACKERS] Heh, the disappearing problem!

From
Maarten Boekhold
Date:
On Mon, 9 Mar 1998, Karl Denninger wrote:

> On Mon, Mar 09, 1998 at 11:05:49PM -0500, Bruce Momjian wrote:
> > > I've seen this same kind of behavior in a few other places as well; in
> > > places where you are doing reads and writes in a mixed environment (ie: read
> > > something, write something (typically in a different table) based on what
> > > you read) performance of 6.3 has gone in the toilet.  Update jobs that used
> > > to run in tens of seconds are requiring several minutes to complete now.
> > >
> > > And again, we're not seeing much disk I/O during this period - but we *ARE*
> > > seeing a hell of a lot of CPU activity, almost all in user mode.
> >
> > OK, how about using postgres -t option or profiling to get the function
> > using so much cpu?  This may help us tell where the problem lies.  Does
> > anyone else see such problems?  All other reports I hear was that 6.3
> > was faster.
>
> Docs on this somewhere?
>
> I'll be happy to profile it if I can figure out how! :-)

I have just figured this out. First, you can get some timing information
from postgresql itself by including '-tpa -tpl -te' on the backend
options, second, you can recompile postgresql with profiling on (look in
Makefile.global for profiling options, adjust them for your syetem).
Then, *don't do a 'make install'* but only copy the postgres binary (in
backend dir) to your postgresql bin-dir, as postgres.profile. Whenever
you want to profile, edit your start-script to use that postgres.profile
as a backend. Run your queries, then copy the profile-output (gmon.out
for gcc profiling) from the *database* directory you did your queries on
(ie. if you connected to database 'template1', it'll be in
data/base/template1).

Then do normal profiling stuff (foor gcc profiling, run 'gprof gmon.out
bin/postgres.profile > prof.out' to get readable output).

If you have questions, please mail me privately.

Maarten

_____________________________________________________________________________
| TU Delft, The Netherlands, Faculty of Information Technology and Systems  |
|                   Department of Electrical Engineering                    |
|           Computer Architecture and Digital Technique section             |
|                          M.Boekhold@et.tudelft.nl                         |
-----------------------------------------------------------------------------


Re: [HACKERS] Heh, the disappearing problem!

From
"Boersenspielteam"
Date:
Hello,

I posted a similar problem here on Saturday. Now I had a little time
looking at the queries we send.

This is the output of PG 6.2.1

--
boersenspiel=> explain SELECT DISTINCT spieler_nr, Trans.wpk_nr,
state, anzahl, buyprice, buydate, sellprice, selldate, Kurse.wpk_nr,
name, curr, kurs, datum , Trans.oid from Trans, Kurse WHERE
Trans.wpk_nr=Kurse.wpk_nr AND spieler_nr=3 ORDER BY Trans.wpk_nr ,
selldate USING >; NOTICE:QUERY PLAN:

Unique  (cost=0.00 size=0 width=0)
   ->   Sort  (cost=4.10 size=0 width=0)
     ->     Nested Loop  (cost=4.10 size=1 width=73)
       ->       Index Scan on trans  (cost=2.05 size=1 width=41)
       ->       Index Scan on kurse  (cost=2.05 size=14305 width=32)


Now the same query in 6.3:

Unique  (cost=1164.21 size=0 width=0)
  ->  Sort  (cost=1164.21 size=0 width=0)
        ->  Hash Join  (cost=1164.21 size=1 width=73)
              ->  Seq Scan on kurse  (cost=688.07 size=14305 width=32)
              ->  Hash  (cost=0.00 size=0 width=0)
                    ->  Index Scan on trans  (cost=2.05 size=1
width=41)

All indices are created (all btrees), but the index on kurse doesn't
seem to be used.


> Guess what - it magically fixed itself.
>
> If you want to talk about things that *bother* me, this one tops the pack.
>
> The same query now returns an index hash query plan, which executes in a few
> seconds and requires little memory (as opposed to looped sequential scans
> requiring 500MB on the server).
>
> This is really, really, odd.

Dito.

BTW.: I tried to apply the patches from Massimo, as the only major
problem for us in 6.2.1p6 is the buggy deadlock code. Anybody managed
to get it working?

Ciao

Ulrich



Ulrich Voss                            \ \   / /__  / ___|__ _| |
VoCal web publishing                    \ \ / / _ \| |   / _` | |
voss@vocalweb.de                         \ V / (_) | |__| (_| | |
http://www.vocalweb.de                    \_/ \___/ \____\__,_|_|
Tel: 0203-306-1560                                 web publishing