Thread: Heh, the disappearing problem!
Hi folks, Remember me? Remember I was bitching about hashed and indexes scans not being indexed? Guess what - it magically fixed itself. If you want to talk about things that *bother* me, this one tops the pack. The same query now returns an index hash query plan, which executes in a few seconds and requires little memory (as opposed to looped sequential scans requiring 500MB on the server). This is really, really, odd. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
On Mon, 9 Mar 1998, Karl Denninger wrote: > > Hi folks, > > Remember me? Remember I was bitching about hashed and indexes scans not > being indexed? > > Guess what - it magically fixed itself. > > If you want to talk about things that *bother* me, this one tops the pack. > > The same query now returns an index hash query plan, which executes in a few > seconds and requires little memory (as opposed to looped sequential scans > requiring 500MB on the server). > > This is really, really, odd. Uh, maybe a stupid question, but did you do a 'vacuum'? Maarten _____________________________________________________________________________ | TU Delft, The Netherlands, Faculty of Information Technology and Systems | | Department of Electrical Engineering | | Computer Architecture and Digital Technique section | | M.Boekhold@et.tudelft.nl | -----------------------------------------------------------------------------
On Mon, Mar 09, 1998 at 10:22:13PM +0100, Maarten Boekhold wrote: > On Mon, 9 Mar 1998, Karl Denninger wrote: > > > > > Hi folks, > > > > Remember me? Remember I was bitching about hashed and indexes scans not > > being indexed? > > > > Guess what - it magically fixed itself. > > > > If you want to talk about things that *bother* me, this one tops the pack. > > > > The same query now returns an index hash query plan, which executes in a few > > seconds and requires little memory (as opposed to looped sequential scans > > requiring 500MB on the server). > > > > This is really, really, odd. > > Uh, maybe a stupid question, but did you do a 'vacuum'? Uh, yeah. Several of them, thinking that perhaps the indices were out of date or otherwise damaged. Then, after some period of time where queries and updates were being put against the table, it started working properly (!) I have *absolutely* no idea what could have changed - I made no alterations, didn't shut the server down, nothing - between the time that it was last returning the wrong "explain" output and when it started returning the correct output. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
> > Uh, maybe a stupid question, but did you do a 'vacuum'? > > Uh, yeah. Several of them, thinking that perhaps the indices were out of > date or otherwise damaged. Then, after some period of time where queries > and updates were being put against the table, it started working properly (!) > > I have *absolutely* no idea what could have changed - I made no alterations, > didn't shut the server down, nothing - between the time that it was last > returning the wrong "explain" output and when it started returning the > correct output. Maybe the tables grew so that the optimizer decided now was to time where using indices would be faster? Just trying to make some sense out of this... Maarten _____________________________________________________________________________ | TU Delft, The Netherlands, Faculty of Information Technology and Systems | | Department of Electrical Engineering | | Computer Architecture and Digital Technique section | | M.Boekhold@et.tudelft.nl | -----------------------------------------------------------------------------
On Mon, Mar 09, 1998 at 10:27:19PM +0100, Maarten Boekhold wrote: > > > Uh, maybe a stupid question, but did you do a 'vacuum'? > > > > Uh, yeah. Several of them, thinking that perhaps the indices were out of > > date or otherwise damaged. Then, after some period of time where queries > > and updates were being put against the table, it started working properly (!) > > > > I have *absolutely* no idea what could have changed - I made no alterations, > > didn't shut the server down, nothing - between the time that it was last > > returning the wrong "explain" output and when it started returning the > > correct output. > > Maybe the tables grew so that the optimizer decided now was to time where > using indices would be faster? Just trying to make some sense out of this... > > Maarten I don't think so. We do a lot of replaces on these tables (a shitload of them in fact) but very few inserts. Vacuum typically finds about the same number of entries have been replaced as there are in the table to begin with (ie: the update rate is roughly 1:1 for the number of tuples in the database, or perhaps even a bit higher). And, we run vacuum across the database every night through cron. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
> > > Hi folks, > > Remember me? Remember I was bitching about hashed and indexes scans not > being indexed? > > Guess what - it magically fixed itself. > > If you want to talk about things that *bother* me, this one tops the pack. > > The same query now returns an index hash query plan, which executes in a few > seconds and requires little memory (as opposed to looped sequential scans > requiring 500MB on the server). > > This is really, really, odd. Vacuum? Vacuum analyze? Improper index? -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, Mar 09, 1998 at 04:58:59PM -0500, Bruce Momjian wrote: > > > > > > Hi folks, > > > > Remember me? Remember I was bitching about hashed and indexes scans not > > being indexed? > > > > Guess what - it magically fixed itself. > > > > If you want to talk about things that *bother* me, this one tops the pack. > > > > The same query now returns an index hash query plan, which executes in a few > > seconds and requires little memory (as opposed to looped sequential scans > > requiring 500MB on the server). > > > > This is really, really, odd. > > Vacuum? Vacuum analyze? Improper index? Nope, nope and it was working until this morning with no indicated errors. This and the text indices are the two really, really odd things I'm seeing in 6.3, along with *some* operations taking a lot longer than they used to. One of the things we do around here from libpq is an operation that looks like this: 1) Select a set of records from a join on a couple of tables. 2) Select from a different table (using the results from the first select), looking specifically for certain data which we then use programmatically to perform a set of computations. Now, the first select is just peachy. It returns ~1500 or so records. The iterative loop on the second select used to run through the entire 1500 records or so (doing a select for each) in ~20-30 seconds. Now it takes roughly 2-3 minutes to do the same job. I have checked with "explain" that the indices are being used for all queries - they are. I've seen this also with a few other processes that do the same kind of thing, and get the same results. I'm wondering if what we're seeing here is a severe degredation of locking performance. That is, could this be related to the new deadlock detection code? We're doing selects/updates within several tables in these jobs, but all within one transaction block (begin/commit or begin/rollback). -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
> > Nope, nope and it was working until this morning with no indicated errors. > > This and the text indices are the two really, really odd things I'm seeing > in 6.3, along with *some* operations taking a lot longer than they used to. > > One of the things we do around here from libpq is an operation that looks > like this: > > > 1) Select a set of records from a join on a couple of tables. > > 2) Select from a different table (using the results from the first > select), looking specifically for certain data which we then use > programmatically to perform a set of computations. > > Now, the first select is just peachy. It returns ~1500 or so records. > > The iterative loop on the second select used to run through the entire 1500 > records or so (doing a select for each) in ~20-30 seconds. Now it takes > roughly 2-3 minutes to do the same job. I have checked with "explain" that > the indices are being used for all queries - they are. > > I've seen this also with a few other processes that do the same kind of > thing, and get the same results. > > I'm wondering if what we're seeing here is a severe degredation of locking > performance. That is, could this be related to the new deadlock detection > code? We're doing selects/updates within several tables in these jobs, but > all within one transaction block (begin/commit or begin/rollback). The deadlock code only runs after a minute of waiting for a lock, and every minute thereafter, so if you are not waiting on a lock, you don't run that code. Now the queue behavior has changed. We used to give readers preference, but now we do some queue management that is fairer to readers and writers. See backend/storage/lmgr/proc.c::ProcSleep for the details: /* * If the first entries in the waitQueue have a greater priority than * we have, we must be a reader, and they must be a writers, and we * must be here because the current holder is a writer or a reader but * we don't share shared locks if a writer is waiting. We put * ourselves after the writers. This way, we have a FIFO, but keep * the readers together to give them decent priority, and no one * starves. Because we group all readers together, a non-empty queue * only has a few possible configurations: * * [readers] [writers] [readers][writers] [writers][readers] * [writers][readers][writers] * * In a full queue, we would have a reader holding a lock, then a writer * gets the lock, then a bunch of readers, made up of readers who * could not share the first readlock because a writer was waiting, * and new readers arriving while the writer had the lock. * */ -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, Mar 09, 1998 at 10:01:17PM -0500, Bruce Momjian wrote: > > 1) Select a set of records from a join on a couple of tables. > > > > 2) Select from a different table (using the results from the first > > select), looking specifically for certain data which we then use > > programmatically to perform a set of computations. > > > > Now, the first select is just peachy. It returns ~1500 or so records. > > > > The iterative loop on the second select used to run through the entire 1500 > > records or so (doing a select for each) in ~20-30 seconds. Now it takes > > roughly 2-3 minutes to do the same job. I have checked with "explain" that > > the indices are being used for all queries - they are. > > > > I've seen this also with a few other processes that do the same kind of > > thing, and get the same results. > > > > I'm wondering if what we're seeing here is a severe degredation of locking > > performance. That is, could this be related to the new deadlock detection > > code? We're doing selects/updates within several tables in these jobs, but > > all within one transaction block (begin/commit or begin/rollback). > > The deadlock code only runs after a minute of waiting for a lock, and > every minute thereafter, so if you are not waiting on a lock, you don't > run that code. > > Now the queue behavior has changed. We used to give readers preference, > but now we do some queue management that is fairer to readers and > writers. Well, the other problem that comes with this however is CPU load on the server. I find that the system load goes through the ceiling - there's not a lot of disk I/O going on during this time, but the CPU is damn busy, almost all of it in user time. I don't think this explains all of it. Something is going on in the interaction of the different processes being handled. Basically, we have the following: 1) One big "select" to get candidates. 2) A second select on another table using values from the candidate to check two algorythmic values (which can't really be pushed up into the first select as a join - if it could, that would make things easy). 3) If either of the two conditions are met, then we take a process detour and do the appropriate system-level things, and update the candidate record. If we do take either of these, we open a transaction to make sure we either commit all changes or none of them. 4) If the candidate was modified (ie: a "begin" was executed along with one or more "updates") we send a "commit", assuming we were able to do the system-level things ok. While this is running, no other instance can be (we use another table with a single "flag" value as a means of preventing this), because if two of these come along at once all hell would break loose. This is "set off" by an async trigger when there could be candidates. With 6.2.1 this worked fine - we got through the entire process in a few seconds, and all was well. With 6.3 I have had to schedule this to run on 30 minute intervals, because its 10x+ slower than it was under 6.2.1. I'm giving serious consideration to a table reconstruction so I can use a join to get all of this in one pass, which will get rid of the need to loop over the first select's returned values. I've seen this same kind of behavior in a few other places as well; in places where you are doing reads and writes in a mixed environment (ie: read something, write something (typically in a different table) based on what you read) performance of 6.3 has gone in the toilet. Update jobs that used to run in tens of seconds are requiring several minutes to complete now. And again, we're not seeing much disk I/O during this period - but we *ARE* seeing a hell of a lot of CPU activity, almost all in user mode. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
> I've seen this same kind of behavior in a few other places as well; in > places where you are doing reads and writes in a mixed environment (ie: read > something, write something (typically in a different table) based on what > you read) performance of 6.3 has gone in the toilet. Update jobs that used > to run in tens of seconds are requiring several minutes to complete now. > > And again, we're not seeing much disk I/O during this period - but we *ARE* > seeing a hell of a lot of CPU activity, almost all in user mode. OK, how about using postgres -t option or profiling to get the function using so much cpu? This may help us tell where the problem lies. Does anyone else see such problems? All other reports I hear was that 6.3 was faster. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, Mar 09, 1998 at 11:05:49PM -0500, Bruce Momjian wrote: > > I've seen this same kind of behavior in a few other places as well; in > > places where you are doing reads and writes in a mixed environment (ie: read > > something, write something (typically in a different table) based on what > > you read) performance of 6.3 has gone in the toilet. Update jobs that used > > to run in tens of seconds are requiring several minutes to complete now. > > > > And again, we're not seeing much disk I/O during this period - but we *ARE* > > seeing a hell of a lot of CPU activity, almost all in user mode. > > OK, how about using postgres -t option or profiling to get the function > using so much cpu? This may help us tell where the problem lies. Does > anyone else see such problems? All other reports I hear was that 6.3 > was faster. Docs on this somewhere? I'll be happy to profile it if I can figure out how! :-) -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly to FULL DS-3 Service | NEW! K56Flex support on ALL modems Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost
> > was faster. > > Docs on this somewhere? > > I'll be happy to profile it if I can figure out how! :-) > See the FAQ. Usually profiling is -pg. -- Bruce Momjian | 830 Blythe Avenue maillist@candle.pha.pa.us | Drexel Hill, Pennsylvania 19026 + If your life is a hard drive, | (610) 353-9879(w) + Christ can be your backup. | (610) 853-3000(h)
On Mon, 9 Mar 1998, Karl Denninger wrote: > On Mon, Mar 09, 1998 at 11:05:49PM -0500, Bruce Momjian wrote: > > > I've seen this same kind of behavior in a few other places as well; in > > > places where you are doing reads and writes in a mixed environment (ie: read > > > something, write something (typically in a different table) based on what > > > you read) performance of 6.3 has gone in the toilet. Update jobs that used > > > to run in tens of seconds are requiring several minutes to complete now. > > > > > > And again, we're not seeing much disk I/O during this period - but we *ARE* > > > seeing a hell of a lot of CPU activity, almost all in user mode. > > > > OK, how about using postgres -t option or profiling to get the function > > using so much cpu? This may help us tell where the problem lies. Does > > anyone else see such problems? All other reports I hear was that 6.3 > > was faster. > > Docs on this somewhere? > > I'll be happy to profile it if I can figure out how! :-) I have just figured this out. First, you can get some timing information from postgresql itself by including '-tpa -tpl -te' on the backend options, second, you can recompile postgresql with profiling on (look in Makefile.global for profiling options, adjust them for your syetem). Then, *don't do a 'make install'* but only copy the postgres binary (in backend dir) to your postgresql bin-dir, as postgres.profile. Whenever you want to profile, edit your start-script to use that postgres.profile as a backend. Run your queries, then copy the profile-output (gmon.out for gcc profiling) from the *database* directory you did your queries on (ie. if you connected to database 'template1', it'll be in data/base/template1). Then do normal profiling stuff (foor gcc profiling, run 'gprof gmon.out bin/postgres.profile > prof.out' to get readable output). If you have questions, please mail me privately. Maarten _____________________________________________________________________________ | TU Delft, The Netherlands, Faculty of Information Technology and Systems | | Department of Electrical Engineering | | Computer Architecture and Digital Technique section | | M.Boekhold@et.tudelft.nl | -----------------------------------------------------------------------------
Hello, I posted a similar problem here on Saturday. Now I had a little time looking at the queries we send. This is the output of PG 6.2.1 -- boersenspiel=> explain SELECT DISTINCT spieler_nr, Trans.wpk_nr, state, anzahl, buyprice, buydate, sellprice, selldate, Kurse.wpk_nr, name, curr, kurs, datum , Trans.oid from Trans, Kurse WHERE Trans.wpk_nr=Kurse.wpk_nr AND spieler_nr=3 ORDER BY Trans.wpk_nr , selldate USING >; NOTICE:QUERY PLAN: Unique (cost=0.00 size=0 width=0) -> Sort (cost=4.10 size=0 width=0) -> Nested Loop (cost=4.10 size=1 width=73) -> Index Scan on trans (cost=2.05 size=1 width=41) -> Index Scan on kurse (cost=2.05 size=14305 width=32) Now the same query in 6.3: Unique (cost=1164.21 size=0 width=0) -> Sort (cost=1164.21 size=0 width=0) -> Hash Join (cost=1164.21 size=1 width=73) -> Seq Scan on kurse (cost=688.07 size=14305 width=32) -> Hash (cost=0.00 size=0 width=0) -> Index Scan on trans (cost=2.05 size=1 width=41) All indices are created (all btrees), but the index on kurse doesn't seem to be used. > Guess what - it magically fixed itself. > > If you want to talk about things that *bother* me, this one tops the pack. > > The same query now returns an index hash query plan, which executes in a few > seconds and requires little memory (as opposed to looped sequential scans > requiring 500MB on the server). > > This is really, really, odd. Dito. BTW.: I tried to apply the patches from Massimo, as the only major problem for us in 6.2.1p6 is the buggy deadlock code. Anybody managed to get it working? Ciao Ulrich Ulrich Voss \ \ / /__ / ___|__ _| | VoCal web publishing \ \ / / _ \| | / _` | | voss@vocalweb.de \ V / (_) | |__| (_| | | http://www.vocalweb.de \_/ \___/ \____\__,_|_| Tel: 0203-306-1560 web publishing