Thread: Dead Lock problem with 8.1.3
Hi Folks, I have a strange Problem (to be honest there are more than one, but this is one of it) after Upgrading to Postgres 8.1.3. The following SQL seems to produce a deadlock while doing an endless reading of a temp table: SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); It should give me a List of sids which are connected to veraid 34 but not to veraid 2 at the same time. When removing the 'AND v.veraid = 34': SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); everything works really fine (even if I include an additional criteria like 'AND stud_vera.status > 0' ). On Postgres 8.0.8 this problem doesn't exist. The first clause works with a lot of additional criteria very fast. Any Idea? *greets* Kai
No one any idea? *sigh* Kai Hessing wrote: > Hi Folks, > > I have a strange Problem (to be honest there are more than one, but this > is one of it) after Upgrading to Postgres 8.1.3. The following SQL seems > to produce a deadlock while doing an endless reading of a temp table: > > SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = > 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); > > It should give me a List of sids which are connected to veraid 34 but > not to veraid 2 at the same time. When removing the 'AND v.veraid = 34': > > SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND s.sid NOT > IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); > > everything works really fine (even if I include an additional criteria > like 'AND stud_vera.status > 0' ). On Postgres 8.0.8 this problem > doesn't exist. The first clause works with a lot of additional criteria > very fast. > > Any Idea? > > *greets* > Kai
On Tue, Sep 26, 2006 at 03:59:48PM +0200, Kai Hessing wrote: > No one any idea? *sigh* It probably has something to with the fact that you didn't explain what you meant by "deadlock". Also, you refer to a temp table, yet don't indicate which table it is. You'll need to be a lot more specific about your problem before we can help. For example, a script to reproduce the problem, or a copy of psql output demonstrating it. Hope this helps, > Kai Hessing wrote: > > Hi Folks, > > > > I have a strange Problem (to be honest there are more than one, but this > > is one of it) after Upgrading to Postgres 8.1.3. The following SQL seems > > to produce a deadlock while doing an endless reading of a temp table: > > > > SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = > > 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Kai Hessing wrote: > No one any idea? *sigh* > > Kai Hessing wrote: >> Hi Folks, >> >> I have a strange Problem (to be honest there are more than one, but this >> is one of it) after Upgrading to Postgres 8.1.3. The following SQL seems >> to produce a deadlock while doing an endless reading of a temp table: >> >> SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = >> 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); I'm pretty sure it's not a deadlock. It probably takes very long for some reason; maybe an explain of that query will give some insight. You probably lack some indices. Alternatively, have you tried an EXCEPT query? I think this is what you try to query: SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = 34 EXCEPT SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = 2 -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Martijn van Oosterhout wrote: > On Tue, Sep 26, 2006 at 03:59:48PM +0200, Kai Hessing wrote: > >>No one any idea? *sigh* > > > It probably has something to with the fact that you didn't explain what > you meant by "deadlock". Also, you refer to a temp table, yet don't > indicate which table it is. > > You'll need to be a lot more specific about your problem before we can > help. For example, a script to reproduce the problem, or a copy of psql > output demonstrating it. Also EXPLAIN output, and information regarding the number of rows in each table, and the number of rows matching veraid = 2 and veraid = 34 might help. While the query is running, how much CPU is the process consuming, and what does vmstat show for disk and swap I/O? Joe
Kai Hessing <kai.hessing@hobsons.de> writes: > No one any idea? *sigh* What makes you think it's a deadlock and not a very slow query? I'd be checking if the tables were all ANALYZEd and comparing EXPLAIN output to the old database ... regards, tom lane
Alban Hertroys wrote: >>> SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = >>> 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); > > I'm pretty sure it's not a deadlock. It probably takes very long for > some reason; maybe an explain of that query will give some insight. You > probably lack some indices. No. The system goes into an endless loop. The part ('SELECT sid FROM stud_vera WHERE veraid = 2') seems to create a temporary table again and again and again .... The same clause needs around 5 seconds under Postgres 8.0.8. On 8.1.3 we killed the process after 40 hours while using constantly 80% CPU power. (Btw. Explain leads to the same problem, it just hangs up...) We're currently trying to separate the problem into a simple setup. > Alternatively, have you tried an EXCEPT query? I think this is what you > try to query: > > SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = 34 > EXCEPT > SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = 2 Thanks, I will try this out. To be honest, I didn't know the EXCEPT query (and I think the programmer also doesn't *g*). *greets* Kai
Joe Conway wrote: > Martijn van Oosterhout wrote: >> It probably has something to with the fact that you didn't explain what >> you meant by "deadlock". Also, you refer to a temp table, yet don't >> indicate which table it is. Deadlock means it hangs up and doesn't terminate through timeout. The system goes into an endless loop. The part ('SELECT sid FROM stud_vera WHERE veraid = 2') seems to create a temporary table again and again and again .... >> You'll need to be a lot more specific about your problem before we can >> help. For example, a script to reproduce the problem, or a copy of psql >> output demonstrating it. There is no output. It just takes forever. We're currently working to seperate this problem into a simple setup. If we have one, I will post it. > Also EXPLAIN output, and information regarding the number of rows in > each table, and the number of rows matching veraid = 2 and veraid = 34 > might help. Explain produces the same problem. It just takes forever... > While the query is running, how much CPU is the process consuming, and > what does vmstat show for disk and swap I/O? ~80% CPU power. Disk usage is not noticable. Thanks and *greets* Kai
Tom Lane wrote: > Kai Hessing <kai.hessing@hobsons.de> writes: >> No one any idea? *sigh* > > What makes you think it's a deadlock and not a very slow query? I'd be > checking if the tables were all ANALYZEd and comparing EXPLAIN output > to the old database ... *hmm* the difference of 5 seconds on 8.0.8 compared to killing the process after 40 hours? All databases are ANALYZEd and EXPLAIN output cannot be compared because it hangs also up on 8.0.8.
Kai Hessing wrote: > Deadlock means it hangs up and doesn't terminate through timeout. > The system goes into an endless loop. The part ('SELECT sid FROM > stud_vera WHERE veraid = 2') seems to create a temporary table again and > again and again .... It is possible for a query to run for many days, and still finish. This classifies as slow, not hung. The difference is important in troubleshooting to determine the cause. >>Also EXPLAIN output, and information regarding the number of rows in >>each table, and the number of rows matching veraid = 2 and veraid = 34 >>might help. > > Explain produces the same problem. It just takes forever... Did you try EXPLAIN, or EXPLAIN ANALYZE? The former only does the planning, the latter actually materializes the result in order to get actual timings, and will therefore take at least as long as the query itself. >>While the query is running, how much CPU is the process consuming, and >>what does vmstat show for disk and swap I/O? > > ~80% CPU power. Disk usage is not noticable. Can you attach with a debugger and see exactly what's going on? If not we'd need that self contained test case to reproduce the problem. Joe
On Wed, Sep 27, 2006 at 06:46:42PM +0200, Kai Hessing wrote: > > Deadlock means it hangs up and doesn't terminate through timeout. No, it doesn't. Deadlock means, for the two deadlocked queries, both cannot possibly finish because each waits on a lock that the other one holds. You can cause such deadlocks in your application, too, of course, but they're not database deadlocks. Also. . . > There is no output. It just takes forever. . . .define "forever". Is it doing any work? Do you see i/o? Is it in SELECT WAITING state? A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
Kai Hessing wrote: > Alban Hertroys wrote: >>>> SELECT s.sid FROM stud s, stud_vera v WHERE s.sid = v.sid AND v.veraid = >>>> 34 AND s.sid NOT IN ( SELECT sid FROM stud_vera WHERE veraid = 2 ); >> I'm pretty sure it's not a deadlock. It probably takes very long for >> some reason; maybe an explain of that query will give some insight. You >> probably lack some indices. > > No. The system goes into an endless loop. The part ('SELECT sid FROM > stud_vera WHERE veraid = 2') seems to create a temporary table again and > again and again .... Ah, this is where part of the confusion stems from. That subquery just results in a resultset being created (maybe not even that, depends a bit on the query planner). That's not a temporary table. I rather doubt that postgres would repeat that query, it's results won't change between comparisons with rows from your main query (this depends a bit on what type of transaction isolation you use). > The same clause needs around 5 seconds under Postgres 8.0.8. On 8.1.3 we > killed the process after 40 hours while using constantly 80% CPU power. > (Btw. Explain leads to the same problem, it just hangs up...) EXPLAIN without ANALYZE locking up?!? Maybe some application is holding a lock on a record in your result set. Did you try this query without any other applications connecting to that database? If you can't do that, you could dump that database and restore it in a different one for testing cases like this. Lastly, considering the odd behaviour and the huge differences between minor versions of the database, you might have a corrupted index somewhere. You can fix those with REINDEX. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
> EXPLAIN without ANALYZE locking up?!? Maybe some application is holding > a lock on a record in your result set. OK, this was a good hint. I got EXPLAIN working. My mistake was, that I tried explain the last time, while the 'Killer-SQL' was running. Then EXPLAIN didn't answer until I killed the CPU-eating process. Interesting is, that it seems to be not possible to access pg_locks while this process is running. A 'select * from pg_locks' takes also for... a very long time... > Did you try this query without > any other applications connecting to that database? > > If you can't do that, you could dump that database and restore it in a > different one for testing cases like this. Yes, I surely have a testing environment with the same setup as the live system. The Query does not work. But after terminating all running queries the EXPLAIN did work. This is the result (The 8.0.8 database has a dataset which is around 2 weeks older): PostgreSQL 8.1.4 (The one which is taking... a very long time...) Merge Join (cost=7751.81..50026810.45 rows=7364 width=4) Merge Cond: ("outer".sid = "inner".sid) -> Index Scan using stud_pkey on stud s (cost=7751.81..49994210.01 rows=56607 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=7751.81..8497.94 rows=53613 width=4) -> Seq Scan on stud_vera (cost=0.00..7488.20 rows=53613 width=4) Filter: (veraid = 2) -> Index Scan using stud_vera_sid_veraid_idx on stud_vera v (cost=0.00..37646.74 rows=14729 width=4) Index Cond: (veraid = 34) PostgreSQL 8.0.8 (The one which takes only a few seconds... on a much slower system...) Merge Join (cost=22724.05..23019.46 rows=731 width=4) Merge Cond: ("outer".sid = "inner".sid) -> Sort (cost=15223.02..15363.42 rows=56159 width=4) Sort Key: s.sid -> Seq Scan on stud s (cost=7427.89..10792.85 rows=56159 width=4) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on stud_vera (cost=0.00..7424.24 rows=1461 width=4) Filter: (veraid = 2) -> Sort (cost=7501.03..7504.69 rows=1461 width=4) Sort Key: v.sid -> Seq Scan on stud_vera v (cost=0.00..7424.24 rows=1461 width=4) Filter: (veraid = 34) > Lastly, considering the odd behaviour and the huge differences between > minor versions of the database, you might have a corrupted index > somewhere. You can fix those with REINDEX. I did REINDEX and REINDEX FORCE and VACUUM FULL ANALYZE and all of this doesn't take any noticable effect neither in calling the query nor in explaining it. Another thing I did was to do strace to the problematic process and noticed that it is also repeatedly accessing a file called (for the whole time): postgresql/8.1/main/base/1740468/pgsql_tmp/pgsql_tmp21938.0 It doesn't cost much disk performance but is permanently working.
Andrew Sullivan wrote: > No, it doesn't. Deadlock means, for the two deadlocked queries, both > cannot possibly finish because each waits on a lock that the other > one holds. Thanks for the clarification. I thought a deadlock is also, when the system runs into an endless loop. >> There is no output. It just takes forever. > > . . .define "forever". Is it doing any work? Do you see i/o? Is it > in SELECT WAITING state? No it is not in WAITING state, it is permanently working. strace is telling me that it is constantly accessing a file: 'postgresql/8.1/main/base/1740468/pgsql_tmp/pgsql_tmp21938.0' It doesn't cost much performance, but it is running all the time. Please see the full update at Message-ID: <4o2bkgFclmdhU1@individual.net>
Joe Conway wrote: > It is possible for a query to run for many days, and still finish. This > classifies as slow, not hung. The difference is important in > troubleshooting to determine the cause. OK, what do you suggest, how long should the process run, until I can except it not to end? >>>Also EXPLAIN output, and information regarding the number of rows in >>>each table, and the number of rows matching veraid = 2 and veraid = 34 >>>might help. >> >> Explain produces the same problem. It just takes forever... > > Did you try EXPLAIN, or EXPLAIN ANALYZE? The former only does the > planning, the latter actually materializes the result in order to get > actual timings, and will therefore take at least as long as the query > itself. I got EXPLAIN to work. Please see the full update at Message-ID: <4o2bkgFclmdhU1@individual.net> So, I don't have to post on different Thread-Branches and can keep the information together. Thanks ;) >>>While the query is running, how much CPU is the process consuming, and >>>what does vmstat show for disk and swap I/O? >> >> ~80% CPU power. Disk usage is not noticable. > > Can you attach with a debugger and see exactly what's going on? If not > we'd need that self contained test case to reproduce the problem. Sorry, I'm currently not able to attach a debugger. I think I'm right that this means a recompilation of the package would be necessary. The only thing I was seeing with strace was that the Process is permanently accessing a file called: postgresql/8.1/main/base/1740468/pgsql_tmp/pgsql_tmp21938.0
One Addition: > PostgreSQL 8.1.4 (The one which is taking... a very long time...) > > Merge Join (cost=7751.81..50026810.45 rows=7364 width=4) > Merge Cond: ("outer".sid = "inner".sid) > -> Index Scan using stud_pkey on stud s (cost=7751.81..49994210.01 > rows=56607 width=4) > Filter: (NOT (subplan)) > SubPlan > -> Materialize (cost=7751.81..8497.94 rows=53613 width=4) > -> Seq Scan on stud_vera (cost=0.00..7488.20 > rows=53613 width=4) > Filter: (veraid = 2) > -> Index Scan using stud_vera_sid_veraid_idx on stud_vera v > (cost=0.00..37646.74 rows=14729 width=4) > Index Cond: (veraid = 34) The interesting thing is that (like I've written in the main post) if I remove the 'AND veraid=34' it working fine (0.3seconds). The explain for this working query is: Merge Join (cost=45500.73..46741955.20 rows=147288 width=4) Merge Cond: ("outer".sid = "inner".sid) -> Index Scan using stud_pkey on stud s (cost=7734.28..46701101.71 rows=56607 width=4) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=7734.28..8431.06 rows=50078 width=4) -> Seq Scan on stud_vera (cost=0.00..7488.20 rows=50078 width=4) Filter: (veraid = 2) -> Sort (cost=37766.46..38502.90 rows=294576 width=4) Sort Key: v.sid -> Seq Scan on stud_vera v (cost=0.00..6751.76 rows=294576 width=4)
Kai Hessing <kai.hessing@hobsons.de> writes: > PostgreSQL 8.1.4 (The one which is taking... a very long time...) > Filter: (NOT (subplan)) > SubPlan > -> Materialize (cost=7751.81..8497.94 rows=53613 width=4) > PostgreSQL 8.0.8 (The one which takes only a few seconds... on a much > slower system...) > Filter: (NOT (hashed subplan)) > SubPlan > -> Seq Scan on stud_vera (cost=0.00..7424.24 rows=1461 > width=4) > Filter: (veraid = 2) I bet the problem is that you're not getting a hashed subplan in 8.1. What do you have work_mem set to on the two systems? The fact that the rowcount estimate for the subplan is much larger may be causing the change, too. Why is that --- perhaps you haven't ANALYZEd stud_vera recently on one system or the other? How many rows will really come out of the sub-select (ie, what's "select count(*) from stud_vera where veraid = 2")? regards, tom lane
Tom Lane wrote: > I bet the problem is that you're not getting a hashed subplan in 8.1. > What do you have work_mem set to on the two systems? The fact that the > rowcount estimate for the subplan is much larger may be causing the > change, too. Why is that --- perhaps you haven't ANALYZEd stud_vera > recently on one system or the other? How many rows will really come out > of the sub-select (ie, what's "select count(*) from stud_vera > where veraid = 2")? HEUREKA... Increasing the work_mem had solved the problem. Result: 3170 rows in 1,487.927ms. Working_mem was set to the default of 1MB. Increasing it to 4 solved it. Are there any suggestions for the size. Maybe depending on database size? Many thanks! But I still wonder why it didn't work with the lower size of working_mem and the temporary file on the disk? I had a similar problem in the past where I forget to ANALYZE and before ANALYZEing also a temporary file was used. The time difference had been from ten seconds to five minutes and not from one second to over 40 hours. The temporary file is created and has a size of around 2MB and doesn't change size while working. I fear to run into the same problem, if I use more complex queries. And, the problem with the estimated rows is still very interesting. The Table is FULLy ANALYZEd and the actual count of stud_vera with verid=2 is 49176, so the 53000 are real close. Don't now, why it is so low on 8.0.8? One other interesting thing is that the query plan has changed after increasing the working mem: Hash Join (cost=12991.28..21472.83 rows=7512 width=4) Hash Cond: ("outer".sid = "inner".sid) -> Bitmap Heap Scan on stud_vera v (cost=1841.02..5834.80 rows=15023 width=4) Recheck Cond: (veraid = 34) -> Bitmap Index Scan on stud_vera_sid_veraid_idx (cost=0.00..1841.02 rows=15023 width=0) Index Cond: (veraid = 34) -> Hash (cost=11008.74..11008.74 rows=56607 width=4) -> Seq Scan on stud s (cost=7617.57..11008.74 rows=56607 width=4) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on stud_vera (cost=0.00..7488.20 rows=51747 width=4) Filter: (veraid = 2) *greets* Kai