Thread: PostgreSQL strugling during high load
Hello, We have problems with one postgresql database with high data change rate. Actually we are already under pressure to change postgresql to Oracle. I cannot post schema and queries to list but can do this privately. Tables are not big (20000-150000 rows each) but have very high turnover rate - 100+ updates/inserts/deletes/selects per second. So contents of database changes very fast. Problem is that when pg_autovacuum does vacuum those changes slows down too much. And we keep autovacuum quite aggressive (-v 1000 -V 0.5 -a 1000 -A 0.1 -s 10) to not bloat database and to avoid bigger impact. analyze seems not to impact performance too much. Tables have 2-3 indexes each and one table have foreign key contraint. Postgresql is 8.0.1. vmstat shows that IO and CPU are not saturated. DB is on RAID1+0 controller with battery backed write cache. What can we tune to improve performance in our case? Please help to defend PostgreSQL against Oracle in this case :). Thanks, Mindaugas
On Fri, May 13, 2005 at 03:52:38PM +0300, Mindaugas Riauba wrote: > Tables are not big (20000-150000 rows each) but have very high > turnover rate - 100+ updates/inserts/deletes/selects per second. > So contents of database changes very fast. Problem is that when > pg_autovacuum does vacuum those changes slows down too much. > And we keep autovacuum quite aggressive (-v 1000 -V 0.5 -a 1000 > -A 0.1 -s 10) to not bloat database and to avoid bigger impact. > analyze seems not to impact performance too much. Are you using the bgwriter? See http://developer.postgresql.org/~wieck/vacuum_cost/ for details. /* Steinar */ -- Homepage: http://www.sesse.net/
"Mindaugas Riauba" <mind@bi.lt> writes: > ... So contents of database changes very fast. Problem is that when > pg_autovacuum does vacuum those changes slows down too much. The "vacuum cost" parameters can be adjusted to make vacuums fired by pg_autovacuum less of a burden. I haven't got any specific numbers to suggest, but perhaps someone else does. regards, tom lane
> > ... So contents of database changes very fast. Problem is that when > > pg_autovacuum does vacuum those changes slows down too much. > > The "vacuum cost" parameters can be adjusted to make vacuums fired > by pg_autovacuum less of a burden. I haven't got any specific numbers > to suggest, but perhaps someone else does. It looks like that not only vacuum causes our problems. vacuum_cost seems to lower vacuum impact but we are still noticing slow queries "storm". We are logging queries that takes >2000ms to process. And there is quiet periods and then suddenly 30+ slow queries appears in log within the same second. What else could cause such behaviour? WAL log switch? One WAL file seems to last <1 minute. And also in slow queries log only function call is shown. Maybe it is possible to get exact query which slows everything down in the serverlog? Thanks, Mindaugas
"Mindaugas Riauba" <mind@bi.lt> writes: > It looks like that not only vacuum causes our problems. vacuum_cost > seems to lower vacuum impact but we are still noticing slow queries "storm". > We are logging queries that takes >2000ms to process. > And there is quiet periods and then suddenly 30+ slow queries appears in > log within the same second. What else could cause such behaviour? Checkpoints? You should ensure that the checkpoint settings are such that checkpoints don't happen too often (certainly not oftener than every five minutes or so), and make sure the bgwriter is configured to dribble out dirty pages at a reasonable rate, so that the next checkpoint doesn't have a whole load of stuff to write. regards, tom lane
> > It looks like that not only vacuum causes our problems. vacuum_cost > > seems to lower vacuum impact but we are still noticing slow queries "storm". > > We are logging queries that takes >2000ms to process. > > And there is quiet periods and then suddenly 30+ slow queries appears in > > log within the same second. What else could cause such behaviour? > > Checkpoints? You should ensure that the checkpoint settings are such > that checkpoints don't happen too often (certainly not oftener than > every five minutes or so), and make sure the bgwriter is configured > to dribble out dirty pages at a reasonable rate, so that the next > checkpoint doesn't have a whole load of stuff to write. bgwriter settings are default. bgwriter_delay=200, bgwriter_maxpages=100, bgwriter_percent=1. checkpoint_segments=8, checkpoint_timeout=300, checkpoint_warning=30. But there's no checkpoint warnings in serverlog. And btw we are running with fsync=off (yes I know the consequences). Database from the viewpoint of disk is practically write only since amount of data is smaller than memory available. I also added some 'vmstat 1' output. How to get more even load. As you see neither disk nor cpu looks too busy. Thanks, Mindaugas procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 194724 12140 10220 1045356 0 1 33 24 60 20 13 3 83 2 0 0 194724 11988 10228 1045464 0 0 12 0 1147 6107 13 4 82 0 2 0 194724 12172 10284 1046076 0 0 244 20692 2067 3117 8 8 84 1 0 0 194724 12164 10280 1045912 0 0 0 4 876 8831 15 11 74 3 0 0 194724 11704 10328 1045952 0 0 24 2116 928 5122 13 12 75 1 0 0 194724 11444 10236 1046264 0 0 340 0 1048 6538 19 10 71 1 0 0 194724 11924 10236 1045816 0 0 0 0 885 7616 14 20 66 0 0 0 194724 11408 10252 1044824 0 0 28 5488 959 4749 11 14 75 1 0 0 194724 11736 10296 1042992 0 0 460 2868 1001 4116 12 12 75 0 0 0 194724 12024 10296 1043064 0 0 36 0 903 5081 13 12 76 1 0 0 194724 12404 10240 1043440 0 0 280 0 899 4246 12 12 75 1 0 0 194724 13128 10236 1043472 0 0 0 0 1016 5394 12 10 78 0 4 0 194724 13064 10244 1043652 0 0 0 14736 1882 9290 10 15 74 0 4 0 194724 13056 10252 1043660 0 0 0 6012 1355 2378 2 3 95 12 21 0 194724 13140 10220 1043640 0 0 8 4 723 2984 5 3 92 1 0 0 194724 13712 10228 1043956 0 0 200 0 1144 10310 30 21 50 0 0 0 194724 13100 10220 1043992 0 0 4 0 840 4676 15 14 71 0 0 0 194724 13048 10296 1041212 0 0 4 6132 918 4074 10 10 80 0 0 0 194724 12688 10324 1041508 0 0 240 1864 849 3873 12 11 77 2 0 0 194724 12544 10240 1041944 0 0 32 0 1171 4844 14 7 78 1 0 0 194724 12384 10232 1041756 0 0 4 0 973 6063 16 9 75 1 0 0 194724 12904 10244 1042116 0 0 264 6052 1049 4762 15 14 71 0 0 0 194724 12616 10236 1042164 0 0 8 0 883 4748 13 8 79 2 0 0 194724 12576 10288 1042460 0 0 252 3136 857 3929 13 15 73 2 0 0 194724 12156 10284 1042504 0 0 0 0 858 8832 13 6 81 2 0 0 194724 12024 10284 1042556 0 0 0 0 834 4229 16 10 74 3 1 0 194724 12024 10364 1043096 0 0 316 10328 1024 5686 14 7 80 0 5 0 194724 12024 10352 1043116 0 0 4 7996 2156 2816 4 5 90 0 4 0 194724 12024 10360 1043124 0 0 4 8560 1369 2700 6 5 90 3 0 0 194724 12024 10264 1043124 0 0 0 4 1037 5132 14 15 71 1 1 0 194724 11876 10264 1043176 0 0 4 0 932 7761 20 20 6
Mindaugas Riauba wrote: >>The "vacuum cost" parameters can be adjusted to make vacuums fired >>by pg_autovacuum less of a burden. I haven't got any specific numbers >>to suggest, but perhaps someone else does. > > It looks like that not only vacuum causes our problems. vacuum_cost > seems to lower vacuum impact but we are still noticing slow queries "storm". > We are logging queries that takes >2000ms to process. > And there is quiet periods and then suddenly 30+ slow queries appears in > log within the same second. What else could cause such behaviour? I've seen that happen when you're placing (explicitly or *implicitly*) locks on the records you're trying to update/delete. If you're willing to investigate, `pg_locks' system view holds information about db locks. -- Cosimo
> >>The "vacuum cost" parameters can be adjusted to make vacuums fired > >>by pg_autovacuum less of a burden. I haven't got any specific numbers > >>to suggest, but perhaps someone else does. > > > > It looks like that not only vacuum causes our problems. vacuum_cost > > seems to lower vacuum impact but we are still noticing slow queries "storm". > > We are logging queries that takes >2000ms to process. > > And there is quiet periods and then suddenly 30+ slow queries appears in > > log within the same second. What else could cause such behaviour? > > I've seen that happen when you're placing (explicitly or > *implicitly*) locks on the records you're trying to update/delete. > > If you're willing to investigate, `pg_locks' system view holds > information about db locks. Hm. Yes. Number of locks varies quite alot (10-600). Now what to investigate further? We do not use explicit locks in our functions. We use quite simple update/delete where key=something; Some sample (select * from pg_locks order by pid) is below. Thanks, Mindaugas | | 584302172 | 11836 | ExclusiveLock | t 17236 | 17230 | | 11836 | AccessShareLock | t 17236 | 17230 | | 11836 | RowExclusiveLock | t 127103 | 17230 | | 11836 | RowExclusiveLock | t 127106 | 17230 | | 11836 | RowExclusiveLock | t 127109 | 17230 | | 11836 | AccessShareLock | t 127109 | 17230 | | 11836 | RowExclusiveLock | t 127109 | 17230 | | 11837 | AccessShareLock | t 127109 | 17230 | | 11837 | RowExclusiveLock | t 17236 | 17230 | | 11837 | AccessShareLock | t 17236 | 17230 | | 11837 | RowExclusiveLock | t 127106 | 17230 | | 11837 | RowExclusiveLock | t 127103 | 17230 | | 11837 | RowExclusiveLock | t | | 584302173 | 11837 | ExclusiveLock | t 127103 | 17230 | | 11838 | RowExclusiveLock | t 17236 | 17230 | | 11838 | RowExclusiveLock | t 127109 | 17230 | | 11838 | RowExclusiveLock | t | | 584302174 | 11838 | ExclusiveLock | t 17285 | 17230 | | 11838 | AccessShareLock | t 17251 | 17230 | | 11838 | AccessShareLock | t 130516 | 17230 | | 11838 | AccessShareLock | t 127106 | 17230 | | 11838 | RowExclusiveLock | t 17278 | 17230 | | 11838 | AccessShareLock | t
On Fri, May 13, 2005 at 05:45:45PM +0300, Mindaugas Riauba wrote: > But there's no checkpoint warnings in serverlog. And btw we are running > with fsync=off (yes I know the consequences). Just a note here; since you have battery-backed hardware cache, you probably won't notice that much of a slowdown with fsync=on. However, you are already pushed, so... :-) /* Steinar */ -- Homepage: http://www.sesse.net/
"Mindaugas Riauba" <mind@bi.lt> writes: > Hm. Yes. Number of locks varies quite alot (10-600). Now what to > investigate > further? We do not use explicit locks in our functions. We use quite simple > update/delete where key=something; > Some sample (select * from pg_locks order by pid) is below. The sample doesn't show any lock issues (there are no processes waiting for ungranted locks). The thing that typically burns people is foreign key conflicts. In current releases, if you have a foreign key reference then an insert in the referencing table takes an exclusive row lock on the referenced (master) row --- which means that two inserts using the same foreign key value block each other. You can alleviate the issue by making all your foreign key checks deferred, but that just shortens the period of time the lock is held. There will be a real solution in PG 8.1, which has sharable row locks. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > "Mindaugas Riauba" <mind@bi.lt> writes: > > ... So contents of database changes very fast. Problem is that > when > > pg_autovacuum does vacuum those changes slows down too much. > > The "vacuum cost" parameters can be adjusted to make vacuums fired > by pg_autovacuum less of a burden. I haven't got any specific > numbers > to suggest, but perhaps someone else does. I solved one problem by cranking sleep scaling to -S 20. It made pg_autovacuum back off longer during extended periods of heavy disk-intensive query activity. Our update activity is near-constant insert rate, then once or twice a day, massive deletes. -- Dreams come true, not free.
Mindaugas Riauba wrote: >>The "vacuum cost" parameters can be adjusted to make vacuums fired >>by pg_autovacuum less of a burden. I haven't got any specific numbers >>to suggest, but perhaps someone else does. >> >> > > It looks like that not only vacuum causes our problems. vacuum_cost >seems to lower vacuum impact but we are still noticing slow queries "storm". >We are logging queries that takes >2000ms to process. > And there is quiet periods and then suddenly 30+ slow queries appears in >log within the same second. What else could cause such behaviour? WAL log >switch? One WAL file seems to last <1 minute. > > How long are these quite periods? Do the "strom" periods correspond to pg_autovacuum loops? I have heard from one person who had LOTS of databases and tables that caused the pg_autovacuum to create a noticable load just updateing all its stats. The solution in that case was to add a small delay insidet the inner pg_autovacuum loop.
Actually, that solution didn't work so well. Even very small delays in the loop caused the entire loop to perform too slowly to be useful in the production environment. I ended up producing a small patch out of it :P, but we ended up using pgpool to reduce connections from another part of the app, which made the pg_autovacuum spikes less troublesome overall. -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC Strategic Open Source: Open Your i™ http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On May 15, 2005, at 9:26 PM, Matthew T. O'Connor wrote: > Mindaugas Riauba wrote: > > >>> The "vacuum cost" parameters can be adjusted to make vacuums fired >>> by pg_autovacuum less of a burden. I haven't got any specific >>> numbers >>> to suggest, but perhaps someone else does. >>> >>> >> >> It looks like that not only vacuum causes our problems. vacuum_cost >> seems to lower vacuum impact but we are still noticing slow >> queries "storm". >> We are logging queries that takes >2000ms to process. >> And there is quiet periods and then suddenly 30+ slow queries >> appears in >> log within the same second. What else could cause such behaviour? >> WAL log >> switch? One WAL file seems to last <1 minute. >> >> > > How long are these quite periods? Do the "strom" periods > correspond to pg_autovacuum loops? I have heard from one person > who had LOTS of databases and tables that caused the pg_autovacuum > to create a noticable load just updateing all its stats. The > solution in that case was to add a small delay insidet the inner > pg_autovacuum loop.
> > Hm. Yes. Number of locks varies quite alot (10-600). Now what to > > investigate > > further? We do not use explicit locks in our functions. We use quite simple > > update/delete where key=something; > > Some sample (select * from pg_locks order by pid) is below. > > The sample doesn't show any lock issues (there are no processes waiting > for ungranted locks). The thing that typically burns people is foreign > key conflicts. In current releases, if you have a foreign key reference > then an insert in the referencing table takes an exclusive row lock on > the referenced (master) row --- which means that two inserts using the > same foreign key value block each other. > > You can alleviate the issue by making all your foreign key checks > deferred, but that just shortens the period of time the lock is held. > There will be a real solution in PG 8.1, which has sharable row locks. In such case our foreign key contraint should not be an issue since it is on msg_id which is pretty much unique among concurrent transactions. And I noticed that "storms" happens along with higher write activity. If bo in vmstat shows 25+MB in 2s then most likely I will get "storm" of slow queries in serverlog. How to even write activity? fsync=off, bgwriter settings are default. And is it possible to log which query in function takes the longest time to complete? Also do not know if it matters but PG database is on ext3 partition with data=journal option. Thanks, Mindaugas
Tom Thanks for the post - I think I am getting this problem for a synthetic workload at high connection loads. The whole system seems to stop. Can you give some examples on what to try out in the .conf file? I tried bgwriter_all_percent = 30, 10, and 3 Which I understand to mean 30%, 10% and 3% of the dirty pages should be written out *between* checkpoints. I didn't see any change in effect. /regards Don C. Tom Lane wrote: >"Mindaugas Riauba" <mind@bi.lt> writes: > > >> It looks like that not only vacuum causes our problems. vacuum_cost >>seems to lower vacuum impact but we are still noticing slow queries "storm". >>We are logging queries that takes >2000ms to process. >> And there is quiet periods and then suddenly 30+ slow queries appears in >>log within the same second. What else could cause such behaviour? >> >> > >Checkpoints? You should ensure that the checkpoint settings are such >that checkpoints don't happen too often (certainly not oftener than >every five minutes or so), and make sure the bgwriter is configured >to dribble out dirty pages at a reasonable rate, so that the next >checkpoint doesn't have a whole load of stuff to write. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >
What platform is this? We had similar issue (PG 7.4.7). Raising number of checkpoint segments to 125, seperating the WAL to a different LUN helped,but it's still not completely gone. As far as disk I/O is concerned for flushing the buffers out, I am not ruling out the combination of Dell PERC4 RAID card,and the RH AS 3.0 Update3 being a problem. Thanks, Anjan -----Original Message----- From: Donald Courtney [mailto:Donald.Courtney@Sun.COM] Sent: Thu 5/19/2005 12:54 PM To: Tom Lane Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL strugling during high load Tom Thanks for the post - I think I am getting this problem for a synthetic workload at high connection loads. The whole system seems to stop. Can you give some examples on what to try out in the .conf file? I tried bgwriter_all_percent = 30, 10, and 3 Which I understand to mean 30%, 10% and 3% of the dirty pages should be written out *between* checkpoints. I didn't see any change in effect. /regards Don C. Tom Lane wrote: >"Mindaugas Riauba" <mind@bi.lt> writes: > > >> It looks like that not only vacuum causes our problems. vacuum_cost >>seems to lower vacuum impact but we are still noticing slow queries "storm". >>We are logging queries that takes >2000ms to process. >> And there is quiet periods and then suddenly 30+ slow queries appears in >>log within the same second. What else could cause such behaviour? >> >> > >Checkpoints? You should ensure that the checkpoint settings are such >that checkpoints don't happen too often (certainly not oftener than >every five minutes or so), and make sure the bgwriter is configured >to dribble out dirty pages at a reasonable rate, so that the next >checkpoint doesn't have a whole load of stuff to write. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings
Anjan Dave wrote: >What platform is this? > > > Its a DELL RH 4 with the xlog on a seperate external mounted file system. The data directory is on a external mounted file system as well. >We had similar issue (PG 7.4.7). Raising number of checkpoint segments to 125, seperating the WAL to a different LUN helped,but it's still not completely gone. > > > I'll try raising the number. I guess the bg* config variables don't do much? thanks >As far as disk I/O is concerned for flushing the buffers out, I am not ruling out the combination of Dell PERC4 RAID card,and the RH AS 3.0 Update3 being a problem. > >Thanks, >Anjan > > -----Original Message----- > From: Donald Courtney [mailto:Donald.Courtney@Sun.COM] > Sent: Thu 5/19/2005 12:54 PM > To: Tom Lane > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] PostgreSQL strugling during high load > > > > Tom > > Thanks for the post - I think I am getting this problem for > a synthetic workload at high connection loads. The whole > system seems to stop. > > Can you give some examples on what to try out in the .conf file? > > I tried > bgwriter_all_percent = 30, 10, and 3 > > Which I understand to mean 30%, 10% and 3% of the dirty pages should be > written out *between* checkpoints. > > I didn't see any change in effect. > > /regards > Don C. > > Tom Lane wrote: > > >"Mindaugas Riauba" <mind@bi.lt> writes: > > > > > >> It looks like that not only vacuum causes our problems. vacuum_cost > >>seems to lower vacuum impact but we are still noticing slow queries "storm". > >>We are logging queries that takes >2000ms to process. > >> And there is quiet periods and then suddenly 30+ slow queries appears in > >>log within the same second. What else could cause such behaviour? > >> > >> > > > >Checkpoints? You should ensure that the checkpoint settings are such > >that checkpoints don't happen too often (certainly not oftener than > >every five minutes or so), and make sure the bgwriter is configured > >to dribble out dirty pages at a reasonable rate, so that the next > >checkpoint doesn't have a whole load of stuff to write. > > > > regards, tom lane > > > >---------------------------(end of broadcast)--------------------------- > >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
On May 19, 2005, at 2:12 PM, Anjan Dave wrote: > As far as disk I/O is concerned for flushing the buffers out, I am > not ruling out the combination of Dell PERC4 RAID card > That'd be my first guess as to I/O speed issues. I have some dell hardware that by all means should be totally blowing out my other boxes in speed, but the I/O sucks out the wazoo. I'm migrating to opteron based DB servers with LSI branded cards (not the Dell re- branded ones). Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
Anjan, > As far as disk I/O is concerned for flushing the buffers out, I am not > ruling out the combination of Dell PERC4 RAID card, and the RH AS 3.0 > Update3 being a problem. You know that Update4 is out, yes? Update3 is currenly throttling your I/O by about 50%. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: >Anjan, > > > >>As far as disk I/O is concerned for flushing the buffers out, I am not >>ruling out the combination of Dell PERC4 RAID card, and the RH AS 3.0 >>Update3 being a problem. >> >> > >You know that Update4 is out, yes? >Update3 is currenly throttling your I/O by about 50%. > > Is that 50% just for the Dell PERC4 RAID on RH AS 3.0? Sound like severe context switching. Steve Poe
Yes, I am using it another DB/application. Few more days and I'll have a free hand on this box as well. Thanks, Anjan -----Original Message----- From: Josh Berkus [mailto:josh@agliodbs.com] Sent: Thursday, May 19, 2005 3:58 PM To: Anjan Dave Cc: Donald Courtney; Tom Lane; pgsql-performance@postgresql.org Subject: Re: [PERFORM] PostgreSQL strugling during high load Anjan, > As far as disk I/O is concerned for flushing the buffers out, I am not > ruling out the combination of Dell PERC4 RAID card, and the RH AS 3.0 > Update3 being a problem. You know that Update4 is out, yes? Update3 is currenly throttling your I/O by about 50%. -- --Josh Josh Berkus Aglio Database Solutions San Francisco