Thread: ANALYZE getting dead tuple count hopelessly wrong
I have a table with about 15 million rows which is constantly having tuples added to the head and deleted in blocks from the tail to maintain the size. The dead tuple count in pg_stat_user_tables tracks the deleted rows fairly accurately until an auto-ANALYZE is done in the background at which point the value it calculates is wrong by a factor of 2-3 times (calculated value is 30-50% of the correct value), which completely throws the auto-VACUUMing. An example is that the auto-VACUUM only ran when there were 12 million (real) dead rows! Any ideas? Thanks Stuart PS. Running 8.3.1 on NetBSD 3.
On Mon, Mar 31, 2008 at 1:33 PM, Stuart Brooks <stuartb@cat.co.za> wrote: > I have a table with about 15 million rows which is constantly having > tuples added to the head and deleted in blocks from the tail to maintain > the size. The dead tuple count in pg_stat_user_tables tracks the deleted > rows fairly accurately until an auto-ANALYZE is done in the background > at which point the value it calculates is wrong by a factor of 2-3 times > (calculated value is 30-50% of the correct value) (copying -hackers) Seems like the redirected-dead line pointers are playing spoil-sport here. In this particular example, the deleted tuples may get truncated to redirected-dead line pointers. Analyze would report them as empty slots and not as dead tuples. So in the worst case, if all the deleted tuples are already truncated to redirected-dead line pointers, analyze may report "zero" dead tuple count. This is a slightly tricky situation because in normal case we might want to delay autovacuum to let subsequent UPDATEs in the page to reuse the space released by the deleted tuples. But in this particular example, delaying autovacuum is not a good thing because the relation would just keep growing. I think we should check for redirected-dead line pointers in analyze.c and report them as dead tuples. The other longer term alternative could be to track redirected-dead line pointers and give them some weightage while deciding on autovacuum. We can also update the FSM information of a page when its pruned/defragged so that the page can also be used for subsequent INSERTs or non-HOT UPDATEs in other pages. This might be easier said than done. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Seems like the redirected-dead line pointers are playing spoil-sport here. > In this particular example, the deleted tuples may get truncated to > redirected-dead line pointers. Analyze would report them as empty > slots and not as dead tuples. So in the worst case, if all the deleted > tuples are already truncated to redirected-dead line pointers, analyze > may report "zero" dead tuple count. [ Please see if you can stop using the "redirected dead" terminology ] Yeah, I think I agree. The page pruning code is set up so that changing a line pointer to DEAD state doesn't change the count of dead tuples in the table, so we are counting unreclaimed DEAD pointers as still being dead tuples requiring VACUUM. ANALYZE should surely not affect that. It looks like there's no trivial way to get ANALYZE to do things that way, though. heap_release_fetch() doesn't distinguish a DEAD line pointer from an unused or redirected one. But in the current implementation of ANALYZE there's really no benefit to using heap_release_fetch anyway --- it always examines all line pointers on each selected page, so we might as well rewrite it to use a simple loop more like vacuum uses. I notice that this'd leave heap_release_fetch completely unused... at least in HEAD I'd be tempted to get rid of it and restore heap_fetch to its former simplicity. regards, tom lane
On Mon, Mar 31, 2008 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > [ Please see if you can stop using the "redirected dead" terminology ] > > Apologies, will keep that in mind. Seems like a hang-over from the past :-) > Yeah, I think I agree. The page pruning code is set up so that changing > a line pointer to DEAD state doesn't change the count of dead tuples in > the table, so we are counting unreclaimed DEAD pointers as still being > dead tuples requiring VACUUM. ANALYZE should surely not affect that. > > It looks like there's no trivial way to get ANALYZE to do things that > way, though. heap_release_fetch() doesn't distinguish a DEAD line > pointer from an unused or redirected one. But in the current > implementation of ANALYZE there's really no benefit to using > heap_release_fetch anyway --- it always examines all line pointers > on each selected page, so we might as well rewrite it to use a simple > loop more like vacuum uses. > I agree. I would write a patch on these lines, unless you are already on to it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On Mon, Mar 31, 2008 at 9:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It looks like there's no trivial way to get ANALYZE to do things that >> way, though. heap_release_fetch() doesn't distinguish a DEAD line >> pointer from an unused or redirected one. But in the current >> implementation of ANALYZE there's really no benefit to using >> heap_release_fetch anyway --- it always examines all line pointers >> on each selected page, so we might as well rewrite it to use a simple >> loop more like vacuum uses. > I agree. I would write a patch on these lines, unless you are already on to it. Please do --- I have a lot of other stuff on my plate. regards, tom lane
On Tue, Apr 1, 2008 at 1:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Please do --- I have a lot of other stuff on my plate. > Please see the attached patch. One change I made is to hold the SHARE lock on the page while ANALYZE is reading tuples from it. I thought it would be a right thing to do instead of repeatedly acquiring/releasing the lock. Another thing I noticed while working on this is VACUUM probably reports the number of dead tuples incorrectly. We don't count the DEAD line pointers as "tups_vacuumed" which is fine if the line pointer was marked DEAD in the immediately preceding heap_page_prune(). In that case the DEAD line pointer is counted in "ndeleted" count returned by heap_page_prune(). But it fails to count already DEAD line pointers. For example postgres=# CREATE TABLE test (a int, b char(500)); CREATE TABLE postgres=# INSERT INTO test VALUES (generate_series(1,15),'foo'); INSERT 0 15 postgres=# DELETE FROM test; DELETE 15 postgres=# select count(*) from test; count ------- 0 (1 row) postgres=# VACUUM VERBOSE test; INFO: vacuuming "public.test" INFO: "test": removed 0 row versions in 1 pages INFO: "test": found 0 removable, 0 nonremovable row versions in 1 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 1 pages contain useful free space. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "test": truncated 1 to 0 pages DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM So VACUUM reports "zero" dead row versions which may seem counter-intuitive especially in the autovac log message (as someone may wonder why autovac got triggered on the table) I am thinking we can make heap_page_prune() to only return number of HOT tuples pruned and then explicitly count the DEAD line pointers in tups_vacuumed. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Attachment
On Tue, 2008-04-01 at 13:07 +0530, Pavan Deolasee wrote: > Please see the attached patch. One change I made is to hold the SHARE lock > on the page while ANALYZE is reading tuples from it. I thought it would > be a right thing to do instead of repeatedly acquiring/releasing the lock. ANALYZE is a secondary task and so we shouldn't be speeding it up at the possible expense of other primary tasks. So I think holding locks for longer than minimum is not good in this case and I see no reason to make the change described. We can speed up ANALYZE by using the background reader to preread the blocks, assuming bgreader will happen one day. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
>> Please do --- I have a lot of other stuff on my plate. >> >> > > Please see the attached patch. One change I made is to hold the SHARE lock > on the page while ANALYZE is reading tuples from it. I thought it would > be a right thing to do instead of repeatedly acquiring/releasing the lock. > I have applied the patch and have started my test again, it takes a little while to fill up so I should have the results sometime tomorrow. Thanks for the quick response. Stuart
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2008-04-01 at 13:07 +0530, Pavan Deolasee wrote: >> Please see the attached patch. One change I made is to hold the SHARE lock >> on the page while ANALYZE is reading tuples from it. I thought it would >> be a right thing to do instead of repeatedly acquiring/releasing the lock. > ANALYZE is a secondary task and so we shouldn't be speeding it up at the > possible expense of other primary tasks. So I think holding locks for > longer than minimum is not good in this case and I see no reason to make > the change described. I think Pavan's change is probably good. In the first place, it's only a shared buffer lock and besides ANALYZE isn't going to be holding it long (all it's doing at this point is counting tuples and copying some of them into memory). In the second place, repeated lock release and re-grab threatens cache line contention and a context swap storm if there is anyone else trying to access the page. In the third, whether there's contention or not the extra acquire/release work will cost CPU cycles. In the fourth, if we actually believed this was a problem we'd need to redesign VACUUM too, as it does the same thing. I haven't read the patch yet, but I'm inclined to go with the design Pavan suggests. regards, tom lane
On Tue, 2008-04-01 at 10:22 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2008-04-01 at 13:07 +0530, Pavan Deolasee wrote: > >> Please see the attached patch. One change I made is to hold the SHARE lock > >> on the page while ANALYZE is reading tuples from it. I thought it would > >> be a right thing to do instead of repeatedly acquiring/releasing the lock. > > > ANALYZE is a secondary task and so we shouldn't be speeding it up at the > > possible expense of other primary tasks. So I think holding locks for > > longer than minimum is not good in this case and I see no reason to make > > the change described. > > I think Pavan's change is probably good. In the first place, it's only > a shared buffer lock and besides ANALYZE isn't going to be holding it > long (all it's doing at this point is counting tuples and copying some > of them into memory). In the second place, repeated lock release and > re-grab threatens cache line contention and a context swap storm if > there is anyone else trying to access the page. In the third, whether > there's contention or not the extra acquire/release work will cost CPU > cycles. In the fourth, if we actually believed this was a problem we'd > need to redesign VACUUM too, as it does the same thing. VACUUM waits until nobody else has the buffer pinned, so lock contention is much less of a consideration there. Plus it rearranges the block, which is hard to do one tuple at a time even if we wanted to. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2008-04-01 at 10:22 -0400, Tom Lane wrote: >> In the fourth, if we actually believed this was a problem we'd >> need to redesign VACUUM too, as it does the same thing. > VACUUM waits until nobody else has the buffer pinned, so lock contention > is much less of a consideration there. Plus it rearranges the block, > which is hard to do one tuple at a time even if we wanted to. That's the second scan. The first scan acts exactly like Pavan is proposing for ANALYZE. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Simon Riggs <simon@2ndquadrant.com> writes: >> On Tue, 2008-04-01 at 13:07 +0530, Pavan Deolasee wrote: >>> Please see the attached patch. One change I made is to hold the SHARE lock >>> on the page while ANALYZE is reading tuples from it. I thought it would >>> be a right thing to do instead of repeatedly acquiring/releasing the lock. > >> ANALYZE is a secondary task and so we shouldn't be speeding it up at the >> possible expense of other primary tasks. So I think holding locks for >> longer than minimum is not good in this case and I see no reason to make >> the change described. > > I think Pavan's change is probably good. In the first place, it's only > a shared buffer lock and besides ANALYZE isn't going to be holding it > long (all it's doing at this point is counting tuples and copying some > of them into memory). In the second place, repeated lock release and > re-grab threatens cache line contention and a context swap storm if > there is anyone else trying to access the page. In the third, whether > there's contention or not the extra acquire/release work will cost CPU > cycles. In the fourth, if we actually believed this was a problem we'd > need to redesign VACUUM too, as it does the same thing. I'm not sure all those arguments are valid (at least the first two seem contradictory). However I'm skeptical about Simon's premise. It's not clear any changes to ANALYZE here are at the expense of other proceses. Any cycles saved in ANALYZE are available for those other processes after all... -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Please see the attached patch. One change I made is to hold the SHARE lock > on the page while ANALYZE is reading tuples from it. I thought it would > be a right thing to do instead of repeatedly acquiring/releasing the lock. I've applied a modified/extended form of this patch for 8.3.2. regards, tom lane
On Thu, Apr 3, 2008 at 10:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I've applied a modified/extended form of this patch for 8.3.2. > Thanks. I had another concern about VACUUM not reporting DEAD line pointers (please see up thread). Any comments on that ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > Thanks. I had another concern about VACUUM not reporting DEAD line > pointers (please see up thread). Any comments on that ? If you want to work on that, go ahead, but I wanted it separate because I didn't think it merited back-patching. It's strictly cosmetic in terms of being about what VACUUM VERBOSE prints, no? regards, tom lane
On Thu, Apr 3, 2008 at 10:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > > > Thanks. I had another concern about VACUUM not reporting DEAD line > > pointers (please see up thread). Any comments on that ? > > If you want to work on that, go ahead Ok. I would do that. > but I wanted it separate because > I didn't think it merited back-patching. It's strictly cosmetic in > terms of being about what VACUUM VERBOSE prints, no? > Umm.. Whatever we decide on the fix, I think we should backpatch it to 8.3 because I am worried that someone way get completely confused with the current vacuum report, especially if the autovac is triggered just because of heap full of DEAD line pointers. The num of dead rows reported may awfully be low in that case. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > On Thu, Apr 3, 2008 at 10:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I didn't think it merited back-patching. It's strictly cosmetic in >> terms of being about what VACUUM VERBOSE prints, no? > Umm.. Whatever we decide on the fix, I think we should backpatch it to > 8.3 because I am worried that someone way get completely confused with > the current vacuum report, "Somebody might misread an optional report" doesn't seem to me to be on the same risk level as "we might destabilize a stable release". The policy of this project is that we only put nontrivial bug fixes into back branches, and I don't think this item qualifies ... regards, tom lane
On Fri, Apr 4, 2008 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The > policy of this project is that we only put nontrivial bug fixes into > back branches, and I don't think this item qualifies ... > Got it. I will submit a patch for HEAD. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Pavan Deolasee wrote: > On Fri, Apr 4, 2008 at 11:10 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> The >> policy of this project is that we only put nontrivial bug fixes into >> back branches, and I don't think this item qualifies ... >> >> > > Got it. I will submit a patch for HEAD. > > Thanks, As I mentioned earlier, I patched 8.3.1 with Pavan's patch and have been running tests. After a few days I have got postgres to lock up - not sure if it is related. Below is a ps from my system (NetBSD 3). TEST> ps -ax | grep post 1952 ? IW<s 13:52.24 postgres: writer process 2113 ? S<s 0:03.04 postgres: logger process 2157 ? S<s 0:03.12 postgres: autovacuum launcher process 2199 ? I<s 0:00.04 postgres: metauser metadb [local] SELECT 2472 ? DW<s 814:23.50 postgres: metauser metadb localhost(65524) COMMIT 2661 ? DW<s 0:11.27 postgres: metauser metadb localhost(65525) idle 2680 ? S<s 1:18.75 postgres: stats collector process 3156 ? S<s 0:45.12 postgres: wal writer process 24362 ? IW<s 0:00.00 postgres: autovacuum worker process 25024 ? IW<s 0:00.00 postgres: autovacuum worker process 25134 ? IW<s 0:00.00 postgres: autovacuum worker process 3289 ttyp5 I< 0:01.96 /usr/local/pgsql/bin/postgres -D ../data/metadb and I was disconnected in my client app with the following message: [WARN] PSQL:exec - failed in command <SELECT relname,n_tup_ins,n_live_tup,n_dead_tup,pg_total_relation_size('s8_0000.' || relname)*10/(1024*1024),last_autovacuum FROM pg_stat_user_tables WHERE schemaname='s8_0000' ORDER BY n_tup_ins DESC> [WARN] error = 'server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.' [WARN] ConnectionNB: PQconsumeInput failed with error 'server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.' The server is still running but I can't access it. A top yields: load averages: 0.23, 0.23, 0.21 09:53:58 110 processes: 109 sleeping, 1 on processor Memory: 513M Act, 256M Inact, 1336K Wired, 75M Exec, 557M File, 2776K Free Swap: 600M Total, 600M Free PID USERNAME PRI NICE SIZE RES STATE TIME WCPU CPU COMMAND 463 root 2 0 6132K 14M select 0:06 0.05% 0.05% kdeinit 2472 postgres -22 -2 4580K 4K mclpl 814:23 0.00% 0.00% <postgres> 2631 root -22 0 644K 4K mclpl 606:25 0.00% 0.00% <test_writer 1622 root 2 0 8456K 14M select 19:05 0.00% 0.00% kdeinit 1952 postgres 2 -2 3544K 4K netlck 13:52 0.00% 0.00% <postgres> 233 root 2 0 24M 31M select 4:47 0.00% 0.00% XFree86 451 root 2 0 3544K 15M select 4:45 0.00% 0.00% kdeinit 16 root 18 0 0K 182M syncer 3:51 0.00% 0.00% [ioflush] 17 root -18 0 0K 182M aiodoned 1:46 0.00% 0.00% [aiodoned] 15 root -18 0 0K 182M pgdaemon 1:30 0.00% 0.00% [pagedaemon] 1301 root -22 0 4092K 4K mclpl 1:23 0.00% 0.00% <kdeinit> 2680 postgres 2 -2 3560K 1588K poll 1:18 0.00% 0.00% postgres 1493 root 2 0 3488K 17M select 1:09 0.00% 0.00% korgac 461 root 2 0 3748K 16M select 0:57 0.00% 0.00% kdeinit 3156 postgres 2 -2 3448K 1792K select 0:45 0.00% 0.00% postgres 1174 root 2 0 2608K 2928K select 0:40 0.00% 0.00% profiler 1428 root 2 0 3376K 13M select 0:26 0.00% 0.00% kdeinit 2661 postgres -22 -2 4896K 4K mclpl 0:11 0.00% 0.00% <postgres> I'm not convinced this is a postgresql bug (state=mclpl concerns me), but it's the first time I've seen it. I suppose it could be: http://www.netbsd.org/cgi-bin/query-pr-single.pl?number=35224. Anything I can do which might help isolating the problem? Regards Stuart