Thread: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
Hi hackers,
I'd like to propose to search min and max value in index with SnapshotAny in get_actual_variable_range function.
Current implementation scans index with SnapshotDirty which accepts uncommitted rows and rejects dead rows.
In a code there is a comment about this:
/*
* In principle, we should scan the index with our current
* active snapshot, which is the best approximation we've got
* to what the query will see when executed. But that won't
* be exact if a new snap is taken before running the query,
* and it can be very expensive if a lot of uncommitted rows
* exist at the end of the index (because we'll laboriously
* fetch each one and reject it). What seems like a good
* compromise is to use SnapshotDirty. That will accept
* uncommitted rows, and thus avoid fetching multiple heap
* tuples in this scenario. On the other hand, it will reject
* known-dead rows, and thus not give a bogus answer when the
* extreme value has been deleted; that case motivates not
* using SnapshotAny here.
*/
But if we delete many rows from beginning or end of index, it would be
very expensive too because we will fetch each dead row and reject it.
Following sequence can be used to reproduce this issue:
psql -c "DROP DATABASE test_polygon";
psql -c "CREATE DATABASE test_polygon";
psql test_polygon -c "CREATE EXTENSION postgis";
psql test_polygon -f /tmp/data.sql;
psql test_polygon -c "ANALYZE";
# \d polygon_table
Table "public.polygon_table"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------
id | integer | not null default nextval('polygon_table_id_seq'::regclass)
time | timestamp with time zone | not null
poly | geometry(Polygon,4326) | not null
second_id | integer | not null
Indexes:
"polygon_table_pkey" PRIMARY KEY, btree (id)
"polygon_table_b179ed4a" btree (second_id)
"polygon_table_poly_id" gist (poly)
"polygon_table_time" btree ("time")
Foreign-key constraints:
"second_table_id" FOREIGN KEY (second_id) REFERENCES second_table(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
1st session:
pgbench test_polygon -P 1 -R 6000 -c 24 -j 8 -T 1000 -n -f /tmp/bad_request
2nd session:
psql test_polygon -c "DELETE FROM polygon_table WHERE time <= '2017-03-30 16:00:00+03'"
After delete we have many dead rows in the beginning of index "polygon_table_time" (time).
pgbench output:
progress: 1.0 s, 6023.8 tps, lat 1.170 ms stddev 1.022, lag 0.157 ms
progress: 2.0 s, 6023.8 tps, lat 1.045 ms stddev 0.182, lag 0.076 ms
progress: 3.0 s, 5957.0 tps, lat 1.046 ms stddev 0.176, lag 0.071 ms
progress: 4.0 s, 6066.9 tps, lat 1.061 ms stddev 0.184, lag 0.072 ms
progress: 5.0 s, 6178.1 tps, lat 1.060 ms stddev 0.189, lag 0.076 ms
progress: 6.0 s, 6079.0 tps, lat 1.075 ms stddev 0.195, lag 0.075 ms
progress: 7.0 s, 6246.0 tps, lat 1.069 ms stddev 0.194, lag 0.076 ms
progress: 8.0 s, 6046.0 tps, lat 1.050 ms stddev 0.181, lag 0.073 ms
progress: 9.0 s, 1255.0 tps, lat 79.114 ms stddev 189.686, lag 63.194 ms
progress: 10.0 s, 4696.0 tps, lat 1015.294 ms stddev 36.291, lag 1009.983 ms
progress: 11.0 s, 6031.0 tps, lat 1001.354 ms stddev 59.379, lag 997.375 ms
progress: 12.0 s, 6013.0 tps, lat 961.725 ms stddev 104.536, lag 957.736 ms
progress: 13.0 s, 6098.0 tps, lat 936.516 ms stddev 140.039, lag 932.580 ms
progress: 14.0 s, 6032.0 tps, lat 935.867 ms stddev 137.761, lag 931.892 ms
progress: 15.0 s, 5975.0 tps, lat 950.911 ms stddev 153.438, lag 946.895 ms
progress: 16.0 s, 6044.0 tps, lat 953.380 ms stddev 146.601, lag 949.413 ms
progress: 17.0 s, 6105.0 tps, lat 956.524 ms stddev 134.940, lag 952.593 ms
progress: 18.0 s, 6097.0 tps, lat 950.913 ms stddev 135.902, lag 946.980 ms
progress: 19.0 s, 6004.9 tps, lat 933.010 ms stddev 142.037, lag 929.014 ms
progress: 20.0 s, 6078.1 tps, lat 920.415 ms stddev 157.117, lag 916.469 ms
progress: 21.0 s, 5402.0 tps, lat 945.490 ms stddev 145.262, lag 941.048 ms
progress: 22.0 s, 5226.0 tps, lat 1082.013 ms stddev 141.718, lag 1077.423 ms
progress: 23.0 s, 12794.1 tps, lat 479.046 ms stddev 434.510, lag 478.106 ms
progress: 24.0 s, 5914.8 tps, lat 0.604 ms stddev 0.075, lag 0.067 ms
progress: 25.0 s, 5994.0 tps, lat 0.596 ms stddev 0.071, lag 0.066 ms
progress: 26.0 s, 6126.9 tps, lat 0.598 ms stddev 0.072, lag 0.067 ms
progress: 27.0 s, 6076.2 tps, lat 0.601 ms stddev 0.072, lag 0.068 ms
progress: 28.0 s, 6035.0 tps, lat 0.608 ms stddev 0.077, lag 0.068 ms
After delete (9s) latency increases significantly for up to 1000ms until autovacuum comes and performs index cleanup (23s).
From EXPLAIN ANALYZE we could see, that we have significantly increased Planning time:
# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..16.86 rows=1 width=160) (actual time=0.100..0.248 rows=4 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id
Buffers: shared hit=49
-> Index Scan using polygon_table_poly_id on public.polygon_table polygon (cost=0.29..8.55 rows=1 width=156) (actual time=0.081..0.220 rows=4 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id
Index Cond: (polygon.poly && '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Filter: _st_intersects(polygon.poly, '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Rows Removed by Filter: 6
Buffers: shared hit=37
-> Index Only Scan using second_table_pkey on public.second_table second (cost=0.28..8.29 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=4)
Output: second.id
Index Cond: (second.id = polygon.second_id)
Heap Fetches: 4
Buffers: shared hit=12
Planning time: 115.122 ms
Execution time: 0.422 ms
(16 rows)
Time: 116.926 ms
# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..16.86 rows=1 width=160) (actual time=0.059..0.373 rows=46 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id
Buffers: shared hit=170
-> Index Scan using polygon_table_poly_id on public.polygon_table polygon (cost=0.29..8.55 rows=1 width=156) (actual time=0.045..0.269 rows=46 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id
Index Cond: (polygon.poly && '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Filter: _st_intersects(polygon.poly, '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Rows Removed by Filter: 44
Buffers: shared hit=32
-> Index Only Scan using second_table_pkey on public.second_table second (cost=0.28..8.29 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=46)
Output: second.id
Index Cond: (second.id = polygon.second_id)
Heap Fetches: 46
Buffers: shared hit=138
Planning time: 6.139 ms
Execution time: 0.482 ms
(16 rows)
Time: 7.722 ms
Initially, the function used active snapshot from GetActiveSnapshot(). But in fccebe421d0c410e6378fb281419442c84759213
this behavior was "weakened" to SnapshotDirty (I suppose for a similar reason).
Was there a particular reason for allowing planner to see uncommitted rows, but forbidding him access to the dead ones?
Simple patch that uses SnapshotAny is attached. Comments in code are not changed yet.
I'd like to propose to search min and max value in index with SnapshotAny in get_actual_variable_range function.
Current implementation scans index with SnapshotDirty which accepts uncommitted rows and rejects dead rows.
In a code there is a comment about this:
/*
* In principle, we should scan the index with our current
* active snapshot, which is the best approximation we've got
* to what the query will see when executed. But that won't
* be exact if a new snap is taken before running the query,
* and it can be very expensive if a lot of uncommitted rows
* exist at the end of the index (because we'll laboriously
* fetch each one and reject it). What seems like a good
* compromise is to use SnapshotDirty. That will accept
* uncommitted rows, and thus avoid fetching multiple heap
* tuples in this scenario. On the other hand, it will reject
* known-dead rows, and thus not give a bogus answer when the
* extreme value has been deleted; that case motivates not
* using SnapshotAny here.
*/
But if we delete many rows from beginning or end of index, it would be
very expensive too because we will fetch each dead row and reject it.
Following sequence can be used to reproduce this issue:
psql -c "DROP DATABASE test_polygon";
psql -c "CREATE DATABASE test_polygon";
psql test_polygon -c "CREATE EXTENSION postgis";
psql test_polygon -f /tmp/data.sql;
psql test_polygon -c "ANALYZE";
# \d polygon_table
Table "public.polygon_table"
Column | Type | Modifiers
-----------+--------------------------+------------------------------------------------------------
id | integer | not null default nextval('polygon_table_id_seq'::regclass)
time | timestamp with time zone | not null
poly | geometry(Polygon,4326) | not null
second_id | integer | not null
Indexes:
"polygon_table_pkey" PRIMARY KEY, btree (id)
"polygon_table_b179ed4a" btree (second_id)
"polygon_table_poly_id" gist (poly)
"polygon_table_time" btree ("time")
Foreign-key constraints:
"second_table_id" FOREIGN KEY (second_id) REFERENCES second_table(id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED
1st session:
pgbench test_polygon -P 1 -R 6000 -c 24 -j 8 -T 1000 -n -f /tmp/bad_request
2nd session:
psql test_polygon -c "DELETE FROM polygon_table WHERE time <= '2017-03-30 16:00:00+03'"
After delete we have many dead rows in the beginning of index "polygon_table_time" (time).
pgbench output:
progress: 1.0 s, 6023.8 tps, lat 1.170 ms stddev 1.022, lag 0.157 ms
progress: 2.0 s, 6023.8 tps, lat 1.045 ms stddev 0.182, lag 0.076 ms
progress: 3.0 s, 5957.0 tps, lat 1.046 ms stddev 0.176, lag 0.071 ms
progress: 4.0 s, 6066.9 tps, lat 1.061 ms stddev 0.184, lag 0.072 ms
progress: 5.0 s, 6178.1 tps, lat 1.060 ms stddev 0.189, lag 0.076 ms
progress: 6.0 s, 6079.0 tps, lat 1.075 ms stddev 0.195, lag 0.075 ms
progress: 7.0 s, 6246.0 tps, lat 1.069 ms stddev 0.194, lag 0.076 ms
progress: 8.0 s, 6046.0 tps, lat 1.050 ms stddev 0.181, lag 0.073 ms
progress: 9.0 s, 1255.0 tps, lat 79.114 ms stddev 189.686, lag 63.194 ms
progress: 10.0 s, 4696.0 tps, lat 1015.294 ms stddev 36.291, lag 1009.983 ms
progress: 11.0 s, 6031.0 tps, lat 1001.354 ms stddev 59.379, lag 997.375 ms
progress: 12.0 s, 6013.0 tps, lat 961.725 ms stddev 104.536, lag 957.736 ms
progress: 13.0 s, 6098.0 tps, lat 936.516 ms stddev 140.039, lag 932.580 ms
progress: 14.0 s, 6032.0 tps, lat 935.867 ms stddev 137.761, lag 931.892 ms
progress: 15.0 s, 5975.0 tps, lat 950.911 ms stddev 153.438, lag 946.895 ms
progress: 16.0 s, 6044.0 tps, lat 953.380 ms stddev 146.601, lag 949.413 ms
progress: 17.0 s, 6105.0 tps, lat 956.524 ms stddev 134.940, lag 952.593 ms
progress: 18.0 s, 6097.0 tps, lat 950.913 ms stddev 135.902, lag 946.980 ms
progress: 19.0 s, 6004.9 tps, lat 933.010 ms stddev 142.037, lag 929.014 ms
progress: 20.0 s, 6078.1 tps, lat 920.415 ms stddev 157.117, lag 916.469 ms
progress: 21.0 s, 5402.0 tps, lat 945.490 ms stddev 145.262, lag 941.048 ms
progress: 22.0 s, 5226.0 tps, lat 1082.013 ms stddev 141.718, lag 1077.423 ms
progress: 23.0 s, 12794.1 tps, lat 479.046 ms stddev 434.510, lag 478.106 ms
progress: 24.0 s, 5914.8 tps, lat 0.604 ms stddev 0.075, lag 0.067 ms
progress: 25.0 s, 5994.0 tps, lat 0.596 ms stddev 0.071, lag 0.066 ms
progress: 26.0 s, 6126.9 tps, lat 0.598 ms stddev 0.072, lag 0.067 ms
progress: 27.0 s, 6076.2 tps, lat 0.601 ms stddev 0.072, lag 0.068 ms
progress: 28.0 s, 6035.0 tps, lat 0.608 ms stddev 0.077, lag 0.068 ms
After delete (9s) latency increases significantly for up to 1000ms until autovacuum comes and performs index cleanup (23s).
From EXPLAIN ANALYZE we could see, that we have significantly increased Planning time:
# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..16.86 rows=1 width=160) (actual time=0.100..0.248 rows=4 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id
Buffers: shared hit=49
-> Index Scan using polygon_table_poly_id on public.polygon_table polygon (cost=0.29..8.55 rows=1 width=156) (actual time=0.081..0.220 rows=4 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id
Index Cond: (polygon.poly && '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Filter: _st_intersects(polygon.poly, '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Rows Removed by Filter: 6
Buffers: shared hit=37
-> Index Only Scan using second_table_pkey on public.second_table second (cost=0.28..8.29 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=4)
Output: second.id
Index Cond: (second.id = polygon.second_id)
Heap Fetches: 4
Buffers: shared hit=12
Planning time: 115.122 ms
Execution time: 0.422 ms
(16 rows)
Time: 116.926 ms
# explain (analyze, verbose, timing, buffers) SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.56..16.86 rows=1 width=160) (actual time=0.059..0.373 rows=46 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id, second.id
Buffers: shared hit=170
-> Index Scan using polygon_table_poly_id on public.polygon_table polygon (cost=0.29..8.55 rows=1 width=156) (actual time=0.045..0.269 rows=46 loops=1)
Output: polygon.id, polygon."time", polygon.poly, polygon.second_id
Index Cond: (polygon.poly && '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Filter: _st_intersects(polygon.poly, '0101000020E6100000245DD83FF42B4A4079ED2D40365F4D40'::geometry)
Rows Removed by Filter: 44
Buffers: shared hit=32
-> Index Only Scan using second_table_pkey on public.second_table second (cost=0.28..8.29 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=46)
Output: second.id
Index Cond: (second.id = polygon.second_id)
Heap Fetches: 46
Buffers: shared hit=138
Planning time: 6.139 ms
Execution time: 0.482 ms
(16 rows)
Time: 7.722 ms
Initially, the function used active snapshot from GetActiveSnapshot(). But in fccebe421d0c410e6378fb281419442c84759213
this behavior was "weakened" to SnapshotDirty (I suppose for a similar reason).
Was there a particular reason for allowing planner to see uncommitted rows, but forbidding him access to the dead ones?
Simple patch that uses SnapshotAny is attached. Comments in code are not changed yet.
data.sql file: https://yadi.sk/d/GtBW6Hhu3HQ4CA
Regards,
Dmitriy Sarafannikov
Attachment
On Thu, Apr 27, 2017 at 4:08 AM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > I'd like to propose to search min and max value in index with SnapshotAny in > get_actual_variable_range function. +1 from me, but Tom rejected that approach last time. > But if we delete many rows from beginning or end of index, it would be > very expensive too because we will fetch each dead row and reject it. Yep, and I've seen that turn into a serious problem in production. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 27, 2017 at 4:08 AM, Dmitriy Sarafannikov > <dsarafannikov@yandex.ru> wrote: >> I'd like to propose to search min and max value in index with SnapshotAny in >> get_actual_variable_range function. > +1 from me, but Tom rejected that approach last time. I remain concerned about the fact that this would allow accepting deleted values that might be way outside the surviving range of values. SnapshotDirty is a bit lax about tuple state, but it's not so lax as to accept fully dead tuples. >> But if we delete many rows from beginning or end of index, it would be >> very expensive too because we will fetch each dead row and reject it. > Yep, and I've seen that turn into a serious problem in production. How so? Shouldn't the indexscan go back and mark such tuples dead in the index, such that they'd be visited this way only once? If that's not happening, maybe we should try to fix it. regards, tom lane
On 2017-04-27 17:22:25 -0400, Tom Lane wrote: > > Yep, and I've seen that turn into a serious problem in production. > > How so? Shouldn't the indexscan go back and mark such tuples dead in > the index, such that they'd be visited this way only once? If that's > not happening, maybe we should try to fix it. One way that never happens is if you end up choosing bitmap index scans all the time... I've previously had to force a certain percentage of queries with it disabled to keep indexes in a good state :( - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-04-27 17:22:25 -0400, Tom Lane wrote: >> How so? Shouldn't the indexscan go back and mark such tuples dead in >> the index, such that they'd be visited this way only once? If that's >> not happening, maybe we should try to fix it. > One way that never happens is if you end up choosing bitmap index scans > all the time... What I'm thinking of is the regular indexscan that's done internally by get_actual_variable_range, not whatever ends up getting chosen as the plan for the user query. I had supposed that that would kill dead index entries as it went, but maybe that's not happening for some reason. regards, tom lane
Hi all.
28 апр. 2017 г., в 0:22, Tom Lane <tgl@sss.pgh.pa.us> написал(а):Robert Haas <robertmhaas@gmail.com> writes:Yep, and I've seen that turn into a serious problem in production.
And that’s why we started digging into it :) We have already seen that to be a problem in another project (not so serious) and that time we rewrote the query. But now planning of such query consumes much more CPU (i.e. on [1] each line is for one DB host) and nearly all queries on all hosts (primary and all standbys) work much worse.
How so? Shouldn't the indexscan go back and mark such tuples dead in
the index, such that they'd be visited this way only once? If that's
not happening, maybe we should try to fix it.
That is definitely not happening. Here is the perf output when the problem happens:
root@pgload01e ~ # perf report | grep -v '^#' | head 56.67% postgres postgres [.] _bt_checkkeys 19.27% postgres postgres [.] _bt_readpage 2.09% postgres postgres [.] pglz_decompress 2.03% postgres postgres [.] LWLockAttemptLock 1.61% postgres postgres [.] PinBuffer.isra.3 1.14% postgres postgres [.] hash_search_with_hash_value 0.68% postgres postgres [.] LWLockRelease 0.42% postgres postgres [.] AllocSetAlloc 0.40% postgres postgres [.] SearchCatCache 0.40% postgres postgres [.] ReadBuffer_common root@pgload01e ~ #
And here is one of backtrace of one of backends (they all look pretty the same):
[Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007fe9e2660709 in _bt_checkkeys (scan=scan@entry=0x7fe9e48ec838, page=page@entry=0x7fe8e477d780 "\263$", offnum=offnum@entry=338, dir=dir@entry=ForwardScanDirection, continuescan=continuescan@entry=0x7ffe3addfd1f "\001h\341\b") at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtutils.c:1378 #0 0x00007fe9e2660709 in _bt_checkkeys (scan=scan@entry=0x7fe9e48ec838, page=page@entry=0x7fe8e477d780 "\263$", offnum=offnum@entry=338, dir=dir@entry=ForwardScanDirection, continuescan=continuescan@entry=0x7ffe3addfd1f "\001h\341\b") at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtutils.c:1378 #1 0x00007fe9e265d301 in _bt_readpage (scan=scan@entry=0x7fe9e48ec838, dir=dir@entry=ForwardScanDirection, offnum=338) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtsearch.c:1198 #2 0x00007fe9e265dbce in _bt_steppage (scan=scan@entry=0x7fe9e48ec838, dir=dir@entry=ForwardScanDirection) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtsearch.c:1357 #3 0x00007fe9e265efec in _bt_endpoint (dir=ForwardScanDirection, scan=0x7fe9e48ec838) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtsearch.c:1739 #4 _bt_first (scan=scan@entry=0x7fe9e48ec838, dir=dir@entry=ForwardScanDirection) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtsearch.c:746 #5 0x00007fe9e265c119 in btgettuple (scan=0x7fe9e48ec838, dir=ForwardScanDirection) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/nbtree/nbtree.c:326 #6 0x00007fe9e2656162 in index_getnext_tid (scan=scan@entry=0x7fe9e48ec838, direction=direction@entry=ForwardScanDirection) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/index/indexam.c:415 #7 0x00007fe9e2656354 in index_getnext (scan=scan@entry=0x7fe9e48ec838, direction=direction@entry=ForwardScanDirection) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/access/index/indexam.c:553 #8 0x00007fe9e2942f70 in get_actual_variable_range (vardata=vardata@entry=0x7ffe3ade1620, sortop=<optimized out>, min=0x7fe9e4912ae0, max=0x0, root=0x7fe9e48e89b0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/utils/adt/selfuncs.c:5137 #9 0x00007fe9e2943c24 in ineq_histogram_selectivity (root=root@entry=0x7fe9e48e89b0, vardata=vardata@entry=0x7ffe3ade1620, opproc=opproc@entry=0x7ffe3ade1550, isgt=isgt@entry=0 '\000', constval=constval@entry=493076, consttype=consttype@entry=23) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/utils/adt/selfuncs.c:833 #10 0x00007fe9e29445f8 in scalarineqsel (root=root@entry=0x7fe9e48e89b0, operator=operator@entry=97, isgt=isgt@entry=0 '\000', vardata=vardata@entry=0x7ffe3ade1620, constval=493076, consttype=23) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/utils/adt/selfuncs.c:563 #11 0x00007fe9e2946233 in mergejoinscansel (root=root@entry=0x7fe9e48e89b0, clause=<optimized out>, opfamily=<optimized out>, strategy=<optimized out>, nulls_first=<optimized out>, leftstart=leftstart@entry=0x7ffe3ade1728, leftend=leftend@entry=0x7ffe3ade1730, rightstart=rightstart@entry=0x7ffe3ade1738, rightend=rightend@entry=0x7ffe3ade1740) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/utils/adt/selfuncs.c:3056 #12 0x00007fe9e27e1b16 in cached_scansel (rinfo=0x7fe9e49100c0, rinfo=0x7fe9e49100c0, pathkey=0x7fe9e490fb80, root=0x7fe9e48e89b0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/costsize.c:2626 #13 initial_cost_mergejoin (root=root@entry=0x7fe9e48e89b0, workspace=workspace@entry=0x7ffe3ade1830, jointype=jointype@entry=JOIN_INNER, mergeclauses=mergeclauses@entry=0x7fe9e4912a60, outer_path=outer_path@entry=0x7fe9e490f810, inner_path=inner_path@entry=0x7fe9e490f5b8, outersortkeys=outersortkeys@entry=0x7fe9e4912a10, innersortkeys=innersortkeys@entry=0x7fe9e4912ab0, sjinfo=0x7ffe3ade1a70) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/costsize.c:2226 #14 0x00007fe9e27ebde6 in try_mergejoin_path (root=root@entry=0x7fe9e48e89b0, joinrel=joinrel@entry=0x7fe9e4910b28, outer_path=outer_path@entry=0x7fe9e490f810, inner_path=inner_path@entry=0x7fe9e490f5b8, pathkeys=0x0, mergeclauses=mergeclauses@entry=0x7fe9e4912a60, outersortkeys=outersortkeys@entry=0x7fe9e4912a10, innersortkeys=innersortkeys@entry=0x7fe9e4912ab0, jointype=jointype@entry=JOIN_INNER, extra=extra@entry=0x7ffe3ade1970) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinpath.c:443 #15 0x00007fe9e27ec705 in sort_inner_and_outer (extra=0x7ffe3ade1970, jointype=JOIN_INNER, innerrel=0x7fe9e48eba00, outerrel=0x7fe9e48ea990, joinrel=0x7fe9e4910b28, root=0x7fe9e48e89b0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinpath.c:766 #16 add_paths_to_joinrel (root=root@entry=0x7fe9e48e89b0, joinrel=joinrel@entry=0x7fe9e4910b28, outerrel=outerrel@entry=0x7fe9e48ea990, innerrel=innerrel@entry=0x7fe9e48eba00, jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x7ffe3ade1a70, restrictlist=restrictlist@entry=0x7fe9e4912850) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinpath.c:173 #17 0x00007fe9e27ee201 in make_join_rel (root=root@entry=0x7fe9e48e89b0, rel1=rel1@entry=0x7fe9e48ea990, rel2=rel2@entry=0x7fe9e48eba00) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinrels.c:754 #18 0x00007fe9e27eeabb in make_rels_by_clause_joins (other_rels=<optimized out>, old_rel=<optimized out>, root=<optimized out>) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinrels.c:274 #19 join_search_one_level (root=root@entry=0x7fe9e48e89b0, level=level@entry=2) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/joinrels.c:96 #20 0x00007fe9e27df09b in standard_join_search (root=0x7fe9e48e89b0, levels_needed=2, initial_rels=<optimized out>) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/allpaths.c:2186 #21 0x00007fe9e27df41b in make_one_rel (root=root@entry=0x7fe9e48e89b0, joinlist=joinlist@entry=0x7fe9e48ebd58) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/path/allpaths.c:176 #22 0x00007fe9e27fa10e in query_planner (root=root@entry=0x7fe9e48e89b0, tlist=tlist@entry=0x7fe9e48e9020, qp_callback=qp_callback@entry=0x7fe9e27fa740 <standard_qp_callback>, qp_extra=qp_extra@entry=0x7ffe3ade1d60) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/plan/planmain.c:255 #23 0x00007fe9e27fbc14 in grouping_planner (root=root@entry=0x7fe9e48e89b0, inheritance_update=inheritance_update@entry=0 '\000', tuple_fraction=<optimized out>, tuple_fraction@entry=0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/plan/planner.c:1701 #24 0x00007fe9e27fe861 in subquery_planner (glob=glob@entry=0x7fe9e4533270, parse=parse@entry=0x7fe9e4760290, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=tuple_fraction@entry=0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/plan/planner.c:759 #25 0x00007fe9e27ff7bd in standard_planner (parse=0x7fe9e4760290, cursorOptions=256, boundParams=0x0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/optimizer/plan/planner.c:292 #26 0x00007fe9dbeb6345 in pathman_planner_hook () from /usr/lib/postgresql/9.6/lib/pg_pathman.so #27 0x00007fe9e288d1d4 in pg_plan_query (querytree=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/tcop/postgres.c:798 #28 0x00007fe9e288d2c4 in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/tcop/postgres.c:857 #29 0x00007fe9e288f52d in exec_simple_query (query_string=0x7fe9e475e9d0 "SELECT * FROM polygon_table polygon INNER JOIN second_table second ON (polygon.second_id = second.id) WHERE ST_Intersects(poly, ST_SetSrid(ST_MakePoint(52.3433914, 58.7438431), 4326));") at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/tcop/postgres.c:1022 #30 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fe9e452f428, dbname=0x7fe9e452f350 "small_weather", username=<optimized out>) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/tcop/postgres.c:4076 #31 0x00007fe9e26134e5 in BackendRun (port=0x7fe9e452d0b0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/postmaster/postmaster.c:4271 #32 BackendStartup (port=0x7fe9e452d0b0) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/postmaster/postmaster.c:3945 #33 ServerLoop () at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/postmaster/postmaster.c:1701 #34 0x00007fe9e282e4db in PostmasterMain (argc=5, argv=<optimized out>) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/postmaster/postmaster.c:1309 #35 0x00007fe9e2614232 in main (argc=5, argv=0x7fe9e44e1210) at /home/robot-gerrit/workspace/postgresql-9.6-deb/src/build/../src/backend/main/main.c:228
And if you disable autovacuum on the table, the problem will continue until you manually run vacuum on it. There could be thousands of index scans during this time. Dmitriy investigated the problem a bit more and may provide some more details.
On Thu, Apr 27, 2017 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> But if we delete many rows from beginning or end of index, it would be >>> very expensive too because we will fetch each dead row and reject it. > >> Yep, and I've seen that turn into a serious problem in production. > > How so? Shouldn't the indexscan go back and mark such tuples dead in > the index, such that they'd be visited this way only once? If that's > not happening, maybe we should try to fix it. Hmm. Actually, I think the scenario I saw was where there was a large number of tuples at the end of the index that weren't dead yet due to an old snapshot held open. That index was being scanned by lots of short-running queries. Those queries executed just fine, but they took a long to plan because they had to step over all of the dead tuples in the index one by one. That increased planning time - multiplied by the number of times it was incurred - was sufficient to cripple the system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 27, 2017 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How so? Shouldn't the indexscan go back and mark such tuples dead in >> the index, such that they'd be visited this way only once? If that's >> not happening, maybe we should try to fix it. > Hmm. Actually, I think the scenario I saw was where there was a large > number of tuples at the end of the index that weren't dead yet due to > an old snapshot held open. That index was being scanned by lots of > short-running queries. Those queries executed just fine, but they > took a long to plan because they had to step over all of the dead > tuples in the index one by one. But that was the scenario that we intended to fix by changing to SnapshotDirty, no? Or I guess not quite, because dead-but-still-visible-to-somebody tuples are rejected by SnapshotDirty. Maybe we need another type of snapshot that would accept any non-vacuumable tuple. I really don't want SnapshotAny semantics here, but a tuple that was live more recently than the xmin horizon seems like it's acceptable enough. HeapTupleSatisfiesVacuum already implements the right behavior, but we don't have a Snapshot-style interface for it. regards, tom lane
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
What I'm thinking of is the regular indexscan that's done internally
by get_actual_variable_range, not whatever ends up getting chosen as
the plan for the user query. I had supposed that that would kill
dead index entries as it went, but maybe that's not happening for
some reason.
Really, this happens as you said. Index entries are marked as dead.
But after this, backends spends cpu time on skip this killed entries
in _bt_checkkeys :
if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
{
/* return immediately if there are more tuples on the page */
if (ScanDirectionIsForward(dir))
{
if (offnum < PageGetMaxOffsetNumber(page))
return NULL;
}
else
{
BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (offnum > P_FIRSTDATAKEY(opaque))
return NULL;
}
This confirmed by perf records and backtrace reported by Vladimir earlier.
root@pgload01e ~ # perf report | grep -v '^#' | head 56.67% postgres postgres [.] _bt_checkkeys 19.27% postgres postgres [.] _bt_readpage 2.09% postgres postgres [.] pglz_decompress 2.03% postgres postgres [.] LWLockAttemptLock 1.61% postgres postgres [.] PinBuffer.isra.3 1.14% postgres postgres [.] hash_search_with_hash_value 0.68% postgres postgres [.] LWLockRelease 0.42% postgres postgres [.] AllocSetAlloc 0.40% postgres postgres [.] SearchCatCache 0.40% postgres postgres [.] ReadBuffer_common root@pgload01e ~ #
It seems like killing dead tuples does not solve this problem.
On Fri, Apr 28, 2017 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Apr 27, 2017 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> How so? Shouldn't the indexscan go back and mark such tuples dead in >>> the index, such that they'd be visited this way only once? If that's >>> not happening, maybe we should try to fix it. > >> Hmm. Actually, I think the scenario I saw was where there was a large >> number of tuples at the end of the index that weren't dead yet due to >> an old snapshot held open. That index was being scanned by lots of >> short-running queries. Those queries executed just fine, but they >> took a long to plan because they had to step over all of the dead >> tuples in the index one by one. > > But that was the scenario that we intended to fix by changing to > SnapshotDirty, no? Or I guess not quite, because > dead-but-still-visible-to-somebody tuples are rejected by SnapshotDirty. Yup. > Maybe we need another type of snapshot that would accept any > non-vacuumable tuple. I really don't want SnapshotAny semantics here, > but a tuple that was live more recently than the xmin horizon seems > like it's acceptable enough. HeapTupleSatisfiesVacuum already > implements the right behavior, but we don't have a Snapshot-style > interface for it. Maybe. What I know is that several people have found SnapshotDirty to be problematic, and in the case with which I am acquainted, using SnapshotAny fixed it. I do not know whether, if everybody in the world were using SnapshotAny, somebody would have the problem you're talking about, or some other one. And if they did, I don't know whether using the new kind of snapshot you are proposing would fix it. I do know that giving SnapshotAny to people seems to have only improved things according to the information currently available to me. I don't, in general, share your intuition that using SnapshotAny is the wrong thing. We're looking up the last value in the index for planning purposes. It seems to me that, as far as making index scans more or less expensive to scan, a dead tuple is almost as good as a live one. Until that tuple is not only marked dead, but removed from the index page, it contributes to the cost of an index scan. To put that another way, suppose the range of index keys is 0 to 2 million, but the heap tuples for values 1 million and up are committed deleted. All the index tuples remain (and may or may not be removable depending on what other snapshots exist in the system). Now, consider the following three cases: (a) index scan from 0 to 10,000 (b) index scan from 1,000,000 to 1,010,000 (c) index scan from 3,000,000 to 3,010,000 I contend that the cost of index scan (b) is a lot closer to the cost of (a) than to the cost of (c). (b) finds a whole bunch of index tuples; (c) gives up immediately and goes home. So I actually think that using the actual last value in the index - 2,000,000 - is conceptually *correct* regardless of whether it's marked dead and regardless of whether the corresponding heap tuple is dead. The cost depends mostly on which tuples are present in the index, not which table rows the user can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 28, 2017 at 12:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe we need another type of snapshot that would accept any >> non-vacuumable tuple. I really don't want SnapshotAny semantics here, > I don't, in general, share your intuition that using SnapshotAny is > the wrong thing. I think you are mistaken. > We're looking up the last value in the index for > planning purposes. It seems to me that, as far as making index scans > more or less expensive to scan, a dead tuple is almost as good as a > live one. You are confusing number of tuples in the index, which we estimate from independent measurements such as the file size, with endpoint value, which is used for purposes like guessing whether a mergejoin will be able to stop early. For purposes like that, we do NOT want to include dead tuples, because the merge join is never gonna see 'em. In short, arguing about the cost of an indexscan per se is quite irrelevant, because this statistic is not used when estimating the cost of an indexscan. Or put another way: the observed problem is planning time, not that the resulting estimates are bad. You argue that we should cut planning time by changing the definition of the estimate, and claim that the new definition is better, but I think you have nothing but wishful thinking behind that. I'm willing to try to cut planning time, but I don't want the definition to change any further than it has to. regards, tom lane
On Fri, Apr 28, 2017 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > You are confusing number of tuples in the index, which we estimate from > independent measurements such as the file size, with endpoint value, > which is used for purposes like guessing whether a mergejoin will be > able to stop early. For purposes like that, we do NOT want to include > dead tuples, because the merge join is never gonna see 'em. I spent several hours today thinking about this and, more than once, thought I'd come up with an example demonstrating why my idea was better than yours (despite the fact that, as you point out, the merge join is never gonna see 'em). However, in each instance, I eventually realized that I was wrong, so I guess I'll have to concede this point. > Or put another way: the observed problem is planning time, not that the > resulting estimates are bad. You argue that we should cut planning > time by changing the definition of the estimate, and claim that the > new definition is better, but I think you have nothing but wishful > thinking behind that. I'm willing to try to cut planning time, but > I don't want the definition to change any further than it has to. OK, I guess that makes sense. There can be scads of dead tuples at the end of the index, and there's no surety that the query actually requires touching that portion of the index at all apart from planning, so it seems a bit unfortunate to burden planning with the overhead of cleaning them up. But I guess with your new proposed definition at least that can only happen once. After that there may be a bunch of pages to skip at the end of the index before we actually find a tuple, but they should at least be empty. Right now, you can end up skipping many tuples repeatedly for every query. I tested a two-table join with a million committed deleted but not dead tuples at the end of one index; that increased planning time from ~0.25ms to ~90ms; obviously, a two-and-a-half order of magnitude increase in CPU time spent planning is not a welcome development on a production system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
> Maybe we need another type of snapshot that would accept any > non-vacuumable tuple. I really don't want SnapshotAny semantics here, > but a tuple that was live more recently than the xmin horizon seems > like it's acceptable enough. HeapTupleSatisfiesVacuum already > implements the right behavior, but we don't have a Snapshot-style > interface for it. If I understood correctly, this new type of snapshot would help if there are long running transactions which can see this tuples. But if there are not long running transactions, it will be the same. Am i right? And what about don’t fetch actual min and max values from indexes whose columns doesn’t involved in join?
Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes: >> Maybe we need another type of snapshot that would accept any >> non-vacuumable tuple. I really don't want SnapshotAny semantics here, > If I understood correctly, this new type of snapshot would help if > there are long running transactions which can see this tuples. > But if there are not long running transactions, it will be the same. > Am i right? Right. You haven't shown us much about the use-case you're concerned with, so it's not clear what's actually needed. > And what about don’t fetch actual min and max values from indexes > whose columns doesn’t involved in join? We don't fetch that info unless we need it. I'm not entirely certain, but there could be cases where a single planning cycle ends up fetching that data more than once. (There's caching at the RestrictInfo level, but that might not be enough.) So a line of thought that might be worth looking into is adding a lower layer of caching to make sure it's not done more than once per plan. Again, whether this saves anything would depend a lot on specific use-cases. regards, tom lane
On Fri, Apr 28, 2017 at 10:02 PM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > > What I'm thinking of is the regular indexscan that's done internally > by get_actual_variable_range, not whatever ends up getting chosen as > the plan for the user query. I had supposed that that would kill > dead index entries as it went, but maybe that's not happening for > some reason. > > > Really, this happens as you said. Index entries are marked as dead. > But after this, backends spends cpu time on skip this killed entries > in _bt_checkkeys : > If that is the case, then how would using SnapshotAny solve this problem. We get the value from index first and then check its visibility in heap, so if time is spent in _bt_checkkeys, why would using a different kind of Snapshot solve the problem? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> 29 апр. 2017 г., в 17:34, Tom Lane <tgl@sss.pgh.pa.us> написал(а): > > Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes: >>> Maybe we need another type of snapshot that would accept any >>> non-vacuumable tuple. I really don't want SnapshotAny semantics here, > >> If I understood correctly, this new type of snapshot would help if >> there are long running transactions which can see this tuples. >> But if there are not long running transactions, it will be the same. >> Am i right? > > Right. You haven't shown us much about the use-case you're concerned > with, so it's not clear what's actually needed. The use case is nearly the same as the way to reproduce the problem described in the first letter. It’s an OLTP databasewith short mostly read-only queries (~ 6k rps). Every 10 minutes new data is inserted (~5-10% of rows in polygon_table)and old is deleted (~ 5-10% or rows in polygon_table). Insert and delete are made in different transactions.Until the vacuum after delete is finished the planning time is two orders of magnitude is higher than usually.If we use prepared statements the problem doesn’t reproduce since planning is not actually done. > >> And what about don’t fetch actual min and max values from indexes >> whose columns doesn’t involved in join? > > We don't fetch that info unless we need it. We used to think that it’s actually not so (there was a problem in our test case), but we rechecked and it is, the plannerdoesn’t find min and max in unneeded for join indexes. > > I'm not entirely certain, but there could be cases where a single > planning cycle ends up fetching that data more than once. (There's > caching at the RestrictInfo level, but that might not be enough.) > So a line of thought that might be worth looking into is adding a > lower layer of caching to make sure it's not done more than once per > plan. Again, whether this saves anything would depend a lot on > specific use-cases. > > regards, tom lane
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
> If that is the case, then how would using SnapshotAny solve this > problem. We get the value from index first and then check its > visibility in heap, so if time is spent in _bt_checkkeys, why would > using a different kind of Snapshot solve the problem? 1st scanning on the index with SnapshotAny will accept this rows and will not mark this entries as killed. Theremore, killed entries are ignored on standby. And scanning with SnapshotAny will always accept first row from index. /* * During recovery we ignore killed tuples and don't botherto kill them * either. We do this because the xmin on the primary node could easily be * later than the xminon the standby node, so that what the primary * thinks is killed is supposed to be visible on standby. So for correct * MVCC for queries during recovery we must ignore these hints and check * all tuples. Do *not* set ignore_killed_tuplesto true when running in a * transaction that was started during recovery. xactStartedInRecovery * should not be altered by index AMs. */ scan->kill_prior_tuple = false; scan->xactStartedInRecovery = TransactionStartedDuringRecovery(); scan->ignore_killed_tuples = !scan->xactStartedInRecovery; We execute this read-only queries on standby.
Hi. > 25 апр. 2017 г., в 18:13, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> написал(а): > > I'd like to propose to search min and max value in index with SnapshotAny in get_actual_variable_range function. > Current implementation scans index with SnapshotDirty which accepts uncommitted rows and rejects dead rows. The patch is already being discussed here [1]. [1] https://www.postgresql.org/message-id/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
> Maybe we need another type of snapshot that would accept any > non-vacuumable tuple. I really don't want SnapshotAny semantics here, > but a tuple that was live more recently than the xmin horizon seems > like it's acceptable enough. HeapTupleSatisfiesVacuum already > implements the right behavior, but we don't have a Snapshot-style > interface for it. I have tried to implement this new type of snapshot that accepts any non-vacuumable tuples. We have tried this patch in our load environment. And it has smoothed out and reduced magnitude of the cpu usage peaks. But this snapshot does not solve the problem completely. Patch is attached. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, May 4, 2017 at 9:42 PM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > >> Maybe we need another type of snapshot that would accept any >> non-vacuumable tuple. I really don't want SnapshotAny semantics here, >> but a tuple that was live more recently than the xmin horizon seems >> like it's acceptable enough. HeapTupleSatisfiesVacuum already >> implements the right behavior, but we don't have a Snapshot-style >> interface for it. > > > I have tried to implement this new type of snapshot that accepts any > non-vacuumable tuples. > We have tried this patch in our load environment. And it has smoothed out > and reduced magnitude of the cpu usage peaks. > But this snapshot does not solve the problem completely. > > Patch is attached. 1. +#define InitNonVacuumableSnapshot(snapshotdata) \ + do { \ + (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \ + (snapshotdata).xmin = RecentGlobalDataXmin; \ + } while(0) + Can you explain and add comments why you think RecentGlobalDataXmin is the right to use it here? As of now, it seems to be only used for pruning non-catalog tables. 2. +bool +HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot, + Buffer buffer) +{ + return HeapTupleSatisfiesVacuum(htup, snapshot->xmin, buffer) + != HEAPTUPLE_DEAD; +} + Add comments on top of this function and for the sake of consistency update the file header as well (Summary of visibility functions:) -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
Amit, thanks for comments! > 1. > +#define InitNonVacuumableSnapshot(snapshotdata) \ > + do { \ > + (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \ > + (snapshotdata).xmin = RecentGlobalDataXmin; \ > + } while(0) > + > > Can you explain and add comments why you think RecentGlobalDataXmin is > the right to use it here? As of now, it seems to be only used for > pruning non-catalog tables. Can you explain me, what value for xmin should be used there? > 2. > +bool > +HeapTupleSatisfiesNonVacuumable(HeapTuple htup, Snapshot snapshot, > + Buffer buffer) > +{ > + return HeapTupleSatisfiesVacuum(htup, snapshot->xmin, buffer) > + != HEAPTUPLE_DEAD; > +} > + > > Add comments on top of this function and for the sake of consistency > update the file header as well (Summary of visibility functions:) Yes, i will add comments and send new patch.
On Fri, May 5, 2017 at 1:28 PM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > Amit, thanks for comments! > >> 1. >> +#define InitNonVacuumableSnapshot(snapshotdata) \ >> + do { \ >> + (snapshotdata).satisfies = HeapTupleSatisfiesNonVacuumable; \ >> + (snapshotdata).xmin = RecentGlobalDataXmin; \ >> + } while(0) >> + >> >> Can you explain and add comments why you think RecentGlobalDataXmin is >> the right to use it here? As of now, it seems to be only used for >> pruning non-catalog tables. > > Can you explain me, what value for xmin should be used there? > I think we can use RecentGlobalDataXmin for non-catalog relations and RecentGlobalXmin for catalog relations (probably a check similar to what we have in heap_page_prune_opt). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
I think we can use RecentGlobalDataXmin for non-catalog relations and
RecentGlobalXmin for catalog relations (probably a check similar to
what we have in heap_page_prune_opt).
I took check from heap_page_prune_opt (Maybe this check must be present as separate function?)
But it requires to initialize snapshot for specific relation. Maybe we need to use GetOldestXmin()?
New version of patch is attached.
Attachment
On Mon, May 8, 2017 at 6:30 PM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > > I think we can use RecentGlobalDataXmin for non-catalog relations and > RecentGlobalXmin for catalog relations (probably a check similar to > what we have in heap_page_prune_opt). > > > I took check from heap_page_prune_opt (Maybe this check must be present as > separate function?) > But it requires to initialize snapshot for specific relation. > + else \ + (snapshotdata).xmin = \ + TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin, \ + relation); \ I think we don't need to use TransactionIdLimitedForOldSnapshots() as that is required to override xmin for table vacuum/pruning purposes. > Maybe we need > to use GetOldestXmin()? It is a costly call as it needs ProcArrayLock, so I don't think it makes sense to use it here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
> + else \ > + (snapshotdata).xmin = \ > + TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin, \ > + relation); \ > > I think we don't need to use TransactionIdLimitedForOldSnapshots() as > that is required to override xmin for table vacuum/pruning purposes. > >> Maybe we need >> to use GetOldestXmin()? > > It is a costly call as it needs ProcArrayLock, so I don't think it > makes sense to use it here. Ok, i agree. Patch is attached. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range
From
Dmitriy Sarafannikov
Date:
> Ok, i agree. Patch is attached. I added a patch to the CF
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation: tested, failed Hi! I've looked into the patch. There is not so much of the code. The code itself is pretty clear and well documented. I've found just one small typo "transactionn"in HeapTupleSatisfiesNonVacuumable() comments. Chosen Snapshot type is not a solution to the author's problem at hand. Though implemented SnapshotNonVacuumable is closerto rational choice than currently used SnapshotDirty for range sketching. As far as I can see this is what is get_actual_variable_range()is used for, despite word "actual" in it's name. The only place where get_actual_variable_range() is used for actual range is preprocessed-out in get_variable_range():/** XXX It's very tempting to try to use the actual column min and max, if * we can get them relatively-cheaplywith an index probe. However, since * this function is called many times during join planning, that could* have unpleasant effects on planning speed. Need more investigation * before enabling this. */ #ifdef NOT_USEDif (get_actual_variable_range(root, vardata, sortop, min, max)) return true; #endif I think it would be good if we could have something like "give me actual values, but only if it is on first index page",like conditional locks. But I think this patch is a step to right direction. Best regards, Andrey Borodin.
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed Oops, missed those checkboxes. Sorry for the noise. Here's correct checklist: installcheck-world passes, documented as expected.Feature is there. Best regards, Andrey Borodin.
> On 18 Aug 2017, at 08:50, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: tested, failed > Spec compliant: tested, failed > Documentation: tested, failed > > Hi! I've looked into the patch. > > There is not so much of the code. The code itself is pretty clear and well documented. I've found just one small typo "transactionn"in HeapTupleSatisfiesNonVacuumable() comments. > > Chosen Snapshot type is not a solution to the author's problem at hand. Though implemented SnapshotNonVacuumable is closerto rational choice than currently used SnapshotDirty for range sketching. As far as I can see this is what is get_actual_variable_range()is used for, despite word "actual" in it's name. > The only place where get_actual_variable_range() is used for actual range is preprocessed-out in get_variable_range(): > /* > * XXX It's very tempting to try to use the actual column min and max, if > * we can get them relatively-cheaply with an index probe. However, since > * this function is called many times during join planning, that could > * have unpleasant effects on planning speed. Need more investigation > * before enabling this. > */ > #ifdef NOT_USED > if (get_actual_variable_range(root, vardata, sortop, min, max)) > return true; > #endif > > I think it would be good if we could have something like "give me actual values, but only if it is on first index page",like conditional locks. But I think this patch is a step to right direction. Thanks for your review! Based on your review and conclusions, I’ve bumped the status to “Ready for Committer”. If you believe another status would be more appropriate, please go in an update. cheers ./daniel
Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes: > [ snapshot_non_vacuumable_v3.patch ] Starting to look at this. I think that the business with choosing RecentGlobalXmin vs. RecentGlobalDataXmin is just wrong. What we want to do is accept any tuple that would not be considered killable by heap_hot_search_buffer, and that looks only at RecentGlobalXmin. It's possible that it'd be sensible for heap_hot_search_buffer to use RecentGlobalDataXmin when dealing with a non-catalog table, allowing it to consider more tuples killable. But I'm not sure; it looks like the decision which to use costs some effort, which we probably don't want to be spending in heap_hot_search_buffer. In any case that would be a different patch topic. With the patch as it stands, if we have a user-data tuple that is dead more recently than RecentGlobalXmin but not more recently than RecentGlobalDataXmin, the snapshot will reject it so the planner's indexscan will keep searching. But heap_hot_search_buffer won't tell the index AM to kill the tuple, so the index entry stays marked as live, which means the next get_actual_variable_range call will have to scan past it again. So we don't get the hoped-for benefit with tuple deaths in that range. I do not know whether this effect might explain your finding that the patch didn't entirely fix your performance problem. In short I think we should just set up the threshold as RecentGlobalXmin. However, this seems like a decision that might be pretty specific to get_actual_variable_range's use-case, so I'm inclined to have the threshold be chosen by get_actual_variable_range itself not hard-wired into InitNonVacuumableSnapshot. Another point is that I think it might be possible for RecentGlobalXmin to advance after get_actual_variable_range looks at it (for example, if we have to take a new catalog snapshot during index scan setup). This creates the possibility that the snapshot accepts tuples that heap_hot_search_buffer would consider dead. That seems fairly harmless though, so I'm inclined not to fuss about it. We could arrange for HeapTupleSatisfiesNonVacuumable to use RecentGlobalXmin directly, but that seems like it makes it too special-purpose. regards, tom lane
I wrote: > Dmitriy Sarafannikov <dsarafannikov@yandex.ru> writes: >> [ snapshot_non_vacuumable_v3.patch ] > In short I think we should just set up the threshold as RecentGlobalXmin. Pushed with that adjustment and some fooling with the comments. regards, tom lane