Thread: Are bitmap index scans slow to start?
(Sorry moderators for any double posts, I keep making subscription errors. Hopefully this one gets through) Hi speed freaks, Can anyone tell me why the bitmap heap scan takes so long to start for this query? (SQL and EXPLAIN ANALYZE follows). The big culprit in this appears to be: -> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52 rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042 loops=1)" Index Cond: (session_id = 27)" I can't see anything that occurs between actual time 0.0..32611.918 that this could be waiting on. Is it building the bitmap? Running the query a second time yields this: -> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52 rows=2947664 width=0) (actual time=2896.601..2896.601 rows=2772042 loops=1)" Index Cond: (session_id = 27)" Does the bitmap then get cached? These queries are built dynamically and called rarely, so their first-time performance is important. I'd prefer a strategy that allowed fast performance the first time, rather than slow the first time and extremely fast subsequently. Thanks, Carlo SELECT l.session_id, l.log_id, s.session_type_code, coalesce(st.name, '?' || s.session_type_code || '?') AS session_type_name, l.input_resource_id, ir.impt_schema AS input_resource_table_schema, ir.impt_table AS input_resource_table_name, ir.resource AS input_resource_name, l.input_resource_pkey_id, tar_table.table_schema, tar_table.table_name, l.target_pkey_id AS table_pkey_id, tar_op.name AS operation, tar_note.name AS note FROM mdx_audit.log_2013_01 AS l JOIN mdx_audit.session AS s USING (session_id) JOIN mdx_audit.target_table AS tar_table USING (target_table_id) JOIN mdx_audit.target_operation_type AS tar_op USING (target_operation_type_code) LEFT OUTER JOIN mdx_audit.target_note AS tar_note USING (target_note_id) LEFT OUTER JOIN mdx_audit.session_type AS st USING (session_type_code) LEFT OUTER JOIN mdx_core.input_resource AS ir USING (input_resource_id) WHERE l.session_id = 27 AND ( input_resource_pkey_id IS NULL OR input_resource_pkey_id IN ( 494568472, 494568473, 494568474, 494568475, 494568476, 494568477, 494568478, 494568479, 494568480, 494568481, 494568482, 494568483, 494568484, 494568485, 494568486, 494568487, 494568488, 494568489, 494568490 ) ) "Hash Left Join (cost=63191.88..853169.29 rows=92 width=2199) (actual time=34185.045..44528.710 rows=603 loops=1)" " Hash Cond: (l.input_resource_id = ir.input_resource_id)" " -> Hash Left Join (cost=63190.22..853165.68 rows=92 width=1377) (actual time=34184.963..44528.391 rows=603 loops=1)" " Hash Cond: (l.target_note_id = tar_note.target_note_id)" " -> Hash Join (cost=63189.07..853164.06 rows=92 width=1161) (actual time=34184.872..44528.167 rows=603 loops=1)" " Hash Cond: (l.target_operation_type_code = tar_op.target_operation_type_code)" " -> Nested Loop (cost=63188.00..853161.72 rows=92 width=1125) (actual time=34184.809..44527.884 rows=603 loops=1)" " -> Nested Loop Left Join (cost=0.00..9.34 rows=1 width=65) (actual time=12.057..12.068 rows=1 loops=1)" " Join Filter: (s.session_type_code = st.session_type_code)" " -> Index Scan using session_pkey on session s (cost=0.00..8.27 rows=1 width=7) (actual time=6.847..6.850 rows=1 loops=1)" " Index Cond: (session_id = 27)" " -> Seq Scan on session_type st (cost=0.00..1.03 rows=3 width=70) (actual time=5.204..5.207 rows=3 loops=1)" " -> Hash Join (cost=63188.00..853151.47 rows=92 width=1064) (actual time=34172.746..44515.696 rows=603 loops=1)" " Hash Cond: (l.target_table_id = tar_table.target_table_id)" " -> Bitmap Heap Scan on log_2013_01 l (cost=63186.57..853148.39 rows=194 width=34) (actual time=34172.631..44515.318 rows=603 loops=1)" " Recheck Cond: (session_id = 27)" " Filter: ((input_resource_pkey_id IS NULL) OR (input_resource_pkey_id = ANY ('{494568472,494568473,494568474,494568475,494568476,494568477,494568478,494 568479,494568480,494568481,494568482,494568483,494568484,494568485,494568486 ,494568487,494568488,494568489,494568490}'::bigint[])))" " -> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52 rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042 loops=1)" " Index Cond: (session_id = 27)" " -> Hash (cost=1.19..1.19 rows=19 width=1034) (actual time=0.059..0.059 rows=44 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 4kB" " -> Seq Scan on target_table tar_table (cost=0.00..1.19 rows=19 width=1034) (actual time=0.023..0.037 rows=44 loops=1)" " -> Hash (cost=1.03..1.03 rows=3 width=46) (actual time=0.029..0.029 rows=3 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 1kB" " -> Seq Scan on target_operation_type tar_op (cost=0.00..1.03 rows=3 width=46) (actual time=0.024..0.025 rows=3 loops=1)" " -> Hash (cost=1.07..1.07 rows=7 width=220) (actual time=0.060..0.060 rows=59 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 4kB" " -> Seq Scan on target_note tar_note (cost=0.00..1.07 rows=7 width=220) (actual time=0.021..0.025 rows=59 loops=1)" " -> Hash (cost=1.29..1.29 rows=29 width=826) (actual time=0.035..0.035 rows=33 loops=1)" " Buckets: 1024 Batches: 1 Memory Usage: 4kB" " -> Seq Scan on input_resource ir (cost=0.00..1.29 rows=29 width=826) (actual time=0.015..0.025 rows=33 loops=1)" "Total runtime: 44529.075 ms"
(Sorry moderators for any double posts, I keep making subscription errors.
Hopefully this one gets through)
Hi speed freaks,
Can anyone tell me why the bitmap heap scan takes so long to start for this
query? (SQL and EXPLAIN ANALYZE follows).
It is probably reading data from disk. explain (analyze, buffers) would be more helpful, especially if track_io_timing were also turned on. In the absence of that, my thoughts below are just best-guesses.
The big culprit in this appears to be:
-> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52
rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042
loops=1)"
Index Cond: (session_id = 27)"
I can't see anything that occurs between actual time 0.0..32611.918 that
this could be waiting on. Is it building the bitmap?
Yes. More importantly, it is reading the index data needed in order to build the bitmap.
Running the query a second time yields this:
-> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52
rows=2947664 width=0) (actual time=2896.601..2896.601 rows=2772042 loops=1)"
Index Cond: (session_id = 27)"
Does the bitmap then get cached?
No, the bitmap itself doesn't get cached. But the data needed to construct the bitmap does get cached. It gets cached by the generic caching methods of PG and the OS, not through something specific to bitmaps.
These queries are built dynamically and
called rarely, so their first-time performance is important.
What is going on during the interregnum? Whatever it is, it seems to be driving the log_2013_01_session_idx index out of the cache, but not the log_2013_01 table. (Or perhaps the table visit is getting the benefit of effective_io_concurrency?)
Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like "session_id", I don't know how long such clustering would last though.
If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
I'd prefer a
strategy that allowed fast performance the first time, rather than slow the
first time and extremely fast subsequently.
So would PG, but it can't find such a strategy. PG optimizes all top-level queries in isolation, it never assumes you will execute the same query repeatedly and build that assumption into the costing process. (This is not true of subqueries, where it does account for repeated executions in the cost)
Jeff
Hi Jeff, thanks for the reply.
<<
What is going on during the interregnum? Whatever it is, it seems to be driving the log_2013_01_session_idx index out of the cache, but not the log_2013_01 table. (Or perhaps the table visit is getting the benefit of effective_io_concurrency?)
…
Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like "session_id", I don't know how long such clustering would last though.
>>
Technically, nothing should be happening. We used to keep one massive audit log, and was impossible to manage due to its size. We then changed to a strategy where every month a new audit log would be spawned, and since log_2013_01 represents January, the log should be closed and nothing should have changed (it is technically possible that a long-running process would spill over into February, but not by this much). So, assuming that it’s stable, it should be a very good candidate for reindexing, no?
Our effective_io_concurrency is 1, and last I heard the PG host was a LINUX 4 drive RAID10, so I don’t know if there is any benefit to raising this number – and if there was any benfit, it would be to the Bitmap Scan, and the problem is the data building before the fact.
>> the bitmap itself doesn't get cached. But the data needed to construct the bitmap does get cached. It gets cached by the generic caching methods of PG and the OS, not through something specific to bitmaps.
<<
This has always been a problem for me. I spend hours trying different strategies and think I’ve solved the problem, when in fact it seems like a cache has spun up, and then something else expires it and the problem is back. Is there a way around this problem, can I force the expiration of a cache?
Carlo
A cool idea, but if I understand it correctly very specific and fussy. New DB’s are spawned on this model, and all the developers would have to be aware of this non-standard behaviour, and DBA”s would have to create these indexes every month, for every DB (as the log tables are created every month). There are 89 session_id values in the January log (log_2013_01) so this would quickly get out of control. But – like I said – an interesting idea for more specific challenges.
From: Marc Mamin [mailto:M.Mamin@intershop.de]
Sent: February 21, 2013 2:41 PM
To: Jeff Janes; Carlo Stonebanks
Cc: pgsql-performance@postgresql.org
Subject: AW: [PERFORM] Are bitmap index scans slow to start?
>Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like >"session_id", I don't know how long such clustering would last though.
>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.
Hello,
if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.
e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;
(can be built in parallel using separate threads)
Then you will have to ensure that all your WHERE clauses also contain the index condition:
WHERE session_id = 27 AND session_id%4 =27%4
regards,
Marc Mamin
A cool idea, but if I understand it correctly very specific and fussy. New DB’s are spawned on this model, and all the developers would have to be aware of this non-standard behaviour, and DBA”s would have to create these indexes every month, for every DB (as the log tables are created every month). There are 89 session_id values in the January log (log_2013_01) so this would quickly get out of control. But – like I said – an interesting idea for more specific challenges.
From: Marc Mamin [mailto:M.Mamin@intershop.de]
Sent: February 21, 2013 2:41 PM
To: Jeff Janes; Carlo Stonebanks
Cc: pgsql-performance@postgresql.org
Subject: AW: [PERFORM] Are bitmap index scans slow to start?
>Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like >"session_id", I don't know how long such clustering would last though.
>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.Hello,
if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.
e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;
(can be built in parallel using separate threads)
Then you will have to ensure that all your WHERE clauses also contain the index condition:
WHERE session_id = 27 AND session_id%4 =27%4
regards,
Marc Mamin
>> Also, you might could try clustering newly created tables on session_id and setting the fillfactor down so rows with the same session id will stick together on disk.
<<
My understanding of PG’s cluster is that this is a one-time command that creates a re-ordered table and doesn’t maintain the clustered order until the command is issued again. During the CLUSTER, the table is read and write locked. So, in order for me to use this I would need to set up a timed event to CLUSTER occasionally.
>> I can't really help, but I can make it more clear why postgres is choosing a _bitmap_ index scan rather than a regular index scan
<<
The EXPLAIN ANALYZE is showing it is taking a long time to prepare the bitmap (i.e.-> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52
rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042 loops=1)" Index Cond: (session_id = 27)" the bitmap scan is actually very fast. Jeff sasys that the bitmap is not cached, so I will assume the PG general caches being created are of general use.
I think what I need to do is figure out is:
1) Why does it take 36 seconds to set up the general index caches?
2) What can I do about it (what stats do I need to look at)?
3) How can I force these caches to expire so I can tell if the strategy worked?
From: Nikolas Everett [mailto:nik9000@gmail.com]
Sent: February 22, 2013 2:05 PM
To: Carlo Stonebanks
Cc: Marc Mamin; Jeff Janes; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Are bitmap index scans slow to start?
I can't really help, but I can make it more clear why postgres is choosing a _bitmap_ index scan rather than a regular index scan. With a regular index scan it pumps the index for the locations of the rows that it points to and loads those rows as it finds them. This works great if the rows in the index are sorta sorted - that way it isn't jumping around the table randomly. Random io is slow. In a bitmap index scan pg pumps the index and buffers the by shoving them in a big bitmap. Then, it walks the bitmap in order to produce in order io. PG makes the choice based on a measure of the index's correlation.
The problem comes down to you inserting the sessions concurrently with one another. My instinct would be to lower the FILLFACTOR on newly created indecies so they can keep their entries more in order. I'm not sure why I have that instinct but it feels right. Also, you might could try clustering newly created tables on session_id and setting the fillfactor down so rows with the same session id will stick together on disk.
Now that I look stuff up on the internet I'm not sure where I saw that pg tries to maintain a cluster using empty space from FILLFACTOR but I _think_ it does. I'm not sure what is going on with my google foo today.
Nik
On Fri, Feb 22, 2013 at 12:50 PM, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
A cool idea, but if I understand it correctly very specific and fussy. New DB’s are spawned on this model, and all the developers would have to be aware of this non-standard behaviour, and DBA”s would have to create these indexes every month, for every DB (as the log tables are created every month). There are 89 session_id values in the January log (log_2013_01) so this would quickly get out of control. But – like I said – an interesting idea for more specific challenges.
From: Marc Mamin [mailto:M.Mamin@intershop.de]
Sent: February 21, 2013 2:41 PM
To: Jeff Janes; Carlo Stonebanks
Cc: pgsql-performance@postgresql.org
Subject: AW: [PERFORM] Are bitmap index scans slow to start?
>Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like >"session_id", I don't know how long such clustering would last though.
>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.
Hello,
if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.
e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;
(can be built in parallel using separate threads)
Then you will have to ensure that all your WHERE clauses also contain the index condition:
WHERE session_id = 27 AND session_id%4 =27%4
regards,
Marc Mamin
Hi Jeff, thanks for the reply.
<<
What is going on during the interregnum? Whatever it is, it seems to be driving the log_2013_01_session_idx index out of the cache, but not the log_2013_01 table. (Or perhaps the table visit is getting the benefit of effective_io_concurrency?)
…Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like "session_id", I don't know how long such clustering would last though.
>>
Technically, nothing should be happening. We used to keep one massive audit log, and was impossible to manage due to its size. We then changed to a strategy where every month a new audit log would be spawned, and since log_2013_01 represents January, the log should be closed and nothing should have changed (it is technically possible that a long-running process would spill over into February, but not by this much). So, assuming that it’s stable, it should be a very good candidate for reindexing, no?
Our effective_io_concurrency is 1, and last I heard the PG host was a LINUX 4 drive RAID10, so I don’t know if there is any benefit to raising this number – and if there was any benfit, it would be to the Bitmap Scan, and the problem is the data building before the fact.
>> the bitmap itself doesn't get cached. But the data needed to construct the bitmap does get cached. It gets cached by the generic caching methods of PG and the OS, not through something specific to bitmaps.
<<
This has always been a problem for me. I spend hours trying different strategies and think I’ve solved the problem, when in fact it seems like a cache has spun up, and then something else expires it and the problem is back. Is there a way around this problem, can I force the expiration of a cache?
My understanding of PG’s cluster is that this is a one-time command that creates a re-ordered table and doesn’t maintain the clustered order until the command is issued again. During the CLUSTER, the table is read and write locked. So, in order for me to use this I would need to set up a timed event to CLUSTER occasionally.
The EXPLAIN ANALYZE is showing it is taking a long time to prepare the bitmap (i.e.-> Bitmap Index Scan on log_2013_01_session_idx (cost=0.00..63186.52
rows=2947664 width=0) (actual time=32611.918..32611.918 rows=2772042 loops=1)" Index Cond: (session_id = 27)" the bitmap scan is actually very fast. Jeff sasys that the bitmap is not cached, so I will assume the PG general caches being created are of general use.
I think what I need to do is figure out is:
1) Why does it take 36 seconds to set up the general index caches?
Hi Jeff, thanks for the insight.
<< And then the next question would be, once they are in the cache, why don't they stay there? For that you would have to know what other types of activities are going on that might be driving the data out of the cache.
>>
To give you an idea of the activity level, each physical machine hosts multiple DB’s with the same structure – one DB per client.
We run automated ETL processes which digests client feeds (E) normalizes them (T) and then stores them in our DB (L).
Looking at the stats from our audit log, the average feed load is 4 hours, divided up into 14 client sessions. Each session averages about 50 write (update, insert, no deletes) operations per second, representing 700 write operations per second. The ratio of reads per write is pretty high as the system goes through the transformation process.
Since I don’t know how this compares to other PG installations, the question of using periodic REINDEX and CLUSTER brings up these questions:
1) Because we are hosting multiple DB’s, what is the impact on OS and disk caches?
2) Is there an automated CLUSTER and REINDEX strategy that will not interfere with normal operations?
3) By PG standards, is this a busy DB - and does explain why the general caches expire?
Thanks,
Carlo
Hi Jeff, thanks for the insight.
<< And then the next question would be, once they are in the cache, why don't they stay there? For that you would have to know what other types of activities are going on that might be driving the data out of the cache.
>>
To give you an idea of the activity level, each physical machine hosts multiple DB’s with the same structure – one DB per client.
We run automated ETL processes which digests client feeds (E) normalizes them (T) and then stores them in our DB (L).
Looking at the stats from our audit log, the average feed load is 4 hours, divided up into 14 client sessions. Each session averages about 50 write (update, insert, no deletes) operations per second, representing 700 write operations per second.
Is each of these write operations just covering a single row? Does this description apply to just one of the many (how many?) databases, so that there are really 14*N concurrent sessions?
The ratio of reads per write is pretty high as the system goes through the transformation process.
Since I don’t know how this compares to other PG installations, the question of using periodic REINDEX and CLUSTER brings up these questions:
1) Because we are hosting multiple DB’s, what is the impact on OS and disk caches?
They have to share the RAM. One strategy would be run ETL processes only one at a time, rather than trying to run several concurrently, if that is what you are doing. That way you can concentrate one customers data in RAM, and then another's, to reduce the competition.
2) Is there an automated CLUSTER and REINDEX strategy that will not interfere with normal operations?
3) By PG standards, is this a busy DB - and does explain why the general caches expire?
You really need to know whether those reads and writes are concentrated in a small region (relative to the amount of your RAM), or widely scattered. If you are reading and writing intensively (which you do seem to be doing) but only within a compact region, then it should not drive other data out of the cache. But, since you do seem to have IO problems from cache misses, and you do have a high level of activity, the easy conclusion is that you have too little RAM to hold the working size of your data.
Cheers,
Jeff
<<Is each of these write operations just covering a single row? Does this description apply to just one of the many (how many?) databases, so that there are really 14*N concurrent sessions?
>>
All writes are single row. All DB’s have exactly the same structure, only the content is different. Currently the server is hosting five active DB’s – although there 14 DB’s actually on the host, the balance are backups and or testing environments.
When a feed comes in, it can be anything from dozens to millions of rows, and may take minutes or days to run. I had asked that PG bouncer be installed in front of the host to act as a traffic cop. Try as I may to convince the engineering team that fewer sessions running faster is optimal, they say that the 14 concurrent sessions is based on real-world experience of what imports the fastest.
<< You really need to know whether those reads and writes are concentrated in a small region (relative to the amount of your RAM), or widely scattered. If you are reading and writing intensively (which you do seem to be doing) but only within a compact region, then it should not drive other data out of the cache. But, since you do seem to have IO problems from cache misses, and you do have a high level of activity, the easy conclusion is that you have too little RAM to hold the working size of your data.
>>
It won’t be a problem of physical RAM, I believe there is at least 32GB of RAM. What constitutes “a compact region”? The ETL process takes the feed and distributes it to 85 core tables. I have been through many PG configuration cycles with the generous help of people in this forum. I think the big problem when getting help has been this issue of those offering assistance understanding that the whopping majority of the time, the system is performing single row reads and writes. The assumption tends to be that the end point of an ETL should just be a series of COPY statements, and it should all happen very quickly in classic SQL bulk queries.
<<Is each of these write operations just covering a single row? Does this description apply to just one of the many (how many?) databases, so that there are really 14*N concurrent sessions?
>>
All writes are single row. All DB’s have exactly the same structure, only the content is different. Currently the server is hosting five active DB’s – although there 14 DB’s actually on the host, the balance are backups and or testing environments.
I had thought you were saying that any one ETL procedure into one database used 14 concurrent threads. But really, each ETL procedure is single-threaded, and there can be up to 5 (or theoretically up to 14) of them running at a time into different databases?
When a feed comes in, it can be anything from dozens to millions of rows, and may take minutes or days to run. I had asked that PG bouncer be installed in front of the host to act as a traffic cop. Try as I may to convince the engineering team that fewer sessions running faster is optimal, they say that the 14 concurrent sessions is based on real-world experience of what imports the fastest.
pgbouncer is more for making connections line up single-file when the line is moving at a very fast clip, say 0.01 second per turn. If I were trying to make tasks that can each last for hours or days line up and take turns, I don't think pgbouncer would be the way to go.
<< You really need to know whether those reads and writes are concentrated in a small region (relative to the amount of your RAM), or widely scattered. If you are reading and writing intensively (which you do seem to be doing) but only within a compact region, then it should not drive other data out of the cache. But, since you do seem to have IO problems from cache misses, and you do have a high level of activity, the easy conclusion is that you have too little RAM to hold the working size of your data.
>>
It won’t be a problem of physical RAM, I believe there is at least 32GB of RAM. What constitutes “a compact region”?
If you have 14 actively going on simultaneously, I'd say a compact region would then be about 512 MB.
(32GB/ 14 / margin of safety of 4). Again, assuming that that is the problem.
The ETL process takes the feed and distributes it to 85 core tables. I have been through many PG configuration cycles with the generous help of people in this forum. I think the big problem when getting help has been this issue of those offering assistance understanding that the whopping majority of the time, the system is performing single row reads and writes. The assumption tends to be that the end point of an ETL should just be a series of COPY statements, and it should all happen very quickly in classic SQL bulk queries.
That is often a reasonable assumption, as ETL does end with L :)
Is the original query you posted part of the transform process, rather than being the production query you run after the ETL is over?
If so, maybe you need a EL(S)TL process, were you first load the data to staging table in bulk, and then transform it in bulk rather than one row at a time.
Cheers,
Jeff
<< pgbouncer is more for making connections line up single-file when the line is moving at a very fast clip, say 0.01 second per turn. If I were trying to make tasks that can each last for hours or days line up and take turns, I don't think pgbouncer would be the way to go.
>>
The recommendation at the time was assuming that write contention was slowing things down and consuming resources, since I can’t stop people from creating big multi-threaded imports. Each import consists of about 50 writes
>> Is the original query you posted part of the transform process, rather than being the production query you run after the ETL is over?
Neither, it is part of our auditing and maintenance processes. It is not called with any great frequency. The audit report generates rows defining how the a particular item (an “item” being a particular table/row) was created: it returns the names of the import tables, the row ids, the write operations and any transformation messages that may have been generated – all in the order they occurred.
You can imagine how useful this in creating a document describing what happened and why.
The same data generated by the report is used to “resurrect” an item. If – for example - our business logic has changed, but the change only affects a small sub-set of our core data, then we perform a “rollback” (a logical cascading delete) on the affected items. Then we create a “rebuild” which is a script that is generated to re-import ONLY the import table rows defined in the audit report.
So, this query is not called often, but the fact is that if it takes over 30 seconds to load an item (because the audit report takes so long to prepare the bitmap index scan when passed new query parameters) then it severely restricts how much data we can resurrect at any one time.
<<
I had thought you were saying that any one ETL procedure into one database used 14 concurrent threads. But really, each ETL procedure is single-threaded, and there can be up to 5 (or theoretically up to 14) of them running at a time into different databases?
>>
Sorry, just caught this.
Your first interpretation was correct. Each DB runs an ETL that can have up to 14 concurrent threads. I don’t think the number should be that high, but the engineering team insists the load time is better than fewer threads running faster.
Could you use CLUSTER on the table after it had been closed off? If appropriate, that should make the queries run much faster, as elated entries will be in the same or nearby blocks on disk.I can't really help, but I can make it more clear why postgres is choosing a _bitmap_ index scan rather than a regular index scan. With a regular index scan it pumps the index for the locations of the rows that it points to and loads those rows as it finds them. This works great if the rows in the index are sorta sorted - that way it isn't jumping around the table randomly. Random io is slow. In a bitmap index scan pg pumps the index and buffers the by shoving them in a big bitmap. Then, it walks the bitmap in order to produce in order io. PG makes the choice based on a measure of the index's correlation.The problem comes down to you inserting the sessions concurrently with one another. My instinct would be to lower the FILLFACTOR on newly created indecies so they can keep their entries more in order. I'm not sure why I have that instinct but it feels right. Also, you might could try clustering newly created tables on session_id and setting the fillfactor down so rows with the same session id will stick together on disk.Now that I look stuff up on the internet I'm not sure where I saw that pg tries to maintain a cluster using empty space from FILLFACTOR but I _think_ it does. I'm not sure what is going on with my google foo today.NikOn Fri, Feb 22, 2013 at 12:50 PM, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:A cool idea, but if I understand it correctly very specific and fussy. New DB’s are spawned on this model, and all the developers would have to be aware of this non-standard behaviour, and DBA”s would have to create these indexes every month, for every DB (as the log tables are created every month). There are 89 session_id values in the January log (log_2013_01) so this would quickly get out of control. But – like I said – an interesting idea for more specific challenges.
From: Marc Mamin [mailto:M.Mamin@intershop.de]
Sent: February 21, 2013 2:41 PM
To: Jeff Janes; Carlo Stonebanks
Cc: pgsql-performance@postgresql.org
Subject: AW: [PERFORM] Are bitmap index scans slow to start?
>Rebuilding the index might help, as it would put all the leaf pages holding values for session_id=27 adjacent to each other, so they would read from disk faster. But with a name like >"session_id", I don't know how long such clustering would last though.
>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.Hello,
if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.
e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;
(can be built in parallel using separate threads)
Then you will have to ensure that all your WHERE clauses also contain the index condition:
WHERE session_id = 27 AND session_id%4 =27%4
regards,
Marc Mamin
>If I'm right about the index disk-read time, then switching to a plain index scan rather than a bitmap index scan would make no difference--either way the data has to come off the disk.
>>I'd prefer a
>>strategy that allowed fast performance the first time, rather than slow the
>>first time and extremely fast subsequently.
if the index is only used to locate rows for single session_id, you may consider split it in a set of partial indexes.
e.g.
create index i_0 on foo where session_id%4 =0;
create index i_1 on foo where session_id%4 =1;
create index i_2 on foo where session_id%4 =2;
create index i_3 on foo where session_id%4 =3;
(can be built in parallel using separate threads)
Then you will have to ensure that all your WHERE clauses also contain the index condition:
WHERE session_id = 27 AND session_id%4 =27%4
regards,
Marc Mamin
<<Could you use CLUSTER on the table after it had been closed off? If appropriate, that should make the queries run much faster, as elated entries will be in the same or nearby blocks on disk.
>>
Technically, yes. That would really help, but the issue is scheduling. Although the logs are closed off for writes, they aren’t closed off for reads, ref PG documentation: “When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.”
Not ideal, but a lot better than doing nothing at all!
<<Could you use CLUSTER on the table after it had been closed off? If appropriate, that should make the queries run much faster, as elated entries will be in the same or nearby blocks on disk.
>>
Technically, yes. That would really help, but the issue is scheduling. Although the logs are closed off for writes, they aren’t closed off for reads, ref PG documentation: “When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished.”
Not ideal, but a lot better than doing nothing at all!
The swap process will take an exclusive lock, but it will only last for a fraction of second rather than the duration of the clustering operation.
Cheers,
Jeff
>> Is the original query you posted part of the transform process, rather than being the production query you run after the ETL is over?
Neither, it is part of our auditing and maintenance processes. It is not called with any great frequency. The audit report generates rows defining how the a particular item (an “item” being a particular table/row) was created: it returns the names of the import tables, the row ids, the write operations and any transformation messages that may have been generated – all in the order they occurred.
...
So, this query is not called often, but the fact is that if it takes over 30 seconds to load an item (because the audit report takes so long to prepare the bitmap index scan when passed new query parameters) then it severely restricts how much data we can resurrect at any one time.
Is that a restriction you have observed, or are you extrapolating based on a single query? If you run a bunch of similar queries in close succession, it is likely that the first few queries will warm up the cache, and following queries will then run much faster. Also, if you restructure the series of queries into a large one that reconstructs many rows simultaneously, it might choose a more efficient path than if it is fed the queries one at a time.
Cheers,
Jeff
Sorry this took so long to get back to you. Here is where we were:
I said: <<
So, this query is not called often, but the fact is that if it takes over 30 seconds to load an item (because the audit report takes so long to prepare the bitmap index scan when passed new query parameters) then it severely restricts how much data we can resurrect at any one time.
>>
Your reply: <<
Is that a restriction you have observed, or are you extrapolating based on a single query? If you run a bunch of similar queries in close succession, it is likely that the first few queries will warm up the cache, and following queries will then run much faster. Also, if you restructure the series of queries into a large one that reconstructs many rows simultaneously, it might choose a more efficient path than if it is fed the queries one at a time.
>>
Actual observation. The first run with a new parameter actually takes 90 seconds. Another run with the same parameter takes 15-30 seconds. Running the query immediately afterwards with different parameters starts with a new 90 seconds query. Unfortunately, since going to LINUX, our sys ops hiss and snarl at anyone who comes anywhere near machine or DB server configs, so I am no longer well informed on how well optimized the machines are.
Ultimately, the machines need to be optimized by an expert. As I mentioned before, our ETL is entirely single-load reads-and-writes (I didn’t go into the “why” of this because the nature of the data and the product dictates this). And this is an example of one of the few complex joins that return hundreds/thousands of rows. The problem is that a full index scan has to be done before we can start building the results. So, if clustering will help such that the index scan KNOWS that there’s no point is scanning the rest of the index because we’ve gone beyond the maximum value in the list of possible values, then that would help, as each table being scanned has 50 – 100 million rows (there is one table for every month of production).
As always, thanks.
From: Jeff Janes [mailto:jeff.janes@gmail.com]
Sent: March 5, 2013 4:21 PM
To: Carlo Stonebanks
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Are bitmap index scans slow to start?
On Wed, Feb 27, 2013 at 1:38 PM, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
>> Is the original query you posted part of the transform process, rather than being the production query you run after the ETL is over?
Neither, it is part of our auditing and maintenance processes. It is not called with any great frequency. The audit report generates rows defining how the a particular item (an “item” being a particular table/row) was created: it returns the names of the import tables, the row ids, the write operations and any transformation messages that may have been generated – all in the order they occurred.
...
So, this query is not called often, but the fact is that if it takes over 30 seconds to load an item (because the audit report takes so long to prepare the bitmap index scan when passed new query parameters) then it severely restricts how much data we can resurrect at any one time.
Is that a restriction you have observed, or are you extrapolating based on a single query? If you run a bunch of similar queries in close succession, it is likely that the first few queries will warm up the cache, and following queries will then run much faster. Also, if you restructure the series of queries into a large one that reconstructs many rows simultaneously, it might choose a more efficient path than if it is fed the queries one at a time.
Cheers,
Jeff