Thread: Plan time Improvement - 64bit bitmapset
Hi, While analyzing some complex query and switching away from using the materialized views to their underlying ones I got interested in the long plan times (minutes and up) and did some profiling work. The queries are high dimensional star-schema-alike queries (unfortunately quite private (health) data and a schema I may not make public). Using oprofile and "valgrind --tool=callgrind --dump-instr=yes --collect-jumps=yes --simulate-cache=yes --simulate-hwpref=yes" I found that one of the bitmapset functions are near the top of the profile. When switching bitmapword and companions in bitmap.h to u64 and s64 respectively I get an improvement up to 15% in queries with 16+ joins. The more joins the bigger the win. In the very simple (structurally) query with 16 joins the improvement is around 1-2%. With the most complex query I tested (the nr. of participating relations is hard to count because of many views) I get an improvement up to 15%. I did not test with bigger/more complex queries because it got too slow to get sufficiently thorough results. When playing around with join_collapse_limit, from_collapse_limit, geqo, geqo_threshold I found that unless the settings are set to really low values I can find performance improvements for most combinations. I could not find any regression in the queries we use - and I can't see where there would be a significant overhead. Unfortunately the more interesting trace seems to be the valgrind one - which with these options currently only "kcachegrind" can read. I could not get a usable text export out of the latter. Linked are two overview pictures before (32bit.png) and after (64bit.png) the switch to using 64bit bitmapsets from the backend evaluating a complex query once: http://anarazel.de/pg/32bit_bitmapsets.png http://anarazel.de/pg/64bit_bitmapsets.png That seems like an easy change - is there a reason not to do this if the arch is a 64bit one? Can anybody else with complex queries test my results? (I can provide a patch if wanted). Andres PS: If kcachegrind users want to see the trace, speak up...
Andres Freund <andres@anarazel.de> writes: > When switching bitmapword and companions in bitmap.h to u64 and s64 > respectively I get an improvement up to 15% in queries with 16+ joins. I find this *really* hard to believe, because I've never seen the bitmap support operations show up noticeably at all in profiles. What sort of queries are you testing? regards, tom lane
Hi, On 06/03/2009 06:21 PM, Tom Lane wrote: > Andres Freund<andres@anarazel.de> writes: >> When switching bitmapword and companions in bitmap.h to u64 and s64 >> respectively I get an improvement up to 15% in queries with 16+ joins. > I find this *really* hard to believe, because I've never seen the bitmap > support operations show up noticeably at all in profiles. What sort of > queries are you testing? Many left joins from one base relation to additional dimensions. All the dimensions are relatively complex views consisting out of multiple joins or subselects. Some correlated subqueries and some [NOT] EXISTS() are also included in some of the queries. I tested by compiling with 64bit bitmaps and without by repeatedly just changing those three definitions. I don't see how I could get false results with that? I guess the biggest advantage comes from less cache trashing? Andres
Andres Freund <andres@anarazel.de> writes: > On 06/03/2009 06:21 PM, Tom Lane wrote: >> I find this *really* hard to believe, because I've never seen the bitmap >> support operations show up noticeably at all in profiles. What sort of >> queries are you testing? > Many left joins from one base relation to additional dimensions. All the > dimensions are relatively complex views consisting out of multiple joins > or subselects. > Some correlated subqueries and some [NOT] EXISTS() are also included in > some of the queries. Hmmm, could you provide a complete test case? I'm thinking the behavior might indicate some other performance issue, ie an unreasonable number of bitmapset calls in some particular planning path. There used to be some performance issues in this area back when we represented sets of relids as integer Lists :-(, but the change to bitmap sets pretty much stomped that. I'm just really surprised that there would be anything measurable from changing the word width. regards, tom lane
On 06/03/2009 06:42 PM, Tom Lane wrote: > Andres Freund<andres@anarazel.de> writes: >> On 06/03/2009 06:21 PM, Tom Lane wrote: >>> I find this *really* hard to believe, because I've never seen the bitmap >>> support operations show up noticeably at all in profiles. What sort of >>> queries are you testing? > >> Many left joins from one base relation to additional dimensions. All the >> dimensions are relatively complex views consisting out of multiple joins >> or subselects. >> Some correlated subqueries and some [NOT] EXISTS() are also included in >> some of the queries. > Hmmm, could you provide a complete test case? I'm thinking the behavior > might indicate some other performance issue, ie an unreasonable number > of bitmapset calls in some particular planning path. Uh. I will try, but probably it is not easy. (Once more a datawarehouse-ish example database would be useful). The graph linked in the former email includes all callers with relevant amount of calls (generate_join_implied_equalities, join_is_legal, have_join_order_restriction are in this order the by far most costly). I put up the raw profile data at: http://anarazel.de/pg/32bit_bitmaps.out.gz http://anarazel.de/pg/64bit_bitmaps.out.gz As I said, unfortunately only kcachegrind seems to be able to load the data - it is included in most linux distros though. > There used to be some performance issues in this area back when we > represented sets of relids as integer Lists :-(, but the change to > bitmap sets pretty much stomped that. I'm just really surprised that > there would be anything measurable from changing the word width. Well, the number of memory accesses is halved and I think that bitwise NOT and AND take the same amount of cycles whether they are operating on 32 or 64bit. That would explain some difference. Andres
Andres Freund <andres@anarazel.de> wrote: > long plan times (minutes and up) Wow. I thought I had some pretty complex queries, including some which join using several views, each of which has several joins; but I've never gone to multiple seconds on plan time (much less multiple minutes!) without very high statistics targets and many indexes on the tables. Any rough estimates on those? If you think your patch could have a significant impact on a query with a 260 ms plan time, I could give it a try. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Andres Freund <andres@anarazel.de> wrote: > >> long plan times (minutes and up) > > Wow. I thought I had some pretty complex queries, including some > which join using several views, each of which has several joins; but > I've never gone to multiple seconds on plan time (much less multiple > minutes!) without very high statistics targets and many indexes on the > tables. Any rough estimates on those? My money's still on very large statistics targets. If you have a lot of columns and 1,000-element arrays for each column that can get big pretty quickly. But that doesn't explain the bitmap ops being important. Hm. Actually having a lot of columns and then joining a lot of tables could mean having fairly large bitmapsets. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes: > But that doesn't explain the bitmap ops being important. Hm. Actually > having a lot of columns and then joining a lot of tables could mean > having fairly large bitmapsets. Yeah, but then you have a lot of *other* expensive operations too, such as the aforementioned statistics-pushing. It's still pretty mystifying why bitmapsets would be eating a noticeable fraction of the total. regards, tom lane
Gregory Stark <stark@enterprisedb.com> wrote: > My money's still on very large statistics targets. If you have a lot > of columns and 1,000-element arrays for each column that can get big > pretty quickly. I'm finding that even the ones that had a plan time in the range of 260 ms go down to 15 ms to 85 ms once the statistics are cached. I wonder if the long run time is because it's having to read statistics multiple times because they don't fit in cache? Like with really wide values? Would the wider bitmap type help with that situation in any way? -Kevin
On Wed, Jun 3, 2009 at 3:18 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Gregory Stark <stark@enterprisedb.com> wrote: > >> My money's still on very large statistics targets. If you have a lot >> of columns and 1,000-element arrays for each column that can get big >> pretty quickly. > > I'm finding that even the ones that had a plan time in the range of > 260 ms go down to 15 ms to 85 ms once the statistics are cached. I > wonder if the long run time is because it's having to read statistics > multiple times because they don't fit in cache? Like with really wide > values? Would the wider bitmap type help with that situation in any > way? > > -Kevin I had some performance results back when we were last looking at default_statistics_target that indicated that the time to repeatedly decompress a toasted statistics array contributed significantly to the total planning time, but my suggestion to disable compression for pg_statistic was summarily poo-poohed for reasons that still aren't quite clear to me. When you say, "don't fit in cache", exactly what cache are you talking about? It seems to me that the statistics should be far smaller than the underlying tables, so if even your statistics don't fit in shared buffers (let alone main memory), it doesn't really matter how long your query takes to plan because it will probably take literally forever to execute. How many tables would you have to be joining to get a GB of statistics, even with dst = 1000? A few hundred? ...Robert
Robert Haas <robertmhaas@gmail.com> wrote: > When you say, "don't fit in cache", exactly what > cache are you talking about? It seems to me that the statistics > should be far smaller than the underlying tables, so if even your > statistics don't fit in shared buffers (let alone main memory), it > doesn't really matter how long your query takes to plan because it > will probably take literally forever to execute. How many tables > would you have to be joining to get a GB of statistics, even with > dst = 1000? A few hundred? Since he can't share the schema, and hasn't even given much of a hint, I don't know whether one (or more) of the columns is a bytea filled with 100 MB values; and I don't remember any description of the hardware environment either. Since the behavior seems so out-of-the-ordinary, I was casting about for possible extraordinary characteristics of his environment which might cause it. I'm probably way off base.... -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Since he can't share the schema, and hasn't even given much of a hint, > I don't know whether one (or more) of the columns is a bytea filled > with 100 MB values; and I don't remember any description of the > hardware environment either. Since the behavior seems so > out-of-the-ordinary, I was casting about for possible extraordinary > characteristics of his environment which might cause it. I'm probably > way off base.... There's a hard-wired restriction in analyze.c that makes it discard data values wider than 1KB on-sight. So no such value will ever be found in a statistics array. You could still have a few meg in a pg_statistics row, I suppose, but not a really horrendous amount. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jun 3, 2009 at 3:18 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> I'm finding that even the ones that had a plan time in the range of >> 260 ms go down to 15 ms to 85 ms once the statistics are cached. > I had some performance results back when we were last looking at > default_statistics_target that indicated that the time to repeatedly > decompress a toasted statistics array contributed significantly to the > total planning time, but my suggestion to disable compression for > pg_statistic was summarily poo-poohed for reasons that still aren't > quite clear to me. Well, smaller is better. Kevin's example demonstrates that it's good to have the stats sucked into cache. If they were uncompressed then less of them would fit in however much cache space you have, and whatever CPU savings you get would be lost to more I/O to read in stats entries. Of course, this all depends on total database size vs total RAM, but that's how I'd interpret the observation. PG is still mostly optimized for databases bigger than RAM, so this decision still makes sense. (I think you could try marking the columns of pg_statistic as "don't compress" if you have a DB you want to optimize for all-in-memory behavior.) regards, tom lane
On 06/03/2009 07:05 PM, Kevin Grittner wrote: > Andres Freund<andres@anarazel.de> wrote: > >> long plan times (minutes and up) > > Wow. I thought I had some pretty complex queries, including some > which join using several views, each of which has several joins; but > I've never gone to multiple seconds on plan time (much less multiple > minutes!) without very high statistics targets and many indexes on the > tables. Any rough estimates on those? Statistics target is 250. Lowering to 10 lowers the query plan time somewhat but not significantly and increases query runtime significantly. Real dataset is a bit less than 1.5TB without materialized views and a bit over 3 with. Production machine (old) is a 2xDualcore Xeon 5150, 32gig ram. Test Dataset is about 15GB. Core2 Duo 2.4Ghz, 4GB ram. Example query (from which the traces are) on the test dataset (I cant simply do a full analyze on the real data): Stat target 10: 22283.187ms PREPARE Stat target 1000: 23986.504ms PREPARE So, no really interesting difference. For the timings I always PREPARE'ed the query multiple times in a transaction to make sure there are no caching effects - a small drop but nothing significant. On the average its about > If you think your patch could have a significant impact on a query > with a 260 ms plan time, I could give it a try.From what I have seen so far I doubt that it will have a really measurable effect on relatively short planning times- if you want to try its a very simple change: Just change all 32 into the 64 bit equivalents in include/nodes/bitmapset.h: #define BITS_PER_BITMAPWORD 32 typedef uint32 bitmapword; /* must be an unsigned type */ typedef int32 signedbitmapword; /* must be the matching signed type */ Andres
Doesn't that still add up to 3GB for a table's stats in the worst case? 1kb * 1,000 buckets * 1,500 attributes * 2 (histogram + mfv) Except you can't actually get 1500 toast pointers on a page. I suppose with games with nulls you could make this worst case happen though. It does seem like it ought to be possible to truncate strings in the histogram since any string between the actual values us equally good. -- Greg On 3 Jun 2009, at 22:11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> Since he can't share the schema, and hasn't even given much of a >> hint, >> I don't know whether one (or more) of the columns is a bytea filled >> with 100 MB values; and I don't remember any description of the >> hardware environment either. Since the behavior seems so >> out-of-the-ordinary, I was casting about for possible extraordinary >> characteristics of his environment which might cause it. I'm >> probably >> way off base.... > > There's a hard-wired restriction in analyze.c that makes it discard > data > values wider than 1KB on-sight. So no such value will ever be found > in > a statistics array. You could still have a few meg in a pg_statistics > row, I suppose, but not a really horrendous amount. > > regards, tom lane
Greg Stark <greg.stark@enterprisedb.com> writes: > It does seem like it ought to be possible to truncate strings in the > histogram since any string between the actual values us equally good. Yeah, that was the justification for dropping the wide values --- that and the theory that they'd be unlikely to be most-common values. regards, tom lane
On 06/03/2009 08:57 PM, Gregory Stark wrote: > "Kevin Grittner"<Kevin.Grittner@wicourts.gov> writes: >> Andres Freund<andres@anarazel.de> wrote: >>> long plan times (minutes and up) >> Wow. I thought I had some pretty complex queries, including some >> which join using several views, each of which has several joins; >> but I've never gone to multiple seconds on plan time (much less >> multiple minutes!) without very high statistics targets and many >> indexes on the tables. Any rough estimates on those? > My money's still on very large statistics targets. If you have a lot > of columns and 1,000-element arrays for each column that can get big > pretty quickly. Only a relatively small difference (stat target 10; 1000): 22283.187 23986.504 > But that doesn't explain the bitmap ops being important. Hm. Actually > having a lot of columns and then joining a lot of tables could mean > having fairly large bitmapsets. Some of the views have a large amount of columns (200-400) - none of the actual tables has many columns though (10 user columns is the biggest I think). 110 tables containing real data. The queries include the same tables quite often. Andres
On 06/03/2009 10:42 PM, Kevin Grittner wrote: > Robert Haas<robertmhaas@gmail.com> wrote: > >> When you say, "don't fit in cache", exactly what >> cache are you talking about? It seems to me that the statistics >> should be far smaller than the underlying tables, so if even your >> statistics don't fit in shared buffers (let alone main memory), it >> doesn't really matter how long your query takes to plan because it >> will probably take literally forever to execute. How many tables >> would you have to be joining to get a GB of statistics, even with >> dst = 1000? A few hundred? The whole pgstat.stat is around 500k on the test database - seems to be relatively reasonable. > Since he can't share the schema, and hasn't even given much of a hint, The schema isnt the most clear one - the original developers are long gone and I only somewhat recently jumped the wagon. If what I have gathered is correct the biggest reason for implementing materialized views was plan and not execution time. The schema is a rather normalized DW snowflake-alike schema - with the abnormality that most of the time a single dimension is actually multidimensional, i.e. there are multiple different joins to it needed. The relatively high degree of normalizations introduces a rather big amount of joins for each additional dimension... I find it hard to give a short overview over a relative complex schema without showing it - but thats not up to my choice. > I don't know whether one (or more) of the columns is a bytea filled > with 100 MB values; and I don't remember any description of the > hardware environment either. Since the behavior seems so > out-of-the-ordinary, I was casting about for possible extraordinary > characteristics of his environment which might cause it. I'm probably > way off base.... I would love to find such a issue, but I fear there is none. The problem exists on different machines, different pg versions, different settings... Please keep in mind that when using the system "normally" the materialized views are used and the query plans stay around 1-2s. Which is quite okay for reporting queries I think. Only that the materialized views start to take too much space... Andres
On Wed, Jun 3, 2009 at 5:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jun 3, 2009 at 3:18 PM, Kevin Grittner >> <Kevin.Grittner@wicourts.gov> wrote: >>> I'm finding that even the ones that had a plan time in the range of >>> 260 ms go down to 15 ms to 85 ms once the statistics are cached. > >> I had some performance results back when we were last looking at >> default_statistics_target that indicated that the time to repeatedly >> decompress a toasted statistics array contributed significantly to the >> total planning time, but my suggestion to disable compression for >> pg_statistic was summarily poo-poohed for reasons that still aren't >> quite clear to me. > > Well, smaller is better. Kevin's example demonstrates that it's good > to have the stats sucked into cache. If they were uncompressed then > less of them would fit in however much cache space you have, and > whatever CPU savings you get would be lost to more I/O to read in > stats entries. Smaller-but-compressed isn't better if the limiting factor is CPU rather than I/O, which I suspect is nearly always the case for pg_statistic. I would be very surprised to find out that anyone has a database where query planning regularly results in disk access. Sure, there are cold-cache effects at system startup, and maybe for rarely used tables whose pg_statistic entries have been paged out to make room for more heavily used data, but it doesn't make any sense to me to optimize for those cases at the expense of the common case where you are constantly executing new and different queries. In that case, fast access to the statistics tuples means "I don't need to decompress this", not "I don't need to read this in from disk". It wouldn't be so bad if memory-resident tuples, once decompressed, stayed decompressed at least for the duration of the current query planning cycle. But it seemed to me when I looked at this a while back that every call to eqjoinsel() and the various other functions that look at statistics to do their thing appears to decompress the tuple all over again. That leads to a huge slowdown in query planning beginning, for the example Greg Stark constructed, right around a statistics target of 90. http://archives.postgresql.org/pgsql-hackers/2008-12/msg00386.php The exact point where you hit the slowdown will presumably vary with the size of the relevant columns, but it's surely going to affect a lot more people now that we've made the default target 10x larger. > Of course, this all depends on total database size vs total RAM, > but that's how I'd interpret the observation. PG is still mostly > optimized for databases bigger than RAM, so this decision still > makes sense. > > (I think you could try marking the columns of pg_statistic as "don't > compress" if you have a DB you want to optimize for all-in-memory > behavior.) If you set allow_system_table_mods... ...Robert
Hi, On 06/03/2009 06:42 PM, Tom Lane wrote: > Andres Freund<andres@anarazel.de> writes: >> On 06/03/2009 06:21 PM, Tom Lane wrote: >>> I find this *really* hard to believe, because I've never seen the bitmap >>> support operations show up noticeably at all in profiles. What sort of >>> queries are you testing? >> Many left joins from one base relation to additional dimensions. All the >> dimensions are relatively complex views consisting out of multiple joins >> or subselects. >> Some correlated subqueries and some [NOT] EXISTS() are also included in >> some of the queries. > Hmmm, could you provide a complete test case? I'm thinking the behavior > might indicate some other performance issue, ie an unreasonable number > of bitmapset calls in some particular planning path. Ok. I tried to reproduce it using only pg_catalog and suceeded to some degree: - The query is pointless, pointless, pointless - The effect is only around 5-10% instead of the 15-20% I have measured (fewer tables involved - fewer cache misses?) - The query is crazy, but so is the one on the schema in question - I could get more consistent results with geqo disabled, but the results are in the same ballpark whether enabled or not - Sometimes adding a single join more/less dropped the planning time to a fraction - strange. - The same with changing {join,from}_collapse_limit - sometimes changing it yields plan times different by orders of magnitudes in both directions. On the real data its naturally not only one view but multiple ones... And there are fewer views involved, but they are more complex (including EXISTS(), and some subqueries). Plan time (averaged) without change: cnt: 40 (4 times per session) avg: 4572ms Plan time (averaged) with change: cnt: 40 (4 times per session) avg: 4236ms ~7% difference. Same with higher number of repetitions and with most other planner settings I tried Now thats not a lot of change, but again, this is smaller than with the original queries. Does that help? Andres
Attachment
Andres Freund <andres@anarazel.de> writes: > Plan time (averaged) without change: > cnt: 40 (4 times per session) > avg: 4572ms > > Plan time (averaged) with change: > cnt: 40 (4 times per session) > avg: 4236ms > > ~7% difference. Same with higher number of repetitions and with most other > planner settings I tried What are you comparing here?
Hi, On 06/10/2009 01:38 PM, Gregory Stark wrote: > Andres Freund<andres@anarazel.de> writes: > >> Plan time (averaged) without change: >> cnt: 40 (4 times per session) >> avg: 4572ms >> >> Plan time (averaged) with change: >> cnt: 40 (4 times per session) >> avg: 4236ms >> >> ~7% difference. Same with higher number of repetitions and with most other >> planner settings I tried > > What are you comparing here? 32bit and 64bit bitmapsets with the attached query. 32beeing the default and the slower one. Does that answer the question? Andres
Andres Freund <andres@anarazel.de> wrote: > - Sometimes adding a single join more/less dropped the planning time > to a fraction - strange. > - The same with changing {join,from}_collapse_limit - sometimes > changing it yields plan times different by orders of magnitudes in > both directions. That seems like the place to spend the time looking, rather than nibbling away at the run times by a few percent after letting them jump by orders of magnitude. -Kevin
Hi, On 06/10/2009 06:01 PM, Kevin Grittner wrote: > Andres Freund<andres@anarazel.de> wrote: >> - Sometimes adding a single join more/less dropped the planning time >> to a fraction - strange. >> - The same with changing {join,from}_collapse_limit - sometimes >> changing it yields plan times different by orders of magnitudes in >> both directions. > That seems like the place to spend the time looking, rather than > nibbling away at the run times by a few percent after letting them > jump by orders of magnitude. Yes. I noticed this while looking for reasons. And I will continue to do so... Andres