Thread: GiST -- making my index faster makes is slower
I just tried an experiment that I thought would improve the performance of the PostGIS spatial index. The old way used a BOX (bounding box of 4 doubles) as the key in the index. The new way uses a bounding box of 4 single-precision floats (its only 16 bytes long - a BOX is 32). I thought that it would significantly reduce the size of the index (it did) and that the indexing would be faster since there's less disk pages to fiddle through and these pages would be more likely to fit in the disk cache. It turns out this is not the case - its significantly slower. I think it to do with more keys fitting in the leaf tuples. As far as I can tell, the GiST index looks through the internal nodes, then when it hits a leaf node, it search through each of the keys in the leaf. I save time searching through the internal nodes (because there's less of them) - this is O(log n) saving. I lose time searching through the leafs (because there's more keys in a leaf) - this is O(n) cost. For a query that does a nested loop of 10,000 index scans (on the same table), the old method takes about 2 second, the new "faster" method takes about 20 seconds. I'm still trying to verify that this is the actual reason why its slower - anyone have any other ideas? Anyone know a good way to profile this? dave
> Couple of observations: > > 1. Are you sure your data is handled as 32 bit all the way through? Run > time casting will offset performance gains on 32 bit floats. Is your > comparison routine casting to double? I thought this might be the case - but I thought it would be small. The only place it might be doing hidden casts would be in statements like "query->xmin <= key->xmax". > 2. Math CPUs usually crunch at 80 bits, can't save much by using 32 bit > floats, although cache coherency will be better. > > 3. 64 bit cpu will probably run better on 64 bit floats. > > 4. Is your dataset congested enough that you now have duplicate values > by loss of precision? This would of course impact performance. How > big is your dataset? How big is your avg. result set? My test datasets are quite spatially separate, so I dont expect there to be any "accidental" overlaps. There's about 12 million rows (in total) in the datasets - I've only noticed about 2 of these overlaps. My test datasets are 1000 rows, 10,000 rows, 10,000,000 rows, and a few different ones in the 200,000 row range. I'm testing queries that return anywhere from 1 geometry to about 10,000. The actual index search is only a few milliseconds longer using the float32 bounding box for a mid-sized table returning a handfull of rows. When the result sets get bigger (or you start doingnested queries), the performance differences become greater. dave
David Blasby <dblasby@refractions.net> writes: > The old way used a BOX (bounding box of 4 doubles) as the key in the index. > The new way uses a bounding box of 4 single-precision floats (its only > 16 bytes long - a BOX is 32). > It turns out this is not the case - its significantly slower. I think > it to do with more keys fitting in the leaf tuples. Hm. With twice as many keys per page, you'd think that it could be no more than 2x slower, if the problem is one of O(n) search through the keys on the page. That doesn't explain a 10x slowdown. Could it be that some part of the GIST code is O(n^2), or worse, in the number of keys per page? If so, perhaps it's fixable. I'd suggest profiling the backend with both key types to get an idea of where the time is going. regards, tom lane PS: actually, allowing for the 12-byte index tuple overhead, you couldn't have even twice as many keys per page. So there's something mighty odd here. Keep us posted.
Tom Lane wrote: > I'd suggest profiling the backend with both key types to get an idea of > where the time is going. I've been trying to use gprof to do some profiling, but I'm having troubles. Whats the best way to profile? > PS: actually, allowing for the 12-byte index tuple overhead, you > couldn't have even twice as many keys per page. So there's something > mighty odd here. Keep us posted. Using the old system, I'd get about 7 internal node hits and about 110 "leaf" hits. Under the new one, I get about 4 internal node hits and about 160 "leaf" hits. I'm just in the process of changing the key to: typedef struct {float xmin;float ymin;float xmax;float ymax;char junk[16]; // make the 16 byte type into 32! } BOX2DFLOAT4; To see what happens. dave
David Blasby <dblasby@refractions.net> writes: > Tom Lane wrote: >> I'd suggest profiling the backend with both key types to get an idea of >> where the time is going. > I've been trying to use gprof to do some profiling, but I'm having > troubles. Whats the best way to profile? It's not hard; the only real gotcha is that on Linux systems you need to compile with -DLINUX_PROFILE so that the postmaster works around some Linux brain damage with dropping the profile status at fork(). I usually do (in an already configured and built source tree) cd src/backendmake cleanmake PROFILE="-pg -DLINUX_PROFILE" allmake install-bin Don't forget that the backends will drop their gmon.out files in $PGDATA/data/DBOID/gmon.out. regards, tom lane
Humm -- strange results here: typedef struct {float xmin;float ymin;float xmax;float ymax; } BOX2DFLOAT4; This takes about 18,000 ms to do a nested query with 10,000 iterations. typedef struct {float xmin;float ymin;float xmax;float ymax;char junk[16]; } BOX2DFLOAT4; This takes about 15,000 ms to do a nested query with 10,000 iterations. typedef struct {double xmin;double ymin;double xmax;double ymax; } BOX2DFLOAT4; This takes about 1500 ms to do a nested query with 10,000 iterations. Yes - that almost 14 seconds faster! This doesnt make a lot of sense since the only part of the GiST index thats being called that actually looks at the bounding boxes is this: retval = (((key->xmax>= query->xmax) && (key->xmin <= query->xmax)) || ((query->xmax>= key->xmax) && (query->xmin<= key->xmax))) && (((key->ymax>= query->ymax) && (key->ymin<= query->ymax))|| ((query->ymax>= key->ymax) && (query->ymin<= key->ymax))); dave
David Blasby <dblasby@refractions.net> writes: > Humm -- strange results here: > typedef struct > { > float xmin; > float ymin; > float xmax; > float ymax; > } BOX2DFLOAT4; > This takes about 18,000 ms to do a nested query with 10,000 iterations. > typedef struct > { > float xmin; > float ymin; > float xmax; > float ymax; > char junk[16]; > } BOX2DFLOAT4; > This takes about 15,000 ms to do a nested query with 10,000 iterations. That is strange --- I'm still suspecting an O(n^2) bit of logic somewhere. But the size of the discrepancy is a little bit more reasonable. Could you profile these two cases and see where the extra time goes? > typedef struct > { > double xmin; > double ymin; > double xmax; > double ymax; > } BOX2DFLOAT4; > This takes about 1500 ms to do a nested query with 10,000 iterations. > Yes - that almost 14 seconds faster! AFAICS there are only two possible explanations: 1. float-vs-float comparison is a lot slower than double-vs-double on your hardware. Actually, the comparisons might be float-vs-double (is the "query" struct the same as the "key"??). The compiler could well be promoting one or both floats to double and then doing double comparison, but even so, that's a heck of a large overhead. 2. The float bounding boxes are presumably slightly inaccurate. If they are a tad too large then perhaps you are visiting more leaf pages than you need to. (A worrisome possibility is that a float box could be slightly too small, causing you to fail to visit an entry you should :-() regards, tom lane
I tracked the problem down to the "penalty" function used to build the tree. Basically, it compares the area of the bounding boxes. There wasnt enough precision in the area calculations - instead of giving 0.0 it would give numbers in the[+-]10^-6 range. This really screwed up how the tree was being built. I cast the float to doubles when calculating area, and it all works well now. dave