Thread: Large index scan perfomance and indexCorrelation (PG 8.1.4 Win32)
Hello all I have a big amount of phone calls data (280M records, 25 Gbytes).The best decision for this task is partitioning and I use it now. But at first I tried put all data in a single table indexed by call date&time. Because of nature of the data the records clustered by date and near ordered by time. The table definition: CREATE DOMAIN datetime AS timestamp NOT NULL; CREATE DOMAIN cause AS int2 DEFAULT 16 NOT NULL; CREATE DOMAIN conn_num AS varchar(34); CREATE DOMAIN dur AS int4 NOT NULL; CREATE DOMAIN lno AS int2; CREATE DOMAIN tgrp AS char(6); CREATE TABLE conn ( datetime datetime, anum conn_num, bnum conn_num, dur dur, itgrp tgrp, ilno lno, otgrp tgrp, olno lno, cause cause ) WITHOUT OIDS; CREATE INDEX conn_dt ON conn USING btree (datetime); Usual tasks on the table are export and search calls on one or more days. This cause the scan of 400K or more records, selected by 'conn_dt' index. The best data access path is a bitmap heap scan. Tests I've made showed incredible bitmap scan perfomance almost equal to a seq scan. But PG always prefered simple index scan which is 20 times slower. Digging in the PG internals brought me to indexCorrelation. For the 'datetime' column it was about 0,999999. But why despite of this the index scan was so slow? In the next step I ran select ctid from conn where ... order by datetime; Result showed up that there were no page seq scan at all - true random access only. The simple model which can explain the situation: the sequence of numbers 2, 1, 4, 3, 6, 5, ..., 100, 99 has correlation about 0,9994. Let's imagine it's the page order of an index scan. H'm, bad choice, isn't it? I think indexCorrelation can help to estimate page count but not page fetch cost. Why not to use formula min_IO_cost = ceil(indexSelectivity * T) * random_page_cost instead of min_IO_cost = ceil(indexSelectivity * T) ?
On Tue, 2006-06-27 at 16:14 +0400, Andrew Sagulin wrote: > Result showed up that there were no page seq scan at all - true random access > only. > The simple model which can explain the situation: the sequence of numbers 2, 1, > 4, 3, 6, 5, ..., 100, 99 has correlation about 0,9994. Let's imagine it's the page > order of an index scan. H'm, bad choice, isn't it? Your example is only possible if whole blocks of data were out of order, which I guess is possible within a multi-million row table. Slightly out of order values would be ignored, since I/O works at the block rather than the tuple level. ANALYZE doesn't cope well with tables as large as you have. It doesn't sample enough rows, nor does it look within single blocks/groups to discover anomalies such as yours. As a result, the data that is sampled looks almost perfectly ordered, though the main bulk is not. I think what you are also pointing out is that the assumption of the effects of correlation doesn't match the current readahead logic of filesystems. If we were to actively force a readahead stride of 2 for this scan (somehow), then the lack of correlation would disappear completely. IIRC the current filesystem readahead logic would find that such a sequence would be seen as random, and so no readahead would be performed at all - even though the data is highly correlated. That isn't PostgreSQL's fault directly, since the readahead ought to work better than it does, but we fail indirectly by relying upon it in this case. > I think indexCorrelation can help to estimate page count but not page > fetch cost. Why not to use formula > > min_IO_cost = ceil(indexSelectivity * T) * random_page_cost > > instead of > > min_IO_cost = ceil(indexSelectivity * T) ? That part is sensible. The min_IO_cost is when the access is sequential, which by definition has a cost of 1.0. The bit you might have issue with is how we extrapolate from the min_IO_cost and correlation to arrive at a cost. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Wednesday, June 28, 2006, 2:04:17 Simon Riggs, you wrote: > That part is sensible. The min_IO_cost is when the access is sequential, > which by definition has a cost of 1.0. In general - yes. But we talk about the min_IO_cost of the index scan which is barely sequential. Correct me if I'm wrong: index scan algorithm is something like this: 'read couple of index pages, read some data pages, index pages, data pages, ...'. So, the current assumption of min_IO_cost is too optimistic even in a case of ideal tuple ordering. > The bit you might have issue with is how we extrapolate from the > min_IO_cost and correlation to arrive at a cost. Now index scan cost calculation use indexCorrelation as measure of a tuple clustering and a degree of their sequentiality (physical ordering). As far as I know there are cases when this approach is wrong, for example, my issue or any other case with high clustering without ordering, where bitmap heap scan is the best way but PostgreSQL prefer index scan or even sequential scan. Does PostgreSQL's development team plan to revise the index scan cost algorithm or issues like mine is too rare for taking into account?
Andrew Sagulin <andrews42@yandex.ru> writes: > Does PostgreSQL's development team plan to revise the index scan > cost algorithm or issues like mine is too rare for taking into account? The algorithm is certainly open for discussion, but we're not changing it on the basis of just a single report ... You're mistaken to be fingering min_IO_cost as the source of the issue, because there is also a correction for near-sequential access in cost_bitmap_heap_scan. If we were to bias the system as heavily against the consideration as you propose, we would logically have to put a similar bias into cost_bitmap_heap_scan, and you'd probably still end up with a plain indexscan. What you need to do is compare the two functions and figure out what part of the cost models are out of line with reality. I tend to agree with the upthread comment that the nonlinear interpolation between min_IO_cost and max_IO_cost is suspect ... but that may or may not have anything truly to do with your problem. It might be that cost_index is fine and cost_bitmap_heap_scan is overcharging. BTW there are already some changes in HEAD relating to this, please see the pghackers archives from beginning of June (thread "More thoughts about planner's cost estimates"). regards, tom lane
On Wed, Jun 28, 2006 at 10:37:24AM -0400, Tom Lane wrote: > with a plain indexscan. What you need to do is compare the two > functions and figure out what part of the cost models are out of line > with reality. I tend to agree with the upthread comment that the > nonlinear interpolation between min_IO_cost and max_IO_cost is suspect If you're going to make such a comparison (which is badly needed, imho), http://stats.distributed.net/~decibel/ might be of use to you. It shows that the nonlinear interpolation between the correlated and uncorrelated index scans is way off base, at least for this example. BTW, you'll have a hard time convincing people to increase the cost estimates of index scans, because experience has shown that they're already too high (all the discussions about dropping random_page_cost, for example). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461