Thread: again on index usage
Hello, I have an table with ca 1.7 million records, with the following structure: Table "iplog_gate200112"Attribute | Type | Modifier -----------+-----------+----------ipaddr | inet | input | bigint | output | bigint | router | text | ipdate | timestamp | Indices: iplog_gate200112_ipaddr_idx, iplog_gate200112_ipdate_idx the following query explain SELECT sum(input) FROM iplog_gate200112 WHERE '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND '2001-12-01 00:00:00+02' <= ipdate ANDipdate < '2002-01-01 00:00:00+02' AND network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router; results in NOTICE: QUERY PLAN: Aggregate (cost=51845.51..51845.51 rows=1 width=8) -> Seq Scan on iplog_gate200112 (cost=0.00..51845.04 rows=190 width=8) Why would it not want to use index scan? Statistics for the table are as follows (from pg_statistic s, pg_attribute a, pg_class c where starelid = c.oid and attrelid = c.oid and staattnum = attnum and relname = 'iplog_gate200112') attname | attdispersion | starelid | staattnum | staop | stanullfrac | stacommonfrac | stacommonval | staloval | stahival ---------+---------------+-----------+-----------+-------+-------------+------- --------+------------------------+------------------------+-------------------- ----ipaddr | 8.85397e-05 | 190565949 | 1 | 1203 | 0 | 0.000441917 | 192.92.129.1 | 192.92.129.0 | 212.72.197.154input | 0.0039343 | 190565949 | 2 | 412 | 0 | 0.0183278 | 0 | 0 | 5929816798output | 0.724808 | 190565949 | 3 | 412 | 0 | 0.835018 | 0 | 0 | 2639435033router | 0.222113 | 190565949 | 4 | 664 | 0 | 0.416541 | sofia5 | bourgas1 | varna3ipdate | 0.014311 | 190565949 | 5 | 1322| 0 | 0.0580676 | 2001-12-04 00:00:00+02 | 2001-12-01 00:00:00+02 | 2001-12-31 00:00:00+02 (5 rows) The query explain SELECT sum(input) FROM iplog_gate200112 WHERE ipdate < '2001-12-01 00:00:00+02' AND network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router; produces Aggregate (cost=4.91..4.91 rows=1 width=8) -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 (cost=0.00..4.91 rows=1 width=8) Note there are no records with ipdate < '2001-12-01 00:00:00+02' in the table. Could anyone sched some light? This is on 7.1.3. Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > explain > SELECT sum(input) FROM iplog_gate200112 > WHERE > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' AND > network_subeq(ipaddr, '193.68.240.0/20') AND 'uni-gw' ~ router; > results in > NOTICE: QUERY PLAN: > Aggregate (cost=51845.51..51845.51 rows=1 width=8) > -> Seq Scan on iplog_gate200112 (cost=0.00..51845.04 rows=190 width=8) > Why would it not want to use index scan? It's difficult to tell from this what it thinks the selectivity of the ipdate index would be, since the rows estimate includes the effect of the ipaddr and router restrictions. What do you get from just explain SELECT sum(input) FROM iplog_gate200112 WHERE '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND '2001-12-01 00:00:00+02' <= ipdate ANDipdate < '2002-01-01 00:00:00+02'; If you say "set enable_seqscan to off", does that change the plan? BTW, the planner does not associate function calls with indexes. If you want to have the ipaddr index considered for this query, you need to write ipaddr <<= '193.68.240.0/20' not network_subeq(ipaddr, '193.68.240.0/20'). (But IIRC, that only works in 7.2 anyway, not earlier releases :-() regards, tom lane
> It's difficult to tell from this what it thinks the selectivity of the > ipdate index would be, since the rows estimate includes the effect of > the ipaddr and router restrictions. What do you get from just > > explain > SELECT sum(input) FROM iplog_gate200112 > WHERE > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02'; > I don't know if this'd help, but if you're suming the data and running this query often, see if a function index would help: CREATE INDEX sum_input_fnc_idx ON iplog_gate200112 (sum(input)); -sc -- Sean Chittenden
>>>Tom Lane said:> It's difficult to tell from this what it thinks the selectivity of the> ipdate index would be, since therows estimate includes the effect of> the ipaddr and router restrictions. What do you get from just> > explain> SELECTsum(input) FROM iplog_gate200112 > WHERE > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02'A ND > '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02'; Same result (sorry, should have included this originally): Aggregate (cost=47721.72..47721.72 rows=1 width=8) -> Seq Scan on iplog_gate200112 (cost=0.00..47579.54 rows=56873 width=8) > If you say "set enable_seqscan to off", does that change the plan? Yes. As expected (I no longer have the problem of NaN estimates :) Aggregate (cost=100359.71..100359.71 rows=1 width=8) -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 (cost=0.00..100217.52 rows=56873 width=8) My belief is that the planner does not want to use index due to low value dispersion of the indexed attribute. When splitting the table into several smaller tables, index is used. This bites me, because each such query takes at least 3 minutes and the script that generates these needs to execute few thousands queries. > BTW, the planner does not associate function calls with indexes. If you> want to have the ipaddr index considered forthis query, you need to write> ipaddr <<= '193.68.240.0/20' not network_subeq(ipaddr, '193.68.240.0/20').> (But IIRC,that only works in 7.2 anyway, not earlier releases :-() This is what I though too, but using the ipdate index will be sufficient. I understand my complaint is not a bug, but rather question of proper planner optimization (it worked 'as expected' in 7.0). Perhaps the planner should consider the total number of rows, as well as the dispersion factor. With the dispersion being around 1.5% and total rows 1.7 million this gives about 25k rows with the same value - large enough to trigger sequential scan, as far as I understand it, but the cost of scanning 1.7 million rows sequentially is just too high. By the way, the same query takes approx 10 sec with set enable_seqscan to off. Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > Same result (sorry, should have included this originally): > Aggregate (cost=47721.72..47721.72 rows=1 width=8) > -> Seq Scan on iplog_gate200112 (cost=0.00..47579.54 rows=56873 width=8) >>> If you say "set enable_seqscan to off", does that change the plan? > Aggregate (cost=100359.71..100359.71 rows=1 width=8) > -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 > (cost=0.00..100217.52 rows=56873 width=8) So, what we've got here is a difference of opinion: the planner thinks that the seqscan will be faster. How many rows are actually selected by this WHERE clause? How long does each plan actually take? regards, tom lane
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > Same result (sorry, should have included this originally):>> > Aggregate (cost=47721.72..47721.72 rows=1 width=8)> > -> Seq Scan on iplog_gate200112 (cost=0.00..47579.54rows=56873 width= 8)> > >>> If you say "set enable_seqscan to off", does that change the plan?> >> Aggregate (cost=100359.71..100359.71 rows=1 width=8)> > -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 > > (cost=0.00..100217.52 rows=56873 width=8)> > So, what we've got here is a difference of opinion: theplanner thinks> that the seqscan will be faster. How many rows are actually selected> by this WHERE clause? How longdoes each plan actually take?> > regards, tom lane 3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns 4062 rows. Out of ca 1700000 rows. With only the datetime constraints (which relates to the index), the number of rows is 51764. In any case, sequential scan of millions of rows cannot be faster than index scan. The average number of records for each index key is around 25000 - perhaps the planner thinks because the number of tuples in this case is higher, it should prefer sequential scan. I guess the planner will do better if there is some scaling of these values with respect to the total number of rows. Daniel
Daniel Kalchev wrote: > > >>>Tom Lane said: > > Daniel Kalchev <daniel@digsys.bg> writes: > > > Same result (sorry, should have included this originally): > > > > >>> If you say "set enable_seqscan to off", does that change the plan? > > > > > Aggregate (cost=100359.71..100359.71 rows=1 width=8) > > > -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 > > > (cost=0.00..100217.52 rows=56873 width=8) > > > > So, what we've got here is a difference of opinion: the planner thinks > > that the seqscan will be faster. How many rows are actually selected > > by this WHERE clause? How long does each plan actually take? > > > > regards, tom lane > > 3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns > 4062 rows. Out of ca 1700000 rows. > > With only the datetime constraints (which relates to the index), the number of > rows is 51764. The planner estimates 56873 rows. It seems not that bad. A plausible reason is your table is nearly clustered as to ipdate. regards, Hiroshi Inoue
Daniel Kalchev <daniel@digsys.bg> writes: > Aggregate (cost=47721.72..47721.72 rows=1 width=8) > -> Seq Scan on iplog_gate200112 (cost=0.00..47579.54 rows=56873 width= > 8) >>> > If you say "set enable_seqscan to off", does that change the plan? >>> > Aggregate (cost=100359.71..100359.71 rows=1 width=8) > -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 > (cost=0.00..100217.52 rows=56873 width=8) >>> >>> So, what we've got here is a difference of opinion: the planner thinks >>> that the seqscan will be faster. How many rows are actually selected >>> by this WHERE clause? How long does each plan actually take? > 3-5 minutes with sequential scan; 10-15 sec with index scan. The query returns > 4062 rows. Out of ca 1700000 rows. > With only the datetime constraints (which relates to the index), the number of > rows is 51764. Hm. Okay, so the number-of-rows estimate is not too far off. I concur with Hiroshi's comment: the reason the indexscan is so fast must be that the table is clustered (physical order largely agrees with index order). This would obviously hold if the records were entered in order by ipdate; is that true? The 7.2 planner does try to estimate index order correlation, and would be likely to price this indexscan much lower, so that it would make the right choice. I'd suggest updating to 7.2 as soon as we have RC1 out. (Don't do it yet, though, since we've got some timestamp bugs to fix that are probably going to require another initdb.) > In any case, sequential scan of millions of rows cannot be faster than index > scan. Snort. If that were true, we could get rid of most of the complexity in the planner. regards, tom lane
>>>Tom Lane said:> Hm. Okay, so the number-of-rows estimate is not too far off. I concur> with Hiroshi's comment: the reasonthe indexscan is so fast must be that> the table is clustered (physical order largely agrees with index order).> Thiswould obviously hold if the records were entered in order by> ipdate; is that true? Yes. But... do you want me to cluster it by ipaddr for example and try it again? I understand the clustering might help with sequential scans, but why would it help with index scans? Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > I understand the clustering might help with sequential scans, but why > would it help with index scans? No, the other way around: it makes no difference for seq scans, but can speed up index scans quite a lot. With a clustered table, successive index-driven fetches tend to hit the same pages rather than hitting random pages throughout the table. That saves I/O. Given the numbers you were quoting, if the table were in perfectly random order by ipdate then there would probably have been about three rows per page that the indexscan would've had to fetch. This would mean touching each page three times in some random order. Unless the table is small enough to fit in Postgres' shared buffer cache, that's going to represent a lot of extra I/O --- a lot more than reading each page only once, as a seqscan would do. At the other extreme, if the table is perfectly ordered by ipdate then the indexscan need only hit a small number of pages (all the rows we want are in a narrow range) and we touch each page many times before moving on to the next. Very few I/O requests in that case. 7.1 does not have any statistics about table order, so it uses the conservative assumption that the ordering is random. 7.2 has more statistical data and perhaps will make better estimates about the cost of indexscans. regards, tom lane
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > I understand the clustering might help with sequential scans,but why > > would it help with index scans?> > No, the other way around: it makes no difference for seq scans, butcan> speed up index scans quite a lot. With a clustered table, successive> index-driven fetches tend to hit the samepages rather than hitting> random pages throughout the table. That saves I/O. Ok, time to go home :-), but... > Given the numbers you were quoting, if the table were in perfectly> random order by ipdate then there would probably havebeen about three> rows per page that the indexscan would've had to fetch. This would mean> touching each page threetimes in some random order. Unless the table> is small enough to fit in Postgres' shared buffer cache, that's goingto> represent a lot of extra I/O --- a lot more than reading each page only> once, as a seqscan would do. At the otherextreme, if the table is> perfectly ordered by ipdate then the indexscan need only hit a small> number of pages (allthe rows we want are in a narrow range) and we> touch each page many times before moving on to the next. Very few I/O>requests in that case. In any case, if we need to hit 50k pages (assuming the indexed data is randomly scattered in the file), and having to read these three times each, it will be less I/O than having to read 1.7 million records. The table will never be laid sequentially on the disk, at least not in this case (which adds to the table day after day - and this is why data is almost ordered by ipdate). What I am arguing about is the scaling - is 50k random reads worse than 1.7 million sequential reads? Eventually considering the tuple size, disk block size etc. I will wait patiently for 4.2 to release and see how this same table performs. :-) Daniel
Daniel Kalchev <daniel@digsys.bg> writes: > In any case, if we need to hit 50k pages (assuming the indexed data is > randomly scattered in the file), and having to read these three times each, it > will be less I/O than having to read 1.7 million records. How do you arrive at that? Assuming 100 records per page (probably the right order of magnitude), the seqscan alternative is 17k page reads. Yes, you examine more tuples, but CPUs are lots faster than disks. That doesn't even address the fact that Unix systems reward sequential reads and penalize random access. In a seqscan, we can expect that the kernel will schedule the next page read before we ask for it, so that our CPU time to examine a page is overlapped with I/O for the next page. In an indexscan that advantage goes away, because neither we nor the kernel know which page will be touched next. On top of the loss of read-ahead, the filesystem is probably laid out in a way that rewards sequential access with fewer and shorter seeks. The tests I've made suggest that the penalty involved is about a factor of four -- ie, a seqscan can scan four pages in the same amount of time that it takes to bring in one randomly-accessed page. regards, tom lane
>>>Tom Lane said:> Daniel Kalchev <daniel@digsys.bg> writes:> > In any case, if we need to hit 50k pages (assuming the indexeddata is > > randomly scattered in the file), and having to read these three times each , it > > will be less I/Othan having to read 1.7 million records.> > How do you arrive at that? Assuming 100 records per page (probably the> rightorder of magnitude), the seqscan alternative is 17k page reads.> Yes, you examine more tuples, but CPUs are lots fasterthan disks. I tried this: db=# select * into iplog_test from iplog_gate200112; SELECT db=# create index iplog_test_ipaddr_idx on iplog_test(ipaddr); CREATE db=# cluster iplog_test_ipaddr_idx on iplog_test; CLUSTER db=# create index iplog_test_ipdate_idx on iplog_test(ipdate); CREATE db=# vacuum verbose analyze iplog_test; NOTICE: --Relation iplog_test-- NOTICE: Pages 17722: Changed 0, reaped 0, Empty 0, New 0; Tup 1706202: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 80, MaxLen 88; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 1.48s/-1.86u sec. NOTICE: Index iplog_test_ipaddr_idx: Pages 5621; Tuples 1706202. CPU 0.51s/1.80u sec. NOTICE: Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU 0.36s/1.92u sec. NOTICE: --Relation pg_toast_253297758-- NOTICE: Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec. NOTICE: Index pg_toast_253297758_idx: Pages 1; Tuples 0. CPU 0.00s/0.00u sec. NOTICE: Analyzing... VACUUM db=# explain db-# SELECT sum(input), sum(output) FROM iplog_test db-# WHERE db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' AND db-# ipaddr <<= '193.68.240.0/20' AND 'uni-gw' ~ router; NOTICE: QUERY PLAN: Aggregate (cost=56112.97..56112.97 rows=1 width=16) -> Seq Scan on iplog_test (cost=0.00..56111.54 rows=284 width=16) EXPLAIN query runs for ca 3.5 minutes. db=# set enable_seqscan to off; the query plan is Aggregate (cost=100507.36..100507.36 rows=1 width=16) -> Index Scan using iplog_test_ipdate_idx on iplog_test (cost=0.00..100505.94 rows=284 width=16) query runs for ca 2.2 minutes. Moves closer to your point :-) Anyway, the platform is an dual Pentium III 550 MHz (intel) machine with 512 MB RAM, with 15000 RPM Cheetah for the database, running BSD/OS 4.2. The machine is reasonably loaded all the time, so this is very much real-time test. I agree, that with the 'wrong' clustering the index scan is not so much faster than the sequential scan. Perhaps I need to tune this machine's costs to prefer more disk intensive operations over CPU intensive operations? Let's see what 4.2 will result in. Daniel
Daniel wrote: (stripped to info I used) > NOTICE: Pages 17722: Changed 0, reaped 0, Empty 0, New 0; > Tup 1706202: Vac 0, > NOTICE: Index iplog_test_ipaddr_idx: Pages 5621; Tuples 1706202. CPU > NOTICE: Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU > -> Seq Scan on iplog_test (cost=0.00..56111.54 rows=284 width=16) > query runs for ca 3.5 minutes. > -> Index Scan using iplog_test_ipdate_idx on iplog_test > (cost=0.00..100505.94 rows=284 width=16) > query runs for ca 2.2 minutes. I cannot really see how 284 rows can have an estimated index cost of 100506 ? > 512 MB RAM, with 15000 RPM Cheetah for the database, running > Perhaps I need to tune this machine's costs to prefer more > disk intensive operations over CPU intensive operations? What is actually estimated wrong here seems to be the estimated effective cache size, and thus the cache ratio of page fetches. Most of your pages will be cached. The tuning parameter is: effective_cache_size With (an estimated) 50 % of 512 Mb for file caching that number would need to be: effective_cache_size = 32768 # 8k pages Can you try this and tell us what happens ? Andreas
>>>"Zeugswetter Andreas SB SD" said:> What is actually estimated wrong here seems to be the estimated> effective cache size,and thus the cache ratio of page fetches.> Most of your pages will be cached.> > The tuning parameter is: effective_cache_size>> With (an estimated) 50 % of 512 Mb for file caching that number would > need to be:> effective_cache_size= 32768 # 8k pages> > Can you try this and tell us what happens ? I suspected this, but haven't really come to test it. On BSD/OS, the buffer cache is 10% of the RAM, in my case buffer cache = 53522432 (51.04 MB) I guess effective_cache_size = 6400 will be ok. Daniel
[with the new effective_cache_size = 6400] explain SELECT sum(input), sum(output) FROM iplog_gate200112 WHERE '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' AND '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2002-01-01 00:00:00+02' AND ipaddr <<= '193.68.240.0/20' AND 'uni-gw' ~ router; gives Aggregate (cost=56111.97..56111.97 rows=1 width=16) -> Seq Scan on iplog_gate200112 (cost=0.00..56110.54 rows=284 width=16) takes 3 min to execute. (was 10 sec after fresh restart) db=# set enable_seqscan to off; Aggregate (cost=84980.10..84980.10 rows=1 width=16) -> Index Scan using iplog_gate200112_ipdate_idx on iplog_gate200112 (cost=0.00..84978.68 rows=284 width=16) takes 1.8 min to execute. (was 2 sec after fresh reshart) Still proves my point, But the fresh restart performance is impressive. After few minutes the database takes its normal load and in my opinion the buffer cache is too much cluttered with pages from other tables. Which brings another question: with so much RAM recent equipment runs, it may be good idea to specifically add to INSTALL instruction on tuning the system as soon as it is installed. Most people will stop there, especially after an upgrade (as I did). Daniel
> [with the new effective_cache_size = 6400] This seems way too low for a 512 Mb machine. Why does your OS only use so little for filecache ? Is the rest used for processes ? For the above number you need to consider OS cache and shared_buffers. You can approximatly add them together minus a few %. With the data you gave, a calculated value for effective_cache_size would be 29370, assuming the random_page_cost is actually 4 on your machine. 29370 might be a slight overestimate, since your new table will probably still be somewhat sorted by date within one IP. Try to measure IO/s during the seq scan and during the index path and calculate the ratio. This should be done during an average workload on the machine. Andreas
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > I cannot really see how 284 rows can have an estimated index cost of 100506 ? The estimated number of indexscanned rows is more like 50k. The number you are looking at includes the estimated selectivity of the non-indexable WHERE clauses, too. > What is actually estimated wrong here seems to be the estimated > effective cache size, and thus the cache ratio of page fetches. Good point, but I think the estimates are only marginally sensitive to estimated cache size (if they're not, we have a problem, considering how poorly we can estimate the kernel's disk buffer size). It would be interesting for Daniel to try a few different settings of effective_cache_size and see how much the EXPLAIN costs change. regards, tom lane
Daniel Kalchev <daniel@digsys.bg> writes: > I agree, that with the 'wrong' clustering the index scan is not so > much faster than the sequential scan. It would be interesting to check whether there is any correlation between ipaddr and ipdate in your test data. Perhaps clustering on ipaddr might not destroy the ordering on ipdate as much as you thought. A more clearly random-order test would go: select * into iplog_test from iplog_gate200112 order by random(); create index iplog_test_ipdate_idx on iplog_test(ipdate); vacuum verbose analyze iplog_test; << run queries >> > Perhaps I need to tune this machine's costs to prefer more disk intensive > operations over CPU intensive operations? Possibly. I'm not sure there's much point in tuning the cost estimates until the underlying model is more nearly right (ie, knows something about correlation). Do you care to try your dataset with 7.2 beta? regards, tom lane
> > What is actually estimated wrong here seems to be the estimated > > effective cache size, and thus the cache ratio of page fetches. > > Good point, but I think the estimates are only marginally sensitive > to estimated cache size (if they're not, we have a problem, considering > how poorly we can estimate the kernel's disk buffer size). It would > be interesting for Daniel to try a few different settings of > effective_cache_size and see how much the EXPLAIN costs change. Well, the number I told him (29370) should clearly prefer the index. The estimate is very sensitive to this value :-( With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the default of 1000 pages ==> index scan. 229 Mb file cache with 512Mb Ram is a reasonable value, I have a lot more here: Memory Real Virtual free 0 MB 218 MB procs 95 MB 293 MB files 159 MB total 256 MB 512 MB Andreas
>>>"Zeugswetter Andreas SB SD" said:> > > [with the new effective_cache_size = 6400]> > This seems way too low for a 512Mb machine. Why does your OS> only use so little for filecache ? Is the rest used for processes ?> For the above numberyou need to consider OS cache and shared_buffers.> You can approximatly add them together minus a few %. As far as I am aware, 10% for buffer space is the default for BSD operating systems... although I have seen buffer space = 50% on MacOS X. There is no problem to increase the buffer space in kernel, although I am not very confident this will give much better overall performance (well, more memory can be added as well). > With the data you gave, a calculated value for effective_cache_size> would be 29370, assuming the random_page_cost is actually4 on your> machine. 29370 might be a slight overestimate, since your new table> will probably still be somewhat sortedby date within one IP. random_page_cost is 4. If the select into then cluster do this, then yes, it is possible, but not guaranteed. I will try with increased effective_cache_size. Postmaster is started with -N 128 -B 256 -i -o "-e -S 10240" Daniel
>>>"Zeugswetter Andreas SB SD" said:> > > > What is actually estimated wrong here seems to be the estimated> > > effectivecache size, and thus the cache ratio of page fetches.> > > > Good point, but I think the estimates are only marginallysensitive> > to estimated cache size (if they're not, we have a problem, considering> > how poorly we can estimatethe kernel's disk buffer size). It would> > be interesting for Daniel to try a few different settings of> > effective_cache_sizeand see how much the EXPLAIN costs change.> > Well, the number I told him (29370) should clearly preferthe index.> The estimate is very sensitive to this value :-(> With 29370 (=229 Mb) the index cost is 1,364 insteadof 3,887 with the > default of 1000 pages ==> index scan. But... if I understand it right (effective_cache_size to be related to kernel buffer space). it turns out that the estimates are different with reality - my buffer cache is ca. 50 MB and I still get at least twice the performance with index scan instead of sequential scan - where as Tom explained things should be much worse. I considered the possibility that the clustered table can still maintain some ordering by ipdate after being clustered by ipaddr - but with over 65k ip addresses, almost evenly spread, this should be not so significant. Best Regards, Daniel
> > > > What is actually estimated wrong here seems to be the estimated > > > > effective cache size, and thus the cache ratio of page fetches. > > > > > > Good point, but I think the estimates are only marginally sensitive > > > to estimated cache size (if they're not, we have a problem, considering > > > how poorly we can estimate the kernel's disk buffer size). It would > > > be interesting for Daniel to try a few different settings of > > > effective_cache_size and see how much the EXPLAIN costs change. > > > > Well, the number I told him (29370) should clearly prefer the index. > > The estimate is very sensitive to this value :-( > > With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the > > default of 1000 pages ==> index scan. > > But... if I understand it right (effective_cache_size to be related to kernel > buffer space). it turns out that the estimates are different with reality - my > buffer cache is ca. 50 MB and I still get at least twice the performance with > index scan instead of sequential scan - where as Tom explained things should > be much worse. Since pg only reads one 8k page at a time, the IO performance of a seq scan is probably not nearly a good as it could be when a lot of other IO is done on the same drive. First thing you should verify is if there is actually a measurable difference in IO throughput on the pg drive during the seq scan and the index scan. (iostat) If there is not, then random_page_cost is too high in your scenario. (All assuming your data is not still clustered like Tom suggested) Andreas
This topic seems to come up a lot. Is there something we are missing in the FAQ? --------------------------------------------------------------------------- Zeugswetter Andreas SB SD wrote: > > > > > > What is actually estimated wrong here seems to be the estimated > > > > > effective cache size, and thus the cache ratio of page fetches. > > > > > > > > Good point, but I think the estimates are only marginally sensitive > > > > to estimated cache size (if they're not, we have a problem, considering > > > > how poorly we can estimate the kernel's disk buffer size). It would > > > > be interesting for Daniel to try a few different settings of > > > > effective_cache_size and see how much the EXPLAIN costs change. > > > > > > Well, the number I told him (29370) should clearly prefer the index. > > > The estimate is very sensitive to this value :-( > > > With 29370 (=229 Mb) the index cost is 1,364 instead of 3,887 with the > > > default of 1000 pages ==> index scan. > > > > But... if I understand it right (effective_cache_size to be related to kernel > > buffer space). it turns out that the estimates are different with reality - my > > buffer cache is ca. 50 MB and I still get at least twice the performance with > > index scan instead of sequential scan - where as Tom explained things should > > be much worse. > > Since pg only reads one 8k page at a time, the IO performance of a seq scan is > probably not nearly a good as it could be when a lot of other IO is done on the > same drive. > > First thing you should verify is if there is actually a measurable difference > in IO throughput on the pg drive during the seq scan and the index scan. (iostat) > If there is not, then random_page_cost is too high in your scenario. > (All assuming your data is not still clustered like Tom suggested) > > Andreas > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
>>>"Zeugswetter Andreas SB SD" said:> First thing you should verify is if there is actually a measurable differenc e>in IO throughput on the pg drive during the seq scan and the index scan. (io stat)> If there is not, then random_page_costis too high in your scenario.> (All assuming your data is not still clustered like Tom suggested) At this idle time (got to have other emergency at 5am in the office :) here is what I have (sd0 is the 'system' drive, sd1 is where postgres data lives): tin tout sps tps msps sps tps msps usr nic sys int idl 0 39 831 12 2.0 8962 121 3.6 4 26 7 0 63 0 13 215 4 7.7 9917 122 3.7 5 24 5 0 66 0 13 216 3 6.1 7116 115 4.1 5 23 4 0 68 0 13 220 3 5.0 9401 128 5.0 5 17 4 0 74 0 13 226 3 12.2 9232 122 3.8 4 24 4 0 67 0 13 536 26 8.5 11353 147 4.4 13 16 9 0 62 0 13 259 5 5.8 12102 165 4.1 8 14 8 0 70 0 13 492 20 7.2 13913 186 4.5 8 9 6 0 76 0 13 185 2 4.7 11423 184 5.0 14 6 8 0 72 running index scan: 0 13 274 8 4.9 5786 145 4.4 18 10 8 0 64 0 13 210 3 8.1 5707 153 3.9 20 9 6 0 64 0 13 286 8 7.7 6283 139 4.3 21 9 8 0 62 0 13 212 3 9.7 5900 133 3.3 22 13 7 0 58 0 13 222 3 6.0 5811 148 3.5 20 12 6 0 61 0 13 350 16 7.5 5640 134 4.1 22 12 7 0 58 (seems to be slowing down other I/O :) running seq scan: 0 13 50 4 1.9 4787 101 3.8 24 12 7 0 57 0 13 34 3 5.6 5533 105 3.4 24 12 6 0 58 0 13 42 4 3.1 5414 103 3.0 25 12 6 0 58 0 13 26 2 0.0 5542 102 3.9 28 12 6 0 54 0 13 52 5 2.8 5644 112 4.1 24 11 7 0 58 0 13 27 2 4.1 6462 122 4.0 26 8 7 0 60 0 13 36 3 2.0 5616 128 4.2 22 8 7 0 63 I can't seem to find any difference... Perhaps this is because the 'sequential' data is anyway scattered all around the disk. I have done this test first, now I will try the random() clustering Tom suggested (although... isn't random not so random to trust it in this scenario? :) Daniel
>>>Tom Lane said:> It would be interesting to check whether there is any correlation> between ipaddr and ipdate in your testdata. Perhaps clustering> on ipaddr might not destroy the ordering on ipdate as much as you> thought. A more clearlyrandom-order test would go:> > select * into iplog_test from iplog_gate200112 order by random();> create index iplog_test_ipdate_idxon iplog_test(ipdate);> vacuum verbose analyze iplog_test;> << run queries >> NOTICE: --Relation iplog_test-- NOTICE: Pages 17761: Changed 17761, reaped 0, Empty 0, New 0; Tup 1706202: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 80, MaxLen 88; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 2.36s/0.32u sec. NOTICE: Index iplog_test_ipdate_idx: Pages 4681; Tuples 1706202. CPU 0.26s/1.98u sec. NOTICE: --Relation pg_toast_275335644-- NOTICE: Pages 0: Changed 0, reaped 0, Empty 0, New 0; Tup 0: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 0, MaxLen 0; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 0.00s/0.00u sec. NOTICE: Index pg_toast_275335644_idx: Pages 1; Tuples 0. CPU 0.00s/0.00u sec. NOTICE: Analyzing... explain (with enable_seqscan to on) Aggregate (cost=56151.97..56151.97 rows=1 width=16) -> Seq Scan on iplog_test (cost=0.00..56150.54 rows=284 width=16) average run time 4 minutes (iostat before seq run) tin tout sps tps msps sps tps msps usr nic sys int idl 5 44 94 6 5.9 6842 120 3.3 25 11 6 0 58 1 14 27 1 5.9 5968 138 3.6 22 11 5 0 62 5 44 58 2 6.3 5647 117 3.0 27 9 6 0 58 1 13 7 1 5.0 5723 125 3.4 24 10 5 0 61 0 15 50 2 4.5 5606 110 3.1 27 10 5 0 58 5 44 90 5 9.1 4702 87 2.5 28 12 4 0 55 1 15 52 1 6.3 5045 114 4.1 24 10 4 0 61 (iostat during seq run) tin tout sps tps msps sps tps msps usr nic sys int idl 1 40 41 2 2.1 5847 123 3.6 25 11 4 0 60 1 16 164 13 6.2 7280 128 3.2 28 8 8 0 57 0 13 36 1 7.9 6059 147 3.9 25 8 5 0 62 0 13 48 3 5.3 6691 126 3.4 26 8 7 0 59 0 13 28 3 4.6 6473 103 2.3 28 11 7 0 54 0 13 138 11 7.6 10848 151 4.9 19 6 6 0 69 0 13 33 3 3.3 5568 100 3.6 21 16 3 0 59 0 13 24 2 1.1 6752 144 3.4 22 12 2 0 64 (sometime at the end of query run) tin tout sps tps msps sps tps msps usr nic sys int idl 0 38 20 2 5.5 5621 57 1.2 23 23 3 0 51 0 13 89 7 7.7 8854 101 3.4 21 18 4 0 57 0 13 72 6 7.3 9929 88 2.2 18 21 4 0 57 0 13 129 6 9.6 4865 43 1.0 21 24 4 0 51 0 13 72 3 4.2 5421 46 0.5 24 22 4 0 50 0 13 52 2 3.5 5877 64 1.8 22 21 4 0 53 0 13 50 3 6.1 5561 54 1.7 19 26 3 0 52 0 13 42 1 3.8 5455 76 1.4 22 22 3 0 53 (see lower msps - perhaps other queries slowed down? - but then again returned to 'normal') 0 13 244 20 6.6 6629 199 4.1 19 9 5 0 67 0 13 68 4 6.2 6080 191 4.3 14 14 3 0 70 0 13 75 3 5.9 6542 214 4.1 19 8 4 0 70 0 13 615 18 5.0 5454 129 4.1 20 18 3 0 59 0 13 88 2 5.7 3718 48 2.5 21 21 4 0 54 0 13 46 3 3.1 4533 75 2.9 20 19 5 0 56 0 13 143 7 5.1 4349 58 2.7 22 18 4 0 55 0 13 58 2 9.9 4038 45 2.0 20 24 4 0 52 0 13 111 4 5.8 4523 60 3.4 18 22 4 0 56 (with enable_seqscan to off) Aggregate (cost=85110.08..85110.08 rows=1 width=16) -> Index Scan using iplog_test_ipdate_idx on iplog_test (cost=0.00..85108.66 rows=284 width=16) average run time 2 minutes (worst 5 minutes, but this happened only once) (iostat output) tin tout sps tps msps sps tps msps usr nic sys int idl 0 38 1 1 6.7 5110 224 3.2 7 16 3 0 73 0 13 21 2 3.6 5249 219 4.1 12 13 2 0 73 0 13 0 0 10.0 5341 216 3.9 12 11 4 0 73 0 13 6 0 9.9 5049 218 3.5 10 14 3 0 72 0 13 6 0 0.0 7654 216 3.8 10 10 2 0 78 0 13 2 1 4.0 8758 222 4.1 6 11 4 0 80 0 13 6 0 9.9 8594 219 4.4 6 10 3 0 81 0 13 4 1 0.0 7290 210 4.3 6 10 4 0 80 0 13 36 3 4.9 5912 196 4.7 9 10 4 0 76 0 13 7 1 13.3 4787 209 3.7 9 17 2 0 72 0 13 0 0 10.0 4691 209 3.8 9 17 2 0 72 This sort of proves that Tom is right about the need for random data. <grin> Having the change to play with it at such idle time in the morning, when nobody is supposed to be working with the databae I dared to shut down all other sessions (idle iostat) iostat 5 tty sd0 sd1 % cputin tout sps tps msps sps tps msps usr nic sysint idl 0 559 148 2 7.7 8 1 3.3 0 0 0 0 99 0 559 235 4 4.1 30 2 2.0 0 0 1 0 98 0 559 189 2 8.7 3 0 10.0 0 0 1 0 99 seq scan (10 sec) 0 575 12103 101 9.2 18798 164 1.4 19 0 8 0 73 0 575 10118 82 11.2 30507 243 0.9 33 0 11 0 55 0 559 12704 101 9.3 7642 61 0.7 9 0 8 0 83 index scan 45 sec 0 38 8 1 4.0 6179 375 2.5 1 0 2 0 98 0 13 4 1 2.5 6208 378 2.4 1 0 2 0 97 0 13 7 1 3.3 6323 382 2.4 1 0 2 0 97 0 13 9 2 0.0 6310 389 2.4 1 0 2 0 97 0 13 0 0 0.0 3648 226 2.3 0 0 1 0 98 This proves to me 15000 RPM Cheetah drives are damn fast at sequential reads. Why index scan wins at 'normal' database load??? I would expect that if the disk subsystem is overloaded, it will be more overloaded seek-wise, than troughput-wise. Perhaps this penalizes the 'expected to be faster' sequential reads much more than the random page index reads. I will try 4.2 with the same data as soon as available and see the differences - on a separate machine with about the same configuration (same OS,same postgres install, same disks, only one 733 MHz CPU instead of two 550 MHz CPUs). Daniel
> This topic seems to come up a lot. Is there something we are missing in > the FAQ? Most of the reports we seem to see are the usual, "but the seq scan is actually faster" case. Daniel actually has a case where the optimizer chooses a bad plan. The difficulty seems to be, that the optimizer cooses a correct plan for an idle system, but with his workload the index path would be far better (2 vs 4 Minutes). This is one of the main problems of the current optimizer which imho rather aggressively chooses seq scans over index scans. During high load this does not pay off. My preference would actually be a way to make the optimizer choose a plan that causes minimal workload, and not shortest runtime (which will obviously only be fast with low overall workload) The reasoning behind this is, that during low workload your response times will be good enough with a "bad" plan, but during high workload your response times will be best with a plan that produces the least additional workload. Andreas
Don Baccus wrote: > Zeugswetter Andreas SB SD wrote: > > > > This is one of the main problems of the current optimizer which imho rather > > aggressively chooses seq scans over index scans. During high load this does > > not pay off. > > > Bingo ... dragging huge tables through the buffer cache via a sequential > scan guarantees that a) the next query sequentially scanning the same > table will have to read every block again (if the table's longer than > available PG and OS cache) b) on a high-concurrency system other queries > end up doing extra I/O, too. > > Oracle partially mitigates the second effect by refusing to trash its > entire buffer cache on any given sequential scan. Or so I've been told > by people who know Oracle well. A repeat of the sequential scan will > still have to reread the entire table but that's true anyway if the > table's at least one block longer than available cache. That is on our TODO list, at least. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > My preference would actually be a way to make the optimizer > choose a plan that causes minimal workload, and not shortest runtime ?? I am not sure that I see the difference. What I think you are saying is that when there's lots of competing work, seqscans have less advantage over indexscans because the sequential-access locality advantage is lost when the disk drive has to go off and service some other request. If that's the mechanism, I think that the appropriate answer is just to reduce random_page_cost. It's true that the current default of 4.0 was chosen using measurements on otherwise-unloaded systems. If you assume that the system is (a) too busy to do read-aheads for you, and (b) has to move the disk arm to service other requests between each of your requests, then it's not clear that sequential reads have any performance advantage at all :-(. I don't think I'd go as far as to lower random_page_cost to 1.0, but certainly there's a case for using an intermediate value. regards, tom lane
Zeugswetter Andreas SB SD wrote: > This is one of the main problems of the current optimizer which imho rather > aggressively chooses seq scans over index scans. During high load this does > not pay off. Bingo ... dragging huge tables through the buffer cache via a sequential scan guarantees that a) the next query sequentially scanning the same table will have to read every block again (if the table's longer than available PG and OS cache) b) on a high-concurrency system other queries end up doing extra I/O, too. Oracle partially mitigates the second effect by refusing to trash its entire buffer cache on any given sequential scan. Or so I've been told by people who know Oracle well. A repeat of the sequential scan will still have to reread the entire table but that's true anyway if the table's at least one block longer than available cache. Of course, Oracle picks sequential scans in horribly and obviously wrong cases as well. On one project over the summer I had a query Oracle refused to use an available index on until I told it to do so explictly, and when I did it sped up by a factor of about 100. All optimizers will fail miserably for certain queries and datasets. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Bruce Momjian wrote: >>Oracle partially mitigates the second effect by refusing to trash its >>entire buffer cache on any given sequential scan. Or so I've been told >>by people who know Oracle well. A repeat of the sequential scan will >>still have to reread the entire table but that's true anyway if the >>table's at least one block longer than available cache. >> > > That is on our TODO list, at least. I didn't realize this, it's good news. (I don't follow PG development closely these days). BTW overall I think the cost-estimating portion of the PG optimizer does about as well as Oracle's. Oracle is a lot smarter about doing transformations of certain types of queries (turning "scalar in (select ...)" into something akin to an "exists") but of course this has nothing to do with estimating the cost of index vs. sequential scans. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
>>>Tom Lane said:> "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:> > My preference would actually be a wayto make the optimizer> > choose a plan that causes minimal workload, and not shortest runtime > > ?? I am not sure thatI see the difference. There can be difference only if the optimizer takes into account already executing plans (by other backends). > What I think you are saying is that when there's lots of competing work,> seqscans have less advantage over indexscansbecause the> sequential-access locality advantage is lost when the disk drive has to> go off and service some otherrequest. This is exactly my point. The primary goal of the optimizer in my opinion should be to avoid trashing. :-) Now, it is not easy to figure out when the system starts trashing - at least not a portable way I can think of immediately. > I don't think I'd go as far as to lower random_page_cost to 1.0, but> certainly there's a case for using an intermediatevalue. The question is: how does one find the proper value? That is, is it possible to design planner benchmarking utility to aidin tuning Postgres? One that does not execute single query and optimize on idle system. Daniel
On Fri, Jan 11, 2002 at 11:42:43AM -0500, Bruce Momjian wrote: > Don Baccus wrote: > > Zeugswetter Andreas SB SD wrote: > > > > > > > This is one of the main problems of the current optimizer which imho rather > > > aggressively chooses seq scans over index scans. During high load this does > > > not pay off. > > > > > > Bingo ... dragging huge tables through the buffer cache via a sequential > > scan guarantees that a) the next query sequentially scanning the same > > table will have to read every block again (if the table's longer than > > available PG and OS cache) b) on a high-concurrency system other queries > > end up doing extra I/O, too. > > > > Oracle partially mitigates the second effect by refusing to trash its > > entire buffer cache on any given sequential scan. Or so I've been told > > by people who know Oracle well. A repeat of the sequential scan will > > still have to reread the entire table but that's true anyway if the > > table's at least one block longer than available cache. > > That is on our TODO list, at least. > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential scans) would have interesting load senstive behavior: since Linux uses all not-otherwise in use RAM as buffer cache, if you've got a low-load system, even very large tables will be cached. Once other processes start competing for RAM, your buffers go away. Bruce, what does xBSD do? I like it, since anything that needs to be sensitive to system wide information, like the total load on the machine, should be handled by the system, not a particular app. Ross
Daniel Kalchev <daniel@digsys.bg> writes: >>> I don't think I'd go as far as to lower random_page_cost to 1.0, but >>> certainly there's a case for using an intermediate value. > The question is: how does one find the proper value? That is, is it > possible to design planner benchmarking utility to aid in tuning > Postgres? The trouble is that getting trustworthy numbers requires huge test cases, because you have to swamp out the effects of the kernel's own buffer caching. I spent about a week running 24-hour-constant-disk- banging experiments when I came up with the 4.0 number we use now, and even then I didn't feel that I had a really solid range of test cases to back it up. My advice to you is just to drop it to 2.0 and see if you like the plans you get any better. regards, tom lane
Ross J. Reedstrom wrote: > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential > scans) would have interesting load senstive behavior: since Linux uses > all not-otherwise in use RAM as buffer cache, if you've got a low-load > system, even very large tables will be cached. Once other processes start > competing for RAM, your buffers go away. Bruce, what does xBSD do? For people who run dedicated database services simply not using pg buffers for sequential scans is probably too simplistic. Assuming one allocates a very large pg buffer space, as I tend to do. Remember that shuffling data between a pg buffer and OS cache buffer takes cycles, and modern machines tend to be starved for memory bandwidth - perhaps another reason why Oracle requested and got writes that bypass the OS cache entirely? This bypasses the byte-shuffling. Of course, Oracle's preferred approach is to have you set up your OS environment so that Oracle pretty much eats the machine. They tell you to set SHMAX to 4GB in the installation docs, for instance, then the installer by default will configure Oracle to use about 1/3 of your available RAM for its buffer cache. Books on tuning generally tell you that's far too low. Anyway, I've been told that Oracle's approach is more along the lines of "don't cache sequential scans that eat up more than N% of our own cache space". Then shorter tables still get fully cached when sequentially scanned, while humongous tables don't wipe out the cache causing dirty pages to be flushed to the platter and other concurrent processes to do file I/O reads because everything but the huge table's disappeared. Someone in an earlier post mentioned "thrashing" and that's what dragging a table bigger than cache causes on busy systems. > > I like it, since anything that needs to be sensitive to system wide > information, like the total load on the machine, should be handled by > the system, not a particular app. > > Ross > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > > . > > -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Daniel Kalchev wrote: > >>>Tom Lane said: > > "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes: > > > My preference would actually be a way to make the optimizer > > > choose a plan that causes minimal workload, and not shortest runtime > > > > ?? I am not sure that I see the difference. > > There can be difference only if the optimizer takes into account already > executing plans (by other backends). > > > What I think you are saying is that when there's lots of competing work, > > seqscans have less advantage over indexscans because the > > sequential-access locality advantage is lost when the disk drive has to > > go off and service some other request. > > This is exactly my point. The primary goal of the optimizer in my opinion > should be to avoid trashing. :-) Now, it is not easy to figure out when the > system starts trashing - at least not a portable way I can think of > immediately. I have always felt some feedback mechanism from the executor back to the optimizer was required but I was never sure quite how to implement it. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > That is on our TODO list, at least. > > > > Hmm, on Linux this sort of behavior (skip the pg buffers for sequential > scans) would have interesting load senstive behavior: since Linux uses > all not-otherwise in use RAM as buffer cache, if you've got a low-load > system, even very large tables will be cached. Once other processes start > competing for RAM, your buffers go away. Bruce, what does xBSD do? FreeBSD does, NetBSD will soon, not sure about the others. I believe NetBSD will be tunable because page cache vs. i/o cache is not always best done with a FIFO setup. Also, when we pull from the kernel cache we have to read into our shared buffers; much faster than disk i/o but slower than if they were already in the cache. For me the idea of doing non-cached sequential scans came from a Solaris internals book I was reading. I think it will be possible to mark large sequential scan buffer i/o with lower priority caching that may help performance. However, if others are also doing sequential scans of the same table _only_, our normal caching may be best. As you can see, this gets quite complicated and I am doubtful there will be a general solution to this problem --- more of a feedback loop may be the best bet --- determine which sequential scans are wiping the cache with little other purpose and start not caching them. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > What I think you are saying is that when there's lots of competing work, > seqscans have less advantage over indexscans because the > sequential-access locality advantage is lost when the disk drive has to > go off and service some other request. If that's the mechanism, I think > that the appropriate answer is just to reduce random_page_cost. It's > true that the current default of 4.0 was chosen using measurements on > otherwise-unloaded systems. If you assume that the system is (a) too > busy to do read-aheads for you, and (b) has to move the disk arm to > service other requests between each of your requests, then it's not > clear that sequential reads have any performance advantage at all :-(. > I don't think I'd go as far as to lower random_page_cost to 1.0, but > certainly there's a case for using an intermediate value. It is even more interesting than simple sequential vs random scan thinking. Depending on the maker of the drive, even an unloaded system will move the head randomly. Modern drives have almost no resemblance to their predecessors. Sectors are mapped however the OEM sees fit. A numerically sequential read from a hard disk may have the drive heads bouncing all over the disk because the internal configuration of the disk has almost nothing to do with the external representation. Think about a RAID device. What does a sequential scan mean to a RAID system? Very little depending on how the image is constructed. Storage devices are now black boxes. The only predictable advantage a "sequential scan" can have on a modern computer is OS level caching.
mlw <markw@mohawksoft.com> writes: > ... Storage > devices are now black boxes. The only predictable advantage a > "sequential scan" can have on a modern computer is OS level caching. You mean read-ahead. True enough, but that "only advantage" is very significant. The 4.0 number did not come out of the air, it came from actual measurements. I think the real point in this thread is that measurements on an idle system might not extrapolate very well to measurements on a heavily loaded system. I can see the point, but I don't really have time to investigate it right now. I'd be willing to reduce the default value of random_page_cost to something around 2, if someone can come up with experimental evidence justifying it ... regards, tom lane
Bruce Momjian wrote: > > I have always felt some feedback mechanism from the executor back to the > optimizer was required but I was never sure quite how to implement it. The folks at DEC (rdb???) wrote a paper on it a long time ago (duh, back when DEC existed). I ran across it in the Tuft's library about a year ago, back when my girlfriend was in grad school. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Bruce Momjian wrote: > Also, when we pull from the kernel cache we have to read into our shared > buffers; much faster than disk i/o but slower than if they were already > in the cache. Yes ... this is one reason why folks like Oracle want to be able to bypass the kernel cache. > For me the idea of doing non-cached sequential scans came > from a Solaris internals book I was reading. I think it will be > possible to mark large sequential scan buffer i/o with lower priority > caching that may help performance. However, if others are also doing > sequential scans of the same table _only_, our normal caching may be > best. As you can see, this gets quite complicated and I am doubtful > there will be a general solution to this problem --- more of a feedback > loop may be the best bet --- determine which sequential scans are wiping > the cache with little other purpose and start not caching them. It would be interesting to learn more about the actual hueristic Oracle uses (straight percents of the buffer cache? Something based on concurrency? I have no idea). The Oracle folks have got tons and tons of data on real-world big, busy db installations to draw from when they investigate such things. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
>>>Tom Lane said:> mlw <markw@mohawksoft.com> writes:> > ... Storage> > devices are now black boxes. The only predictableadvantage a> > "sequential scan" can have on a modern computer is OS level caching.> > You mean read-ahead. Trueenough, but that "only advantage" is very> significant. The 4.0 number did not come out of the air, it came> from actualmeasurements. On what OS? Linux? Windows? BSD? OSF/1? System V? All these differ significantly in how buffer cache is managed. For example, the BSD 'soft updates' will not penalize large directory updates, but not do any good for sequential reads (considering what was said already about modern disks). SCSI tag queueing will significantly improve raw disk reads ('sequential' again) because of the low overhead of host<->SCSI subsystem communication - any decent SCSI host adapter will do bus-master DMA, without the interference of the processor (simplified as much as to illustrate it :). Todays IDE drives on PC hardware don't do that! Which is not to say that only SCSI drive controllers are intelligent enough - I still remember an older Motorola VME based UNIX system (that now can only server the purpose of coffee table :), where an MFM controller board had all the intelligence of the SCSI subsystem, although it operated with 'dump' MFM disks. So many examples can be given here. > I think the real point in this thread is that measurements on an idle> system might not extrapolate very well to measurementson a heavily> loaded system. I can see the point, but I don't really have time to> investigate it right now. I'd be willing to reduce the default value of> random_page_cost to something around 2, if someone can come up with>experimental evidence justifying it ... Agreed. My preference would be, that if you have reasonable enough test data, that can be shared, many people on different platforms can run performance tests and come up with an array of recommended values for their particular OS/hardware configuration. I believe these two items are most significant for the tuning of an installation. Daniel Kalchev
Don Baccus wrote: > > Zeugswetter Andreas SB SD wrote: > > > This is one of the main problems of the current optimizer which imho rather > > aggressively chooses seq scans over index scans. During high load this does > > not pay off. > > Bingo ... dragging huge tables through the buffer cache via a sequential > scan guarantees that a) the next query sequentially scanning the same > table will have to read every block again (if the table's longer than > available PG and OS cache) b) on a high-concurrency system other queries > end up doing extra I/O, too. > > Oracle partially mitigates the second effect by refusing to trash its > entire buffer cache on any given sequential scan. Or so I've been told > by people who know Oracle well. A repeat of the sequential scan will > still have to reread the entire table but that's true anyway if the > table's at least one block longer than available cache. One radical way to get better-than-average cache behaviour in such pathologigal casescases would be to discard a _random_ page instead of LRU page (perhaps tuned to not not select from 1/N of pages on that are MRU) ------------- Hannu
Hannu Krosing wrote: > One radical way to get better-than-average cache behaviour in such > pathologigal casescases would be to discard a _random_ page instead of > LRU page (perhaps tuned to not not select from 1/N of pages on that are > MRU) Yep, that's one of the ways to improve performance when the same table's being scanned sequentially multiple times, or where different queries sometimes scan it sequentially, other times by index. MRU would help if you're constantly doing sequential scans. So would flipping the scan order depending on what's in the cache :) But none of these would mitigate the effects on other concurrent queries that don't query the large table at all. -- Don Baccus Portland, OR http://donb.photo.net, http://birdnotes.net, http://openacs.org
Tom wrote: > > The question is: how does one find the proper value? That is, is it > > possible to design planner benchmarking utility to aid in tuning > > Postgres? > > The trouble is that getting trustworthy numbers requires huge test > cases, because you have to swamp out the effects of the kernel's own > buffer caching. I spent about a week running 24-hour-constant-disk- > banging experiments when I came up with the 4.0 number we use now, > and even then I didn't feel that I had a really solid range of test > cases to back it up. Well, I certainly think you did a great job on your test cases :-) If you look at Daniel's idle system he has a measured factor of ~4.8. The number is also (imho correctly) rather an underestimate than an overestimate. That is, I haven't been able to empirically proove mlw's point about modern disks not beeing sensitive to scan with larger blocks vs. random 8k. (My tests typically used raw devices circumventing the OS completely, since I do the tests for Informix servers) > My advice to you is just to drop it to 2.0 and see if you like the plans > you get any better. Yup, maybe even lower. I cannot see a difference in disk IO troughput during seq or index scan in Daniel's test case during his normal workload test. This is because his system had a CPU bottleneck during his normal workload test. (Shows again how bad OS's distribute processes (1 CPU serves 2 backends that have a CPU bottleneck, 1 CPU is idle)) Andreas
> -----Original Message----- > From: Daniel Kalchev > > I tried this: > > db=# select * into iplog_test from iplog_gate200112; > SELECT > db=# create index iplog_test_ipaddr_idx on iplog_test(ipaddr); > CREATE > db=# cluster iplog_test_ipaddr_idx on iplog_test; > CLUSTER > db=# create index iplog_test_ipdate_idx on iplog_test(ipdate); > CREATE > db=# explain > db-# SELECT sum(input), sum(output) FROM iplog_test > db-# WHERE > db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 > 00:00:00+02' Is there only one ipdate value which satisfies the above where clause ? regards, Hiroshi Inoue
> -----Original Message----- > From: Hiroshi Inoue > > > -----Original Message----- > > From: Daniel Kalchev > > > > I tried this: > > > > db=# explain > > db-# SELECT sum(input), sum(output) FROM iplog_test > > > db-# WHERE > > db-# '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 > > 00:00:00+02' > > Is there only one ipdate value which satisfies the above where clause ? If '2001-12-01 00:00:00+02' is the unique ipdate value which satsifies '2001-12-01 00:00:00+02' <= ipdate AND ipdate < '2001-12-02 00:00:00+02' and CREATE INDEX preserves the physical order of the same key, the IndexScan would see physically ordered tuples. There's no strangeness even if the scan is faster than sequential scan. regards, Hiroshi Inoue
> -----Original Message----- > From: Daniel Kalchev [mailto:daniel@digsys.bg] > > Well, we already played around this by first creating new table > ordered by > random(). This seem to have little meaing if there's only one ipdate value which satisfies the WHERE clause and the index order matches the physical order of heap tuples. I tried a very simple test case. I have a table which has 100000 tuples and made a index on a column whose value is always ''(i.e the same value). Different from my expectation CREATE INDEX doesn't seem to preserve the physical(input) order. However it seems sufficiently ordered and probably we could expect the scan using the index is as fast as sequential scan. The result(distribution of corrsponding CTIDs) is as follows. (3448, 17) (3542, 26) -- (5172, 24) consecutive 94539 tuples (3542, 25) -- (3448, 18) consecutive 5460 tuples regards, Hiroshi Inoue