Tuple sampling - Mailing list pgsql-hackers
From | Manfred Koizar |
---|---|
Subject | Tuple sampling |
Date | |
Msg-id | 467780tvbj8tln1eutjp9q0ks9mvn4ie9h@email.aon.at Whole thread Raw |
List | pgsql-hackers |
The proposed new sampling method (http://archives.postgresql.org/pgsql-hackers/2004-04/msg00036.php and http://archives.postgresql.org/pgsql-patches/2004-04/msg00045.php) basically incorporates two independant changes: (1) Two-stage sampling: Stage one collects a random sample of pages, stage two collects a random sample of tuples out of these pages. (2) Improved row count estimation: Once a page is read in, all tuples on this page are looked at to see whether they are active (visible, live) or dead. The estimated row count is calculated from the number of active tuples seen, the number of visited pages, and the total number of pages. The preview patch implements three sampling methods: Method 0 is the original sampling method; method 1 uses both changes; and method 2 is sort of change (1) without change (2). Theoretically we could apply only the second change, if we come to the conclusion that we do not want two-stage sampling. But I don't have test results for this. I'd like to present my *interpretation* of the results I got from testing various sampling methods. Whoever is interested in looking at the raw test results can get them from http://www.pivot.at/pg/SamplePerf2.sxc or (for those still without OOo) http://www.pivot.at/pg/SamplePerf2.html. This set of tests measures only the sampling times, actually analysing the sample and storing the results in pg_statistic has been disabled. To eliminate as many sources of noise as possible, following steps have been taken prior to each test run: . SELECT count(*) FROM <testtable>; to make sure that all visibility bits in the tuple headers are set correctly, . tar cf /dev/null <something> to fill the OS cache with unrelated stuff, . SELECT count(*) FROM <othertable>; to fill the shared buffers with unrelated data, . CHECKPOINT; to avoid a checkpoint happening during the test run. The tests have been performed with two types of tables. The tables named x, x2, x3, x3d have initially been copied from tenk1 in the regression database. These tables have between 20 and 30 tuples per page. Tables of the other type -- named y, y3, y3d -- have much smaller tuples, ca. 150 per page. Method 1 vs. method 2: With large tuples method 2 is sometimes faster, sometimes slower than method 1, never enough to worry about. This has been tested for statistics targets 10 and 100. Small tuples, statistics target 10 to 100: No reproduceable difference up to 490 pages (80000 tuples). Starting at 1900 pages (320000 tuples) method 2 is consistently faster. The difference can easily be explained by hundreds of thousands or even millions of heap_fetch() calls. OTOH with method 2 we get row count estimation errors of up to 20% compared to less than 1% for method 1. So I conclude that change (2) is well worth the CPU cycles it costs. Method 0 produced estimation errors of up to 60%. What about two-stage sampling? Comparing the speeds of method 0 and method 2 we get the following pattern which is basically the same for all statistics targets I tried. Sample size 3000 (30000): For tables smaller than 1000 (5000) pages all sampling methods access all or almost all pages and there is hardly any runtime difference. For table sizes between 1000 (5000) and 3000 (30000) the new method reads the whole table while the old method starts to skip pages. This doesn't result in a significant runtime difference, however. If the table size grows beyond 3000 (30000) pages, the number of page reads continues to grow only for the old method, but up to ca. 30000 (300000) pages the new method is not much faster (if at all). For tables larger than 30000 (300000) pages two-stage sampling is reliably faster. Other pros and cons of two-stage sampling Pro and con: Cache pollution. Two-stage sampling reads the same number of pages as one-stage sampling for small tables, slightly more for medium sized tables (worst case is ca. 15%), and a little less to far less for large to very large tables (3000 vs. 20000 for 430000 pages with statistics target 10, 30000 vs. 98000 for the same table with statistsitcs target 100). Con: In Tom's words "There's still a risk of not being able to collect N rows out of N blocks, if you are unfortunate enough to select a lot of wholly-empty pages. But that seems like a low-probability scenario; besides such a table would be so desperately in need of VACUUM FULL that the possible low quality of the stats hardly matters." Con: The sample generated is not perfectly random (cf. discussion starting at http://archives.postgresql.org/pgsql-performance/2004-04/msg00196.php). If somebody has an idea how we could steer against that effect of collecting tuples from too few different pages, I'd be happy to implement it. If nothing pops up, I think that the current consensus is that we don't care. Anything else? ServusManfred
pgsql-hackers by date: