Home > mailing lists

Tuple sampling - Mailing list pgsql-hackers

From	Manfred Koizar
Subject	Tuple sampling
Date	April 19, 2004 10:28:34
Msg-id	467780tvbj8tln1eutjp9q0ks9mvn4ie9h@email.aon.at Whole thread Raw
List	pgsql-hackers

Tree view

The proposed new sampling method
(http://archives.postgresql.org/pgsql-hackers/2004-04/msg00036.php and
http://archives.postgresql.org/pgsql-patches/2004-04/msg00045.php)
basically incorporates two independant changes:

(1) Two-stage sampling: Stage one collects a random sample of pages,
stage two collects a random sample of tuples out of these pages.

(2) Improved row count estimation: Once a page is read in, all tuples
on this page are looked at to see whether they are active (visible,
live) or dead. The estimated row count is calculated from the number of
active tuples seen, the number of visited pages, and the total number of
pages.

The preview patch implements three sampling methods: Method 0 is the
original sampling method; method 1 uses both changes; and method 2 is
sort of change (1) without change (2).

Theoretically we could apply only the second change, if we come to the
conclusion that we do not want two-stage sampling. But I don't have
test results for this.

I'd like to present my *interpretation* of the results I got from
testing various sampling methods. Whoever is interested in looking at
the raw test results can get them from
http://www.pivot.at/pg/SamplePerf2.sxc or (for those still without OOo)
http://www.pivot.at/pg/SamplePerf2.html.

This set of tests measures only the sampling times, actually analysing
the sample and storing the results in pg_statistic has been disabled.
To eliminate as many sources of noise as possible, following steps have
been taken prior to each test run:
. SELECT count(*) FROM <testtable>; to make sure that all visibility bits in the tuple headers are set correctly,
. tar cf /dev/null <something> to fill the OS cache with unrelated stuff,
. SELECT count(*) FROM <othertable>; to fill the shared buffers with unrelated data,
. CHECKPOINT; to avoid a checkpoint happening during the test run.

The tests have been performed with two types of tables. The tables
named x, x2, x3, x3d have initially been copied from tenk1 in the
regression database. These tables have between 20 and 30 tuples per
page. Tables of the other type -- named y, y3, y3d -- have much smaller
tuples, ca. 150 per page.

Method 1 vs. method 2:

With large tuples method 2 is sometimes faster, sometimes slower than
method 1, never enough to worry about. This has been tested for
statistics targets 10 and 100.

Small tuples, statistics target 10 to 100: No reproduceable difference
up to 490 pages (80000 tuples). Starting at 1900 pages (320000 tuples)
method 2 is consistently faster. The difference can easily be explained
by hundreds of thousands or even millions of heap_fetch() calls.

OTOH with method 2 we get row count estimation errors of up to 20%
compared to less than 1% for method 1. So I conclude that change (2) is
well worth the CPU cycles it costs.

Method 0 produced estimation errors of up to 60%.

What about two-stage sampling?

Comparing the speeds of method 0 and method 2 we get the following
pattern which is basically the same for all statistics targets I tried.

Sample size 3000 (30000):
For tables smaller than 1000 (5000) pages all sampling methods access
all or almost all pages and there is hardly any runtime difference. For
table sizes between 1000 (5000) and 3000 (30000) the new method reads
the whole table while the old method starts to skip pages. This doesn't
result in a significant runtime difference, however. If the table size
grows beyond 3000 (30000) pages, the number of page reads continues to
grow only for the old method, but up to ca. 30000 (300000) pages the new
method is not much faster (if at all). For tables larger than 30000
(300000) pages two-stage sampling is reliably faster.

Other pros and cons of two-stage sampling

Pro and con: Cache pollution. Two-stage sampling reads the same number
of pages as one-stage sampling for small tables, slightly more for
medium sized tables (worst case is ca. 15%), and a little less to far
less for large to very large tables (3000 vs. 20000 for 430000 pages
with statistics target 10, 30000 vs. 98000 for the same table with
statistsitcs target 100).

Con: In Tom's words "There's still a risk of not being able to collect
N rows out of N blocks, if you are unfortunate enough to select a lot of
wholly-empty pages. But that seems like a low-probability scenario;
besides such a table would be so desperately in need of VACUUM FULL that
the possible low quality of the stats hardly matters."

Con: The sample generated is not perfectly random (cf. discussion
starting at
http://archives.postgresql.org/pgsql-performance/2004-04/msg00196.php).
If somebody has an idea how we could steer against that effect of
collecting tuples from too few different pages, I'd be happy to
implement it. If nothing pops up, I think that the current consensus is
that we don't care.

Anything else?

ServusManfred

pgsql-hackers by date:

From: Andrew Sullivan
Date: 19 April 2004, 10:10:53
Subject: Re: signal 11 on AIX: 7.4.2

From: "Thomas Hallgren"
Date: 19 April 2004, 12:12:48
Subject: GUC variable set, TODO

Tuple sampling - Mailing list pgsql-hackers

Previous

Next