Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Date
Msg-id 03182011-10a4-97b4-7c70-d99aa1c11eed@enterprisedb.com
Whole thread Raw
In response to Re: postgres_fdw: using TABLESAMPLE to collect remote sample  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: postgres_fdw: using TABLESAMPLE to collect remote sample
List pgsql-hackers
Hi!

Here's an updated patch, including all the changes proposed by Tom in
his review. There were two more points left [1], namely:

1) The useless test added to postgres_fdw.sql consuming a lot of time.

I decided to just rip this out and replace it with a much simpler test,
that simply changes the sampling method and does ANALYZE. Ideally we'd
also verify we actually generated the right SQL query to be executed on
the remote server, but there doesn't seem to be a good way to do that
(e.g. we can't do EXPLAIN to show the "Remote SQL" on the "ANALYZE").

So I guess just changing the method and running ANALYZE will have to be
enough for now (coverage report should at least tell us we got to all
the SQL variants).


2) the logic disabling sampling when the fraction gets too high

I based the condition on the absolute number of "unsampled" rows, Tom
suggested maybe it should be just a simple condition on sample_frac
(e.g. 95%). But neither of us was sure what would be a good value.

So I did a couple tests to get at least some idea what would be a good
threshold, and it turns out the threshold is mostly useless. I'll
discuss the measurements I did in a bit (some of the findings are a bit
amusing or even hilarious, actually), but the main reasons are:

- The sampling overhead is negligible (compared to transferring the
data, even on localhost), and the behavior is very smooth. So almost
never exceed reading everything, unless sample_frac >= 0.99 and even
there it's a tiny difference. So there's almost no risk, I think.

- For small tables it doesn't really matter. It's going to be fast
anyway, and the difference between reading 10000 or 11000 rows is going
to be just noise. Who cares if this takes 10 or 11 ms ...

- For large tables, we'll never even get to these high sample_frac
values. Imagine a table with 10M rows - the highers stats target we
allow is 10k, and the sample size is 300 * target, so 3M rows. It
doesn't matter if the condition is 0.95 or 0.99, because for this table
we'll never ask for a sample above 30%.

- For the tables in between it might be more relevant, but the simple
truth is that reading the row and sampling it remotely is way cheaper
than the network transfer, even if on localhost. The data suggest that
reading+sampling a row costs ~0.2us at most, but sending it is ~1.5us
(localhost) or ~5.5us (local network).

So I just removed the threshold from the patch, and we'll request
sampling even with sample_frac=100% (if that happens).


sampling test
-------------

I did a simple test to collect some data - create a table, and sample
various fractions in the ways discussed in this patch - either locally
or through a FDW.

This required a bit of care to ensure the sampling happens in the right
place (and e.g. we don't push the random() or tablesample down), which I
did by pointing the FDW table not to a remote table, but to a view with
an optimization fence (OFFSET 0). See the run-local.sh script.

The script populates the table with different numbers of rows, samples
different fractions of it, etc. I also did this from a different
machine, to see what a bit more network latency would do (that's what
run-remote.sh is for).


results (fdw-sampling-test.pdf)
-------------------------------

The attached PDF shows the results - first page is for the foreign table
in the same instance (i.e. localhost, latency ~0.01ms), second page is
for FDW pointing to a machine in the same network (latency ~0.1ms).

Left column is always the table "directly", right column is through the
FDW. On the x-axis is the fraction of the table we sample, y-axis is
duration in milliseconds. "full" means "reading everything" (i.e. what
the FDW does now), the other options should be clear I think.

The first two dataset sizes (10k and 10M rows) are tiny (10M is ~2GB),
and fit into RAM, which is 8GB. The 100M is ~21GB, so much larger.

In the "direct" (non-FDW) sampling, the various sampling methods start
losing to seqscan fairly soon - "random" is consistently slower,
"bernoulli" starts losing at ~30%, "system" as ~80%. This is not very
surprising, particularly for bernoulli/random which actually read all
the rows anyway. But the overhead is pretty limited to ~30% on top of
the seqscan.

But in the "FDW" sampling (right column), it's entirely different story,
and all the methods clearly win over just transferring everything and
only then doing the sampling.

Who cares if the remote sampling means means we have to pay 0.2us
instead of 0.15us (per row), when the transfer costs 1.5us per row?

The 100M case shows an interesting behavior for the "system" method,
which quickly spikes to ~2x of the "full" method when sampling ~20% of
the table, and then gradually improves again.

My explanation is that this is due to "system" making the I/O patter
increasingly more random, because it jumps blocks in a way that makes
readahead impossible. And then as the fraction increases, it becomes
more sequential again.

All the other methods are pretty much equal to just scanning everything
sequentially, and sampling rows one by one.

The "system" method in TABLESAMPLE would probably benefit from explicit
prefetching, I guess. For ANALYZE this probably is not a huge, as we'll
never sample this large fraction for large tables (for 100M rows we peak
at ~3% with target 10000, which is way before the peak). And for smaller
tables we're more likely to hit cache (which is why the smaller data
sets don't have this issue). But for explicit TABLESAMPLE queries that
may not be the case.

Although, ANALYZE uses something like "system" to sample rows too, no?

However, even this is not an issue for the FDW case - in that case it
still clearly wins over the current "local sample" approach, because
transferring the data is so expensive which makes the "peak" into a tiny
hump.

The second page (different machine, local network) tells the same story,
except that the differences are even clearer.


ANALYZE test
------------

So I changed the patch, and did a similar test by running ANALYZE either
on the local or foreign table, using the different sampling methods.
This does not require the hacks to prevent pushdown etc. but it also
means we can't determine sample_frac directly, only through statistics
target (which is capped to 10k).

In this case, "local" means ANALYZE on the local table (which you might
think of as the "baseline"), and "off" means reading all data without
remote sampling.

For the two smaller data sets (10k and 10M rows), the benefits are
pretty obvious. We're very close to the "local" results, because we save
a lot on copying only some of the rows. For 10M we only get to ~30%
before we hit target=10k, which is we don't see it get closer to "off".

But now we get to the *hilarious* thing - if you look at the 10M result,
you may notice that *all* the sampling methods beat ANALYZE on the local
table.

For "system" (which wins from the very beginning) we might make some
argument that the algorithm is simpler than what ANALYZE does, skips
blocks differently, etc. - perhaps ...

But bernoulli/random are pretty much ideal sampling, reading/sampling
all rows. And yet both methods start winning after crossing ~1% on this
tables. In other words, it's about 3x faster to ANALYZE a table through
FDW than directly ;-)

Anyway, those issues have impact on this patch, I think. I believe the
results show what the patch does is reasonable.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment

pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: subtransaction performance
Next
From: Tom Lane
Date:
Subject: Non-robustness in pmsignal.c