Re: postgres_fdw: using TABLESAMPLE to collect remote sample - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: postgres_fdw: using TABLESAMPLE to collect remote sample |
Date | |
Msg-id | 03182011-10a4-97b4-7c70-d99aa1c11eed@enterprisedb.com Whole thread Raw |
In response to | Re: postgres_fdw: using TABLESAMPLE to collect remote sample (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: postgres_fdw: using TABLESAMPLE to collect remote sample
|
List | pgsql-hackers |
Hi! Here's an updated patch, including all the changes proposed by Tom in his review. There were two more points left [1], namely: 1) The useless test added to postgres_fdw.sql consuming a lot of time. I decided to just rip this out and replace it with a much simpler test, that simply changes the sampling method and does ANALYZE. Ideally we'd also verify we actually generated the right SQL query to be executed on the remote server, but there doesn't seem to be a good way to do that (e.g. we can't do EXPLAIN to show the "Remote SQL" on the "ANALYZE"). So I guess just changing the method and running ANALYZE will have to be enough for now (coverage report should at least tell us we got to all the SQL variants). 2) the logic disabling sampling when the fraction gets too high I based the condition on the absolute number of "unsampled" rows, Tom suggested maybe it should be just a simple condition on sample_frac (e.g. 95%). But neither of us was sure what would be a good value. So I did a couple tests to get at least some idea what would be a good threshold, and it turns out the threshold is mostly useless. I'll discuss the measurements I did in a bit (some of the findings are a bit amusing or even hilarious, actually), but the main reasons are: - The sampling overhead is negligible (compared to transferring the data, even on localhost), and the behavior is very smooth. So almost never exceed reading everything, unless sample_frac >= 0.99 and even there it's a tiny difference. So there's almost no risk, I think. - For small tables it doesn't really matter. It's going to be fast anyway, and the difference between reading 10000 or 11000 rows is going to be just noise. Who cares if this takes 10 or 11 ms ... - For large tables, we'll never even get to these high sample_frac values. Imagine a table with 10M rows - the highers stats target we allow is 10k, and the sample size is 300 * target, so 3M rows. It doesn't matter if the condition is 0.95 or 0.99, because for this table we'll never ask for a sample above 30%. - For the tables in between it might be more relevant, but the simple truth is that reading the row and sampling it remotely is way cheaper than the network transfer, even if on localhost. The data suggest that reading+sampling a row costs ~0.2us at most, but sending it is ~1.5us (localhost) or ~5.5us (local network). So I just removed the threshold from the patch, and we'll request sampling even with sample_frac=100% (if that happens). sampling test ------------- I did a simple test to collect some data - create a table, and sample various fractions in the ways discussed in this patch - either locally or through a FDW. This required a bit of care to ensure the sampling happens in the right place (and e.g. we don't push the random() or tablesample down), which I did by pointing the FDW table not to a remote table, but to a view with an optimization fence (OFFSET 0). See the run-local.sh script. The script populates the table with different numbers of rows, samples different fractions of it, etc. I also did this from a different machine, to see what a bit more network latency would do (that's what run-remote.sh is for). results (fdw-sampling-test.pdf) ------------------------------- The attached PDF shows the results - first page is for the foreign table in the same instance (i.e. localhost, latency ~0.01ms), second page is for FDW pointing to a machine in the same network (latency ~0.1ms). Left column is always the table "directly", right column is through the FDW. On the x-axis is the fraction of the table we sample, y-axis is duration in milliseconds. "full" means "reading everything" (i.e. what the FDW does now), the other options should be clear I think. The first two dataset sizes (10k and 10M rows) are tiny (10M is ~2GB), and fit into RAM, which is 8GB. The 100M is ~21GB, so much larger. In the "direct" (non-FDW) sampling, the various sampling methods start losing to seqscan fairly soon - "random" is consistently slower, "bernoulli" starts losing at ~30%, "system" as ~80%. This is not very surprising, particularly for bernoulli/random which actually read all the rows anyway. But the overhead is pretty limited to ~30% on top of the seqscan. But in the "FDW" sampling (right column), it's entirely different story, and all the methods clearly win over just transferring everything and only then doing the sampling. Who cares if the remote sampling means means we have to pay 0.2us instead of 0.15us (per row), when the transfer costs 1.5us per row? The 100M case shows an interesting behavior for the "system" method, which quickly spikes to ~2x of the "full" method when sampling ~20% of the table, and then gradually improves again. My explanation is that this is due to "system" making the I/O patter increasingly more random, because it jumps blocks in a way that makes readahead impossible. And then as the fraction increases, it becomes more sequential again. All the other methods are pretty much equal to just scanning everything sequentially, and sampling rows one by one. The "system" method in TABLESAMPLE would probably benefit from explicit prefetching, I guess. For ANALYZE this probably is not a huge, as we'll never sample this large fraction for large tables (for 100M rows we peak at ~3% with target 10000, which is way before the peak). And for smaller tables we're more likely to hit cache (which is why the smaller data sets don't have this issue). But for explicit TABLESAMPLE queries that may not be the case. Although, ANALYZE uses something like "system" to sample rows too, no? However, even this is not an issue for the FDW case - in that case it still clearly wins over the current "local sample" approach, because transferring the data is so expensive which makes the "peak" into a tiny hump. The second page (different machine, local network) tells the same story, except that the differences are even clearer. ANALYZE test ------------ So I changed the patch, and did a similar test by running ANALYZE either on the local or foreign table, using the different sampling methods. This does not require the hacks to prevent pushdown etc. but it also means we can't determine sample_frac directly, only through statistics target (which is capped to 10k). In this case, "local" means ANALYZE on the local table (which you might think of as the "baseline"), and "off" means reading all data without remote sampling. For the two smaller data sets (10k and 10M rows), the benefits are pretty obvious. We're very close to the "local" results, because we save a lot on copying only some of the rows. For 10M we only get to ~30% before we hit target=10k, which is we don't see it get closer to "off". But now we get to the *hilarious* thing - if you look at the 10M result, you may notice that *all* the sampling methods beat ANALYZE on the local table. For "system" (which wins from the very beginning) we might make some argument that the algorithm is simpler than what ANALYZE does, skips blocks differently, etc. - perhaps ... But bernoulli/random are pretty much ideal sampling, reading/sampling all rows. And yet both methods start winning after crossing ~1% on this tables. In other words, it's about 3x faster to ANALYZE a table through FDW than directly ;-) Anyway, those issues have impact on this patch, I think. I believe the results show what the patch does is reasonable. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: