But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this:
tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1;
That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):
The sample is not truly random. The two-stage sampling method causes too few blocks to have exactly one row chosen from them, and too many to have either 0 or 2+ rows chosen from them.
When values in the same block are likely to be equal, then it finds too many duplicates because it too often picks two rows from a single block.
If we assume all the blocks have the same tuple density, then it is easy to correct this. But without that assumption of constant tuple density, I don't know how to get a truly random sample while still skipping most of the table.