=?iso-8859-1?q?Gary=20Cowell?= <gary_cowell@yahoo.co.uk> writes:
> -> Sort (cost=117865.77..119220.13 rows=541741
> width=132) (actual time=63623.417..66127.641
> rows=541741 loops=1)
This is clearly where the time is going.
> sort_mem = 16384
Probably not enough for this problem. The estimated data size is
upwards of 60 meg (132 bytes * half a mil rows); allowing for per-row
overhead I suspect that you'd need sort_mem approaching 100 meg for
a fully-in-memory sort. (Also I'd take the width=132 with a *big*
grain of salt, unless you have reason to know that it's accurate.)
The on-disk sorting algorithm that we use is designed to favor minimum
disk space consumption over speed. It has a fairly nonrandom access
pattern that can be pretty slow if your disks don't have good seek-time
specs.
I don't know whether Oracle's performance advantage is because they're
not swapping the sort to disk at all, or because they use a different
on-disk sort method with a more sequential access pattern.
[... thinks for awhile ...] It seems possible that they may use sort
code that knows it is performing a DISTINCT operation and discards
duplicates on sight. Given that there are only 534 distinct values,
the sort would easily stay in memory if that were happening.
It would be interesting to compare Oracle and PG times for a straight
sort of half a million rows, without the DISTINCT part; that would
give us a clue whether they simply have much better sort technology,
or whether they have a special optimization for sort+unique.
regards, tom lane