Re: [HACKERS] qsort again (was Re: Strange Create - Mailing list pgsql-performance

From PFC
Subject Re: [HACKERS] qsort again (was Re: Strange Create
Date
Msg-id op.s435ywc1cigqcu@apollo13
Whole thread Raw
In response to Re: [HACKERS] qsort again (was Re: Strange Create  (Markus Schaber <schabi@logix-tt.com>)
List pgsql-performance
    Has anybody got some profiler data on the amount of time spent in
comparisons during a sort ? Say, the proposals here would give the most
gains on simple types like INTEGER ; so it would be interesting to know
how much time is now spent in comparisons for sorting a column of ints. If
it's like, 10% of the total time, well...

    More hand-waving :

    What are the usage case for sorts ?

    - potentially huge data sets : create index, big joins, reporting queries
etc.
    - small data sets : typically, a query with an ORDER BY which will return
a small amount of rows (website stuff), or joins not small enough to use a
HashAggregate, but not big enough to create an index just for them.

    The cost of a comparison vs. moving stuff around and fetching stuff is
probably very different in these two cases. If it all neatly fits in
sort_mem, you can do fancy stuff (like sorting on SortKeys) which will
need to access the data in almost random order when time comes to hand the
sorted data back. So, I guess the SortKey idea would rather apply to the
latter case only, which is CPU limited.

    Anyway, I was wondering about queries with multipart keys, like ORDER BY
zipcode, client_name, date and the like. Using just an int64 as the key
isn't going to help a lot here. Why not use a binary string of limited
length ? I'd tend to think it would not be that slower than comparing
ints, and it would be faster than calling each comparison function for
each column. Each key part would get assigned to a byte range in the
string.
    It would waste some memory, but for instance, using 2x the memory for
half the time would be a good tradeoff if the amount of memory involved is
in the order of megabytes.
    Also, it would solve the problem of locales. Comparisons involving
locales are slow, but strings can be converted to a canonical form
(removing accents and stuff), and then binary sorted.

    Also I'll insert a plug for the idea that the Sort needs to know if there
will be a LIMIT afterwards ; this way it could reduce its working set by
simply discarding the rows which would have been discarded anyway by the
LIMIT. Say you want the 100 first rows out of a million ordered rows. If
the sort knows this, it can be performed in the amount of memory for a 100
rows sort.


pgsql-performance by date:

Previous
From: Markus Schaber
Date:
Subject: Re: [HACKERS] qsort again (was Re: Strange Create
Next
From: Ron
Date:
Subject: Re: [HACKERS] qsort again (was Re: Strange Create