Thread: Merging large volumes of data

Merging large volumes of data

From
"Ambrus Wagner (IJ/ETH)"
Date:
Dear All,

I have several tables containing data sorted by 2 keys (neither are keys in db terms (not unique), however). I would
liketo retrieve all rows from all tables sorted by the same keys, essentially merging the contents of the tables
together.While I am completely aware of sort order not being a (fundamental) property of an RDBMS table, I am also
awareof indices and clustering (in fact, data is inserted into the tables into the correct order, and not consequently
modifiedin any way). I have a union query like this one:
 

select a,b,c,d,e from table1 union all
select a,b,c,d,e from table2 union all
etc...
select a,b,c,d,e from tablen order by a,b;

Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed?
Evenif I write
 

(select a,b,c,d,e from table1 order by a,b) union all
(select a,b,c,d,e from table2 order by a,b)  union all
etc...
(select a,b,c,d,e from tablen order by a,b)  order by a,b;

PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by"
clauseis merely a final merge step on the ordered data sets.
 

Is there a workaround for this within PostgreSQL (another type of query, parameter tuning, stored procedure, anything)
orshould I use my back-up plan of making separate queries and merging the results in the target language?
 

Thanks a lot,
Ambrus

--
Wagner, Ambrus (IJ/ETH/GBD)
Tool Designer
GSDC Hungary

Location: Science Park, A2 40 008
Phone: +36 1 439 5282 

Re: Merging large volumes of data

From
Andreas Kostyrka
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think you'll have to stick with doing your sorting (or merging) in
your client. Don't think that PG recognizes the fact it's just a merge step.

Andreas

Ambrus Wagner (IJ/ETH) wrote:
> Dear All,
>
> I have several tables containing data sorted by 2 keys (neither are keys in db terms (not unique), however). I would
liketo retrieve all rows from all tables sorted by the same keys, essentially merging the contents of the tables
together.While I am completely aware of sort order not being a (fundamental) property of an RDBMS table, I am also
awareof indices and clustering (in fact, data is inserted into the tables into the correct order, and not consequently
modifiedin any way). I have a union query like this one: 
>
> select a,b,c,d,e from table1 union all
> select a,b,c,d,e from table2 union all
> etc...
> select a,b,c,d,e from tablen order by a,b;
>
> Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed?
Evenif I write 
>
> (select a,b,c,d,e from table1 order by a,b) union all
> (select a,b,c,d,e from table2 order by a,b)  union all
> etc...
> (select a,b,c,d,e from tablen order by a,b)  order by a,b;
>
> PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by"
clauseis merely a final merge step on the ordered data sets. 
>
> Is there a workaround for this within PostgreSQL (another type of query, parameter tuning, stored procedure,
anything)or should I use my back-up plan of making separate queries and merging the results in the target language? 
>
> Thanks a lot,
> Ambrus
>
> --
> Wagner, Ambrus (IJ/ETH/GBD)
> Tool Designer
> GSDC Hungary
>
> Location: Science Park, A2 40 008
> Phone: +36 1 439 5282
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGPy1CHJdudm4KnO0RAuKlAKCbYu2G/MYfmX9gAlSxkzA6KB4A+QCeIlAT
USxhGD5XL7oGlIh+i2rVyN4=
=APcb
-----END PGP SIGNATURE-----

Re: Merging large volumes of data

From
Gregory Stark
Date:
"Andreas Kostyrka" <andreas@kostyrka.org> writes:

>> (select a,b,c,d,e from table1 order by a,b) union all
>> (select a,b,c,d,e from table2 order by a,b)  union all
>> etc...
>> (select a,b,c,d,e from tablen order by a,b)  order by a,b;
>>
>> PostgreSQL does not seem to realise (maybe it should not be able to do this
>> trick anyway) that the last "order by" clause is merely a final merge step
>> on the ordered data sets.

There's no plan type in Postgres for merging pre-sorted data like this. The
only merge plan type is for joins which isn't going to be what you need.

But the queries as written here would be just as fast or faster to do one big
sort as they would be to do separate sorts and merge the results.

You might want to do it the way you describe if there were selective WHERE
clauses that you've left out that make the intermediate orderings come for
free.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Merging large volumes of data

From
Tom Lane
Date:
Andreas Kostyrka <andreas@kostyrka.org> writes:
> Ambrus Wagner (IJ/ETH) wrote:
>> Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed?
Evenif I write 
>>
>> (select a,b,c,d,e from table1 order by a,b) union all
>> (select a,b,c,d,e from table2 order by a,b)  union all
>> etc...
>> (select a,b,c,d,e from tablen order by a,b)  order by a,b;
>>
>> PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by"
clauseis merely a final merge step on the ordered data sets. 

At least to a first-order approximation, teaching it this would be a
waste of time.

Assume for simplicity that each of your K sub-selects yields N tuples.
Then there are KN items altogether, so if we just sort the big data set
it takes O(KN*log(KN)) time ... which is the same as O(KN*(log K + log N)).
OTOH, if we sort each sub-select by itself, that takes O(N*log N) time,
or O(KN*log N) for all K sub-sorts.  Now we've got to do a K-way merge,
which will take O(log K) comparisons for each of the KN tuples, ie,
O(KN*log K)).  Net result: exactly the same runtime.

Of course this argument fails if you have some way of obtaining the
sub-select values pre-sorted for free.  But it's never really free.
Historical experience is that full-table indexscans often underperform
explicit sorts, at least when there are enough tuples involved to
make the problem interesting.

So the bottom line is that the use-case for this optimization seems
far too narrow to justify the implementation effort.

            regards, tom lane