Thread: Merging large volumes of data
Dear All, I have several tables containing data sorted by 2 keys (neither are keys in db terms (not unique), however). I would liketo retrieve all rows from all tables sorted by the same keys, essentially merging the contents of the tables together.While I am completely aware of sort order not being a (fundamental) property of an RDBMS table, I am also awareof indices and clustering (in fact, data is inserted into the tables into the correct order, and not consequently modifiedin any way). I have a union query like this one: select a,b,c,d,e from table1 union all select a,b,c,d,e from table2 union all etc... select a,b,c,d,e from tablen order by a,b; Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed? Evenif I write (select a,b,c,d,e from table1 order by a,b) union all (select a,b,c,d,e from table2 order by a,b) union all etc... (select a,b,c,d,e from tablen order by a,b) order by a,b; PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by" clauseis merely a final merge step on the ordered data sets. Is there a workaround for this within PostgreSQL (another type of query, parameter tuning, stored procedure, anything) orshould I use my back-up plan of making separate queries and merging the results in the target language? Thanks a lot, Ambrus -- Wagner, Ambrus (IJ/ETH/GBD) Tool Designer GSDC Hungary Location: Science Park, A2 40 008 Phone: +36 1 439 5282
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I think you'll have to stick with doing your sorting (or merging) in your client. Don't think that PG recognizes the fact it's just a merge step. Andreas Ambrus Wagner (IJ/ETH) wrote: > Dear All, > > I have several tables containing data sorted by 2 keys (neither are keys in db terms (not unique), however). I would liketo retrieve all rows from all tables sorted by the same keys, essentially merging the contents of the tables together.While I am completely aware of sort order not being a (fundamental) property of an RDBMS table, I am also awareof indices and clustering (in fact, data is inserted into the tables into the correct order, and not consequently modifiedin any way). I have a union query like this one: > > select a,b,c,d,e from table1 union all > select a,b,c,d,e from table2 union all > etc... > select a,b,c,d,e from tablen order by a,b; > > Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed? Evenif I write > > (select a,b,c,d,e from table1 order by a,b) union all > (select a,b,c,d,e from table2 order by a,b) union all > etc... > (select a,b,c,d,e from tablen order by a,b) order by a,b; > > PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by" clauseis merely a final merge step on the ordered data sets. > > Is there a workaround for this within PostgreSQL (another type of query, parameter tuning, stored procedure, anything)or should I use my back-up plan of making separate queries and merging the results in the target language? > > Thanks a lot, > Ambrus > > -- > Wagner, Ambrus (IJ/ETH/GBD) > Tool Designer > GSDC Hungary > > Location: Science Park, A2 40 008 > Phone: +36 1 439 5282 > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGPy1CHJdudm4KnO0RAuKlAKCbYu2G/MYfmX9gAlSxkzA6KB4A+QCeIlAT USxhGD5XL7oGlIh+i2rVyN4= =APcb -----END PGP SIGNATURE-----
"Andreas Kostyrka" <andreas@kostyrka.org> writes: >> (select a,b,c,d,e from table1 order by a,b) union all >> (select a,b,c,d,e from table2 order by a,b) union all >> etc... >> (select a,b,c,d,e from tablen order by a,b) order by a,b; >> >> PostgreSQL does not seem to realise (maybe it should not be able to do this >> trick anyway) that the last "order by" clause is merely a final merge step >> on the ordered data sets. There's no plan type in Postgres for merging pre-sorted data like this. The only merge plan type is for joins which isn't going to be what you need. But the queries as written here would be just as fast or faster to do one big sort as they would be to do separate sorts and merge the results. You might want to do it the way you describe if there were selective WHERE clauses that you've left out that make the intermediate orderings come for free. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Andreas Kostyrka <andreas@kostyrka.org> writes: > Ambrus Wagner (IJ/ETH) wrote: >> Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed? Evenif I write >> >> (select a,b,c,d,e from table1 order by a,b) union all >> (select a,b,c,d,e from table2 order by a,b) union all >> etc... >> (select a,b,c,d,e from tablen order by a,b) order by a,b; >> >> PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by" clauseis merely a final merge step on the ordered data sets. At least to a first-order approximation, teaching it this would be a waste of time. Assume for simplicity that each of your K sub-selects yields N tuples. Then there are KN items altogether, so if we just sort the big data set it takes O(KN*log(KN)) time ... which is the same as O(KN*(log K + log N)). OTOH, if we sort each sub-select by itself, that takes O(N*log N) time, or O(KN*log N) for all K sub-sorts. Now we've got to do a K-way merge, which will take O(log K) comparisons for each of the KN tuples, ie, O(KN*log K)). Net result: exactly the same runtime. Of course this argument fails if you have some way of obtaining the sub-select values pre-sorted for free. But it's never really free. Historical experience is that full-table indexscans often underperform explicit sorts, at least when there are enough tuples involved to make the problem interesting. So the bottom line is that the use-case for this optimization seems far too narrow to justify the implementation effort. regards, tom lane