Thread: Sorting Tuples
Hi to all. INTRODUCTION: I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I just wanted to implement an idea for possibly getting better External Sorting module on PostgreSQL just using some basic C programming knowledge and now I'm facing to using PostgreSQL for a while in order to test my patch. My patch "activates" when user is trying to sort a load of data more or less twice bigger than his available memory (e.g.: sorting 5GB of data just disposing of 1GB of available memory) PROBLEM: First of all I want to realise my patch does sorts correctly or not. But it's a little bit difficult to check when SELECT * FROM huge_table ORDER BY a_column_of_huge_table; means checking a 4 or 5 GB result... QUESTION: Is there a way to discover if data is sorted correctly other than spending 2 months taking a look to the 5 GB result tuples, one by one? Thanks for your attention. Regards, Manolo.
On Thu, Mar 13, 2008 at 7:25 AM, Manolo <manolo.espa@gmail.com> wrote: > Hi to all. > > INTRODUCTION: > I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I > just wanted to implement an idea for possibly getting better External > Sorting module on PostgreSQL just using some basic C programming > knowledge and now I'm facing to using PostgreSQL for a while in order to > test my patch. > My patch "activates" when user is trying to sort a load of data more or > less twice bigger than his available memory (e.g.: sorting 5GB of data > just disposing of 1GB of available memory) > > PROBLEM: > First of all I want to realise my patch does sorts correctly or not. But > it's a little bit difficult to check when > > SELECT * FROM huge_table ORDER BY a_column_of_huge_table; > > means checking a 4 or 5 GB result... > > QUESTION: > Is there a way to discover if data is sorted correctly other than > spending 2 months taking a look to the 5 GB result tuples, one by one? I can think of a couple of answers: 1) test your sorting routine on smaller data sets 2) have an external program verify the sort result on larger sets The bigger question is what you hope to accomplish. In other words, what about postgresql sorting doesn't work for you? Sean
Good job... even if it requires 10GB more of disk space (5GB for the unpatched and 5GB for the patched result) and multiplying by 3 or 4 the time needed to scan a 5GB table (twice for producing the sorted result and once or twice in order to produce the diff). Till now it's the candidate method: I could run it ad night while I'm sleeping! Thank you Ken! On Thu, 2008-03-13 at 07:32 -0500, Kenneth Marshall wrote: > Run the same query on an un-patched version and diff the output. > > Ken > > On Thu, Mar 13, 2008 at 12:25:53PM +0100, Manolo wrote: > > Hi to all. > > > > INTRODUCTION: > > I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I > > just wanted to implement an idea for possibly getting better External > > Sorting module on PostgreSQL just using some basic C programming > > knowledge and now I'm facing to using PostgreSQL for a while in order to > > test my patch. > > My patch "activates" when user is trying to sort a load of data more or > > less twice bigger than his available memory (e.g.: sorting 5GB of data > > just disposing of 1GB of available memory) > > > > PROBLEM: > > First of all I want to realise my patch does sorts correctly or not. But > > it's a little bit difficult to check when > > > > SELECT * FROM huge_table ORDER BY a_column_of_huge_table; > > > > means checking a 4 or 5 GB result... > > > > QUESTION: > > Is there a way to discover if data is sorted correctly other than > > spending 2 months taking a look to the 5 GB result tuples, one by one? > > > > Thanks for your attention. > > Regards, Manolo. > > > > > > -- > > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-novice > >
> I can think of a couple of answers: > > 1) test your sorting routine on smaller data sets How smaller? Remember that I have got a 1GB RAM... and even if I run it on a smaller RAM machine... that would mean checking MBs of tuples anyway... I don't think checking manually tuples one by one would be a good choice in my case. > 2) have an external program verify the sort result on larger sets Do you have one? > The bigger question is what you hope to accomplish. In other words, > what about postgresql sorting doesn't work for you? I suppose that the current postgresql sorting module works. As I said I'm just creating a patch to get better (faster, more efficient) the actual sorting module. > > Sean Thank you Sean!
Hi Brian. That's what actually I was trying to build in order to get it. I was thinking about 2 nested query. The inner one just sorting the output, something like: SELECT * FROM huge_table ORDER BY a_column_of_huge_table ASC The outer would compare for example the current tuple with the following one and as you said "If it finds an index such that a[i] > a[i+1], you know the result isn't sorted correctly." Could you please write to me a sort of "what should that nested query be", please? Remark I'm not a database user and SQL is not that fresh to me. Thank you Brian! > My advice would be to write a quick application that creates a cursor of > the sorted results, and sucks them out. If it finds an index such that > a[i] > a[i+1], you know the result isn't sorted correctly. > > Brian
On Thu, Mar 13, 2008 at 7:25 AM, Manolo <manolo.espa@gmail.com> wrote: > I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I > just wanted to implement an idea for possibly getting better External > Sorting module on PostgreSQL just using some basic C programming > knowledge and now I'm facing to using PostgreSQL for a while in order to > test my patch. OK. > My patch "activates" when user is trying to sort a load of data more or > less twice bigger than his available memory (e.g.: sorting 5GB of data > just disposing of 1GB of available memory) OK, we kinda already have that covered. > PROBLEM: > First of all I want to realise my patch does sorts correctly or not. But > it's a little bit difficult to check when > > SELECT * FROM huge_table ORDER BY a_column_of_huge_table; > > means checking a 4 or 5 GB result... 5GB is nothing, but I'll continue... > QUESTION: > Is there a way to discover if data is sorted correctly other than > spending 2 months taking a look to the 5 GB result tuples, one by one? I'm trying to understand this. You wrote a patch to improve external sorting... and, you're not sure how to test whether it sorted correctly? I don't mean to be rude, but am I missing something here? To your question, just sort the output and run checksums against the output for comparison. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/