Thread: Sorting Tuples

Sorting Tuples

From
Manolo
Date:
Hi to all.

INTRODUCTION:
I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I
just wanted to implement an idea for possibly getting better External
Sorting module on PostgreSQL just using some basic C programming
knowledge and now I'm facing to using PostgreSQL for a while in order to
test my patch.
My patch "activates" when user is trying to sort a load of data more or
less twice bigger than his available memory (e.g.: sorting 5GB of data
just disposing of 1GB of available memory)

PROBLEM:
First of all I want to realise my patch does sorts correctly or not. But
it's a little bit difficult to check when

    SELECT * FROM huge_table ORDER BY a_column_of_huge_table;

means checking a 4 or 5 GB result...

QUESTION:
Is there a way to discover if data is sorted correctly other than
spending 2 months taking a look to the 5 GB result tuples, one by one?

Thanks for your attention.
Regards, Manolo.


Re: Sorting Tuples

From
"Sean Davis"
Date:
On Thu, Mar 13, 2008 at 7:25 AM, Manolo <manolo.espa@gmail.com> wrote:
> Hi to all.
>
>  INTRODUCTION:
>  I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I
>  just wanted to implement an idea for possibly getting better External
>  Sorting module on PostgreSQL just using some basic C programming
>  knowledge and now I'm facing to using PostgreSQL for a while in order to
>  test my patch.
>  My patch "activates" when user is trying to sort a load of data more or
>  less twice bigger than his available memory (e.g.: sorting 5GB of data
>  just disposing of 1GB of available memory)
>
>  PROBLEM:
>  First of all I want to realise my patch does sorts correctly or not. But
>  it's a little bit difficult to check when
>
>         SELECT * FROM huge_table ORDER BY a_column_of_huge_table;
>
>  means checking a 4 or 5 GB result...
>
>  QUESTION:
>  Is there a way to discover if data is sorted correctly other than
>  spending 2 months taking a look to the 5 GB result tuples, one by one?

I can think of a couple of answers:

1) test your sorting routine on smaller data sets

2) have an external program verify the sort result on larger sets

The bigger question is what you hope to accomplish.  In other words,
what about postgresql sorting doesn't work for you?

Sean

Re: Sorting Tuples

From
Manolo
Date:

Good job... even if it requires 10GB more of disk space (5GB for the
unpatched and 5GB for the patched result) and multiplying by 3 or 4 the
time needed to scan a 5GB table (twice for producing the sorted result
and once or twice in order to produce the diff).

Till now it's the candidate method: I could run it ad night while I'm
sleeping!

Thank you Ken!


On Thu, 2008-03-13 at 07:32 -0500, Kenneth Marshall wrote:
> Run the same query on an un-patched version and diff the output.
>
> Ken
>
> On Thu, Mar 13, 2008 at 12:25:53PM +0100, Manolo wrote:
> > Hi to all.
> >
> > INTRODUCTION:
> > I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I
> > just wanted to implement an idea for possibly getting better External
> > Sorting module on PostgreSQL just using some basic C programming
> > knowledge and now I'm facing to using PostgreSQL for a while in order to
> > test my patch.
> > My patch "activates" when user is trying to sort a load of data more or
> > less twice bigger than his available memory (e.g.: sorting 5GB of data
> > just disposing of 1GB of available memory)
> >
> > PROBLEM:
> > First of all I want to realise my patch does sorts correctly or not. But
> > it's a little bit difficult to check when
> >
> >     SELECT * FROM huge_table ORDER BY a_column_of_huge_table;
> >
> > means checking a 4 or 5 GB result...
> >
> > QUESTION:
> > Is there a way to discover if data is sorted correctly other than
> > spending 2 months taking a look to the 5 GB result tuples, one by one?
> >
> > Thanks for your attention.
> > Regards, Manolo.
> >
> >
> > --
> > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-novice
> >


Re: Sorting Tuples

From
Manolo
Date:

> I can think of a couple of answers:
>
> 1) test your sorting routine on smaller data sets

How smaller? Remember that I have got a 1GB RAM... and even if I run it
on a smaller RAM machine... that would mean checking MBs of tuples
anyway... I don't think checking manually tuples one by one would be a
good choice in my case.

> 2) have an external program verify the sort result on larger sets

Do you have one?

> The bigger question is what you hope to accomplish. In other words,
> what about postgresql sorting doesn't work for you?

I suppose that the current postgresql sorting module works. As I said
I'm just creating a patch to get better (faster, more efficient) the
actual sorting module.
>
> Sean

Thank you Sean!


Re: Sorting Tuples

From
Manolo
Date:
Hi Brian.

That's what actually I was trying to build in order to get it.

I was thinking about 2 nested query. The inner one just sorting the output, something like:

SELECT * FROM huge_table ORDER BY a_column_of_huge_table ASC

The outer would compare for example the current tuple with the following one and as you said
 "If it finds an index such that a[i] > a[i+1], you know the result isn't sorted correctly."

Could you please write to me a sort of "what should that nested query be", please?
Remark I'm not a database user and SQL is not that fresh to me.

Thank you Brian!

> My advice would be to write a quick application that creates a cursor of
> the sorted results, and sucks them out.  If it finds an index such that
> a[i] > a[i+1], you know the result isn't sorted correctly.
>
> Brian


Re: Sorting Tuples

From
"Jonah H. Harris"
Date:
On Thu, Mar 13, 2008 at 7:25 AM, Manolo <manolo.espa@gmail.com> wrote:
>  I created a patch for PostgreSQL even if i'm not a PostgreSQL user. I
>  just wanted to implement an idea for possibly getting better External
>  Sorting module on PostgreSQL just using some basic C programming
>  knowledge and now I'm facing to using PostgreSQL for a while in order to
>  test my patch.

OK.

>  My patch "activates" when user is trying to sort a load of data more or
>  less twice bigger than his available memory (e.g.: sorting 5GB of data
>  just disposing of 1GB of available memory)

OK, we kinda already have that covered.

>  PROBLEM:
>  First of all I want to realise my patch does sorts correctly or not. But
>  it's a little bit difficult to check when
>
>         SELECT * FROM huge_table ORDER BY a_column_of_huge_table;
>
>  means checking a 4 or 5 GB result...

5GB is nothing, but I'll continue...

>  QUESTION:
>  Is there a way to discover if data is sorted correctly other than
>  spending 2 months taking a look to the 5 GB result tuples, one by one?

I'm trying to understand this.  You wrote a patch to improve external
sorting... and, you're not sure how to test whether it sorted
correctly?  I don't mean to be rude, but am I missing something here?

To your question, just sort the output and run checksums against the
output for comparison.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/