Thread: Brain-Dead Sort Algorithm??

Brain-Dead Sort Algorithm??

From
"Tim Perdue"
Date:
This message was sent from Geocrawler.com by "Tim Perdue" <tim@perdue.net>
Be sure to reply to that address.


> >It uses over 1GB of disk space to do that sort,
> >and it would have used a lot more if I hadn't
run
> >out.
> >
> >Then it won't fail gracefully, instead of just
> >hangs and leaves temp files completely filling
up
> >the hard drive.

> Because maybe you're doing a really dumb join
before you
> sort?  SQL is full of such "gotchas".

No, sorry.

select distinct serial into serial_good from
serial_half;

serial_half is a 1-column list of 10-digit
numbers. I'm doing a select distinct because I
believe there may be duplicates in that column.

The misunderstanding on my end came because
serial_half was a 60MB text file, but when it was
inserted into postgres, it became 345MB (6.8
million rows has a lot of bloat apparently).

So the temp-sort space for 345MB could easily
surpass the 1GB I had on my hard disk. Although
how anyone can take a 60MB text file and turn it
into > 1GB is beyond me.

> And, of course, you've posed your question
stupidly - "my query's
> slow, why is Postgres so horrible?" and you
haven't bothered posting
> your query.

None of that was ever stated.

Actually what was stated is that it is retarded to
fill up a hard disk and then hang instead of
bowing out gracefully, forcing the user to
manually delete the temp_sort files and kill -9
postgres.

You can't argue with that portion. 

And it happens on v6.4, v6.4.2, and v6.5.2 on RHAT
6.1, and LinuxPPC.

Yes, my post was rather harsh - I posted it when I
was pissed and that was a mistake.  I had this
same problem in March when trying to sort a 2.5GB
file with 9GB free.

I use postgres on every project I work on,
including this site, Geocrawler.com, and my
PHPBuilder.com site, because it's a decent and
free database and it will scale beyond 2GB, unlike
MySQL.

Tim

Geocrawler.com - The Knowledge Archive


Re: [HACKERS] Brain-Dead Sort Algorithm??

From
Tom Lane
Date:
"Tim Perdue" <archiver@db.geocrawler.com> writes:
> serial_half is a 1-column list of 10-digit
> numbers. I'm doing a select distinct because I
> believe there may be duplicates in that column.

> The misunderstanding on my end came because
> serial_half was a 60MB text file, but when it was
> inserted into postgres, it became 345MB (6.8
> million rows has a lot of bloat apparently).

The overhead per tuple is forty-something bytes, IIRC.  So when the only
useful data in a tuple is an int, the expansion factor is unpleasantly
large.  Little to be done about it though.  All the overhead fields
appear to be necessary if you want proper transaction semantics.

> So the temp-sort space for 345MB could easily surpass the 1GB I had on
> my hard disk.

Yes, the merge algorithm used up through 6.5.* seems to have typical
space usage of about 4X the actual data volume.  I'm trying to reduce
this to just 1X for 7.0, although some folks are complaining that the
result is slower than before :-(.

> Actually what was stated is that it is retarded to fill up a hard disk
> and then hang instead of bowing out gracefully,

Yup, that was a bug --- failure to check for write errors on the sort
temp files.  I believe it's fixed in current sources too.
        regards, tom lane


Re: [HACKERS] Brain-Dead Sort Algorithm??

From
Thomas Lockhart
Date:
> serial_half is a 1-column list of 10-digit
> numbers. I'm doing a select distinct because I
> believe there may be duplicates in that column.
> 
> The misunderstanding on my end came because
> serial_half was a 60MB text file, but when it was
> inserted into postgres, it became 345MB (6.8
> million rows has a lot of bloat apparently).
> 
> So the temp-sort space for 345MB could easily
> surpass the 1GB I had on my hard disk. Although
> how anyone can take a 60MB text file and turn it
> into > 1GB is beyond me.

Sigh. Y'all like the sweeping statement, which got you in a bit of
trouble the first time too :)

Without knowing your schema, I can't say why you have *exactly* the
storage requirement you see. But, you have chosen the absolute worst
case for *any* relational database: a schema with only a single, very
small column.

For Postgres (and other DBs, but the details will vary) there is a 36
byte overhead per row to manage the tuple and the transaction
behavior. So if you stored your data as int8 (int4 is too small for 10
digits, right?) I see an average usage of slightly over 44 bytes per
row (36+8). So, for 6.8 million rows, you will require 300MB. I'm
guessing that you are using char(10) fields, which gives 50 bytes/row
or a total of 340MB, which matches your number to two digits.

Note that the tuple header size will stay the same (with possibly some
modest occasional bumps) for rows with more columns, so the overhead
decreases as you increase the number of columns in your tables.

By the way, I was going to say to RTFM, but I see a big blank spot on
this topic (I could have sworn that some of the info posted to the
mailing lists on this topic had made it into the manual, but maybe
not).

Does anyone see where this is in the docs, or have an interest in
writing a bit? The place is doc/src/sgml/storage.sgml and page.sgml
...

Good luck.
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Brain-Dead Sort Algorithm??

From
Bruce Momjian
Date:
> Sigh. Y'all like the sweeping statement, which got you in a bit of
> trouble the first time too :)
> 
> Without knowing your schema, I can't say why you have *exactly* the
> storage requirement you see. But, you have chosen the absolute worst
> case for *any* relational database: a schema with only a single, very
> small column.
> 
> For Postgres (and other DBs, but the details will vary) there is a 36
> byte overhead per row to manage the tuple and the transaction
> behavior. So if you stored your data as int8 (int4 is too small for 10
> digits, right?) I see an average usage of slightly over 44 bytes per
> row (36+8). So, for 6.8 million rows, you will require 300MB. I'm
> guessing that you are using char(10) fields, which gives 50 bytes/row
> or a total of 340MB, which matches your number to two digits.
> 
> Note that the tuple header size will stay the same (with possibly some
> modest occasional bumps) for rows with more columns, so the overhead
> decreases as you increase the number of columns in your tables.
> 
> By the way, I was going to say to RTFM, but I see a big blank spot on
> this topic (I could have sworn that some of the info posted to the
> mailing lists on this topic had made it into the manual, but maybe
> not).

This is an FAQ item.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026