PG-related ACM Article: "The Pathologies of Big Data" - Mailing list pgsql-performance

From Josh Kupershmidt
Subject PG-related ACM Article: "The Pathologies of Big Data"
Date
Msg-id 4ec1cf760908071317o4ffa3ad8y4182e425bf71ce26@mail.gmail.com
Whole thread Raw
Responses Re: PG-related ACM Article: "The Pathologies of Big Data"  (Greg Stark <gsstark@mit.edu>)
Re: PG-related ACM Article: "The Pathologies of Big Data"  (Scott Marlowe <scott.marlowe@gmail.com>)
List pgsql-performance
Just stumbled across this recent article published in the
Communications of the ACM:

http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext

The author shares some insights relating to difficulties processing a
6.75 billion-row
table, a dummy table representing census-type data for everyone on earth, in
Postgres.

I'd really like to replicate the author's experiment, but it's not clear from
the article what his table definition looks like. He claims to be using a
16-byte record to store the several columns he needs for each row, so perhaps
he's using a user-defined type?

The author implies with his definition of "big data" that the dataset he
analyzed is "... too large to be placed in a relational database... ". From
Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
run on 6.75 billion rows. This amount of time for the query didn't seem
surprising to me given how many rows he has to process, but in a recent post
on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
ran the same SELECT query in 105 minutes! This would be very impressive (a
10-fold improvement over Postgres) if true.

The author complained that "on larger tables [Postgres' planner] switched to
sorting by grouping columns", which he blamed for the slow query execution. I
don't personally see this plan as a problem, but maybe someone can enlighten
me.

One intriguing tidbit I picked up from the article: "in modern systems, as
demonstrated in the figure, random access to memory is typically slower than
sequential access to disk." In hindsight, this seems plausible (since modern
disks can sustain sequential reads at well over 100MB/sec).

Anyway, it would be very interesting to attempt to speed up the author's query
if at all possible.

pgsql-performance by date:

Previous
From: Culley Harrelson
Date:
Subject: Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine
Next
From: Greg Stark
Date:
Subject: Re: PG-related ACM Article: "The Pathologies of Big Data"