Home > mailing lists

PG-related ACM Article: "The Pathologies of Big Data" - Mailing list pgsql-performance

From	Josh Kupershmidt
Subject	PG-related ACM Article: "The Pathologies of Big Data"
Date	August 7, 2009 21:09:20
Msg-id	4ec1cf760908071317o4ffa3ad8y4182e425bf71ce26@mail.gmail.com Whole thread Raw
Responses	Re: PG-related ACM Article: "The Pathologies of Big Data" Re: PG-related ACM Article: "The Pathologies of Big Data"
List	pgsql-performance

Tree view

Just stumbled across this recent article published in the
Communications of the ACM:

http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext

The author shares some insights relating to difficulties processing a
6.75 billion-row
table, a dummy table representing census-type data for everyone on earth, in
Postgres.

I'd really like to replicate the author's experiment, but it's not clear from
the article what his table definition looks like. He claims to be using a
16-byte record to store the several columns he needs for each row, so perhaps
he's using a user-defined type?

The author implies with his definition of "big data" that the dataset he
analyzed is "... too large to be placed in a relational database... ". From
Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
run on 6.75 billion rows. This amount of time for the query didn't seem
surprising to me given how many rows he has to process, but in a recent post
on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
ran the same SELECT query in 105 minutes! This would be very impressive (a
10-fold improvement over Postgres) if true.

The author complained that "on larger tables [Postgres' planner] switched to
sorting by grouping columns", which he blamed for the slow query execution. I
don't personally see this plan as a problem, but maybe someone can enlighten
me.

One intriguing tidbit I picked up from the article: "in modern systems, as
demonstrated in the figure, random access to memory is typically slower than
sequential access to disk." In hindsight, this seems plausible (since modern
disks can sustain sequential reads at well over 100MB/sec).

Anyway, it would be very interesting to attempt to speed up the author's query
if at all possible.

pgsql-performance by date:

From: Culley Harrelson
Date: 07 August 2009, 18:26:18
Subject: Need suggestions on kernel settings for dedicated FreeBSD/Postgresql machine

From: Greg Stark
Date: 07 August 2009, 21:28:53
Subject: Re: PG-related ACM Article: "The Pathologies of Big Data"

PG-related ACM Article: "The Pathologies of Big Data" - Mailing list pgsql-performance

Previous

Next