Re: PG-related ACM Article: "The Pathologies of Big Data" - Mailing list pgsql-performance

From Greg Stark
Subject Re: PG-related ACM Article: "The Pathologies of Big Data"
Date
Msg-id 407d949e0908071728y17ed4b31g7b95e2be5fce4924@mail.gmail.com
Whole thread Raw
In response to PG-related ACM Article: "The Pathologies of Big Data"  (Josh Kupershmidt <schmiddy@gmail.com>)
List pgsql-performance
On Fri, Aug 7, 2009 at 9:17 PM, Josh Kupershmidt<schmiddy@gmail.com> wrote:
> Just stumbled across this recent article published in the
> Communications of the ACM:
>
> http://cacm.acm.org/magazines/2009/8/34493-the-pathologies-of-big-data/fulltext
>
> The author shares some insights relating to difficulties processing a
> 6.75 billion-row
> table, a dummy table representing census-type data for everyone on earth, in
> Postgres.
>
> I'd really like to replicate the author's experiment, but it's not clear from
> the article what his table definition looks like. He claims to be using a
> 16-byte record to store the several columns he needs for each row, so perhaps
> he's using a user-defined type?

or four integers, or who knows. Postgres's per-row overhead is 24
bytes plus a 16-bit line pointer so you're talking about 42 bytes per
row. There's per-page overhead and alignment but in this case it
shouldn't be much.



> The author implies with his definition of "big data" that the dataset he
> analyzed is "... too large to be placed in a relational database... ". From
> Fig. 2, the SELECT query he ran took just under 10^5 seconds (~28 hours) when
> run on 6.75 billion rows. This amount of time for the query didn't seem
> surprising to me given how many rows he has to process, but in a recent post
> on comp.databases.ingres someone claimed that on a far-inferior PC, Ingres
> ran the same SELECT query in 105 minutes! This would be very impressive (a
> 10-fold improvement over Postgres) if true.

6.75 billion * 42 bytes is 283.5GB.

Assuming you stick that on a single spindle capable of 100MB/s:

You have: 283.5GB / (100MB/s)
You want: min
        * 47.25

So something's not adding up.

> One intriguing tidbit I picked up from the article: "in modern systems, as
> demonstrated in the figure, random access to memory is typically slower than
> sequential access to disk." In hindsight, this seems plausible (since modern
> disks can sustain sequential reads at well over 100MB/sec).

Sure, but the slowest PCIe bus can sustain 1GB/s and your memory
bandwidth is probably at least 8GB/s.

--
greg
http://mit.edu/~gsstark/resume.pdf

pgsql-performance by date:

Previous
From: Josh Kupershmidt
Date:
Subject: PG-related ACM Article: "The Pathologies of Big Data"
Next
From: Scott Marlowe
Date:
Subject: Re: PG-related ACM Article: "The Pathologies of Big Data"