Re: Air-traffic benchmark - Mailing list pgsql-performance

From Lefteris
Subject Re: Air-traffic benchmark
Date
Msg-id 852badbc1001071457m42d4042brbb5c6b5fa7efce20@mail.gmail.com
Whole thread Raw
In response to Re: Air-traffic benchmark  (Greg Smith <greg@2ndquadrant.com>)
Responses Re: Air-traffic benchmark
List pgsql-performance
Hi Greg,

thank you for your help. The changes I did on the dataset was just
removing the last comma from the CSV files as it was interpreted by pg
as an extra column. The schema I used, the load script and queries can
be found at:

http://homepages.cwi.nl/~lsidir/postgres/

(I understood that if I attach these files here, my email will not
reach the list so I give you a link to download them).

Also since you are interesting on the benchmark, you can also check

http://homepages.cwi.nl/~mk/ontimeReport

for a report of various experiments with MonetDB and comparison with
previously published numbers.

The schema I used for pg is slightly different from that one of
MonetDB since the delay fields could not be parsed by pg as integers
but only as varchar(4). Hence the extra index on DepDelay field:)

Also at http://homepages.cwi.nl/~lsidir/PostgreSQL-ontimeReport you
can see the detailed times I got from postgres.

I really appreciate your help! this is a great opportunity for me to
get some feeling and insights on postgres since I never had the chance
to use it in a large scale project.

lefteris


On Thu, Jan 7, 2010 at 11:21 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Lefteris wrote:
>>
>> So we all agree that the problem is on the scans:)
>>
>> So the next question is why changing shared memory buffers will fix
>> that? i only have one session with one connection, do I have like many
>> reader workers or something?
>>
>
> I wouldn't expect it to.  Large sequential scans like this one are optimized
> in PostgreSQL to only use up a small portion of the shared_buffers cache.
>  Allocating more RAM to the database won't improve the fact that you're
> spending the whole time waiting for physical I/O to happen very much.
>
> What might help is increasing effective_cache_size a lot though, because
> there you might discover the database switching to all new sorts of plans
> for some of these queries.  But, again, that doesn't impact the situation
> where a sequential scan is the only approach.
>
> I have this whole data set on my PC already and have been trying to find
> time to get it loaded and start my own tests here, it is a quite interesting
> set of information.  Can you tell me what you had to do in order to get it
> running in PostgreSQL?  If you made any customizations there, I'd like to
> get a copy of them.  Would save me some time and help me get to where I
> could give suggestions out if I had a "pgdumpall --schema-only" dump from
> your database for example, or however you got the schema into there, and the
> set of PostgreSQL-compatible queries you're using.
>
> By the way:  if anybody else wants to join in, here's a script that
> generates a script to download the whole data set:
>
> #!/usr/bin/env python
> for y in range(1988,2010):
>   for m in range(1,13):
>       print "wget --limit-rate=100k
> http://www.transtats.bts.gov/Download/On_Time_On_Time_Performance_%s_%s.zip"
> % (y,m)
>
> It's 3.8GB of download that uncompresses into 46GB of CSV data, which is why
> I put the rate limiter on there--kept it from clogging my entire Internet
> connection.
>
> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
> greg@2ndQuadrant.com  www.2ndQuadrant.com
>
>

pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: Air-traffic benchmark
Next
From: Lefteris
Date:
Subject: Re: Air-traffic benchmark