Re: old synchronized scan patch - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: old synchronized scan patch
Date
Msg-id 1165275500.25371.63.camel@dogma.v10.wvs
Whole thread Raw
In response to Re: old synchronized scan patch  ("Luke Lonergan" <llonergan@greenplum.com>)
Responses Re: old synchronized scan patch
Re: old synchronized scan patch
List pgsql-hackers
On Mon, 2006-12-04 at 15:03 -0800, Luke Lonergan wrote:
> Jeff,
> > Now that 8.3 is open, I was considering a revival of this old patch:
> > 
> > http://archives.postgresql.org/pgsql-hackers/2005-02/msg00832.php
> > 
> > I could probably clean it up with a little help from someone on this
> > list.
> > 
> > 
> > Is there some interest in this patch?
> 
> Yes.
> 

<snip>

> Where I think sync scan could have a big benefit is for multi-user business
> intelligence workloads where there are a few huge fact tables of interest to
> a wide audience.  Example: 5 business analysts come to work at 9AM and start
> ad-hoc queries expected to run in about 15 minutes each.  Each query
> sequential scans a 10 billion row fact table once, which takes 10 minutes of
> the query runtime.  With sync scan the last one completes in 35 minutes.
> Without sync scan the last completes in 75 minutes.  In this case sync scan
> significantly improves the experience of 5 people.
> 

Thank you for your input. 

> > How would I go about proving whether it's useful enough or not?
> 
> Can you run the above scenario on a table whose size is ten times the memory
> on the machine?  As a simple starting point, a simple "SELECT COUNT(*) FROM
> BIGTABLE" should be sufficient, but the scans need to be separated by enough
> time to invalidate the OS I/O cache.
> 

I'll try to run a test like that this week. I will be doing this on my
home hardware (bad, consumer-grade stuff), so if I gave you a patch
against HEAD could you test it against some more real hardware (and
data)?

To open up the implementation topic: 

My current patch starts a new sequential scan on a given relation at the
page of an already-running scan. It makes no guarantees that the scans
stay together, but in practice I don't think they deviate much. To try
to enforce synchronization of scanning I fear would do more harm than
good. Thoughts?

Also, it's more of a "hint" system that uses a direct mapping of the
relations Oid to hold the position of the scan. That means that, in rare
cases, the page offset could be wrong, in which case it will degenerate
to the current performance characteristics with no cost. The benefit of
doing it this way is that it's simple code, with essentially no
performance penalty or additional locking. Also, I can use a fixed
amount of shared memory (1 page is about right).

Regards,Jeff Davis



pgsql-hackers by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: old synchronized scan patch
Next
From: Tom Lane
Date:
Subject: Re: old synchronized scan patch