Re: experiments in query optimization - Mailing list pgsql-performance

From Faheem Mitha
Subject Re: experiments in query optimization
Date
Msg-id alpine.DEB.2.00.1003300141290.13883@orwell.homelinux.org
Whole thread Raw
In response to Re: experiments in query optimization  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-performance

On Mon, 29 Mar 2010, Robert Haas wrote:

> On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem@email.unc.edu> wrote:
>>> It's not really too clear to me from reading this what specific
>>> questions you're trying to answer.
>>
>> Quote from opt.{tex/pdf}, Section 1:
>>
>> "If I have to I can use Section~\ref{ped_hybrid} and
>> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
>> I do out of the earlier versions. Specifically, why is
>> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
>> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
>> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?"
>
> Here and in the document, you refer to section numbers for the
> "hybrid" version but I don't see where you define what the "hybrid"
> version actually is.

It is defined later in the file. I don't know if you are looking at the
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text
file, I guess the easist way would be to grep for the label ped_hybrid.

> And the differences between your queries are not real clear either -
> first you say you took out pheno and sex because they weren't necessary,
> but then you decide to put them back.  I don't know what that means.
> If they're not necessary, leave them out.

I don't see where I say that pheno and sex weren't necessary. In fact, the
word 'necessary' does not appear in the opt document. I took them out to
see how it would affect performance. Which is does, dramatically. I say

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, and the runtime dropped to 150 min, while the
memory stayed around 5G."

Maybe I wasn't being sufficiently explicit here. Perhaps

"So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance..."

would have been better.

>>> One random thought: WHERE row_number() = 1 is not too efficient.
>>> Try using LIMIT or DISTINCT ON instead.
>>
>> Possibly. However, the CTE that uses
>>
>> WHERE row_number() = 1
>>
>> doesn't dominate the runtime or memory usage, so I'm not too concerned
>> about it.
>
> Hmm, you might be right.
>
>>> If you're concerned about memory usage, try reducing work_mem; you've
>>> probably got it set to something huge.
>>
>> work_mem = 1 GB (see diag.{tex/pdf}).
>>
>> The point isn't that I'm using so much memory. Again, my question is, why
>> are these changes affecting memory usage so drastically?
>
> Well each sort or hash can use an amount of memory that is limited
> from above by work_mem.  So if you write the query in a way that
> involves more sorts or hashes, each one can add up to 1GB to your
> memory usage, plus overhead.  However, it doesn't look like any of
> your queries including 30 sorts or hashes, so I'm thinking that the
> RSS number probably also includes some of the shared memory that has
> been mapped into each backend's address space.  RSS is not a terribly
> reliable number when dealing with shared memory; it's hard to say what
> that really means.

>>> You might need to create some indices, too.

>> Ok. To what purpose? This query picks up everything from the tables and the
>> planner does table scans, so conventional wisdom and indeed my experience,
>> says that indexes are not going to be so useful.

> Well, a hash join is not usually the first thing that pops to mind when
> dealing with a table that has 825 million rows (geno).  I don't know if
> a nested loop with inner-indexscan would be faster, but it would almost
> certainly use less memory.

Can you provide an illustration of what you mean? I don't know what a
"nested loop with inner-indexscan" is in this context.

                                                        Regards, Faheem.

pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: experiments in query optimization
Next
From: Josh Berkus
Date:
Subject: Re: why does swap not recover?