Re: experiments in query optimization - Mailing list pgsql-performance
From | Faheem Mitha |
---|---|
Subject | Re: experiments in query optimization |
Date | |
Msg-id | alpine.DEB.2.00.1003300141290.13883@orwell.homelinux.org Whole thread Raw |
In response to | Re: experiments in query optimization (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-performance |
On Mon, 29 Mar 2010, Robert Haas wrote: > On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha <faheem@email.unc.edu> wrote: >>> It's not really too clear to me from reading this what specific >>> questions you're trying to answer. >> >> Quote from opt.{tex/pdf}, Section 1: >> >> "If I have to I can use Section~\ref{ped_hybrid} and >> Section~\ref{tped_hybrid}, but I am left wondering why I get the performance >> I do out of the earlier versions. Specifically, why is >> Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and >> why does the memory usage in Section~\ref{ped_phenoout} blow up relative to >> Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?" > > Here and in the document, you refer to section numbers for the > "hybrid" version but I don't see where you define what the "hybrid" > version actually is. It is defined later in the file. I don't know if you are looking at the pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text file, I guess the easist way would be to grep for the label ped_hybrid. > And the differences between your queries are not real clear either - > first you say you took out pheno and sex because they weren't necessary, > but then you decide to put them back. I don't know what that means. > If they're not necessary, leave them out. I don't see where I say that pheno and sex weren't necessary. In fact, the word 'necessary' does not appear in the opt document. I took them out to see how it would affect performance. Which is does, dramatically. I say "So, I decided to remove the joins to tables corresponding to the patient data, namely pheno and sex, and the runtime dropped to 150 min, while the memory stayed around 5G." Maybe I wasn't being sufficiently explicit here. Perhaps "So, I decided to remove the joins to tables corresponding to the patient data, namely pheno and sex, to see how it would affect performance..." would have been better. >>> One random thought: WHERE row_number() = 1 is not too efficient. >>> Try using LIMIT or DISTINCT ON instead. >> >> Possibly. However, the CTE that uses >> >> WHERE row_number() = 1 >> >> doesn't dominate the runtime or memory usage, so I'm not too concerned >> about it. > > Hmm, you might be right. > >>> If you're concerned about memory usage, try reducing work_mem; you've >>> probably got it set to something huge. >> >> work_mem = 1 GB (see diag.{tex/pdf}). >> >> The point isn't that I'm using so much memory. Again, my question is, why >> are these changes affecting memory usage so drastically? > > Well each sort or hash can use an amount of memory that is limited > from above by work_mem. So if you write the query in a way that > involves more sorts or hashes, each one can add up to 1GB to your > memory usage, plus overhead. However, it doesn't look like any of > your queries including 30 sorts or hashes, so I'm thinking that the > RSS number probably also includes some of the shared memory that has > been mapped into each backend's address space. RSS is not a terribly > reliable number when dealing with shared memory; it's hard to say what > that really means. >>> You might need to create some indices, too. >> Ok. To what purpose? This query picks up everything from the tables and the >> planner does table scans, so conventional wisdom and indeed my experience, >> says that indexes are not going to be so useful. > Well, a hash join is not usually the first thing that pops to mind when > dealing with a table that has 825 million rows (geno). I don't know if > a nested loop with inner-indexscan would be faster, but it would almost > certainly use less memory. Can you provide an illustration of what you mean? I don't know what a "nested loop with inner-indexscan" is in this context. Regards, Faheem.
pgsql-performance by date: