Re: Parallel copy - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | CAA4eK1LyAyPCtBk4rkwomeT6=yTse5qWws-7i9EFwnUFZhvu5w@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel copy (Andrew Dunstan <andrew.dunstan@2ndquadrant.com>) |
Responses |
Re: Parallel copy
|
List | pgsql-hackers |
On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On 2/15/20 7:32 AM, Amit Kapila wrote: > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <minion@decodable.me> wrote: > >>> > >> The problem case that I see is the chunk boundary falling in the > >> middle of a quoted field where > >> - The quote opens in chunk 1 > >> - The quote closes in chunk 2 > >> - There is an EoL character between the start of chunk 2 and the closing quote > >> > >> When the worker processing chunk 2 starts, it believes itself to be in > >> out-of-quote state, so only data between the start of the chunk and > >> the EoL is regarded as belonging to the partial line. From that point > >> on the parsing of the rest of the chunk goes off track. > >> > >> Some of the resulting errors can be avoided by, for instance, > >> requiring a quote to be preceded by a delimiter or EoL. That answer > >> fails when fields end with EoL characters, which happens often enough > >> in the wild. > >> > >> Recovering from an incorrect in-quote/out-of-quote state assumption at > >> the start of parsing a chunk just seems like a hole with no bottom. So > >> it looks to me like it's best done in a single process which can keep > >> track of that state reliably. > >> > > Good point and I agree with you that having a single process would > > avoid any such stuff. However, I will think some more on it and if > > you/anyone else gets some idea on how to deal with this in a > > multi-worker system (where we can allow each worker to read and > > process the chunk) then feel free to share your thoughts. > > > > > IIRC, in_quote only matters here in CSV mode (because CSV fields can > have embedded newlines). > AFAIU, that is correct. > So why not just forbid parallel copy in CSV > mode, at least for now? I guess it depends on the actual use case. If we > expect to be parallel loading humungous CSVs then that won't fly. > I am not sure about this part. However, I guess we should at the very least have some extendable solution that can deal with csv, otherwise, we might end up re-designing everything if someday we want to deal with CSV. One naive idea is that in csv mode, we can set up the things slightly differently like the worker, won't start processing the chunk unless the previous chunk is completely parsed. So each worker would first parse and tokenize the entire chunk and then start writing it. So, this will make the reading/parsing part serialized, but writes can still be parallel. Now, I don't know if it is a good idea to process in a different way for csv mode. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: