Thread: what would tar file FDW look like?
<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">I'm starting to work on a tarFDW as a proxy for a much more specific FDW. (It's the 'faster to build two and toss the first away' approach - tar letsme get the FDW stuff nailed down before attacking the more complex container.) It could also be useful in its own right,or as the basis for a zip file FDW.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Ihave figured out that in one mode the FDW mapping that would take thename of the tarball as an option and produce a relation that has all of the metadata for the contained files - filename,size, owner, timestamp, etc. I can use the same approach I used for the /etc/passwd FDW for that.</div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">(BTWthe current version is at <a href="https://github.com/beargiles/passwd-fdw">https://github.com/beargiles/passwd-fdw</a>.It's skimpy on automated testsuntil I can figure out how to handle the user mapping but it works.)</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Theproblem is the second mode where I pull a single file out of the FDW.I've identified three approachs so far:</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">1.A FDW mapping specific to each file. It would take the name of thetarfile and the embedded file. Cleanest in some ways but it would be a real pain if you're reading a tarball dynamically.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">2. A user-defined function that takes the name ofthe tarball and file and returns a blob. This is the traditional approach but why bother with a FDW then? It also bringsup access control issues since it requires disclosure of the tarball name to the user. A FDW could hide that.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">3. A user-defined function that takes a tar FDWand the name of a file and returns a blob. I think this is the best approach but I don't know if I can specify a FDW asa parameter or how to access it.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">I'veskimmed the existing list of FDW but didn't find anything that canserve as a model. The foreign DB are closest but, again, they aren't designed for dynamic use where you want to do somethingwith every file in an archive / table in a foreign DB.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Isthere an obvious approach? Or is it simply a bad match for FDW andshould be two standard UDF? (One returns the metadata, the second returns the specific file.)</div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Thanks,</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Bear</div></div>
On Mon, Aug 17, 2015 at 3:14 PM, Bear Giles <bgiles@coyotesong.com> wrote: > I'm starting to work on a tar FDW as a proxy for a much more specific FDW. > (It's the 'faster to build two and toss the first away' approach - tar lets > me get the FDW stuff nailed down before attacking the more complex > container.) It could also be useful in its own right, or as the basis for a > zip file FDW. Hm. tar may be a bad fit where zip may be much easier. Tar has no index or table of contents. You have to scan the entire file to find all the members. IIRC Zip does have a table of contents at the end of the file. The most efficient way to process a tar file is to describe exactly what you want to happen with each member and then process it linearly from start to end (or until you've found the members you're looking for). Trying to return meta info and then go looking for individual members will be quite slow and have a large startup cost. -- greg
On 08/17/2015 10:14 AM, Bear Giles wrote: > I'm starting to work on a tar FDW as a proxy for a much more specific > FDW. (It's the 'faster to build two and toss the first away' approach > - tar lets me get the FDW stuff nailed down before attacking the more > complex container.) It could also be useful in its own right, or as > the basis for a zip file FDW. > > I have figured out that in one mode the FDW mapping that would take > the name of the tarball as an option and produce a relation that has > all of the metadata for the contained files - filename, size, owner, > timestamp, etc. I can use the same approach I used for the /etc/passwd > FDW for that. > > (BTW the current version is at > https://github.com/beargiles/passwd-fdw. It's skimpy on automated > tests until I can figure out how to handle the user mapping but it works.) > > The problem is the second mode where I pull a single file out of the > FDW. I've identified three approachs so far: > > 1. A FDW mapping specific to each file. It would take the name of the > tarfile and the embedded file. Cleanest in some ways but it would be a > real pain if you're reading a tarball dynamically. > > 2. A user-defined function that takes the name of the tarball and file > and returns a blob. This is the traditional approach but why bother > with a FDW then? It also brings up access control issues since it > requires disclosure of the tarball name to the user. A FDW could hide > that. > > 3. A user-defined function that takes a tar FDW and the name of a file > and returns a blob. I think this is the best approach but I don't know > if I can specify a FDW as a parameter or how to access it. > > I've skimmed the existing list of FDW but didn't find anything that > can serve as a model. The foreign DB are closest but, again, they > aren't designed for dynamic use where you want to do something with > every file in an archive / table in a foreign DB. > > Is there an obvious approach? Or is it simply a bad match for FDW and > should be two standard UDF? (One returns the metadata, the second > returns the specific file.) > > I would probably do something like this: In this mode, define a table that has <path, blob>. To get the blob for a single file, just do "select blob from fdwtable where path = '/path/to/foo'". Make sure you process the qual in the FDW. e.g. create foreign table tarblobs (path text, blob bytea) server tarfiles options (filename '/path/to/tarball', mode 'contents'); cheers andrew
<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">I've written readers for bothfrom scratch. Tar isn't that bad since it's blocked - you read the header, skip forward N blocks, continue. The hardestpart is setting up the decompression libraries if you want to support tar.gz or tar.bz2 files.</div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Zipfiles are more complex. You have (iirc) 5 control blocks - start ofarchive, start of file, end of file, start of index, end of archive, and the information in the control block is prettylimited. That's not a huge burden since there's support for extensions for things like the unix file metadata. Onecomplication is that you need to support compression from the start.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Zipfiles support two types of encryption. There's a really weak versionthat almost nobody supports and a much stronger modern version that's subject to license restrictions. (Some peopleuse the weak version on embedded systems because of legal requirements to /do something/, no matter how lame.)</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">There are third-party libraries, of course, butthat introduces dependencies. Both formats are simple enough to write from scratch.</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br/></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:#000000">Iguess my bigger question is if there's an interest in either or bothfor "real" use. I'm doing this as an exercise but am willing to contrib the code if there's a general interest in it.</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">(BTW the more complex object I'm working on is the.p12 keystore for digital certificates and private keys. We have everything we need in the openssl library so there'sno additional third-party dependencies. I have a minimal FDW for the digital certificate itself and am now workingon a way to access keys stored in a standard format on the filesystem instead of in the database itself. A naturalfit is a specialized archive FDW. Unlike tar and zip it will have two payloads, the digital certificate and the (optionallyencrypted) private key. It has searchable metadata, e.g., finding all records with a specific subject.)</div><divclass="gmail_default" style="font-family:tahoma,sans-serif;color:#000000"><br /></div><div class="gmail_default"style="font-family:tahoma,sans-serif;color:#000000">Bear</div></div><div class="gmail_extra"><br /><divclass="gmail_quote">On Mon, Aug 17, 2015 at 8:29 AM, Greg Stark <span dir="ltr"><<a href="mailto:stark@mit.edu"target="_blank">stark@mit.edu</a>></span> wrote:<br /><blockquote class="gmail_quote" style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mon, Aug 17, 2015 at 3:14 PM, BearGiles <<a href="mailto:bgiles@coyotesong.com">bgiles@coyotesong.com</a>> wrote:<br /> > I'm starting to workon a tar FDW as a proxy for a much more specific FDW.<br /> > (It's the 'faster to build two and toss the first away'approach - tar lets<br /> > me get the FDW stuff nailed down before attacking the more complex<br /> > container.)It could also be useful in its own right, or as the basis for a<br /> > zip file FDW.<br /><br /></span>Hm.tar may be a bad fit where zip may be much easier. Tar has no<br /> index or table of contents. You have to scanthe entire file to find<br /> all the members. IIRC Zip does have a table of contents at the end of<br /> the file.<br/><br /> The most efficient way to process a tar file is to describe exactly<br /> what you want to happen with eachmember and then process it linearly<br /> from start to end (or until you've found the members you're looking<br /> for).Trying to return meta info and then go looking for individual<br /> members will be quite slow and have a large startupcost.<br /><span class="HOEnZb"><font color="#888888"><br /><br /> --<br /> greg<br /></font></span></blockquote></div><br/></div>