I think I have a novel syscall proposal. I have a table represented as N files, one file per column. Each file is of a uniform, known type, and is mapped into memory. On this table I perform a relational restrict operation (SQL WHERE), testing each row against a predicate. The output is manifested as a similar set of memory-mapped files. The restriction can be parallelized and order-preserving. The work could be partitioned to multiple processors. For P processors, each process p would read rows (p-1)/P though p/P. (That is, given 100 rows, process 3 of four processors reads rows [50,75).) At the end of the parallel stage, I have P sets of partial outputs. Of course, it's possible to concatenate them togther to form a single (multifile) output. But why is that necessary? A given logical output column consists of P components in a known order. Instead of $ cat part1 part part3 part4 > output why can I not $ union -o output part1 part part3 part4 where the putative union(1) creates a file by knitting together the blocks assigned to its arguments? ISTM to that "cat foo* > bar" is a pretty common operation, and the fairly often the input is discarded once the output is ready. And I have an analogy: cp is to mv as cat is to union To answer my own question -- why not? -- the filesystem doesn't offer block level access or inode manipulation. You can assign names to inodes and place those names in directories, but you can't adjust what resources an inode uses. You may object that "union" could result in fragmented files. But so what? If each fragment is a terabyte, the"fragmentation" resulting from ganging the "fragments" together might be very preferable to copying all the blocks around just for the sake of colocation. It's not difficult to imagine a syscall to union two files without exposing inode details. In fact, it's easy to imagine that int union( const char *from, const char *to ) might leave to the discretion of the kernel or filesystem whether to diddle with the inodes or just extend the file conventionally, based on disk geometry or whatever. Have you heard of this idea before? Seen it implemented? --jkl
![]() |
0 |
![]() |
On 12/13/2016 11:30 PM, James K. Lowden wrote: > [...] > A given logical output column consists of P components in a known > order. Instead of > > $ cat part1 part part3 part4 > output > > why can I not > > $ union -o output part1 part part3 part4 > > where the putative union(1) creates a file by knitting together the > blocks assigned to its arguments? One difficulty would be dealing with allocation "atoms" -- "blocks" or "pages" or whatever your file system of choice likes to call them. In an ordinary file, every block except the last is chock-a-block full, but the file you illustrate could have as many as four partially-filled blocks. (Even in sparse files, the gaps are block-aligned.) So, how many such partial-block gaps might appear, how much bookkeeping might the file system need to do, and could lseek() and stat() be done efficiently? Not to say this would be an insuperable amount of work, but it would probably be a "more than epsilon" amount. > Have you heard of this idea before? Seen it implemented? I've seen something vaguely like it, but not on a file system with Unixoid semantics. (It supported things like deletion and insertion in the middle of a file; you could build a file from the middle outward instead of from beginning to end ...) But even that one required special conventions for unused block frags; they were filled with zeroes that the applications had to know to ignore (hence, such files were, er, "not recommended" for binary data unless it had its own "record structure" to distinguish padding from payload.) -- esosman@comcast-dot-net.invalid "Nobody ever went broke underestimating the intelligence of the American public." -- HLM (paraphrased)
![]() |
0 |
![]() |
James K. Lowden <jklowden@speakeasy.net> wrote: <snip> > ISTM to that "cat foo* > bar" is a pretty common operation, and the > fairly often the input is discarded once the output is ready. And I > have an analogy: > > cp is to mv as cat is to union > > To answer my own question -- why not? -- the filesystem doesn't offer > block level access or inode manipulation. You can assign names to > inodes and place those names in directories, but you can't adjust what > resources an inode uses. > > You may object that "union" could result in fragmented files. But so > what? If each fragment is a terabyte, the"fragmentation" resulting > from ganging the "fragments" together might be very preferable to > copying all the blocks around just for the sake of colocation. > > It's not difficult to imagine a syscall to union two files without > exposing inode details. In fact, it's easy to imagine that > > int union( const char *from, const char *to ) > > might leave to the discretion of the kernel or filesystem whether to > diddle with the inodes or just extend the file conventionally, based on > disk geometry or whatever. > > Have you heard of this idea before? Seen it implemented? > Linux has the splice(2) system call. The idea behind splice is to allow you to copy or move in-kernel buffers between objects (pipes, file buffer cache, and sockets) without copying into and out of userspace; and with vmsplice(2) even move user VM pages. It's similar to sendfile(2) (from FreeBSD and Solaris) but far more generic. There was a least one patch to optimize file-to-file splices, particularly for filesystems like Btrfs where blocks could be shared between files: https://lwn.net/Articles/567086/ See also https://lwn.net/Articles/550621/
![]() |
0 |
![]() |
On Wed, 14 Dec 2016 00:23:49 -0500 Eric Sosman <esosman@comcast-dot-net.invalid> wrote: > In an ordinary file, every block except the last is chock-a-block > full, but the file you illustrate could have as many as four > partially-filled blocks. Mmph. Thinking in terms of memory-mapped files, the notion of partially used blocks in the middle of the file is especially problematic. There's no hardware mechanism to "fake out" the block alignment, to make file2 appear to be contiguous with file1 if its final block isn't full. Without that, the kernel has no advantage; any "elision" of the unused portion of interim bytes could as easily be done in userspace. Non-epsilon, indeed. Thanks for the reply. --jkl
![]() |
0 |
![]() |
"James K. Lowden" , dans le message <20161213233012.331fe9316e545212d6ace577@speakeasy.net>, a �crit�: > why can I not > > $ union -o output part1 part part3 part4 > > where the putative union(1) creates a file by knitting together the > blocks assigned to its arguments? Because it is awfully specific. To implement what you request, it would require generic code for a system call, plus specific code in each filesystem implementation plus fallback emulation code for the filesystems where it is not implemented. All this to satisfy a single use case, mostly useless for anything else, and giving only a slight speedup compared to doing it in userland. If you need this, it seems to me you are using the wrong file format or the wrong tool.
![]() |
0 |
![]() |
On 14 Dec 2016 16:22:21 GMT Nicolas George <nicolas$george@salle-s.org> wrote: >=20 > "James K. Lowden" , dans le message > <20161213233012.331fe9316e545212d6ace577@speakeasy.net>, a =E9crit=A0: > > why can I not > >=20 > > $ union -o output part1 part part3 part4 > >=20 > > where the putative union(1) creates a file by knitting together the > > blocks assigned to its arguments? >=20 > Because it is awfully specific. To implement what you request, it > would require generic code for a system call,=20 true for any new call > plus specific code in each filesystem implementation=20 true for any new filesystem feature=20 > plus fallback emulation code for the filesystems where it is not > implemented.=20 It's fairly common for a syscall to fail if the underlying device doesn't support it. ENODEV comes to mind.=20 > All this to satisfy a single use case, mostly useless for anything > else, and giving only a slight speedup compared to doing it in > userland. Could I do what I want to do in userland? Of course. What I can't do is join many files into one without moving the data on the disk. =20 Is it highly speciallzed? I don't know. Is splice(2) speciallized? Is passing file descriptors over a Unix domain socket? I think we could find quite a few arcane features of the kernel. =20 Eric Sosman convinced me it's impractical on techical grounds. No underlying machine could map a sparse file such as I describe as contiguous memory. No device could represent the beginning of a second file as contiguous with its precedessor. Camouflaging that is no simple matter. =20 On the other hand, a lot of effort these days goes into minimizing copying. We've had copy-on-write forever, and is implemented in some filesystems. C ++ got move semantics. And there's sendfile(2). So it seemed like it was worth asking. =20 --jkl
![]() |
0 |
![]() |
On 14.12.16 06.23, Eric Sosman wrote: > On 12/13/2016 11:30 PM, James K. Lowden wrote: >> why can I not >> >> $ union -o output part1 part part3 part4 >> >> where the putative union(1) creates a file by knitting together the >> blocks assigned to its arguments? > > One difficulty would be dealing with allocation "atoms" -- > "blocks" or "pages" or whatever your file system of choice likes > to call them. In an ordinary file, every block except the last > is chock-a-block full, but the file you illustrate could have as > many as four partially-filled blocks. (Even in sparse files, > the gaps are block-aligned.) So, how many such partial-block > gaps might appear, how much bookkeeping might the file system > need to do, and could lseek() and stat() be done efficiently? In fact it would be possible to write a driver that provides exactly this behavior with reasonable performance at VFS level. It is basically the data structure of a rope class. When dealing with very large data mapping that unfragmented into memory could be difficult due to fragmentation of the virtual address space and furthermore insert/delete operations in the middle are extraordinary expensive. (In fact any editor that can deal with large files has this problem.) Cutting the data into arbitrary chunks improves the situation significantly. The resulting structure is similar to a BTree. So any random access operations turn into O(log(n)) which is not that bad in general. > Not to say this would be an insuperable amount of work, but > it would probably be a "more than epsilon" amount. The question is more like: should this be a standard function of a file system? Is it really needed that often? Many file formats (e.g. XML, JSON ...) do not tolerate simple concatenation at binary level. In this cases it is useless. If this requirement becomes more common a one to many symlink could be the solution. However, this requires changes in the filesystem API to access the feature. >> Have you heard of this idea before? Seen it implemented? > > I've seen something vaguely like it, but not on a file system > with Unixoid semantics. (It supported things like deletion and > insertion in the middle of a file; you could build a file from > the middle outward instead of from beginning to end ...) But > even that one required special conventions for unused block frags; > they were filled with zeroes that the applications had to know to > ignore (hence, such files were, er, "not recommended" for binary > data unless it had its own "record structure" to distinguish > padding from payload.) This sounds more like sparse files. They can be quite useful if you want to add trim support to a virtual disk of a VM. There is no other way to mark arbitrary chunks of the virtual disk file as free in a way that host file system could trim this regions too on the physical device. Marcel
![]() |
0 |
![]() |
"James K. Lowden" , dans le message <20161215182335.56c8f237068ac0140e55176d@speakeasy.net>, a �crit�: > true for any new call > true for any new filesystem feature This is why there are so few of them. > It's fairly common for a syscall to fail if the underlying device > doesn't support it. ENODEV comes to mind. Then consider you have your syscall. > Could I do what I want to do in userland? Of course. What I can't do > is join many files into one You can do that in userland. > without moving the data on the disk. And this is a break of abstraction. > Is it highly speciallzed? I don't know. Is splice(2) speciallized? > Is passing file descriptors over a Unix domain socket? I think we > could find quite a few arcane features of the kernel. The features you did find are IMHO much less specialized than what you request. > On the other hand, a lot of effort these days goes into minimizing > copying. Yes, but it is supported by specialized file formats, not dumped on top of an aging API.
![]() |
0 |
![]() |