Or: A tale about Linux file write patterns.
So I once wrote a custom core dump handler to be used with Linux’s
core_pattern. What it does is take a core dump on
STDIN plus a few arguments, and then write the core to a predictable
location on disk with a time stamp and suitable access rights. Core
dumps tend to be rather large, and in general you don’t know in
advance how much data you’ll write to disk. So I built a functionality
to write a chunk of data to disk (say, 16MB) and then check with
fstatfs()
if the disk has still more than threshold capacity (say,
10GB). This way, a rapidly restarting and core-dumping application
cannot lead to “disk full” follow up failures that will inevitably
lead to a denial of service for most data handling services.
So… how do we write a lot of data to disk really fast? – Let us maybe
rephrase the question: How do we write data to disk in the first
place? Let’s assume we have already opened file descriptors in
and
out
, and we just want to copy everything from in
to out
.
One might be tempted to try something like this:
ssize_t read_write(int in, int out)
{
ssize_t n, t = 0;
char buf[1024];
while((n = read(in, buf, 1024)) > 0) {
t += write(out, buf, n);
}
return t;
}
“But…!”, you cry out, “there’s so much wrong with this!” And you are right, of course:
n
is not checked. It might be -1
. This might be
because e.g. we have got a bad file descriptor, or because the
syscall was interrupted.write(out, buf, 1024)
will – if it does not return -1
– write at least one byte, but we have no guarantee that we will
actually write all n
bytes to disk. So we have to loop the write
until we have written n
bytes.An updated and semantically correct pattern reads like this (in a real program you’d have to do real error handling instead of assertions, of course):
ssize_t read_write_bs(int in, int out, ssize_t bs)
{
ssize_t w = 0, r = 0, t, n, m;
char *buf = malloc(bs);
assert(buf != NULL);
t = filesize(in);
while(r < t && (n = read(in, buf, bs))) {
if(n == -1) { assert(errno == EINTR); continue; }
r = n;
w = 0;
while(w < r && (m = write(out, buf + w, (r - w)))) {
if(m == -1) { assert(errno == EINTR); continue; }
w += m;
}
}
free(buf);
return w;
}
We have a total number of bytes to read (t
), the number of bytes
already read (r
), and the number of bytes already written (w
).
Only when t == r == w
are we done (or if the input stream ends
prematurely). Error checking is performed so that we restart
interrupted syscalls and crash on real errors.
What about the bs
parameter? Of course you may have already noticed
in the first example that we always copied 1024 bytes. Typically, a
block on the file system is 4KB, so we are only writing quarter
blocks, which is likely bad for performance. So we’ll try different
block sizes and compare the results.
We can find out the file system’s block size like this (as usual, real error handling left out):
ssize_t block_size(int fd)
{
struct statfs st;
assert(fstatfs(fd, &st) != -1);
return (ssize_t) st.f_bsize;
}
OK, let’s do some benchmarks! (Full code is on GitHub.) For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each version three times and printing the best time in the first column. In parantheses you’ll see the percentage increase in comparison to the best run of all methods:
read+write 16bs 164ms 191ms 206ms
read+write 256bs 167ms 168ms 187ms (+ 1.8%)
read+write 4bs 169ms 169ms 177ms (+ 3.0%)
read+write bs 184ms 191ms 200ms (+ 12.2%)
read+write 1k 299ms 317ms 329ms (+ 82.3%)
Mh. Seems like multiples of the FS’s block sizes don’t really matter here. In some runs, the 16x blocksize is best, sometimes it’s the 256x. The only obvious point is that writing only a single block at once is bad, and writing fractions of a block at once is very bad indeed performance-wise.
Now what’s there to improve? “Surely it’s the overhead of using
read()
to get data,” I hear you saying, “Use mmap()
for that!”
So we come up with this:
ssize_t mmap_write(int in, int out)
{
ssize_t w = 0, n;
size_t len;
char *p;
len = filesize(in);
p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0);
assert(p != NULL);
while(w < len && (n = write(out, p + w, (len - w)))) {
if(n == -1) { assert(errno == EINTR); continue; }
w += n;
}
munmap(p, len);
return w;
}
Admittedly, the pattern is simpler. But, alas, it is even a little bit slower! (YMMV)
read+write 16bs 167ms 171ms 209ms
mmap+write 186ms 187ms 211ms (+ 11.4%)
“Surely copying around useless data is hurting performance,” I hear you say, “it’s 2014, use zero-copy already!” – OK. So basically there are two approaches for this on Linux: One cumbersome but rather old and known to work, and then there is the new and shiny sendfile interface.
For the splice
approach, since either reader or writer of your splice call must be
pipes (and in our case both are regular files), we need to create a
pipe solely for the purpose of splicing data from in
to the write
end of the pipe, and then again splicing that same chunk from the read
end to the out
fd:
ssize_t pipe_splice(int in, int out)
{
size_t bs = 65536;
ssize_t w = 0, r = 0, t, n, m;
int pipefd[2];
int flags = SPLICE_F_MOVE | SPLICE_F_MORE;
assert(pipe(pipefd) != -1);
t = filesize(in);
while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) {
if(n == -1) { assert(errno == EINTR); continue; }
r += n;
while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) {
if(m == -1) { assert(errno == EINTR); continue; }
w += m;
}
}
close(pipefd[0]);
close(pipefd[1]);
return w;
}
“This is not true zero copy!”, I hear you cry, and it’s true, the ‘page stealing’ mechanism has been discontinued as of 2007. So what we get is an “in-kernel memory copy”, but at least the file contents don’t cross the kernel/userspace boundary twice unnecessarily (we don’t inspect it anyway, right?).
The sendfile()
approach is more immediate and clean:
ssize_t do_sendfile(int in, int out)
{
ssize_t t = filesize(in);
off_t ofs = 0;
while(ofs < t) {
if(sendfile(out, in, &ofs, t - ofs) == -1) {
assert(errno == EINTR);
continue;
}
}
return t;
}
So… do we get an actual performance gain?
sendfile 159ms 168ms 175ms
pipe+splice 161ms 162ms 163ms (+ 1.3%)
read+write 16bs 164ms 165ms 178ms (+ 3.1%)
“Yes! I knew it!” you say. But I’m lying here. Every time I execute
the benchmark, another different approach is the fastest. Sometimes
the read/write approach comes in first before the two others. So it
seems that this is not really a performance saver, is it? I like the
sendfile()
semantics, though. But beware:
In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately.
Strangely, sendfile()
works on regular files in the default Debian
Squeeze Kernel (2.6.32-5) without problems. (Update 2015-01-17:
Przemysław Pawełczyk, who in 2011 sent Changli Gao’s patch which
re-enables this behaviour to stable@kernel.org for inclusion in Linux
2.6.32, wrote to me explaining how exactly it ended up being
backported. If you’re interested, see this excerpt from his
email.)
“But,” I hear you saying, “the system has no clue what your intentions are, give it a few hints!” and you are probably right, that shouldn’t hurt:
void advice(int in, int out)
{
ssize_t t = filesize(in);
posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED);
posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL);
}
But since the file is very probably fully cached, the performance is not improved significantly. “BUT you should supply a hint on how much you will write, too!” – And you are right. And this is where the story branches off into two cases: Old and new file systems.
I’ll just tell the kernel that I want to write t
bytes to disk now,
and please reserve space (I don’t care about a “disk full” that I
could catch and act on):
void do_falloc(int in, int out)
{
ssize_t t = filesize(in);
posix_fallocate(out, 0, t);
}
I’m using my workstation’s SSD with XFS now (not my laptop any more). Suddenly everything is much faster, so I’ll simply run the benchmarks on a 512MB file so that it actually takes time:
sendfile + advices + falloc 205ms 208ms 208ms
pipe+splice + advices + falloc 207ms 209ms 210ms (+ 1.0%)
sendfile 226ms 226ms 229ms (+ 10.2%)
pipe+splice 227ms 227ms 231ms (+ 10.7%)
read+write 16bs + advices + falloc 235ms 240ms 240ms (+ 14.6%)
read+write 16bs 258ms 259ms 263ms (+ 25.9%)
Wow, so this posix_fallocate()
thing is a real improvement! It seems
reasonable enough, of course: Already the file system can prepare an
– if possible contiguous – sequence of blocks in the requested size. But
wait! What about Ext3? Back to the laptop:
sendfile 161ms 171ms 194ms
read+write 16bs 164ms 174ms 189ms (+ 1.9%)
pipe+splice 167ms 170ms 178ms (+ 3.7%)
read+write 16bs + advices + falloc 224ms 229ms 229ms (+ 39.1%)
pipe+splice + advices + falloc 229ms 239ms 241ms (+ 42.2%)
sendfile + advices + falloc 232ms 235ms 249ms (+ 44.1%)
Bummer. That was unexpected. Why is that? Let’s check strace
while
we execute this program:
fallocate(1, 0, 0, 134217728) = -1 EOPNOTSUPP (Operation not supported)
...
pwrite(1, "\0", 1, 4095) = 1
pwrite(1, "\0", 1, 8191) = 1
pwrite(1, "\0", 1, 12287) = 1
pwrite(1, "\0", 1, 16383) = 1
...
What? Who does this? – Glibc does this! It sees the syscall fail and re-creates the semantics by hand. (Beware, Glibc code follows. Safe to skip if you want to keep your sanity.)
/* Reserve storage for the data of the file associated with FD. */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
if (__glibc_likely (__have_fallocate >= 0))
# endif
{
INTERNAL_SYSCALL_DECL (err);
int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0,
__LONG_LONG_PAIR (offset >> 31, offset),
__LONG_LONG_PAIR (len >> 31, len));
if (! INTERNAL_SYSCALL_ERROR_P (res, err))
return 0;
# ifndef __ASSUME_FALLOCATE
if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS))
__have_fallocate = -1;
else
# endif
if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
return INTERNAL_SYSCALL_ERRNO (res, err);
}
#endif
return internal_fallocate (fd, offset, len);
}
And you guessed it, internal_fallocate()
just does a pwrite()
on
the first byte for every block until the space requirement is
fulfilled. This is slowing things down considerably. This is bad. –
“But other people just truncate the file! I saw this!”, you interject, and again you are right.
void enlarge_truncate(int in, int out)
{
ssize_t t = filesize(in);
ftruncate(out, t);
}
Indeed the truncate versions work faster on Ext3:
pipe+splice + advices + trunc 157ms 158ms 160ms
read+write 16bs + advices + trunc 158ms 167ms 188ms (+ 0.6%)
sendfile + advices + trunc 164ms 167ms 181ms (+ 4.5%)
sendfile 164ms 171ms 193ms (+ 4.5%)
pipe+splice 166ms 167ms 170ms (+ 5.7%)
read+write 16bs 178ms 185ms 185ms (+ 13.4%)
Alas, not on XFS. There, the fallocate()
system call is just more
performant. (You can also use
xfsctl
directly for that.) –
And this is where the story ends.
In place of a sweeping conclusion, I’m a little bit disappointed that
there seems to be no general semantics to say “I’ll write n
bytes
now, please be prepared”. Obviously, using posix_fallocate()
on Ext3
hurts very much (this may be why cp
is not
employing
it). So I guess the best solution is still something like this:
if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP)
ftruncate(out, len);
Maybe you have another idea how to speed up the writing process? Then drop me an email, please.
Update 2014-05-03: Coming back after a couple of days’ vacation, I found the post was on HackerNews and generated some 23k hits here. I corrected the small mistake in example 2 (as pointed out in the comments – thanks!). – I trust that the diligent reader will have noticed that this is not a complete survey of either I/O hierarchy, file system and/or hard drive performace. It is, as the subtitle should have made clear, a “tale about Linux file write patterns”.
Update 2014-06-09: Sebastian pointed out an error
in the mmap write pattern (the write should start at p + w
, not at p
).
Also, the basic read/write pattern contained a subtle error. Tricky business –
Thanks!