Julius Plenz - Blog - So you want to write to a file real fast…

So you want to write to a file real fast…

Or: A tale about Linux file write patterns.

So I once wrote a custom core dump handler to be used with Linux’s core_pattern. What it does is take a core dump on STDIN plus a few arguments, and then write the core to a predictable location on disk with a time stamp and suitable access rights. Core dumps tend to be rather large, and in general you don’t know in advance how much data you’ll write to disk. So I built a functionality to write a chunk of data to disk (say, 16MB) and then check with fstatfs() if the disk has still more than threshold capacity (say, 10GB). This way, a rapidly restarting and core-dumping application cannot lead to “disk full” follow up failures that will inevitably lead to a denial of service for most data handling services.

So… how do we write a lot of data to disk really fast? – Let us maybe rephrase the question: How do we write data to disk in the first place? Let’s assume we have already opened file descriptors in and out, and we just want to copy everything from in to out.

One might be tempted to try something like this:

ssize_t read_write(int in, int out)
{
    ssize_t n, t = 0;
    char buf[1024];
    while((n = read(in, buf, 1024)) > 0) {
        t += write(out, buf, n);
    }
    return t;
}

“But…!”, you cry out, “there’s so much wrong with this!” And you are right, of course:

The return value n is not checked. It might be -1. This might be because e.g. we have got a bad file descriptor, or because the syscall was interrupted.
A call to write(out, buf, 1024) will – if it does not return -1 – write at least one byte, but we have no guarantee that we will actually write all n bytes to disk. So we have to loop the write until we have written n bytes.

An updated and semantically correct pattern reads like this (in a real program you’d have to do real error handling instead of assertions, of course):

ssize_t read_write_bs(int in, int out, ssize_t bs)
{
    ssize_t w = 0, r = 0, t, n, m;

    char *buf = malloc(bs);
    assert(buf != NULL);

    t = filesize(in);

    while(r < t && (n = read(in, buf, bs))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r = n;
        w = 0;
        while(w < r && (m = write(out, buf + w, (r - w)))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    free(buf);

    return w;
}

We have a total number of bytes to read (t), the number of bytes already read (r), and the number of bytes already written (w). Only when t == r == w are we done (or if the input stream ends prematurely). Error checking is performed so that we restart interrupted syscalls and crash on real errors.

What about the bs parameter? Of course you may have already noticed in the first example that we always copied 1024 bytes. Typically, a block on the file system is 4KB, so we are only writing quarter blocks, which is likely bad for performance. So we’ll try different block sizes and compare the results.

We can find out the file system’s block size like this (as usual, real error handling left out):

ssize_t block_size(int fd)
{
    struct statfs st;
    assert(fstatfs(fd, &st) != -1);
    return (ssize_t) st.f_bsize;
}

OK, let’s do some benchmarks! (Full code is on GitHub.) For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each version three times and printing the best time in the first column. In parantheses you’ll see the percentage increase in comparison to the best run of all methods:

read+write 16bs             164ms      191ms      206ms
read+write 256bs            167ms      168ms      187ms  (+ 1.8%)
read+write 4bs              169ms      169ms      177ms  (+ 3.0%)
read+write bs               184ms      191ms      200ms  (+ 12.2%)
read+write 1k               299ms      317ms      329ms  (+ 82.3%)

Mh. Seems like multiples of the FS’s block sizes don’t really matter here. In some runs, the 16x blocksize is best, sometimes it’s the 256x. The only obvious point is that writing only a single block at once is bad, and writing fractions of a block at once is very bad indeed performance-wise.

Now what’s there to improve? “Surely it’s the overhead of using read() to get data,” I hear you saying, “Use mmap() for that!” So we come up with this:

ssize_t mmap_write(int in, int out)
{
    ssize_t w = 0, n;
    size_t len;
    char *p;

    len = filesize(in);
    p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0);
    assert(p != NULL);

    while(w < len && (n = write(out, p + w, (len - w)))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        w += n;
    }

    munmap(p, len);

    return w;
}

Admittedly, the pattern is simpler. But, alas, it is even a little bit slower! (YMMV)

read+write 16bs               167ms      171ms      209ms
mmap+write                    186ms      187ms      211ms  (+ 11.4%)

“Surely copying around useless data is hurting performance,” I hear you say, “it’s 2014, use zero-copy already!” – OK. So basically there are two approaches for this on Linux: One cumbersome but rather old and known to work, and then there is the new and shiny sendfile interface.

For the splice approach, since either reader or writer of your splice call must be pipes (and in our case both are regular files), we need to create a pipe solely for the purpose of splicing data from in to the write end of the pipe, and then again splicing that same chunk from the read end to the out fd:

ssize_t pipe_splice(int in, int out)
{
    size_t bs = 65536;
    ssize_t w = 0, r = 0, t, n, m;
    int pipefd[2];
    int flags = SPLICE_F_MOVE | SPLICE_F_MORE;

    assert(pipe(pipefd) != -1);

    t = filesize(in);

    while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r += n;
        while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    close(pipefd[0]);
    close(pipefd[1]);

    return w;
}

“This is not true zero copy!”, I hear you cry, and it’s true, the ‘page stealing’ mechanism has been discontinued as of 2007. So what we get is an “in-kernel memory copy”, but at least the file contents don’t cross the kernel/userspace boundary twice unnecessarily (we don’t inspect it anyway, right?).

The sendfile() approach is more immediate and clean:

ssize_t do_sendfile(int in, int out)
{
    ssize_t t = filesize(in);
    off_t ofs = 0;

    while(ofs < t) {
        if(sendfile(out, in, &ofs, t - ofs) == -1) {
            assert(errno == EINTR);
            continue;
        }
    }

    return t;
}

So… do we get an actual performance gain?

sendfile                    159ms      168ms      175ms
pipe+splice                 161ms      162ms      163ms  (+ 1.3%)
read+write 16bs             164ms      165ms      178ms  (+ 3.1%)

“Yes! I knew it!” you say. But I’m lying here. Every time I execute the benchmark, another different approach is the fastest. Sometimes the read/write approach comes in first before the two others. So it seems that this is not really a performance saver, is it? I like the sendfile() semantics, though. But beware:

In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately.

Strangely, sendfile() works on regular files in the default Debian Squeeze Kernel (2.6.32-5) without problems. (Update 2015-01-17: Przemysław Pawełczyk, who in 2011 sent Changli Gao’s patch which re-enables this behaviour to stable@kernel.org for inclusion in Linux 2.6.32, wrote to me explaining how exactly it ended up being backported. If you’re interested, see this excerpt from his email.)

“But,” I hear you saying, “the system has no clue what your intentions are, give it a few hints!” and you are probably right, that shouldn’t hurt:

void advice(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED);
    posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL);
}

But since the file is very probably fully cached, the performance is not improved significantly. “BUT you should supply a hint on how much you will write, too!” – And you are right. And this is where the story branches off into two cases: Old and new file systems.

I’ll just tell the kernel that I want to write t bytes to disk now, and please reserve space (I don’t care about a “disk full” that I could catch and act on):

void do_falloc(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fallocate(out, 0, t);
}

I’m using my workstation’s SSD with XFS now (not my laptop any more). Suddenly everything is much faster, so I’ll simply run the benchmarks on a 512MB file so that it actually takes time:

sendfile + advices + falloc            205ms      208ms      208ms
pipe+splice + advices + falloc         207ms      209ms      210ms  (+ 1.0%)
sendfile                               226ms      226ms      229ms  (+ 10.2%)
pipe+splice                            227ms      227ms      231ms  (+ 10.7%)
read+write 16bs + advices + falloc     235ms      240ms      240ms  (+ 14.6%)
read+write 16bs                        258ms      259ms      263ms  (+ 25.9%)

Wow, so this posix_fallocate() thing is a real improvement! It seems reasonable enough, of course: Already the file system can prepare an – if possible contiguous – sequence of blocks in the requested size. But wait! What about Ext3? Back to the laptop:

sendfile                               161ms      171ms      194ms
read+write 16bs                        164ms      174ms      189ms  (+ 1.9%)
pipe+splice                            167ms      170ms      178ms  (+ 3.7%)
read+write 16bs + advices + falloc     224ms      229ms      229ms  (+ 39.1%)
pipe+splice + advices + falloc         229ms      239ms      241ms  (+ 42.2%)
sendfile + advices + falloc            232ms      235ms      249ms  (+ 44.1%)

Bummer. That was unexpected. Why is that? Let’s check strace while we execute this program:

fallocate(1, 0, 0, 134217728)           = -1 EOPNOTSUPP (Operation not supported)
...
pwrite(1, "\0", 1, 4095)                = 1
pwrite(1, "\0", 1, 8191)                = 1
pwrite(1, "\0", 1, 12287)               = 1
pwrite(1, "\0", 1, 16383)               = 1
...

What? Who does this? – Glibc does this! It sees the syscall fail and re-creates the semantics by hand. (Beware, Glibc code follows. Safe to skip if you want to keep your sanity.)

/* Reserve storage for the data of the file associated with FD.  */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
  if (__glibc_likely (__have_fallocate >= 0))
# endif
    {
      INTERNAL_SYSCALL_DECL (err);
      int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0,
                                  __LONG_LONG_PAIR (offset >> 31, offset),
                                  __LONG_LONG_PAIR (len >> 31, len));

      if (! INTERNAL_SYSCALL_ERROR_P (res, err))
        return 0;

# ifndef __ASSUME_FALLOCATE
      if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS))
        __have_fallocate = -1;
      else
# endif
        if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
          return INTERNAL_SYSCALL_ERRNO (res, err);
    }
#endif

  return internal_fallocate (fd, offset, len);
}

And you guessed it, internal_fallocate() just does a pwrite() on the first byte for every block until the space requirement is fulfilled. This is slowing things down considerably. This is bad. –

“But other people just truncate the file! I saw this!”, you interject, and again you are right.

void enlarge_truncate(int in, int out)
{
    ssize_t t = filesize(in);
    ftruncate(out, t);
}

Indeed the truncate versions work faster on Ext3:

pipe+splice + advices + trunc        157ms      158ms      160ms
read+write 16bs + advices + trunc    158ms      167ms      188ms  (+ 0.6%)
sendfile + advices + trunc           164ms      167ms      181ms  (+ 4.5%)
sendfile                             164ms      171ms      193ms  (+ 4.5%)
pipe+splice                          166ms      167ms      170ms  (+ 5.7%)
read+write 16bs                      178ms      185ms      185ms  (+ 13.4%)

Alas, not on XFS. There, the fallocate() system call is just more performant. (You can also use xfsctl directly for that.) –

And this is where the story ends.

In place of a sweeping conclusion, I’m a little bit disappointed that there seems to be no general semantics to say “I’ll write n bytes now, please be prepared”. Obviously, using posix_fallocate() on Ext3 hurts very much (this may be why cp is not employing it). So I guess the best solution is still something like this:

if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP)
    ftruncate(out, len);

Maybe you have another idea how to speed up the writing process? Then drop me an email, please.

Update 2014-05-03: Coming back after a couple of days’ vacation, I found the post was on HackerNews and generated some 23k hits here. I corrected the small mistake in example 2 (as pointed out in the comments – thanks!). – I trust that the diligent reader will have noticed that this is not a complete survey of either I/O hierarchy, file system and/or hard drive performace. It is, as the subtitle should have made clear, a “tale about Linux file write patterns”.

Update 2014-06-09: Sebastian pointed out an error in the mmap write pattern (the write should start at p + w, not at p). Also, the basic read/write pattern contained a subtle error. Tricky business – Thanks!

posted 2014-04-30 ∴ tagged linux and c

Julius Plenz – Blog

So you want to write to a file real fast…

Tagcloud