EWONTFIX

32-bit x86 Position Independent Code - It's that bad

15 Apr 2015 03:23:06 GMT

Let's start by looking at a simple C function to be compiled as position-independent code (i.e. -fPIC, for use in a shared library):

void bar(void);

void foo(void)
{
    bar();
}

And now, what GCC compiles it to (listing 2):

foo:
    pushl   %ebx
    subl    $24, %esp
    call    __x86.get_pc_thunk.bx
    addl    $_GLOBAL_OFFSET_TABLE_, %ebx
    movl    32(%esp), %eax
    movl    %eax, (%esp)
    call    bar@PLT
    addl    $24, %esp
    popl    %ebx
    ret
__x86.get_pc_thunk.bx:
    movl    (%esp), %ebx
    ret

Wow. What went wrong?

Obviously what we'd like to see is:

foo:
    jmp bar

And that's what we'd get with non-PIC code generation, or with PIC and applying hidden visibility to bar. That ideal form is not attainable with PIC, because the PC-relative address of the callee (bar) may not be fixed at link-time; it may reside in another shared library or the main program.

Readers familiar with the principles of position-independent code and GOTs/PLTs might wonder why we can't achieve this (listing 4):

foo:
    jmp bar@PLT

Here, the symbol@PLT notation in the assembly tells the assembler to generate a special type of relocation, which the linker will use to resolve the relative address in the call instruction to a “procedure linkage table” thunk it generates in output. This thunk exists at a fixed location in the shared library (and thus a fixed relative address from the caller, no matter what address the library is loaded at) and is responsible for loading the actual address of the function and jumping to it.

Here's where things start to go wrong.

In order to be able to jump to the actual function bar, the PLT thunk needs to be able to access global data: in particular, a pointer in the “global offset table” (GOT) which it will use as its jump destination. The PLT thunk code looks like this (listing 5):

bar@PLT:
    jmp *bar@GOT(%ebx)
    push $0
    jmp <PLT slot -1>

The second and third instructions are related to lazy binding (more on that later) but the first one is the one we're interested in right now. Since 32-bit x86 provides no direct way to do memory loads/stores relative to the current instruction pointer, the SysV ABI provides that, when code generated as PIC calls into a PLT thunk, it must pass a pointer to the GOT as a hidden argument in %ebx.

So why does that preclude the code in listing 4 above?

Well, per the ABI, the only call-clobbered registers on x86 are %eax, %ecx, and %edx. The register used for the hidden GOT argument, %ebx, is call-saved. That means foo is responsible for backing up and restoring %ebx if it modifies it. So we have a horrible cascade of inefficiency:

  1. foo has to load %ebx as an argument to bar@PLT.
  2. foo has to save %ebx on the stack and restore it before returning because it's a call-saved register.
  3. The call to bar can't be a tail-call because foo has work to do after bar returns: it has to restore %ebx.

Thus the monstrosity in listing 2.

What can be done to fix it?

Well, a first thought might be to change the register used for the hidden argument, but that can't be done without breaking the ABI contract between the compiler and linker, so it's out.

What about getting rid of the requirement for the hidden argument to bar@PLT? Well, that would also be an ABI change, but at least not an incompatible one. In any case it's not really practical. There's no simple way to make the PLT thunk load the GOT address without clobbering registers, and the only three registers which are call-clobbered are also used for argument-passing in certain non-default but supported "regparm" calling conventions. The choice of %ebx was almost certainly intentional, to avoid clashing with registers that could potentially be used as arguments.

So what's left?

How about getting rid of the PLT thunk entirely? Instead of aiming to generate the code in listing 4, let's aim for this (listing 6):

foo:
    call    __x86.get_pc_thunk.cx
    jmp *bar@GOT+_GLOBAL_OFFSET_TABLE_(%ecx)
__x86.get_pc_thunk.cx:
    movl    (%esp), %ecx
    ret

Not only does this eliminate nearly all of the overhead/bloat in foo; it also eliminates the one-instruction visit to an extra cache line where the PLT thunk resides. Seems like a big win.

Why wasn't this done from the beginning?

Unfortunately, there's a reason, and it's a really bad one.

The original purpose for having a PLT was for the main program, compiled for a fixed load address (think pre-PIE era) and not using position-independent code, to be able to call shared library functions. The type of PLT that appears in the main program does not need the hidden GOT argument in %ebx, because, being at a fixed address, it's free to use absolute addresses for accessing its own GOT. The main program does need the PLT, though, because it's incorporating “legacy” (non-PIC) object files that don't know the functions they're calling could be loaded at varying locations at run-time. (It's the linker that's responsible, when producing a dynamic-linked program from such object files, for generating the appropriate glue in the form of PLT thunks.)

Position-independent code does not need a PLT. As in my example in listing 6, such code can load the target address from the GOT itself and make the indirect call/jump. Rather, the use of a PLT in position-independent shared library code was instituted to exploit another property of having a PLT: lazy binding.

Basics of lazy binding

When lazy binding is used, the dynamic linker gets to skip looking up callee symbol names at load time; the lookup is deferred to the first time the function is called. In theory, this trades runtime overhead, in the form of complexity and major latency at the time of the first call to a function, for a moderate savings in startup-time.

At least, that was the theory a few decades ago when all this machinery was designed.

Nowdays, lazy binding is a huge liability for security, and its performance benefits have also come under question. The biggest problem is that, for lazy binding to work, the GOT must be writable at runtime, and that makes it an attack vector for arbitrary code execution. Modern hardened systems use relro, which protects part or all of the GOT read-only after loading, but GOT slots subject to lazy-binding in the PLT are excluded from this protection. To get significant benefit from the relro link feature, lazy binding must also be disabled, with the following link options:

-Wl,-z,relro -Wl,-z,now

So basically, lazy binding is, or should be considered, deprecated.

Incidentally, musl libc does not support lazy binding at all for these and other reasons.

Lazy binding and the PLT

Remember lines 2 and 3 of the sample PLT thunk in listing 5? Well, the way they work is that bar@GOT(%ebx) initially (prior to lazy binding) contains a pointer to line 2, setup by the dynamic linker. The constant 0 pushed in line 2 is the PLT/GOT slot number, and the code jumped to is a thunk that invokes the code to resolve the lazy binding, using the slot number that was pushed into the stack as its argument.

There's no easy way to achieve the same thing with the code in listing 6; attempting to do so would slow down the caller and require some code duplication at each call site.

So, the reason we don't have efficient x86 PIC function calls is to support an obsolete misfeature.

Fortunately, it's fixable!

If we can (optionally) give up lazy binding, that is.

Alexander Monakov has prepared this simple patch for GCC, which lets you disable PIC calls via PLT, and which probably has a chance of making it upstream:

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3263656..cd5f246 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5451,7 +5451,8 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
   if (!TARGET_MACHO
       && !TARGET_64BIT
       && flag_pic
-      && (!decl || !targetm.binds_local_p (decl)))
+      && flag_plt
+      && (decl && !targetm.binds_local_p (decl)))
     return false;

   /* If we need to align the outgoing stack, then sibcalling would
@@ -25577,15 +25578,23 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
       /* Static functions and indirect calls don't need the pic register.  */
       if (flag_pic
          && (!TARGET_64BIT
+             || !flag_plt
              || (ix86_cmodel == CM_LARGE_PIC
                  && DEFAULT_ABI != MS_ABI))
          && GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF
          && ! SYMBOL_REF_LOCAL_P (XEXP (fnaddr, 0)))
        {
-         use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
-         if (ix86_use_pseudo_pic_reg ())
-           emit_move_insn (gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM),
-                           pic_offset_table_rtx);
+         if (flag_plt)
+           {
+             use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
+             if (ix86_use_pseudo_pic_reg ())
+               emit_move_insn (gen_rtx_REG (Pmode,
+                                            REAL_PIC_OFFSET_TABLE_REGNUM),
+                               pic_offset_table_rtx);
+           }
+         else
+           fnaddr = gen_rtx_MEM (QImode,
+                                 legitimize_pic_address (XEXP (fnaddr, 0), 0));
        }
     }

diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 301430c..aacc668 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -572,6 +572,10 @@ mprefer-avx128
 Target Report Mask(PREFER_AVX128) SAVE
 Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.

+mplt
+Target Report Var(flag_plt) Init(0)
+Use PLT for PIC calls (-mno-plt: load the address from GOT at call site)
+
 ;; ISA support

 m32

I've been playing around with some similar changes to my local gcc 4.7.3 tree and was able to achieve the following output:

foo:
    call    __x86.get_pc_thunk.cx
    addl    $_GLOBAL_OFFSET_TABLE_, %ecx
    movl    bar@GOT(%ecx), %eax
    jmp *%eax
__x86.get_pc_thunk.cx:
    movl    (%esp), %ecx
    ret

It's still a ways off from the ideal 2 functions, but much better than the original output.

Compared to listing 6, there are two differences. Loading bar@GOT(%ecx) into %eax to make the indirect call is utterly useless and just bad codegen that's hopefully fixed on newer GCC versions. Failure to combine the constants bar@GOT and _GLOBAL_OFFSET_TABLE_ (which actually resolves to _GLOBAL_OFFSET_TABLE_-.) into a single constant is a more fundamental problem, though. It would take a new relocation type to resolve: a relocation that resolves not to the fixed GOT-base-relative offset of the GOT slot for bar, but rather to the fixed instruction-pointer-relative offset of the GOT slot for bar. Having this new relocation type would make all GOT accesses mildly cheaper.