Let's start by looking at a simple C function to be compiled as
position-independent code (i.e. -fPIC
, for use in a shared library):
void bar(void);
void foo(void)
{
bar();
}
And now, what GCC compiles it to (listing 2):
foo:
pushl %ebx
subl $24, %esp
call __x86.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_, %ebx
movl 32(%esp), %eax
movl %eax, (%esp)
call bar@PLT
addl $24, %esp
popl %ebx
ret
__x86.get_pc_thunk.bx:
movl (%esp), %ebx
ret
Obviously what we'd like to see is:
foo:
jmp bar
And that's what we'd get with non-PIC code generation, or with PIC and
applying hidden visibility to bar
. That ideal form is not attainable
with PIC, because the PC-relative address of the callee (bar
) may
not be fixed at link-time; it may reside in another shared library or
the main program.
Readers familiar with the principles of position-independent code and GOTs/PLTs might wonder why we can't achieve this (listing 4):
foo:
jmp bar@PLT
Here, the symbol
@PLT
notation in the assembly tells the
assembler to generate a special type of relocation, which the linker
will use to resolve the relative address in the call instruction to a
“procedure linkage table” thunk it generates in output. This thunk
exists at a fixed location in the shared library (and thus a fixed
relative address from the caller, no matter what address the library
is loaded at) and is responsible for loading the actual address of the
function and jumping to it.
In order to be able to jump to the actual function bar
, the PLT
thunk needs to be able to access global data: in particular, a pointer
in the “global offset table” (GOT) which it will use as its jump
destination. The PLT thunk code looks like this (listing 5):
bar@PLT:
jmp *bar@GOT(%ebx)
push $0
jmp <PLT slot -1>
The second and third instructions are related to lazy binding (more on
that later) but the first one is the one we're interested in right
now. Since 32-bit x86 provides no direct way to do memory loads/stores
relative to the current instruction pointer, the SysV ABI provides
that, when code generated as PIC calls into a PLT thunk, it must pass
a pointer to the GOT as a hidden argument in %ebx
.
So why does that preclude the code in listing 4 above?
Well, per the ABI, the only call-clobbered registers on x86 are
%eax
, %ecx
, and %edx
. The register used for the hidden GOT
argument, %ebx
, is call-saved. That means foo
is responsible for
backing up and restoring %ebx
if it modifies it. So we have a
horrible cascade of inefficiency:
foo
has to load %ebx
as an argument to bar@PLT
.foo
has to save %ebx
on the stack and restore it before
returning because it's a call-saved register.bar
can't be a tail-call because foo
has work to do
after bar
returns: it has to restore %ebx
.Thus the monstrosity in listing 2.
Well, a first thought might be to change the register used for the hidden argument, but that can't be done without breaking the ABI contract between the compiler and linker, so it's out.
What about getting rid of the requirement for the hidden argument to
bar@PLT
? Well, that would also be an ABI change, but at least not an
incompatible one. In any case it's not really practical. There's no
simple way to make the PLT thunk load the GOT address without
clobbering registers, and the only three registers which are
call-clobbered are also used for argument-passing in certain
non-default but supported "regparm" calling conventions. The choice of
%ebx
was almost certainly intentional, to avoid clashing with
registers that could potentially be used as arguments.
So what's left?
How about getting rid of the PLT thunk entirely? Instead of aiming to generate the code in listing 4, let's aim for this (listing 6):
foo:
call __x86.get_pc_thunk.cx
jmp *bar@GOT+_GLOBAL_OFFSET_TABLE_(%ecx)
__x86.get_pc_thunk.cx:
movl (%esp), %ecx
ret
Not only does this eliminate nearly all of the overhead/bloat in
foo
; it also eliminates the one-instruction visit to an extra cache
line where the PLT thunk resides. Seems like a big win.
Unfortunately, there's a reason, and it's a really bad one.
The original purpose for having a PLT was for the main program,
compiled for a fixed load address (think pre-PIE era) and not using
position-independent code, to be able to call shared library
functions. The type of PLT that appears in the main program does not
need the hidden GOT argument in %ebx
, because, being at a fixed
address, it's free to use absolute addresses for accessing its own
GOT. The main program does need the PLT, though, because it's
incorporating “legacy” (non-PIC) object files that don't know the
functions they're calling could be loaded at varying locations at
run-time. (It's the linker that's responsible, when producing a
dynamic-linked program from such object files, for generating the
appropriate glue in the form of PLT thunks.)
Position-independent code does not need a PLT. As in my example in listing 6, such code can load the target address from the GOT itself and make the indirect call/jump. Rather, the use of a PLT in position-independent shared library code was instituted to exploit another property of having a PLT: lazy binding.
When lazy binding is used, the dynamic linker gets to skip looking up callee symbol names at load time; the lookup is deferred to the first time the function is called. In theory, this trades runtime overhead, in the form of complexity and major latency at the time of the first call to a function, for a moderate savings in startup-time.
At least, that was the theory a few decades ago when all this machinery was designed.
Nowdays, lazy binding is a huge liability for security, and its
performance benefits have also come under question. The biggest
problem is that, for lazy binding to work, the GOT must be writable at
runtime, and that makes it an attack vector for arbitrary code
execution. Modern hardened systems use relro
, which protects part or
all of the GOT read-only after loading, but GOT slots subject to
lazy-binding in the PLT are excluded from this protection. To get
significant benefit from the relro
link feature, lazy binding must
also be disabled, with the following link options:
-Wl,-z,relro -Wl,-z,now
So basically, lazy binding is, or should be considered, deprecated.
Incidentally, musl libc does not support lazy binding at all for these and other reasons.
Remember lines 2 and 3 of the sample PLT thunk in listing 5? Well, the
way they work is that bar@GOT(%ebx)
initially (prior to lazy
binding) contains a pointer to line 2, setup by the dynamic linker.
The constant 0 pushed in line 2 is the PLT/GOT slot number, and the
code jumped to is a thunk that invokes the code to resolve the lazy
binding, using the slot number that was pushed into the stack as its
argument.
There's no easy way to achieve the same thing with the code in listing 6; attempting to do so would slow down the caller and require some code duplication at each call site.
So, the reason we don't have efficient x86 PIC function calls is to support an obsolete misfeature.
If we can (optionally) give up lazy binding, that is.
Alexander Monakov has prepared this simple patch for GCC, which lets you disable PIC calls via PLT, and which probably has a chance of making it upstream:
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3263656..cd5f246 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5451,7 +5451,8 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
if (!TARGET_MACHO
&& !TARGET_64BIT
&& flag_pic
- && (!decl || !targetm.binds_local_p (decl)))
+ && flag_plt
+ && (decl && !targetm.binds_local_p (decl)))
return false;
/* If we need to align the outgoing stack, then sibcalling would
@@ -25577,15 +25578,23 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
/* Static functions and indirect calls don't need the pic register. */
if (flag_pic
&& (!TARGET_64BIT
+ || !flag_plt
|| (ix86_cmodel == CM_LARGE_PIC
&& DEFAULT_ABI != MS_ABI))
&& GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF
&& ! SYMBOL_REF_LOCAL_P (XEXP (fnaddr, 0)))
{
- use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
- if (ix86_use_pseudo_pic_reg ())
- emit_move_insn (gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM),
- pic_offset_table_rtx);
+ if (flag_plt)
+ {
+ use_reg (&use, gen_rtx_REG (Pmode, REAL_PIC_OFFSET_TABLE_REGNUM));
+ if (ix86_use_pseudo_pic_reg ())
+ emit_move_insn (gen_rtx_REG (Pmode,
+ REAL_PIC_OFFSET_TABLE_REGNUM),
+ pic_offset_table_rtx);
+ }
+ else
+ fnaddr = gen_rtx_MEM (QImode,
+ legitimize_pic_address (XEXP (fnaddr, 0), 0));
}
}
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 301430c..aacc668 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -572,6 +572,10 @@ mprefer-avx128
Target Report Mask(PREFER_AVX128) SAVE
Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.
+mplt
+Target Report Var(flag_plt) Init(0)
+Use PLT for PIC calls (-mno-plt: load the address from GOT at call site)
+
;; ISA support
m32
I've been playing around with some similar changes to my local gcc 4.7.3 tree and was able to achieve the following output:
foo:
call __x86.get_pc_thunk.cx
addl $_GLOBAL_OFFSET_TABLE_, %ecx
movl bar@GOT(%ecx), %eax
jmp *%eax
__x86.get_pc_thunk.cx:
movl (%esp), %ecx
ret
It's still a ways off from the ideal 2 functions, but much better than the original output.
Compared to listing 6, there are two differences. Loading
bar@GOT(%ecx)
into %eax
to make the indirect call is utterly
useless and just bad codegen that's hopefully fixed on newer GCC
versions. Failure to combine the constants bar@GOT
and
_GLOBAL_OFFSET_TABLE_
(which actually resolves to
_GLOBAL_OFFSET_TABLE_-.
) into a single constant is a more
fundamental problem, though. It would take a new relocation type to
resolve: a relocation that resolves not to the fixed GOT-base-relative
offset of the GOT slot for bar
, but rather to the fixed
instruction-pointer-relative offset of the GOT slot for bar
. Having
this new relocation type would make all GOT accesses mildly cheaper.