Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sysvabi64] Optional add x16, x16, :lo12: &.got.plt[N]] for ld -z now #202

Open
MaskRay opened this issue Apr 11, 2023 · 5 comments
Open

Comments

@MaskRay
Copy link
Contributor

MaskRay commented Apr 11, 2023

The sysvabi64/sysvabi64.rst document contains sample PLT sequences where x16 holds the address of the .got.plt entry. Some rtld implementations, such as glibc and FreeBSD rtld, with lazy PLT resolver support use x16 to compute the PLT entry index.

However, for ld -z now, there is no need to compute the PLT entry index, so the add x16, x16, ... instruction can be removed,
although the performance effect is likely negligible.
There are multiple variants of PLT due to BTI and PAC. Allowing optional add x16, x16, ... will add another dimension and increase the number of PLT types, so there is some debate about whether we should allow this flexibility.
I created the issue on this topic, as @appujee mentioned that the instruction seems unneeded on Android. Also CC @enh-google

Note that changing the PLT sequences may affect certain programs, including disassemblers, profilers, and debuggers. However, in my experience, they are unlikely to check the existence of add x16, x16, ....
If we consider add x16, x16, ... unneeded for -z now, ideally GNU ld and ld.lld can make the change for parity.

@appujee
Copy link

appujee commented Apr 11, 2023

Should we create a similar bug for binutils ?

@MaskRay
Copy link
Contributor Author

MaskRay commented Apr 11, 2023

I think some binutils aarch64 maintainers watch this repository. This issue tracker may be the best place to discuss the possible change. The owners of this repository have a better idea which binutils aarch64 maintainers should be tagged:)

@nsz-arm
Copy link
Contributor

nsz-arm commented Apr 12, 2023

this prevents plt hooking at runtime (not just lazy binding). e.g. glibc ld.so supports LD_PROFILE and LD_AUDIT even with bind-now (then plt got is not readonly). but i think there are external tools that hook plts like ltrace (although i think it only places breakpoints on the plt, but this means it has to know the plt layout). so this is not an obvious change (might need additional elf marking).

@smithp35
Copy link
Contributor

smithp35 commented Apr 12, 2023

From the perspective of just the sysvabi64 document I'd like to write down what the minimal requirements are for the PLT sequences. I think that the two extreme approaches are:

  • A minimalist document the calling conventions for PLT[0] and PLT[N]. No requirements on the instruction sequences, or requirements for uniform size of PLT[N]. This would maximise the freedom for linker implementers, but make it harder for disassembly/debugging tools as they would need to handle more possible implementations.
  • A maximalist documentation of the calling conventions and at least some of the heuristics used by disassemblers/debuggers to recognise PLT entries. This has the opposite properties of the first.

I think our approach so far has tended towards the maximalist as we've provided some dynamic tags such as DT_AARCH64_BTI_PLT and DT_AARCH64_PAC_PLT https://github.com/ARM-software/abi-aa/blob/main/aaelf64/aaelf64.rst#642dynamic-section

I personally tend towards the minimalist approach in the specifications to provide more freedom for implementers who may choose different trade-offs.

I'm wondering if there is a set of properties that we represent with a combination of dynamic tags so that we don't have to keep introducing new ones. For example:

  • DT_AARCH64_PLT_SIZE = <size of each PLT entry, with a special value for variable sized entries>
  • DT_AARCH64_NOLAZY =
  • DT_AARCH64_PAC_PLT = <PLT entry expects .got.plt entries to be signed>
  • DT_AARCH64_OS_PLT = <PLT has OS specific behaviour described by one or more tags between DT_LOOS and DT_HIOS>

I think DT_AARCH64_BTI_PLT may not be required as DT_AARCH64_PLT_SIZE should work for that purpose.

For this particular case we have documented the calling convention for PLT[0], but not PLT[N] which is strongly implied by PLT[0]. I think this could be made explicit by stating that when lazy loading is permitted, ip0 (x16) contains the address of the .got.plt entry corresponding to PLT[N], contents unspecified otherwise.

As an aside there are other possibilities for the PLT entries that could help:

  • For small ELF files where the .got.plt is within 1 MiB of the of the .plt then adr and ldr could be used to save an adrp.
  • For BTI it doesn't have to embedded in the PLT entry itself. There could be a thunk/stub for every indirect branch to a PLT entry that direct branches to the PLT which does not have a BTI. This reduces the amount of BTIs at the expense of a slightly more expensive sequence for indirect calls.

@Wilco1
Copy link
Contributor

Wilco1 commented Apr 12, 2023

A larger issue is that PLTs have become less efficient with the added BTI making them 20 bytes and span multiple fetch blocks. In principle we don't need BTI in PLTs. To reduce PLT uses we could always create function addresses via a GOT load. Canonical PLTs still need a BTI. An extra thunk just containing BTI would add significant overhead, so the thunk needs to load the address from the GOT and branch (making them non-lazy).

It would be feasible to remove the ADD x16 by slightly changing the PLT: the default (unlinked) GOT entry could point to a branch associated with the PLT entry which then branches to PLT[0]. So x16 contains a unique address relating to the PLT entry. The branch can be at the begin/end of the PLT sequence (eg. ADRP/LDR/BR/B) or placed after all PLTs (which would allow using BTI for canonical PLTs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants