Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Instrumentation support for PIE & shared libs on x86_64 Linux #192

Closed
wants to merge 914 commits into from

Conversation

vleonen
Copy link

@vleonen vleonen commented Jul 21, 2021

This PR enables instrumentation support for PIE executables, shared libraries, and initial support for statically linked executables on x86_64 Linux systems.
We are using relative jumps in trampoline functions to avoid usage of absolute addresses calculated during build time on the linking stage for initialization/finalization functions and instrumentation calls. These absolute addresses are not valid for randomized addresses for PIE & shared libraries.
Also for correct operation for instrumented shared libraries, we added a set of syscalls and functions for searching a path to the current binary in /proc/self/map_files directory based on the current virtual address and symlinks in this directory. In case/proc/self/map_files directory access is restricted (e.g. in container environment), we added -instrumentation-binpath option to pass information about expected path to an instrumented binary during runtime.

Co-authored-by: Vladislav Khmelevsky [email protected]
Co-authored-by: Elvina Yakubova [email protected]

Fixes #137

Vasily Leonenko,
Advanced Software Technology Lab, Huawei

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jul 21, 2021
maksfb and others added 28 commits July 22, 2021 17:46
Summary: The flag is no longer used/needed.

(cherry picked from commit ae10b98bfa5c3cb53295e0e79489de2bef1c8dc0)
Summary:
Change our X86 target to use long nops by default. In general,
BOLT does not put nops into the instruction stream that is going to be
executed, since it doesn't align basic blocks, only functions. Since we
rebased BOLT, our relationship with MCAssembler changed because it
stopped using multibyte nops and we never needed to revisit that. But it
makes a difference if we want to mitigate perf issues with the Intel
JCC erratum, since the nops inserted are going to be decoded and
executed. To make MCAssembler emit long nops again, we need to explictly
set mattr (Features) of the X86 target.

(cherry picked from commit f19079b7617a79ff0a98930400f47ba6d6fa13fc)
Summary: Add first bits to disassemble functions from a MachO binary.

(cherry picked from commit 24251682342940f7d560f0b206298fb86499fc5b)
Summary: Add first bits to build CFG.

(cherry picked from commit 36b89229c9897caf30c1d30d0596f7b90a41babf)
Summary: Enable reversing the order of basic blocks.

(cherry picked from commit 46ebeb83b67d2be5e5e3eb8ac6ddc7b42993dc44)
Summary:
(cherry picked from commit 7cd3b8e217f93b85df0090002df438c0fdda009f)
Summary: Add missing override in X86MCPlusBuilder.cpp.

(cherry picked from commit f5735b400d2eb086f4b3e98d83cc21743071ba87)
Summary: The interface is no longer in use.

(cherry picked from commit 460bdaaf5c48910d228d17211548bdc410f89d44)
(cherry picked from commit b392da1fbd055a555ef0fab3f943bbf214ace0ee)
(cherry picked from commit 0290dbaa15546fafa27a9a17240f0da4218299ce)
Summary:
Shrink wrapping has a mode where it will directly move push
pop pairs, instead of replacing them with stores/loads. This is an
ambitious mode that is triggered sometimes, but whenever matching with
a push, it would operate with the assumption that the restoring
instruction was a pop, not a load, otherwise it would assert. Fix this
assertion to bail nicely back to non-pushpop mode (use regular store and
load instructions).

(cherry picked from commit 2ff448dee7461b1b31ae06837e8d859b77985e6b)
(cherry picked from commit d4667d4e3eb21f481d2fd8df30d1be160fa0f646)
Summary: Fix begin decrementing.

(cherry picked from commit 664a2dbdae4741bb4a6048c9bb5de6e262708845)
Summary: The parameter is no longer used.

(cherry picked from commit 7ac00974b07895aeb9ad53f5f417119c8b15c7e5)
Summary: The option is not used. Remove all related code.

(cherry picked from commit ad05832b1e039215512df7a5e652931210963d6a)
Summary: Temporarily mark functions containing data as non-simple.

(cherry picked from commit 0e099b021028f71c447adc3d531b992313d2ae76)
Summary:
1. Uniquify names of local symbols.
2. Handle aliases.

(cherry picked from commit ced7da5591a5e688f4725d872355feae4800bc4d)
Summary:
There is no need to treat the emission of the original `.eh_frame`
section as a special case.

(cherry picked from commit 79adc11ef27b9a15aa417e0d9ae8581a664dd2ab)
Summary:
This is a prerequisite for larger emitter refactoring.

Since .dynamic is read unconditionally, add an error message if the
section is missing, or the size of the section is zero.

(cherry picked from commit 70c0e23504ef8463ef15950690f458affc6ba511)
Summary:
Consolidate code and data emission code in ELF-independent
BinaryEmitter. The high-level interface includes only two
functions emitBinaryContext() and emitFunctionBody() used
by RewriteInstance and BinaryContext respectively.

(cherry picked from commit 55c0e69dfd6ecf65c9d93340347f52e4751210bc)
(cherry picked from commit bd7ce3bc754556a7d4a31bc626f57148ffb89be0)
Summary:
Make ELF symbol table rewriting code more structured. While at it,
remove symbols from non-allocatable sections.

(cherry picked from commit 751a3b565878d16cd124dfddfb07baa3fddb03a4)
Summary:
The version of LLVM that we are based on lacks the support for base
address in DWARF location lists. Add the missing pieces.

(cherry picked from commit 39a52458e17adcdc8e718efbc690f469a5fdc44a)
Summary:
Some functions may have exactly the same code and exception handlers.
However, their action tables could be different leading to mismatching
semantics. We should verify their equivalence while running ICF.

(cherry picked from commit 2449406494a66fbe4995f92747795fc130ae7907)
Summary:
ICF may fold functions in arbitrary order when running multi-threaded.
This is fine in relocation mode as we end up with just one function
holding all function symbols.

However, in non-relocation mode we keep all function bodies, and if we
keep merging profiles in non-deterministic order, we end up with
functions with non deterministic profiles. The fix for non-relocation
mode is to not merge profiles as the factual new profile could be
different from the merged one since both function instances are
potentially callable.

Additionally, emit extra symbols for ICF functions in non-relocation
mode to make it possible to track the folding.

(cherry picked from commit 201427172392b8fe17ac91d09a3b1f9db5a38e26)
Summary:
Too many hash collisions may cause ICF to run slowly.

We used to hash BinaryFunction only looking at instruction opcodes,
ignoring instruction operands. With many almost identical functions,
such approach may lead to long ICF processing time. By including
operands into the hash, we reduce the number of collisions and
improve the runtime often by a factor of 2 or more.

(cherry picked from commit d425f3c272fdf5f15e68d62e6f4880b976ed58e6)
Summary:
Further speedup ICF by applying stricter rules for congruent functions.
While checking symbolic operands in congruent functions, consider
operands congruent only if they are equal or reference functions
with identical hashes, i.e. potentially foldable functions.
Note that jump table operands are handled as a special case.

(cherry picked from commit 03bca75462804d9ab12b64530e55ca376e0fef24)
Summary:
Indirect calls that use RSP to compute the target address would
break in instrumentation mode because we were adding instructions that
changed the stack pointer. Fix this.

(cherry picked from commit 7815cc30c420107cfeee77d42b2acdeb0bca1035)
luoj1 and others added 4 commits July 22, 2021 17:48
…tion

Summary:
Match new direct call generated during ICP to correct pseudo probe

New call is matched to the probes of original call instruction.
Summary:
.stab and .stabstr are special sections containing debugging
information and strings associated with the debugging information.
This commit adds them to the list of debugging sections, so
these sections can be removed for output binary.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary: This shouldn't be upstreamed, it's just for our github presentation page.
Copy link
Contributor

@rafaelauler rafaelauler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I've left some suggestions below, let me know what do you think.

/// and searching for the appropriate link in address range in
/// /proc/self/map_files
static char *getBinaryPath() {
const char dirPath[] = "/proc/self/map_files/";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow LLVM coding standards here (see https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly ) and use DirPath[] instead.

Similarly, target_path should be TargetPath, and so on. The only place in the instrumentation code that we are not following these guidelines is in the name of syscalls, such as __open, but that's a special symbol. So __readlink and __getdents (introduced in another patch in this PR). are good and shouldn't be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions :)

Comment on lines 582 to 589
static char target_path[NAME_MAX] = {};
char buf[BUF_SIZE], start[ADDR_SIZE], end[ADDR_SIZE];
char findBuf[NAME_MAX];
char *strcp;
int lenS, lenE;
long nread;
unsigned long curAddr, saddr, eaddr;
struct dirent *d;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a little bit like C code when we initialize all variables at the start of the function. Because this is a c++ file that follows LLVM style, it makes more sense to push some of these definitions closer to where they are used. For example, LenS declaration can be moved to line 610, LenE declaration can be moved to line 621, etc.

I do understand that this file has very low-level code that looks more like systems programming (where C is very popular) and less like compiler code, but even LLVM has some low-level libraries that still follows strictly C++ styles, example: https://github.com/facebookincubator/BOLT/blob/main/llvm/lib/Support/Unix/Program.inc

Comment on lines 610 to 626
lenS = 0, lenE = 0;
while (*strcp != '-') {
if (*strcp == '\0') {
lenS = 0;
break;
}
*strcp++;
lenS++;
}
if (lenS == 0)
continue;
*strcp++;
while (*strcp != '\0') {
*strcp++;
lenE++;
}
assert(lenS + lenE < ADDR_SIZE, "file name too long");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that does parsing gets complicated and hard to read because it usually needs to keep several pointers to the parsing buffer. To avoid this from distracting us from the main purpose of this function, what do you think about refactoring this in a "parseAddressRange()" helper function or a lambda? We can write it in such a way that it returns StartAddress and EndAddress as integers directly, and then add a comment that it is parsing <hex1>-<hex2>. I always find it helpful to add comments in parsing functions to make it clear what is the format that the code is trying to read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do. Thank you

assert(static_cast<int64_t>(FDdir) > 0,
"failed to open /proc/self/map_files");

for (;;) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this for (;;) here, I don't think we need to re-execute this loop body, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example at the end of the page https://man7.org/linux/man-pages/man2/getdents.2.html and it seems like we need this loop (getdents can read to the buf only BUF_SIZE elements at once), otherwise we might lose some entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed the line

     if (nread == 0)
      break;

with the exit condition. So it reads until nread is zero. Then perhaps this outer loop could be written as:

while (nread = __getdents(FDdir, (struct dirent *)buf, BUF_SIZE) ) {
}

To make the exit condition more explicit

Comment on lines 569 to 571
#define BUF_SIZE 1024
#define NAME_MAX 256
#define ADDR_SIZE 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these as const integers inside getBinaryPath():

const uint32_t BufSize = 1024;
const uint32_t NameMax = 256;
const uint32_t AddrSize = 32;

lenS = __readlink(findBuf, target_path, sizeof(target_path));
assert(lenS != -1 && lenS != BUF_SIZE, "readlink error");
target_path[lenS] = '\0';
goto end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goto here shouldn't be necessary once we remove the outer loop. It will be just one loop checking all entries, it will be something like:


for ( entries start... end boundaries...) {
  uint64_t StartAddress, EndAddress;
  if (!parseAddressRange(StartAddress, EndAddress))
    continue;
  if (CurAddr < StartAddress || CurAddr > EndAddress)
    continue;
  const *C = strCopy(FindBuf, DirPath, NameMaX]);
  C = strCopy(C, d->d_name, NameMax - (C - FindBuf)]);
  *C = '\0';
  uint32_t Ret = __readlink(FindBuf, TargetPath, sizeof(TargetPath));
  assert(Ret != -1 && Ret != BufSize, "readlink error");
  TargetPath[Ret] = '\0';
  return TargetPath;
}
return nulptr;
 

@@ -264,6 +264,23 @@ void reportNumber(const char *Msg, uint64_t Num, uint32_t Base) {

void report(const char *Msg) { __write(2, Msg, strLen(Msg)); }

unsigned long hexToLong(const char *str) {
unsigned long res = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res -> Res
str -> Str

@@ -580,6 +580,8 @@ void RewriteInstance::discoverStorage() {
Phdr.p_filesz,
Phdr.p_align};
}
if (Phdr.p_type == ELF::PT_INTERP)
BC->hasInterpHeader = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it would be nice to refactor a bit:

  for (const ELF64LE::Phdr &Phdr : PHs) {
    if (Phdr.p_type == ELF::PT_INTERP) {
      BC->hasInterpHeader = true;
      continue;
    }
    if (Phdr.p_type != ELF::PT_LOAD)
      continue;
    BC->FirstAllocAddress = std::min(BC->FirstAllocAddress,
                                     static_cast<uint64_t>(Phdr.p_vaddr));
    NextAvailableAddress = std::max(NextAvailableAddress,
                                    Phdr.p_vaddr + Phdr.p_memsz);
    NextAvailableOffset = std::max(NextAvailableOffset,
                                   Phdr.p_offset + Phdr.p_filesz);
    BC->SegmentMapInfo[Phdr.p_vaddr] = SegmentInfo{Phdr.p_vaddr,
                                                   Phdr.p_memsz,
                                                   Phdr.p_offset,
                                                   Phdr.p_filesz,
                                                   Phdr.p_align};
  }

or a switch-case.

@@ -535,6 +535,9 @@ class BinaryContext {
/// linked.
bool IsStaticExecutable{false};

/// Set to true if the binary contains PT_INTERP header.
bool hasInterpHeader{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize:

HasInterpHeader

@@ -651,6 +651,12 @@ void Instrumentation::createAuxiliaryFunctions(BinaryContext &BC) {
createSimpleFunction(
"__bolt_fini_trampoline",
BC.MIB->createSymbolTrampoline(FiniSym, BC.Ctx.get()));
} else {
// Create dummy return function for tranpoline to avoid issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tranpoline -> trampoline

Copy link
Contributor

@rafaelauler rafaelauler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I finished reviewing the most important parts of this PR. Let me know what you think about the comments and thanks a lot for working on this, that was a great job.

createInstrumentedIndCallTrampoline(const MCSymbol *InstrTrampoline,
const MCSymbol *IndCallHandler,
MCContext *Ctx) const override {
const MCPhysReg TempReg = getIntArgRegister(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here with the assembly code being generated by this function? In the same way as lines 3247-3260 for createInstrumentedTailCallHandler(), so it's easier to read. It is a bit distracting to read and visualize the assembly code that it is trying to emit because the function needs to emplace_back each instr.

MCSymbol *IndCallHandlerFunc;
MCSymbol *IndTailCallHandlerFunc;
/// Pointer to runtime instrumentation handler
MCSymbol *IndCallInstrPointer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Instr" here confused me a bit because I was associating with "Instr" being "Instruction". Perhaps we can rename to IndCallCounterFuncPtr? (since it is basically just updating a counter)

Summary->InitialIndCallHandlerFunction =
createSimpleFunction("__bolt_instr_default_ind_call_handler",
BC.MIB->createInstrumentedNoopIndCallHandler());
BinaryFunction *IndCallHandler =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to understand what was happening here because we are not creating two functions but the real intent is to create two basic blocks of the same function that is coordinating entry to the code that performs instrumentation of indirect calls.

Can we rename this variable to IndCallHandlerExitBB, and add a comment explaning? Also, the methods "createInstrumentedIndCallHandler" and "createInstrumentedIndCallTrampoline" could be renamed to "createInstrumentedIndCalllHandlerExitBB" and "createInstrumentedIndCallHandlerEntryBB", respectively, so it's extra clear what we are doing here.

Regarding the comment explaining, I was thinking in something like:

// Here IndCallHandlerExitBB has the instructions to finish handling traffic
// to an indirect call. We pass it to createInstrumentedIndCallHandlerEntryBB(), which
// will decide between jumping to code in the runtime library (to account traffic)
// if the pointer is set, and then finish by jumping to the exit bb, which will return
// normal program execution.

MCSymbol *IndTailCallHandlerFunc;
/// Pointer to runtime instrumentation handler
MCSymbol *IndCallInstrPointer;
MCSymbol *IndTailCallInstrPointer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Vasily Leonenko and others added 13 commits July 30, 2021 20:35
Summary:
This commit implements new method for _start & _fini functions hooking
which allows to use relative jumps for future PIE & .so library support.
Instead of using absolute address of _start & _fini functions known on
linking stage - we'll use dynamically created trampoline functions and
use corresponding symbols in instrumentation runtime library.

As we would like to use instrumentation for dynamically loaded binaries
(with PIE & .so), thus we need to compile instrumentation library with
"-fPIC" flag to support relative address resolution for functions and
data.

For shared libraries we need to handle initialization of instrumentation
library case by using DT_INIT section entry point.

Also this commit adds detection if the binary is executable or shared
library based on existence of PT_INTERP header. In case of shared
library we save information about real library init function address
for further usage for instrumentation library init trampoline function
creation and also update DT_INIT to point instrumentation library init
function.

Functions called from init/fini functions should be called with forced
stack alignment to avoid issues with instructions which relies on it.
E.g. optimized string operations.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit adds support for getting directory entries and
reading value of a symbolic link in instrumentation runtime library

Elvina Yakubova,
Advanced Software Technology Lab, Huawei
…lf/map_files

Summary:
This commit adds support for opening libs based on links /proc/self/map_files.
For this we're getting current virtual address and searching the lib in the
directory with such address range. After that, we're getting full path to the
binary by using readlink function. Direct read from link in /proc/self/map_files
entries is not possible because of lack of permissions.

Elvina Yakubova,
Advanced Software Technology Lab, Huawei
Summary:
This commit introduces static binaries instrumentation support.
Note that current implementation does not support profile output
on the instrumented binary finalization. So it requires to use
-instrumentation-sleep-time=N (N>0) option usage.
Note: There is unhandled case with static PIE executable which
might have dynamic header.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit fixes runtime instrumentation handlers for PIE
binaries case.

Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Summary:
This commit replaces IsStaticExecutable with !hasDynamicHeader to improve
actual meaning of this value and corresponding checks. Also this commit
introduces isExecutable() wrapper function for further handling static-pie
executable case during it's enabling.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Vasily Leonenko,
Advanced Software Technology Lab, Huawei
Summary:
This commit introduces -instrumentation-binpath argument used to point
instumented binary in runtime in case if /proc/self/map_files path
is not accessible due to access restriction issues.

Vasily Leonenko
Advanced Software Technology Lab, Huawei
The trampolines are no loger pointers to the functions.
For propper name resolving by bolt use extern "C" for
all external symbols in instr.cpp

Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
To avoid RELATIVE relocations avoid using of GOT table
by using hidden visibility for all symbols in library.

Vladislav Khmelevsky,
Advanced Software Technology Lab, Huawei
Summary:
This commit adds dummy tests for checking instrumentation support for
PIE executables and shared libraries.

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
This is a temporary fix for failing tests on newer gcc versions.
Note that older gcc versions doesnot support -fcf-protection
option, so this commit will cause failure on these versions.

Related issues:
facebookarchive#117
facebookarchive#181

Vasily Leonenko,
Advanced Software Technology Lab, Huawei
@vleonen vleonen force-pushed the inst-pie-shlib-x86 branch from 9622680 to b4714ba Compare August 2, 2021 13:11
@vleonen
Copy link
Author

vleonen commented Aug 2, 2021

@rafaelauler Thank you for reviewing it. I just have pushed a version updated according to your comments and rebased on top of current main branch. Please have a look at it.

@rafaelauler
Copy link
Contributor

Thanks a lot folks!

@rafaelauler rafaelauler closed this Aug 5, 2021
@vleonen
Copy link
Author

vleonen commented Aug 5, 2021

@rafaelauler Thank you!
I just found that readme file in root and in bolt folder become duplicated some time ago. And after my commit they become different (this PR patch 1437d30 modifies bolt/README.md). Please give me know if I need to prepare one more PR to sync them or do you plan to remove bolt/README.md to avoid further duplication and uncertainly?

@rafaelauler
Copy link
Contributor

Oh, good catch! Let me fix this.

@vleonen vleonen deleted the inst-pie-shlib-x86 branch August 7, 2021 08:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using Bolt in dynamic libary, occur Segmentation fault