Debugging a dynamic linking bug in a Nix project
July 17, 2020
Trying out the Nix development experience
The other day, while building a scientific
project to which I’m a contributor, I ran into a nasty version
conflict between two system libraries. In a fit of pique, I decided to
learn enough about Nix to be able to set
up a reproducible, tightly controlled local build. It’s done now, and
overall I’m very happy with the tooling and setup. I’m using direnv to tightly integrate my normal
shell with Nix’s nix-shell
feature, and for the most part
everything feels seamless. It is extremely refreshing to see
cmake
report that it has found a plethora of binaries and
libraries, content-hashed and installed in neat little rows under
/nix/store
.
I’m using Nix to manage my development environment, but not
to build the project itself. Nix ensures that the project dependencies
are installed and discoverable by the compiler and linker. Building the
project is done with CMake, set up for cmake
to find the
nix-installed libraries. Nix achieves this by wrapping the C
compiler with its own shell script and injecting the paths to
libraries and binaries via environment variables. There’s very little to
do to make cmake
just work, beyond declaring that the
packages you want are buildInputs
. The first version of my
shell.nix
file looked like this:
# file shell.nix
{ pkgs ? import <nixpkgs> {} }:
pkgs.mkShell {
buildInputs = with pkgs; [
cmake
(callPackage nix/petsc.nix {})
metis
hdf5
openmpi
(python38.withPackages (packages: [ packages.numpy ]))
];
}
Using this setup, I had very little trouble getting the project to build. I had to override the default PETSc derivation to compile with METIS and OpenMPI support, which was not too hard:
# file nix/petsc.nix
{ petsc , blas , gfortran , lapack , python , metis , openmpi }:
petsc.overrideAttrs (oldAttrs: rec {
nativeBuildInputs = [ blas gfortran gfortran.cc.lib lapack python openmpi metis ];
preConfigure = ''
export FC="${gfortran}/bin/gfortran" F77="${gfortran}/bin/gfortran"
patchShebangs .
configureFlagsArray=(
$configureFlagsArray
"--with-mpi-dir=${openmpi}"
"--with-metis=${metis}"
"--with-blas-lib=[${blas}/lib/libblas.so,${gfortran.cc.lib}/lib/libgfortran.a]"
"--with-lapack-lib=[${lapack}/lib/liblapack.so,${gfortran.cc.lib}/lib/libgfortran.a]"
)
'';
})
This Nix file returns a function which is invoked in
shell.nix
using callPackage
function.
petsc.overrideAttrs
is a neat way to override the
attributes of a derivation created with
stdenv.mkDerivation
. Building PETSc with MPI and METIS
support is as simple as passing in a different set of arguments to the
configure
script.
Figuring out how to do all of this was fun. I mostly referred to the Nix “Pills”, which are a great progression through the Nix tool and language.
With these Nix files, I was able to execute
cmake .. && make
successfully. Getting the project
to run was another story. The final binary failed immediately
with a dynamic loading error:
➜ bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
Reason: image not found
The binary was trying to load a dynamic lib from one of the temporary
directories that Nix created in the process of building PETSc. Of course
this failed: by the time I invoked bin/warpxm
, that
directory had been cleaned up. Instead of a file under
/private/tmp
, the binary should have linked to the result
of the petsc
derivation in the Nix store, under
/nix/store
. At some point, it seemed, an environment
variable was incorrectly set to this intermediate directory. To figure
out where, I would have to learn a lot more about linking on OS X than I
ever expected.
Whither the linker?
First I checked the compiler and linker flags that are inserted by
Nix’s compiler wrapper. These come in via
NIX_CFLAGS_COMPILE
and NIX_LDFLAGS
. When
you’re working with nix-shell
and direnv
, all
of the environment variables from your derivations are injected into
your shell. It’s a simple matter of echoing them out:
➜ echo $NIX_CFLAGS_COMPILE
... -isystem /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/include ...
➜ echo $NIX_LDFLAGS
... -L/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib ...
These look fine! Invoking cmake
and make
in
this shell ought to pull in the correct library.
Then I remembered that this project uses pkg-config
to
find and pull together the linked libraries. Frankly, I don’t understand
pkg-config
very well, but I do know that in this project it
is invoked from inside of cmake
. It searches for libraries
according to its own rules, and it runs after Nix has done its
job setting everything up. Therefore, it circumvents the compiler and
linker flags that we just checked.
I happened to have pkg-config
installed from before
setting up this Nix environment. Therefore, cmake
was able
to invoke the system pkg-config
from my user
PATH
. Perhaps the system version of pkg-config
was somehow finding the wrong library? Indeed,
echo $PKG_CONFIG_PATH
confirmed that it was searching a
directory under my $HOME
. I thought it possible that some
wires got crossed while I was adding dependencies to my Nix derivation
one at a time: configuring pkg-config
appropriately might
help.
I referred once again to the Nix wiki page on C projects, which also
has a section on
using pkg-config
. It seems that including the
pkg-config
derivation as a nativeBuildInput
will let packages like petsc
append their output paths to
the PKG_CONFIG_PATH
environment variable. I did so:
pkgs.mkShell {
buildInputs = with pkgs; [
...
];
nativeBuildInputs = with pkgs; [
pkg-config
];
}
but it didn’t fix the problem. I would have to go deeper and track down where the bad library was being pulled in.
Digging into the cmake
documentation and the project’s
.cmake
files led me to insert a trio of print
statements:
find_package(PkgConfig REQUIRED)
pkg_check_modules(PETSC PETSc REQUIRED)
link_directories(${PETSC_LIBRARY_DIRS})+ message("petsc libraries: ${PETSC_LIBRARIES}")
+ message("petsc library dirs: ${PETSC_LIBRARY_DIRS}")
+ message("petsc link libraries: ${PETSC_LINK_LIBRARIES}")
list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})
These printed out three lines in my cmake
output:
petsc libraries: petsc
petsc library dirs: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib
petsc link libraries: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib
The second two look good. But the first, just the library name
petsc
, was a little too implicit for comfort. It was
precisely this variable that was being appended to the link targets
list. At compile time, it would be up to the linker to find the
library petsc
, and I wasn’t sure where it would look. Safer
to use the absolute path to the .dylib
, like so:
- list(APPEND WARPXM_LINK_TARGETS ${PETSC_LIBRARIES})
+ list(APPEND WARPXM_LINK_TARGETS ${PETSC_LINK_LIBRARIES})
My thinking here was wrong. We can be sure where the linker
will look at compile time: in the paths listed in
NIX_LDFLAGS
! I wasn’t thinking clearly about the flow of
data in the compilation process.
Changing the link target to the absolute path eased my mind only for
the duration of the next cmake .. && make
cycle.
Surely there was no way the linker could screw up now. No arcane library
search involved, just an absolute path, which couldn’t possibly be
misinterpreted…
➜ bin/warpxm
dyld: Library not loaded: /private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Referenced from: /Users/jack/src/warpxm/build/bin/warpxm
Reason: image not found
Damn it!
install_name and other depravities
At this point I was absolutely flummoxed. With every fix I attempted,
I grepped vainly for the offending /private/tmp
path in my
build directory, and come up empty-handed. I tracked down the final,
irrevocable link options passed to the compiler, tucked away in a
link.txt
file in the build tree. They showed
incontrovertibly that my binary was being linked to the correct
library:
➜ cat build/src/CMakeFiles/warpxm.dir/link.txt
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ -O3 -DNDEBUG -isysroot ... -L/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib ...
I had proved nearly to my satisfaction that CMake was doing the right thing with this library, and I was completely out of ideas. Finally, a very lucky google search led me to the section of the Nix manual describing issues specific to the Darwin (MacOS) platform. It states:
On Darwin, libraries are linked using absolute paths, libraries are resolved by their install_name at link time. Sometimes packages won’t set this correctly causing the library lookups to fail at runtime. This can be fixed by adding extra linker flags or by running install_name_tool -id during the fixupPhase.
This is a very matter-of-fact way of stating something that, when I understood it, flabbergasted me. To the best of my understanding, here’s what happens on MacOS:
- My source code has an include directive,
include<petsc.h>
or something like that, which creates a binary interface to be satisfied by the linker. - At link time, we pass the list of absolute paths to libraries, and the linker finds the one that matches the interface.
- The linker then saves the install_name of the library it found in the binary’s load section.
- At run time, the binary (actually, the MacOS
dyld
system) loads the library. The install_name is all it has, so it looks there.
I’ve certainly gotten some aspect of this wrong, so I would definitely appreciate hearing from someone who understands it better than me!
In any case, this find pointed me to the concept of the install_name, so I had something to go on. More searching led to a helpful blog post describing exactly the issue that I was facing. It also described how to check the install_name of the library:
➜ otool -D /nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/31d3hng4sclxi3sz8g3zi3yqmychj2kg-petsc-3.13.2/lib/libpetsc.dylib:
/private/tmp/nix-build-petsc-3.13.2.drv-0/petsc-3.13.2/arch-darwin-c-debug/lib/libpetsc.3.13.dylib
Gotcha.
The Nix manual states that “some packages won’t set this correctly”,
and points to the fix, which is to use install_name_tool
to
change the install_name of the built library. Is the
PETSc derivation on nixpkgs doing this correctly? I saw that it was
doing something with install_name_tool
:
prePatch = ''
substituteInPlace configure \
--replace /bin/sh /usr/bin/python
'' + stdenv.lib.optionalString stdenv.isDarwin ''
substituteInPlace config/install.py \
--replace /usr/bin/install_name_tool install_name_tool
'';
This directive replaces the appearances of the string
/usr/bin/install_name_tool
with just
install_name_tool
. The reason that Nix packages do this is
to ensure that builds rely on the Nix-built tools, which are provided in
the build shell’s PATH
, and not on binaries in system
directories like /usr/bin
.
The PR that introduced this substitution indicates that it fixed a
build on Darwin, so there must be some invocation of
/usr/bin/install_name_tool
in PETSc. Searching for that in
the PETSc repo leads to this
line, which is doing exactly what the Mark’s Logs post
on install_name instructed: it changes the install_name to the absolute
path of the library in its installation directory, using
install_name_tool -id
.
if os.path.splitext(dst)[1] == '.dylib' and os.path.isfile('/usr/bin/install_name_tool'):
= self.executeShellCommand("otool -D "+src)
[output,err,flg] = output[output.find("\n")+1:]
oldname = oldname.replace(os.path.realpath(self.archDir), self.installDir)
installName self.executeShellCommand('/usr/bin/install_name_tool -id ' + installName + ' ' + dst)
According to this, the install_name of the library should have been
repaired by PETSc when the library was built! Except… notice something?
The second condition in the if
statement. After the PETSc
derivation runs its prePatch
step, that condition will
become and os.path.isfile('install_name_tool')
. That will
certainly fail: install_name_tool
is not going to be a file
in the directory where configure
is running! The patched
configure
script will silently skip this step, leaving the
install_name of the library as the temporary directory where it was
built!
Luckily, the solution to this problem is not too hard. Instead of the
name of a program on the PATH
, we should pass the absolute
path to the program we want to run. This can be done by overriding the
prePatch
step like so:
prePatch = ''
substituteInPlace configure \
--replace /bin/sh /usr/bin/python
'' + stdenv.lib.optionalString stdenv.isDarwin ''
substituteInPlace config/install.py \
--replace /usr/bin/install_name_tool ${darwin.cctools}/bin/install_name_tool
'';
The Nix variable ${darwin.cctools}
will expand to the
full path of the built darwin.cctools
derivation, which is
a directory under /nix/store
. So the patched
if
statement inside of PETSc’s configure.py
becomes
if os.path.splitext(dst)[1] == '.dylib' and
'/nix/store/1dgdim74d05ypll85vslm8i7kgzq78vw-cctools-port/bin/install_name_tool'):
os.path.isfile(# use install_name_tool
and the install_name of the resulting library will be correct. We can
check that with otool -D
again:
➜ otool -D /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.dylib:
/nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib
Looking much better! And since the error was in a dynamically loaded library, we don’t even have to recompile to check that it’s working:
➜ build git:(master) ✗ DYLD_PRINT_LIBRARIES=1 bin/warpxm
dyld: loaded: /Users/jack/src/warpxm/build/bin/warpxm
dyld: loaded: /nix/store/ni26aaiira47ak60vks1qv4apbkwbg1d-hdf5-1.10.6/lib/libhdf5.103.dylib
dyld: loaded: /nix/store/acsjaw04hrf4rv8gizai7gx1ibq92ksa-zlib-1.2.11/lib/libz.dylib
dyld: loaded: /nix/store/z4f1bq363m0ydmbyncfi2srij8vlsx32-Libsystem-osx-10.12.6/lib/libSystem.B.dylib
dyld: loaded: /nix/store/w23r8kplmfx2xc111cpvmdjwmkwy6ip3-petsc-3.13.2/lib/libpetsc.3.13.dylib
...
That’s more like it.
Epilogue
I spent most of my time debugging this problem without a working
understanding of the different build phases. It should have been clear
to me that neither the CMake nor the pkg-config
setups
could be the cause, because at the time that I was invoking
cmake
, the offending /private/tmp
directory
had long vanished. If I had focused exclusively on the PETSc derivation
provided by Nix, I might have homed in on the
install_name_tool
patch a little sooner. As it went, I was
lucky to find the note in the Nix manual about Darwin-specific linker
problems.
As for Nix, I will absolutely be using more of it. What’s remarkable is how little impact it can have. I am able to use it to manage my environment for this project without impacting the way the other developers manage their environments. Of course, if they asked, I would advocate that they try out Nix, but it’s nice for everyone to be able to do it on their own time.
I’m also looking forward to having my first contribution to Nix!