Post

Hunting the zombie GIL: how Cython profiling hooks in the release wheel slowed causalml's parallel Causal Forest

While doing uplift modeling to measure marketing performance, I ran into a performance problem in Uber’s causalml (0.16.0): CausalRandomForestRegressor got slower when I raised n_jobs to parallelize the fit. n_jobs=4 was about 6x slower than the default. The cause was a linetrace directive hardcoded into the tree-building Cython sources, which compiled profiling hooks all the way into the released wheel. This post walks through the symptom, the root cause, the verification, and the fix. The PR has been merged upstream.

The symptom

While comparing several meta-learners (T/S-Learner) with the Causal Forest, only the Causal Forest was unusually slow. Tuning n_jobs to parallelize it gave a result that ran against intuition: the more cores I added, the longer the fit took.

settingfit time
default (n_jobs unset)29.0s
n_jobs=129.5s
n_jobs=4192.6s

PyPI wheel 0.16.0, 50 trees, 12k rows.

There was no error and no warning, just a longer fit. The bug probably survived because you can’t spot anything wrong unless you compare across different parallel core counts.

Root cause

When adding cores makes things slower, it’s a sign that the workers are serializing somewhere. I had to find the point where the parallel workers converge on a single lock.

causalml’s tree building is written in Cython for speed. Five of its .pyx sources (and the matching .pxd headers) had this directive hardcoded:

1
# cython: linetrace=True

linetrace is a tool for measuring test coverage. The problem is that this setting doesn’t stay in the development build; it gets compiled into the released wheel as well. When the directive is on, the C code generated for CPython is built with CYTHON_PROFILE 1, which inserts profiling hooks (__Pyx_TraceCall / __Pyx_TraceSetupAndCall, etc.) into every function. That includes the hot-loop nogil tree-building criterion functions.

Because of these hooks, worker threads that should run with the GIL released re-acquire the GIL inside the hooks and serialize. The more threads there are, the worse the contention, which is why n_jobs=4 was slower than n_jobs=1 (the default).

You can confirm directly that the hooks made it into the released binary by inspecting its symbols:

1
2
3
4
$ nm .../causalml/inference/tree/causal/_criterion.cpython-312-darwin.so \
    | grep -E "TraceSetupAndCall|call_return_trace"
... __Pyx_TraceSetupAndCall ...
... __Pyx_call_return_trace_func ...

Verification and reproduction

To verify, I compared two builds that differed only in the linetrace=True directive, on the same machine, with the same Cython and the same setup.py. The results:

buildprofiling hooks in .son_jobs=1n_jobs=4
linetrace=True present (orig.)yes (2 symbols)29.4s190.9s
linetrace=True removednone4.1s1.7s

Removing the linetrace=True directive cleared the hooks from the binary. The single-threaded fit got about 7x faster, and n_jobs went back to acting as a speedup (n_jobs=4 moved from the slowest row to the fastest). Since only the compilation was changed, the statistical model’s predictions did not change: the downstream T-Learner’s Qini score matched to six decimal places, and tests/test_causal_trees.py still passed.

I also packaged the check as a Docker setup so it can be reproduced easily in an isolated environment. The Dockerfile and the reproduction command are in the issue linked at the top.

The fix

Instead of hardcoding the directive in the sources, I moved it behind an environment variable so it’s opt-in. In setup.py, linetrace is enabled only when CYTHON_TRACE is set:

1
2
3
4
5
import os
linetrace = os.environ.get("CYTHON_TRACE", "0") not in ("0", "", "false", "False")
define_macros = [("CYTHON_TRACE", "1"), ("CYTHON_TRACE_NOGIL", "1")] if linetrace else []
# Extension(..., define_macros=define_macros)
# cythonize(..., compiler_directives={"linetrace": linetrace})

I then removed # cython: linetrace=True from the five .pyx files and their .pxd headers. Coverage builds can still turn instrumentation on with CYTHON_TRACE=1 python setup.py build_ext, while every other release build is no longer instrumented. The profiling tool isn’t gone; it’s just separated out to build time so it doesn’t ride along into every function call.

Wrapping up

What made this bug tricky is that it disguises itself, with no error or warning, as “the model is just heavy.” The clue was the counterintuitive result: raising n_jobs made the fit take longer, not shorter. I went in to use a statistics library and ended up down in the generated C code and the build configuration. It was a fun excuse to actually dig into the Python GIL and wheel compilation/builds, things I’d only heard about until now.

Whether a model gives the right answer matters, and so does whether it gives that answer in time. A slow tool doesn’t get used enough, and a tool that isn’t used enough doesn’t get verified enough either.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.