Skip to main content

Optimizing Snabbwall

· 10 min read
Asumu Takikawa

In my previous blog post, I wrote about the work I've been doing on Snabbwall, sponsored by the NLNet Foundation. The next milestone in the project was to write some user documentation (this is now done) and to do some benchmarking.

After some initial benchmarking, I found that Snabbwall wasn't performing as well as it could. One of the impressive things about Snabb is that well-engineered apps can achieve line-rate performance on 10gbps NICs. That means that the LuaJIT program is processing packets at 10gbps, which means that if your packets are about 40 bytes (the minimum size of an IPv6 packet) then it spends around 30 nanoseconds per packet.

On the other hand, Snabbwall was clocking about 1gbps or less. This was based on measurements from a simple benchmarking script that uses the packetblaster program to fire a ton of packets at a NIC connected to an instance of Snabbwall. The benchmark output looked like this:

Firewall on 02:00.1 (CPU 3). Packetblaster on 82:00.1.
BENCH (BITTORRENT.pcap, 1 iters, 10 secs)
bytes: 1,396,392,179 packets: 1,736,085 bps: 1,090,257,569.1429
BENCH (rtmp_sample.cap, 1 iters, 10 secs)
bytes: 490,248,129 packets: 1,510,824 bps: 381,488,072.1031

(the bps numbers give the bits per second processed for the run)

For Snabbwall, we hadn't actually set a goal of processing packets at line-rate. And in any case, the performance of the system is limited by the processing speed of nDPI, which handles the actual deep-packet inspection work. But 1gbps is pretty far from line-rate, so I spent a few days on finding some low-hanging performance fruit.

Profiling and exploring traces

Most of the performance issues that I found were pinpointed by using the very helpful LuaJIT profiler and debugging tools. For debugging Snabb performance issues in particular, you can use an invocation like the following:

   ## dump verbose trace information
$ ./snabb snsh -jv -p program.to.run -f <flags> args

The -jv option provides verbose profiling output that shows an overview of the tracing process. In particular, it shows when the trace recorder has to abort.

(see this page for details on LuaJIT's control commands)

In case you're not familiar with how tracing JITs like LuaJIT work, the basic idea is that the VM will run in an interpreted mode by default, and record traces through the program as it executes instructions.

(BTW, if you're wondering what a trace is, it is a "a linear sequence of instructions with a single entry point and one or more exit points")

Once the VM finds a hot (i.e., frequently executed) trace that it is capable of compiling and is also worth compiling, the VM compiles the trace and runs the result.

If the compiler can't handle some aspect of the trace, however, it will abort and return to the interpreter. If this happens in hot code, you can get severely degraded performance.

This was what was happening in Snabbwall. Here's an excerpt from the trace info for a Snabbwall run:

[TRACE  83 util.lua:27 -> 72]
[TRACE --- util.lua:58 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE --- (78/1) scanner.lua:110 -- NYI: unsupported C type conversion at scanner.lua:202]
[TRACE 84 (78/1) scanner.lua:110 -- fallback to interpreter]

The source code documentation in the LuaJIT implementation explains what the notation means. What's important for our purposes is that the lines without a trace number which have --- are showing trace aborts where the compiler gave up.

As the comments note, trace aborts are not always a problem because the speed of the interpreter may be sufficient. Presumably more so if the code is not that warm.

In our case, however, these trace aborts are happening in the middle of the packet scanning code in scanner.lua, which is part of the core loop of the firewall. That's a bad sign.

It turns out that the unsupported C type conversion error occurs in some cases when a cdata (the type for LuaJIT's FFI objects) allocation is unsupported. You can see the code that's throwing this error in the LuaJIT implementation here.

The specific line in Snabbwall that is causing the trace to abort in cdata allocation is this one:

key = flow_key_ipv4()

which is allocating a new instance of an FFI data type. The call occurs in a function which is called repeatedly in the scanning loop, so it triggers the allocation issue each time. The data type it's trying to allocate is this one:

struct swall_flow_key_ipv4 {
uint16_t vlan_id;
uint8_t __pad;
uint8_t ip_proto;
uint8_t lo_addr[4];
uint8_t hi_addr[4];
uint16_t lo_port;
uint16_t hi_port;
} __attribute__((packed));

Reading the LuaJIT internals a bit reveals that the issue is that an allocation of a struct which has an array field is unsupported in JIT-mode.

To test this hypothesis, here's a small Lua script that you can try out that just allocates a struct with a single array field:

ffi = require("ffi")

ffi.cdef[[
struct foo {
uint8_t a[4];
};
]]

for i=1, 1000 do
local foo = ffi.new("struct foo")
end

Running this with the -jv option yields output like this:

$ luajit -jv cdata-test.lua 
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]
[TRACE --- cdata-test.lua:9 -- NYI: unsupported C type conversion at cdata-test.lua:10]

which is the same error we saw earlier from Snabbwall. For Snabbwall, we can work around this by allocating the swall_flow_key_ipv4 data structure just once in the module. On each loop iteration, we then re-write the fields on the single instance instead of allocating new ones.

This might sound iffy, but as long as the lifetime of this flow key data structure is controlled, it should be ok. In particular, the documented API for Snabbwall doesn't even expose this data structure so we can ensure that an old reference is never read after the fields get overwritten.

Using some dynasm

Once I optimized the flow key allocation, I saw another trace abort in Snabbwall that was trickier to work around. The relevant trace info is this excerpt here:

[TRACE  78 (71/3) scanner.lua:110 -> 72]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE --- (77/1) util.lua:34 -- NYI: unsupported C function type at wrap.lua:64]
[TRACE 79 link.lua:45 return]
[TRACE 80 (77/1) util.lua:34 -- fallback to interpreter]

For this case, it wasn't necessary to go read the LuaJIT source code to figure out exactly what was going on (though I suspect the error comes from this line). The module wrap.lua in the nDPI FFI library uses two C functions with the following signatures:

typedef struct { uint16_t master_protocol, protocol; } ndpi_protocol_t;

ndpi_protocol_t ndpi_detection_process_packet (ndpi_detection_module_t *detection_module,
ndpi_flow_t *flow,
const uint8_t *packet,
unsigned short packetlen,
uint64_t current_tick,
ndpi_id_t *src,
ndpi_id_t *dst);

ndpi_protocol_t ndpi_guess_undetected_protocol (ndpi_detection_module_t *detection_module,
uint8_t protocol,
uint32_t src_host, uint16_t src_port,
uint32_t dst_host, uint32_t dst_port);

Note that both functions return a struct by value. If you read the FFI semantics page for LuaJIT closely, you'll see that calls to "C functions with aggregates passed or returned by value" are described as having "suboptimal performance" because they're not compiled.

This is a little tricky to work around without writing some C code. At the C-level, it's easy to write a wrapper that returns the struct data by reference through a pointer argument to avoid the return. Then wrap.lua can allocate its own protocol struct and pass that into the wrapper instead. That's actually the first thing I did in order to test if this approach improves the performance (spoiler: it did).

But using a C wrapper complicates the build process for Snabbwall and introduces some issues with linking. It turns out that dynasm, which came up in a previous blog post, can help us out.

Specifically, instead of using a C wrapper, we can just write what the C wrapper code would do in dynamically-generated assembly code. Generating the code once at run-time lets us avoid any build/linking issues and it's just as fast.

The downside is of course that it's harder to write and debug. I'm not really a seasoned x64 assembly hacker, so it took me a while to grok the ABI docs in order to put it all together.

Here's the wrapper code for the ndpi_detection_process_packet function:

local function gen(Dst)
-- pass the first stack argument onto the original function
| mov rax, [rsp+8]
| push rax

-- call the original function, do stack cleanup
| mov64 rax, orig_f
| call rax
| add rsp, 8

-- at this point, rax and rdx have struct
-- fields in them, which we want to write into
-- the struct pointer (2nd stack arg)
| mov rcx, [rsp+16]
| mov [rcx], rax
| mov [rcx+4], rdx

| ret
end

The code is basically just doing some function call plumbing following the x64 SystemV ABI.

What's going on is that the original C function has 7 arguments and our assembly wrapper is supposed to have 8 (an additional pointer that we'll write through). On x64, six integer arguments are passed through registers so the remaining two get passed on the stack.

That means we don't need to modify any registers in this wrapper (since we will immediately call the original function), but we do need to re-push the first argument onto the stack to prepare for the call.

We can then call the function, and then increment the stack pointer to clean up the stack (the ABI also requires the caller to clean up the stack).

The rest of the code just writes the two returned struct fields from the registers through the pointer on the stack to the struct contained in Lua-land.

Speedup

With the changes I described in the post, the performance of Snabbwall on the benchmark improved quite a bit. Here are the numbers after implementing the three optimizations mentioned above:

Firewall on 02:00.1 (CPU 3). Packetblaster on 82:00.1.
BENCH (BITTORRENT.pcap, 1 iters, 10 secs)
bytes: 5,827,504,913 packets: 7,247,814 bps: 4,539,941,765.8323
BENCH (rtmp_sample.cap, 1 iters, 10 secs)
bytes: 6,099,115,315 packets: 18,793,998 bps: 4,745,353,670.4701

We're still not at line-rate, but at this point the profiler attributes a large portion of the time (23% + 20%) to the two C functions that do the actual packet inspection work:

24%  ipv4_addr_cmp
23% process_packet
20% guess_undetected_protocol
9% extract_packet_info
4% scan_packet
3% hash

(there might be some optimization potential in ipv4_addr_cmp, which is in the Lua code)

In doing this optimization work, I was happy to find that the LuaJIT performance tools were very helpful. Though I do think there might be an opportunity to put a more solution-oriented interface on it. For example, an optimization coach for LuaJIT could be interesting and useful.