Strange effects with N3IWF handling (major TCP packet loss, kernel XFRM issue?)

Hi,
I’ve setup a small demo with free5gc-compose. As n3iwf-clients I’m using either the free5gc n3iwue or the python based NWu-Non3GPP-5GC (a bit more flexible). After some fiddling (IP ranges etc. but basically the default configs are not altered regarding the inner workings) the clients actually get an IP from the DNN pool and traffic goes over the userplane, into “the internet” and also finds its way back. The tunnel seems to work, you can do flood-ping with full MTU size with no loss, also some manual ssh, all good. But when trying to do iperf3 TCP tests, the bandwidth is really low and varies a lot (1 to 30Mbit/s). Wireshark shows a total TCP mess with retransmissions, timeouts, duplicate packets and so on. It makes no difference which n3iwf-client I use.

After long debugging sessions at all components in the data path I found the following:

(preface #1) The n3iwf client essentially builds an IPSEC connection to the n3iwf-container for the data link. Inside the IPSEC packet there is a GRE header that encapsulates the client’s raw IP packet.

(preface #2) The free5gc-n3iwf decrypts the IPSEC packet via the XFRM kernel functions, gets the contained GRE and forwards that to the UPF.

I assumed that the XFRM decryption is transparent. But it is not! When looking at the kernel source for XFRM, the kernel dissects the decrypted payload, it actually looks inside the GRE packet and messes with that content. In case of TCP it merges multiple TCP segments to one larger segment (>3kB). But not always, this depends on the timing and presumably also the TCP PUSH flag. This large TCP segment is then forwarded to the UPF. When the UPF tries to inject this segment into the outgoing network, it gets truncated and from that location a ICMP “fragmentation needed” message is then actually sent back to the client. But as the client didn’t send this large packet in the first place and even MTUs with 556 bytes can get merged to >2kB segments in the XFRM, there is no chance that a sane TCP flow will ever occur with this packet loss.

But the kernel code for XFRM shows that the dissector is not enabled when the totally undocumented GRE_ROUTING bit (0x4000) is set in the GRE header. Since the python client builds the GRE header manually it’s easy to set it. Result: no more fragmentation or lost packets, steady and reproducible bandwidths >150Mbit/s (limited just by the python performance).

And now I’m lost. How can that work at all? Has that ever worked? Why do I have to set a totally undocumented bit in a network packet header for which even Google has no other hits than the source code references? As far as I can see the bit is also only read inside the kernel, never set. The actual use case for this bit is a total mystery to me and still it solves all issues I have with N3IWF. Unfortunately not for the native n3iwue as it uses kernel GRE…

Is this really a free5gc issue (somewhere in the XFRM setup) or am I just dumb and missing something obvious in the free5gc configs?

Hi @GeorgA,
Sorry for the late reply.

For iperf3 settings, you can refer to the following link, which may be helpful.

Also, Thanks for sharing your informative insights!
If you have any questions, feel free to let me know.

Best regards.