Connection Pooling and Intermittent Failures in K8s
Intro
One of the aims of this blog is to occasionally publish smaller pieces of content that cover specific problems and troubleshooting situations that we encounter during our own day-to-day work. We believe this will be useful because it can a) help anyone facing the same problem save a bit of time, and b) over time teach our readers investigation methods that can be applied to other similar problems or scenarios.
We have not yet found a good name for this type of content, so in the absence of a better option, we’ll refer to them as “bitesize knowledge”.
In this post for instance, we look at the journey of troubleshooting a real-life problem involving intermittent inter-pod connection failures in our EKS setup.
So what happened?
Recently, a developer on my team reported intermittent failures with a specific HTTP request flow that involves our services running in AWS EKS. The failures occurred when certain REST requests hit an internal API service, which then called other microservices to gather information and return a response. As the requests were synchronous, any issues with subsequent requests propagated to the client as an overall failure.
Even more strangely, they noticed a certain peculiar pattern:
Requests submitted one after another in quick succession completed successfully.
Requests started failing if there was a few seconds of wait time between them.
Requests failed if there was a “context switch”; meaning, if requests were made in such a way that the API Service had to call a downstream application that was different to the one in the prior request.
So, we started looking into it together…
Investigation
I started off as I usually would, by trying to understand the basic business logic, request flow, and components involved in the path as quickly and carefully as possible.
In addition to the information gathered in the previous section, I knew that all the pods have a Service configured in front and that they use Calico network policies for management and whitelisting of traffic. Knowing the latter is generally useful in the context of networking problems but is less relevant for intermittent failures because issues relating to network policy are usually black or white, it either works or it doesn’t, there are no “in-betweens”.
I also learned that:
The “target” services are mostly data model systems written in Python and use a web server technology called Uvicorn.
The API Service (“source” app) is written in Java and uses Hikari connection pooling.
When I learned about connection pooling, I had an immediate hunch (from past experience) that it might have something to do with the issue. This was especially plausible considering that the developers were using a “tried and tested” topology described in the previous paragraph, which reduced the likelihood of more generic issues such as firewall and networking.
Before making any firm further conclusions however, I wanted to verify this hypothesis. I figured that if it were a genuine networking problem, I should be able to replicate it (at least to a similar extent) using raw TCP and/or HTTP calls.
So I exec’d inside the source container(s) and ran telnet and curl calls from it to the K8s service endpoint associated with the target (note this assumes those packages are installed as basic utilities inside any of the container image layers; that’s normally the case for the more widely used distros but do verify in your case too):
$ kubectl [-n <namespace_name>] exec --it <source_container> -- /bin/bash
// container shell ->
# curl -vvv https://<target_service>.<namespace>.svc
... connection establishment and handshakes etc.
... connected to <svc_name> (IP x.x.x.x) port 443 (#0)
...
When querying pods within the K8s cluster by hostname be aware that depending on the namespace you are in and what namespace you are trying to ping, the hostname rules change. DNS is handled by the name of the service object, a quick kubectl get service
will get you a list of service resources available.
When querying a pod within the same namespace:
ping SERVICE_NAME
When querying a pod in another namespace:
ping SERVICE_NAME.NAMESPACE
If you don’t want to deal with any of this use FQDM:
ping SERVICE_NAME.NAMESPACE.svc.cluster.local
You can read more about it here.
I ran quite a few calls simulating the http request pattern and not a single one failed, so this indeed turned out to be OK. I now almost certainly knew that this is not a networking problem but rather an application-level one and to do with the original hypotheses around connection pooling — though we still lack definitive proof, and a root cause, and a solution… so work is far from over.
If you are experiencing connection issues, try verifying that network-level connectivity is robust. This will allow you to eliminate unknown factors and pinpoint which layer of the OSI model the problem lies within.
Starting from the top (oversimplifying here…), the application layer can be checked using HTTP-based tools like curl. If problems occur, delve into the Transport layer with telnet, a tool that validates data transmission protocols. If issues persist, you can evaluate the network (IP) layer using utilities like ping
or traceroute
to track packet routing and delivery. By sequentially diagnosing each layer, you can efficiently isolate and rectify connectivity issues.
In order to establish a higher degree of causality, we decided to track, in real time, the connections on the worker EC2 nodes that housed these services. So I ran the below command to find out the name and IP of the amazon instances a couple of these containers run on and SSH’d into them:
$ kubectl [-n <namespace_name>] get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xxxx 1/1 Running 0 30h x.x.x.x ip-x-x-x-x.ec2.internal <none> <none>
...
Linux kernel implements Connection Tracking so if you ssh into a linux-based machine you can access a handy userspace utility called conntrack
that displays this information in a reasonable format (it’s not installed on most distros OOTB but you can download using your package manager). It’s best ran with escalated privileges:
$ watch -n 0.5 sudo conntrack -L -p tcp --dport 8443
tcp 6 431968 ESTABLISHED src=x.x.x.x dst=x.x.x.x sport=60370 dport=8443 src=x.x.x.x dst=x.x.x.x sport=8443 dport=60370 [ASSURED] use=1
tcp 6 38 TIME_WAIT src=x.x.x.x dst=x.x.x.x sport=40026 dport=8443 src=x.x.x.x dst=x.x.x.x sport=8443 dport=40026 [ASSURED] use=1
...
So we had a watch
of conntrack entries, and in parallel resumed doing the http calls using postman requests like we’d originally done. Conntrack provides a point in time snapshot of the machine’s connections that match the filters given to it. It’s slightly beyond the scope of this post to talk about details of various connection states etc. but if you’re interested in testing conntrack, we recommend this read. It suffices to say that after paying close attention for a few minutes, we started noticing that some open connections were being randomly evicted after what felt like almost exactly 5 seconds.
Root cause
Speaking to the team and the owner of the source service, we figured there is no reason why it should behave like above. As we were concluding this conversation, the owner of one of the target services raised a question: what if it’s our service? To reiterate, this is a Python service that uses Uvicorn. A few minutes into scanning the docs for a smoking gun around potentially short default timeouts and we come across this:
• Keep-Alive. Defaults to 5 seconds. Between requests, connections must receive new data within this period or be disconnected.
AHA! This might explain it!
So…
the source app opens new connections to the target app some of which become idle and remain open →
After 5 seconds this connection gets evicted by the target →
Connection pool within the source app tries to reuse the connection entry but fails to do so because it’s already gone.
The attempt done immediately after this always works because it opens a new connection and does not reuse one. So on and so forth.
We further verified this by increasing the Uvicorn default KA to 50 seconds for instance (by tweaking the --timeout-keep-alive
parameter) and noticed things started looking a lot better and connections would only get dropped if not instead of 5 seconds, one now waited about 50 seconds instead, i.e. pretty strong correlation.
Solution
Assigning a high KA timeout in Uvicorn could reduce the frequency of issues but it does not address the underlying problem. Hence it cannot be considered an optimal / reliable long-term fix.
Speaking with the team, we think of the following solutions:
→ Disable connection reuse in the source apps’ Hikari pool: we continue to use connection pooling to benefit from OOTB graceful allocation and closing of connections but just disable connection reuse aspect.
pros: this is a config change and therefore will likely involve the least amount of effort, it could also “simplify” the flow to some extent.
cons: opening a new connection every time is expensive because it’s I/O bound, hence disabling reuse will come with some performance penalty (roughly 20ms per call).
→ Explicitly evicting the idle connections at *source*: Make the source application more aware of idle connections approaching eviction and have it clean them up so they’re not reused.
pros: a slightly more polished way of dealing with the problem compared to the first option.
cons: Additional logic and coupling needs to be introduced in the service. Also, this in theory is fairly similar to option 1 and will introduce a performance decrease (Though the decrease is going to be less noticeable because reuse is still happening, just on non-idle connections).
→ Retrying failed connection attempts:
pros: It’s a full-proof protection against connection pooling issues. Will introduce less “state” in the system compared to the previous option since calls can be done in a fire-and-forget mode and attention is only needed if an issue is faced downstream.
cons: It’s sub-optimal for a connection to be attempted, just so that it is failed and having to be retried. Preference should be given to preventing the issue in the first place as opposed to addressing it.
The choice of the solution really comes down to specific requirements and amount of effort involved. In our case for instance, the team decided to go with option 2. There might’ve also been other better solutions that we did not think about but this seemed sufficient for our use case.
Hope you found this useful! If so, keep an eye out for more!
/ Sam