Video Thumbnail for Lesson
11.1: Debugging Applications in Kubernetes

Debugging Applications in Kubernetes

Transcript:

In the next module of the course, I want to take a look at how to debug applications running on Kubernetes if something is going wrong. And specifically, I want to focus more on issues at the Kubernetes layer that would be preventing your applications from running or causing them to crash, rather than at the application layer. There is an awesome visual flowchart guide published by learnkubernetes.io. If you go to this URL, you can find it there. There's an accompanying blog post. Here's the full resolution version of that. You're almost always going to start with a get pods command. This is going to give you a high-level overview of what pods exist in the cluster and what state they are in. Then based on their state, you'll follow the path and try and figure out what exactly is going on. By looking at the logs, you might see that there's an application issue. By looking at the description, you can see all the different events that the cluster issues about the pods. If your pod appears to be ready but is not working, you could port forward to that pod and access it directly from your system. And so while I don't use this directly anymore and generally just know these commands, this is going to be a great starting point as you're looking at debugging an application for the first time and you're not sure what to do next. To use as an example for trying this out, I'm going to deploy an example application from Google called the microservices demo. It is a pretend online shop made up of a number of different microservices. As you can see here, there's one for checkout, one for currency, one for sending emails, one for generating load, et cetera. And I have purposefully taken their configurations and made a handful of changes such that when you try to deploy it, it will break. Let's go ahead and deploy that broken version of the microservices demo into our cluster and then try to walk through and figure out what specifically is going on and how to fix it. Let's just use the CVO cluster for this one. Any other clusters would be fine. And then we'll navigate to our debugging module. We'll start by creating a namespace. And then we're going to install our microservices demo. To do that, I'm just going to run a kubectl apply f on the microservices demo YAML file in the subdirectory. There's a note here at the top which references both the original project from Google Cloud Platform as well as the fork that I made where I introduced these breaks. If we look at all the different services in our cluster, we can see that the microservices demo created a number of services within our cluster. It provisioned this load balancer service here. And so let me just access that from the browser and see what's going on. OK, so we're able to use the load balancer to access something. But it's surfacing this 500 internal server error. It looks like it's trying to connect to one of the services. In this case, could not retrieve currency. So maybe it's the currency service. It looks like that failed. And so this front end is bubbling that application error back up to us. Let's check the status of the pods. This is also a good opportunity to showcase a tool called K9S, which is a text user interface for Kubernetes. And so if I just do K9S, it will start a TUI, a text-based user interface. And I can navigate around via my keyboard. And so rather than having to issue a whole bunch of kget pods, kdescribe pods, et cetera, I can just run K9S and navigate around accordingly. You can see the current kubectl context at the top. And as you can see, four of my services appear to be broken. The one that we saw on the front end was referencing a currency. So it's probably this currency service. I can hit the D command to describe it. And now we're dropped into a description of that pod. If we scroll here to the bottom, we can see all the events that Kubernetes is publishing about this pod. And these will, by default, persist for one hour. So after an hour, these would get wiped out. But we can see two minutes ago, it was assigned to a specific node. We then pulled the container image. It started, but then the readiness probe failed. So let's take a look at what's going on here. If we scroll through the definition for this pod for this deployment, we can see the reason here is that this was umkilled. That means out of memory killed. It tried to use more memory than the limit that was specified. And so if we look down here in the request section, it's only requesting five megabytes of memory, and it has a limit of 10. So why don't we bump those up and give it additional memory and see if that solves the issue? I'll open up my microservicesdemo.yaml. I'll search for currency service, find that deployment. And as you can see, I've commented out the original values and have replaced them with these insufficient values. So if I replace the original values and then do an apply again, now let's go back into K9S. You can see the currency service original pod was crashing. We have a new replica coming up from that modified deployment. The new one appears to be healthy, and the old one was taken down. OK, so that's one of our four issues solved. Let's load the application on the front end again and see what happens and see if we get a different error. OK, so we're still getting 500 error, but in this time, it's talking about the cart service. Let's take a look at the cart service. It is indeed in a non-ready state, and we can see that it has been restarting over and over and over. OK, let's go ahead and describe that. We're going to take a look at the events. We assigned it, we pulled the image, we started, and then the liveness probe is failing. If the liveness probes that we specify are not passing, then Kubernetes will assume that the application is unhealthy and will kill the pod in order to try again. If we scroll up to look at our readiness and liveness probe definitions here, we see that for liveness, it's checking on our pod on port 8080, and our readiness is the same. And so why might that not be working? If we scroll up here to the container definition, we can see that the actual port that's specified is port 7070. The kubelet is trying to check for liveness on port 8080, but the application is actually listening on port 7070. Let's go ahead and fix that. To our definition, we'll go to the cart service, scroll down, and as you can see, I had commented out the original version and replaced it with this bogus port. Now we'll reapply, and we'll go back into k9s to see if that fixes things. We've got our new pod coming up associated with that modified deployment. It appears to be running, but it's not in a ready state yet. And now our cart service appears to be healthy. We've got two additional pods that seem like they're having issues. Let's take a look at the Redis cart. We see that it is in an image pullback off state. That means that Kubernetes is unable to pull the image associated with the container. We look at the events, we can see more details about that. Pulling image Redis Debian failed to pull image Redis Debian. So let's go to the Docker Hub page for the Redis container image, look under tags, and so we're trying to pull the Debian tag. No tags found. So perhaps this tag is just simply invalid. Instead, let's take a look at one of these other versions. Let's use version 7.2.5. We'll go into the YAML file. We'll search for Redis cart. We'll find the container image. And yes, so here's the bogus image that I had added to make it fail. It looks like the original definition uses this Redis Alpine image. Let's go ahead and use that. We'll apply, go into K9s. And now our Redis cart service has come up healthy. And finally, we have our load generator service. Oh, interestingly, the cart service is still failing. Let's take a look at that. Oh, it looks like I maybe modified only the readiness probe and not the liveness probe. So let's go and fix that. Yeah, so I modified one of these back to the original value, but not the other. So hopefully that fixes our cart service. And then the load generator is the final one not working. Let's go ahead and describe it. If we scroll down, look at the events. Two nodes have insufficient CPU. So zero of two nodes are available for scheduling this pod. Two have insufficient CPU. Let's go ahead and look at the CPU required. And we can see that the limit is specified as requiring up to five CPUs, virtual CPUs. And the request is specified as needing up to three CPUs. And I believe these nodes that I provisioned in CVO have only two CPUs each. So that makes sense that the deployment is requesting at least three CPUs just for this application. But none of the nodes have capacity to satisfy that. So it gets stuck in this pending state. Let's look at the load generator service. And here you can see the original CPU request was 300. I had upped that to 3,000. The original CPU limit was 500. I had upped that to 5,000. And those values were just too high, and so our cluster could not meet the demand. We can now apply one more time. Go into K9S. Looks like our load generator is coming up. Now it's healthy. And now all of our services for our application are healthy. Let's go ahead and access it from the browser. Refresh the page. And now this is what the demo application is supposed to look like. We've got our products, different pricing. We can choose any number. We can add them to the cart. Those will get saved in the cart. We can switch to a different currency. And so now all of the microservices appear to be functioning correctly. And that's represented both by the fact that our pods are all running and healthy, and by the fact that we can access our application successfully from within the browser. Now all of these breaks were within the deployment definitions. There could also be issues, obviously, maybe, let's say, between the services and the deployments where we specified the wrong set of selectors or the wrong ports. My goal here was to give you an idea of how to look at the information that the Kubernetes cluster provides, either via the kubectl command line or via a tool like K9S, if you prefer, and walk down all the potential different issues, identify them, and fix them. And so as you work with Kubernetes, inevitably, you're going to run into some issues. Make sure to take a step back. Think about what the root cause could be. And systematically work through the different avenues for gathering information and identifying and fixing those bugs.