Video Thumbnail for Lesson
14.2: GitOps (using Kluctl)

GitOps (using Kluctl)

Transcript:

When it comes to GitOps, there's two names that people will always reference, ArgoCD and FluxCD. Those are both very popular and very powerful projects. However, I'm going to showcase a third approach using ClueCuddle GitOps. One, because I think it's awesome and we can leverage all of that multi-environment configuration capability that we showcased in Section 12. And two, because I want more people to use this project. I think the developer experience is actually better in many cases than Flux or Argo. And because of that, I wanted to highlight it here and hopefully convince you that you should use it as well. The concept behind GitOps, as I mentioned before, is that you're going to have a controller running in your cluster that is able to pull updates from Git. Now, that can be triggered via a webhook to increase the speed with which updates make it into the cluster. That's not super important. The important piece is that we have this operator which is able to pull in updates automatically and keep those updates in sync with the deployed state of the cluster. Now, within the ClueCuddle GitOps subdirectory, we've got a few things defined. We've got our top-level ClueControl file. So we're going to define a deployment just like we did for our multi-cluster deployment. But in this case, the deployment is controlling the ClueCuddle GitOps controller. We've got our two targets. Again, staging is pointing to the Civo cluster and production is pointing to my GKE cluster. The args that we're going to pass into this deployment are just going to be the cluster name. And we've got a discriminator using that cluster name to enable ClueControl to properly prune and delete resources. The deployment specifies we're going to start with our namespaces. We have a barrier, and then we're going to deploy our clusters. So namespaces here, there's two namespaces that ClueCuddle uses for GitOps. We have our ClueCuddle GitOps namespace as well as our ClueCuddle system namespace. So those will be deployed first. And then it will look into this clusters subdirectory where we find the following deployment. This is another interesting thing about ClueControl that we didn't look at in module 12. In module 12, all of our deployments were referencing things in our file tree. We're referencing a subdirectory. However, we can reference a Git repo, public or private. We can specify credentials to access it here. And so these configurations don't even need to live in our repo. They can live in, for example, the ClueControl repo and specify that it should install the controller and the web UI. So let's go to this repo and see what those contain. We're also specifying a tag directly. It's leveraging Git's ability to have versions and releases that serve us these files rather than needing to pull those in locally. So if we go to this repo and then to the install directory, the two things it was referencing were controller and UI. So the controller one is going to deploy this. And the web UI is going to deploy this. The web UI is a stateless deployment that will allow us to visualize the state of our ClueControl resources. And the controller is the brains of the operation that's going to actually be performing the command line executions that we would have done locally. But now they're going to happen inside of the cluster. So these are kind of the system components. And then we have two additional subdirectories. We've got the all subdirectory. So this is anything that's going to be common across our staging and production environment would live in here. Within that, we specify a custom resource called a ClueControl deployment. This is how we tell ClueControl about our project specifically. We say here is the repo on Git. And I want you to look in module 14, my current module, to find the resources I care about. And so by specifying this, ClueControl is going to be able to manage itself via GitOps. I want to target the cluster name specifically. And then I want to pass that in as I deploy this such that it will be usable as an argument. These two options allow the controller to then clean things up if I were to delete the resource from my Git repository. And so this is going to be shared across all of our clusters with the only difference being it will get a different cluster name specified from the argument that I pass it. Back to our clusters deployment.yaml. The final piece that we're going to install is based on the cluster name, one of which will be staging and one of which will be production. We're going to install either of these two deployments. This deployment is pointing to the module 14 subdirectory. These deployments are going to be pointing to the module 12 directory and will control deployment of our application itself just like we did in module 12, but now in an automated fashion. I know that might have been a little hard to follow with which elements the ClueControl deployments were managing, so I drew this diagram to help hopefully clarify things. On the left-hand side, we see the two ClueCuddle deployments, the first of which we're naming ClueCuddleGitOps. This is going to be managing itself as well as the ClueCuddleController and WebUI deployments. And finally, it will manage the ClueCuddle deployment referencing our demo application deployment from the other module. By including itself in this deployment, now we can manage ClueCuddle via GitOps. That second ClueCuddle deployment demo app references our module 12 directory, and so it will be deploying our demo app ClueCuddle deployment that includes our third-party dependencies, the Postgres cluster, and all of our first-party services. We'll manually deploy the ClueCuddleGitOps deployment one time. We'll then install the second ClueCuddle deployment, which will then, in turn, deploy all of our applications. Let's go ahead and deploy these two GitOps controllers into the staging and production clusters. We'll start with staging. I went ahead and cleaned up the namespaces and resources that we had deployed manually from module 12, such that we would have a fresh slate, and we should be able to now deploy into this. Navigating into our module 14, into the ClueCuddleGitOps directory, we can see we have four tasks here. The first one is to deploy into the production cluster. The second one is to deploy into the staging cluster. Let's start with staging. This just calls a ClueCuddle deploy passing it the staging target. We'll say yes. If we list our ClueCuddle deployments across all our namespaces, we've got our GitOps deployment. This one looks like it has finished deploying. It was successful. This includes two pods within the ClueCuddle namespace, the controller. This is where all of the ClueControl logic is being applied. Then the web UI, which we can actually use to investigate and look at how these deployments are progressing. When we deployed this, a random password was generated, which I can get out of the Kubernetes secret using this command. I'll copy that. Then I'll port forward to the service in front of that web UI. That's service ClueCuddle web UI and the ClueCuddle system namespace on port 8080. Now I can go to localhost 8080 and log in with admin and that username that I just copied. We can see here's the GitOps deployment. It looks like it's healthy. It includes both the commands that I issued via the command line. That's this path it deployed two minutes ago. Then here are the deployments that will be issued by the controller. It looks like our demo app is still reconciling. You can see the node API is up. The dbMigrator is crashing, likely because the database is not up yet. It just came up 38 seconds ago, so hopefully the next time that dbMigrator restarts, it should connect successfully. That will enable the Golang API to come up and then our React client to successfully start up as well once it detects all of the necessary backends. I'll also create that secret for our load generator Python. If this were a real project, rather than manually creating that image pull secret, I would store that in Google Cloud Secret Manager and use External Secrets Manager like we showed in an earlier module to pull those values in automatically to our cluster. Then those non-sensitive external secret configurations could live within the Git repo alongside the rest of our configuration. That way there would be no manual step after the initial deployment. A few minutes later, after all of those bootstrapping steps have taken place, all of our service pods are now healthy. We can find our external IP, edit Etsy hosts. This was our staging cluster. Navigate to our staging URL, and there's our application completely bootstrapped via GitOps. Let's take another look at that web UI. Refresh the page. I'm going to just revalidate now that our application is healthy. We've got a healthy validation state. It's telling me that my reconcile failed. I think it's because it timed out on that initial run. Let's try it again. It's still showing an error here specifically for the traffic dashboard ingress route. This error appears to be with something with how the helm hook associated with the traffic ingress route for the traffic dashboard is being applied. Let me try manually deploying module 12 to see if that resolves the issue. Deploy staging. That's one interesting thing about the ClueControl GitOps model is that it's meant to enable you to use both the in-cluster controller in conjunction with the CLI, whereas Flux and Argo generally discourage that practice. By default, if you manually apply a change, the GitOps controller is not going to revert that unless you have specifically told it to or upon the next time you push an update to your Git repo, then it would revert those manual changes and apply whatever configuration you've pushed to the Git repo. Refreshing the web UI, we can see now that we have a command line push here that was deployed one minute ago and everything appears healthy. Let's rerun our deployment. I could have just rerun that deployment on the server side most likely and it would have succeeded. We'll try that in the production cluster. And so we've got our staging cluster deployed. Now let's deploy our production cluster. ClueCuddle deploy, pass it the target. Say yes. Let's take a look at that web UI as the two ClueCuddle deployments come online. This is a random password that was generated, so it'll be different than the staging password. You can see the GitOps ClueControl deployment. We deployed it first through the command line and then the first GitOps reconciliation was successful. You can watch the logs of this demo app production deployment going through. It's waiting on that Migrator job, which is crashing until the Postgres cluster comes online. Looks like our database is healthy. Our Migrator job has completed. Our two replicas of our Golang API are coming online. Now I'm going to delete this pod so it'll restart. And we've got our healthy services. Let's get our service. Our external IP, put it in etc hosts. Navigate to our domain. And there we go. Our production application is now deployed via GitOps. Let's check the web UI and see if that same issue that happened on staging happened in production. It looks like it did. When I click validate, it checks that the target is ready and it is true. Now let's click deploy. The controller should rerun that deploy command behind the scenes and everything should come up healthy. There we go. We got a checkmark. At this point, these ClueControl deployments will check for new configurations on GitHub and rerun those deployments every five minutes as specified in their configuration. And so it will automatically pull in new updates. To demonstrate this, we can go here. To demonstrate this, we can go here and merge this pull request that was generated automatically by our GitHub action. And we should see an upgrade across all our services on staging from this version to that version. Looks good. Let's go ahead and merge it. We'll switch over to our staging cluster. Just to confirm the versions that we're currently running with, let's do k, get, pod, yaml. We can see it was indeed using that 0.3.0 version. Now I could wait for this next iteration of this or I can go ahead here. Oh, it looks like it has detected a new version and run it on the GitHub side. Now once that five minutes has elapsed, this will rerun and we should get that latest version updated. I can also skip that by going ahead and clicking the deploy button. Let's go into our cluster. We'll do k9s. And it looks like that new version must have been detected because we've got new pods coming online. And see, all of our pods have now upgraded to the latest version. Let's just confirm that. We're now on version 0.5.0-45. And we can see that this was just deployed 40 seconds ago. And so that just showcases making a change in our Git repo and having that automatically be reflected in the clusters. So now, instead of having to deploy stuff manually, we can just make those changes in our Git repo and those will then flow automatically into the cluster at the next iteration of that deploy. One aspect of our GitHub action that we didn't demonstrate was the trigger based on a release event. So now that we've made it all the way through the course, we've got everything up and running. Our GitOps controllers are active. We've got our CICD set up. I'm going to tag a release. We'll go here to the releases page. Say draft new release. This is going to be the 1.0.0 release. We'll create a new tag when we publish it. And we will click publish. This is going to create the 1.0.0 tag. We should be able to go to GitHub actions and see we now have a GitHub action running our image CI workflow based on that tag. And now because of how we set up that generate image tag command using git describe, we'll see here that the image tag that's getting used for our container images is 1.0.0. Now that the images have been built and pushed, we're updating the tags. With that complete, we see this auto-generated pull request updating the image tags to 1.0. You'll notice that it is updating both our production tags as well as our staging tags, which is what we want. We want to be running this image across both environments. We can merge this. We can go to staging. Redeploy here. We can go to production. And redeploy here. We can see that our pods updated a couple of minutes ago. Let's check the version on it. And there we can see that our pod is running that updated 1.0.0 version in production. I expect it to be the same in staging. Both of our apps are up and healthy. We can hit either one from the browser. With that, we've got a fully automated pipeline for making code or configuration changes and pushing those to git and having those changes make their way into the appropriate environment, building that container image, pushing it to its registry, updating the git manifest, and then having our GitOps controller automatically pull that in to the cluster and update its state. This gives us that powerful familiar workflow where now we can push code to main and have it automatically deploy to staging. The one thing we would need to change is rather than creating a pull request from our GitHub action, we could commit those staging image changes directly to main. I just wanted to use the pull request option to showcase how you could have a manual human in the loop if you wanted to. But now you've got a really robust GitOps-based workflow that you can use across any number of clusters within your organization.