And why Kubernetes should actually be part of your stack.
We’re writing this after a lot of thought about the article published by Coinbase on their blog, Container Technologies at Coinbase. Although we understand a lot of their reasoning, there are a few things that we would like to point out.
First of all, we totally agree that Kubernetes is not for everyone at this point in time. It is complex and has quite a bit of a learning curve. Although it is very easy to get started, it’s not that easy to do it right. Things are improving, however, and having someone at hand to guide you is a real help. There are also several courses available (including our own “Kubernetes for Developers” training) that will help you along tremendously. However, once you get the hang of it, making use of the platform is a joy for most people. The speed at which you can apply changes and the ease that allows you to no longer think in “machines” but just in “resources” is a really nice way to work in our industry.
So we would like to address the points raised by Coinbase, not to refute them, really, but to offer some additional perspective. And maybe offer some options if you find yourself in the same situation.
One of the first bolded statements is that a lot of the advertised features like storage orchestration, secret and config management and automatic bin packing do not work for large scale installations without intense investments in forking, customization or through integrations and separation. Although that may be true for certain situations, in our experience most companies use these services as provided by Kubernetes in the clusters that we maintain. You may not like them (we’re personally not big fans of the way Secrets are handled, for instance), but if you really want to, there are alternative solutions available. To point out a simple one, we see customers switching their Secrets to Hashicorp’s Vault and using that for fine-grained control over them. Storage orchestration, although a very nice feature, is very little used in general, as applications in containers should prefer to be stateless and therefore not require persistent storage. Temporary storage is easy to provide via the local machine. For configuration management, ConfigMaps are a very good solution and often used as-is. The flexibility of the platform even allows you to use the provided solutions where they make sense, and switch to different solutions where you need to.
In regard to the automatic bin packing, in general, this requires a good insight into the resources that are actually required in your containers. To make the best use of it, and save the most on your infrastructure, performing tuning and benchmarking as part of your CI pipeline is essential. You could also write your own scheduler that makes use of measurements and tries to accommodate for that. This requires a different way of thinking about your infrastructure; no longer are you packing your work units in “machines”, you’re packing them in “resources”. You benchmark your container for a specific performance pattern that you’re happy with and set up its resources to allow that. If more resources are required, an additional replica is started based on your rules and you can scale up. This is not required, however; if you prefer simply giving a specific application a large amount of resources, you can still do that. Sometimes this makes sense. But at least you have the option to do it smarter, if the time investment is worth it for you.
The number one mistake people make who evaluate Kubernetes (or other container orchestration software) is expecting things to work out of the box exactly the way you want them to. Sorry, they do not. The value that the orchestration software adds is the ease with which you can add a behaviour or modify it. Sometimes, the defaults out of the box are fine. Other times, you need to tweak them. But this is just a fact of life, we still have not found any application that is perfect out of the box for everyone.
And nobody is forcing you to make use of the entirety of the platform. If you already have a great solution for handling secrets, use that! You do not need to have it plugged in to Kubernetes or anything, but you can, if you really want to. That’s part of the magic of Kubernetes, the ease with which you can extend it by either adding services into the Kubernetes framework or just run something outside of it.
Next up is a list of problems Coinbase recognises when they would decide to switch to Kubernetes. Let us respond to each of them in turn, because honestly, we don’t see how many of these would prevent you from switching if it offers you enough benefits.
“We would need to build/staff a full-time Compute team.”
Yes, a team with knowledge about the software you’re running is a requirement. Or someone at hand that has expert knowledge, at least. But that goes for any new technology that you either develop in house or apply from another source. We suspect Coinbase has a full team running and working on Odin as well. And AWS certified engineers to help out with Route53 and ALB settings. And of course blockchain experts to help with problems with that. And if they are considering the switch to Fargate and ECS, experts would be needed for those services as well. Any change to a new technology requires you to invest in building up knowledge and experience with that technology to get the most out of it. That should not stop you from considering the switch.
If Coinbase means that they do not expect that this investment would provide a value return in value, well, that’s up to them to decide, of course. But keep in mind that even a small team can already do a lot and you probably already have a lot of the experience required within your current operations team. You would still need the AWS expertise, but you get to commoditise it by wrapping it into the Kubernetes building blocks. So you would end up adding Kubernetes experts to your team or maybe training a few of your current members into that role. But it is still applying a lot of the knowledge that you already have available. Kubernetes by itself is just a few processes and a lot of glue. You already have the knowledge about the glue!
And we think that especially if your setup is large enough, the investment is worth it. You get in essence a self-service platform in which you can provide spaces for your developers to try out things before sending it to production. Developing on the cluster is good practise, whether it’s a large shared cluster (which is probably not something Coinbase would particularly want, but actually makes a lot of sense to a lot of different companies) or locally in minikube/kind or even telepresence environments. No longer are infrastructure engineers required to deliver a new environment, your developers just need to learn how to write manifests describing the resources and connections they need. Everything else can be automated. No more forgotten backups, no more separate deployments of log aggregation solutions, metrics are available out of the box and scraped as soon as your Pods are deployed, TLS certificates are requested automatically, et cetera.
A good operations team is automating the hell out of their jobs. Kubernetes provides them with an API to easily make that automation available everywhere, so your developers can reap the benefits of that without any intervention by operations. But also without compromising on the integrity and reliability of the platform as a whole.
Securing Kubernetes is not a trivial, easy, or well understood operation.
Again, totally true. Security is hard. So is hardening Linux. Or your AWS accounts. And how certain are you about the security practises in ECS or DynamoDB? Not your usage of it, but the implementation at AWS? Any new technology requires an investment into security. And security is hard. Securing anything is not trivial or easy, and we would argue that while you might be familiar with the problem space, unless you’re some sort of demi-god, you don’t understand it half as well as you think you do.
Implementing known best practises and enforcing them is a really good way to prevent 99.9% of security problems. Most of the problems are not even Kubernetes specific. For instance, Coinbase point to three CVEs which are all preventable problems or problems present on other platforms as well, no matter how secure your practises are. In particular, the last CVE they mention, CVE-2019-5736, is issued on AWS as well and Fargate and ECS were vulnerable to it. This is just to show that no platform is 100% secure and will always require time and effort to secure, even when managed by Amazon. Kubernetes is no different.
A good argument here is if the investment in securing the platform is worth the time you have to spend on that. Yes, it might be a new beast to secure, but once the security is in place, it becomes a lot easier to enforce. But security is an ongoing process, whichever platform you choose. It’s never done. But being able to enforce it fairly easy will at least prevent the junior developer from running as root without really stopping them in their tracks. RBAC is not perfect, but it allows your developers a minimal set of privileges without getting in the way. OpenPolicy’s gatekeeper can be added to enforce a lot of best security practises with regard to running images. You can even write your own operators that perform checks for specific policies that you would like to apply. Although it does require time and investment to make the platform secure, it is by no means an impossible task in most situations. And the ability to enforce it for your users is pretty sweet.
Managed Kubernetes (EKS on AWS, GKE on Google) is very much in its infancy and doesn’t solve most of the challenges with owning/operating Kubernetes (if anything it makes them more difficult at this time).
This is so true, but all the more reason to take control of the platform yourself. Or to buy into any of the experts already out there (like Kumina, this small company in Eindhoven, the Netherlands, that’s a Kubernetes Certified Service Partner and sets up and maintains clusters for their customers, thus making it easier to actually make use of the platform!). Any of those kind of providers can make the platform easy to use and have basic secure practises in place. They would help you keep it secure and easy to use as well. Talk to us, see if we can help you. This is our job, we’re pretty sure we can help you with any problem that’s preventing you from implementing it.
The managed Kubernetes solutions by the big cloud providers are not production ready and very basic. They lack a lot of the features that make Kubernetes such a nice platform to get accustomed to. The power of the platform is the extensibility and the way the open source community picked up on that by creating a lot of operators that make life a lot easier. Use the cloud provider’s offering to quickly test Kubernetes, see if you can get your application running on it, but as soon as you get serious about it, you’ll want to extend and customize the platform. Deploying and maintaining it yourself gives you that power, whether it’s an in-house team or an external one.
Cluster upgrades and management require a much more operationally heavy focus than we have today.
Again, external providers can help you here. There are expert companies out there that perform these actions daily and can tell you exactly what to expect. Kubernetes is still heavily in development and new features are added in each release, but that doesn’t mean you have to immediately start using those features.
Development is getting to a point where changes in already running services are few and far between. Keep in mind that in essence, you just want the platform to start up your containers and there are only so many ways of doing that. This is the core, everything else is extra.
Upgrades do need to be tested. Whether it’s a platform upgrade or a distribution upgrade or even a library upgrade. Coinbase’s argument that rolling updates of clusters is not yet mature does not track with our experience. Yes, you need to run multiple clusters to have a staging/acceptance cluster on which you can try changes. No, you do not have to fail over your applications while you upgrade your cluster; that’s one update strategy that works, but is absolutely not required.
A lot of our customers run a single cluster. They depend on our knowledge and experience to tell them how an upgrade will be handled. If an upgrade is disruptive, we set them up with a new cluster and they can indeed switch over. Most upgrades can be done in place with no disruption to services at all. Yes, that means you need to set up your applications to be able to migrate workloads, but you need to do that anyway for redundancy purposes; hardware can and will fail, after all. Rolling nodes on a Kubernetes cluster is no different to node failure. It’s what allows us to maintain clusters without having to bother the customer for each and every upgrade of Kubernetes, while also not requiring them to run an ancient version. It’s what the platform was designed for.
Do things fail? Of course they do; that’s life. That’s why you set up failsafes. You have those now as well, I hope.
Today, we do not carry this burden.
You also don’t have any of the benefits, which Coinbase doesn’t really describe very well. Nowhere is it mentioned which problems might be solved by switching to Kubernetes nor is any assessment made whether the involved work would be continuous or just a one time effort to get started.
We’re not saying that we know better than Coinbase how to run their business, because we obviously do not. We are contending that the arguments made for not switching to Kubernetes are fairly weak. A new technology poses new challenges, this will always be the case. But what is the trade-off? And how much gain can be made?
Let me make a case for Coinbase to switch to Kubernetes, without knowing about their specific problems and how they solve them currently by looking at the problems described in their document.
Control over your own security
If you are running on AWS (especially when using the more advanced managed services) you’re essentially outsourcing security. This can be a conscious decision, but we’ve often seen people think that in the cloud one does not need to have proper security measures in place, or that the cloud provider will knows best. We do not doubt that there are a lot of smart people working for the cloud providers, but their security practises are on the whole very opaque. One can sometimes find security advisories, but we as customers generally do not know how well and how often they provide updates to their platform.
We do not intend to spread FUD here, we’re sure their practises are good. However, in our experience ops people want to have control and insight into everything that is happening. We do, anyway. That’s the main reason why the whole observability movement is getting so much traction; ops people want to know what’s happening and see it as clearly as possible. We consider this a good practise for security as well. After all, trust is fine, checking is better.
By only using the smallest amount of black-box services at a cloud provider, you can keep a lot of control and insight into your infrastructure. Yes, it’s work, but you at least fully control everything that you’re responsible for. If security matters for you, you should trust outsiders as little as possible and use as few black-box services as possible. A self-maintained Kubernetes on top of AWS EC2 instances fits that bill, in our not so humble opinion. Just running your own stuff on EC2 instances would fit that as well.
What Kubernetes adds here is the audit log and the Admissions Controllers. The audit log makes it easy to get a full view of what’s happening on your cluster and who is initiating actions. The Admission Controllers limit the type of actions that can be performed and enforce best practises. You will still have to parse the AWS audit logs as well, of course, but for everything else, the Kubernetes audit log can tell you exactly who and what was active on your cluster.
Allowing developers to take control
Most developers do not appreciate too much hand-holding; they want to deploy. Giving them the correct access to Kubernetes and adding your requirements as part of either CI checks, by enforcing Gatekeeper or other mechanisms, gives your developers a lot of freedom. Freedom to experiment, but also freedom to take ownership. Most problems with applications do not originate from the infrastructure and can thus be best debugged by the developers themselves. A Kubernetes cluster set up properly gives them the tools to do it effectively.
Decouple infrastructure from code
Devops is awesome, but we should not forget that it’s a cultural thing; there are still (at least) two different disciplines involved. Operations engineers want to be able to work on the infrastructure during normal business hours as well. Kumina has been doing operations since 2007, so we know how it goes; everyone wants all their changes only applied at night, during off hours, when the impact of a change is smallest.
With Kubernetes, this is no longer strictly necessary. Maintenance can be tested on a test cluster and the ops engineer knows what to expect with each upgrade. Due to the Kubernetes self-healing properties, an upgrade can be done during normal business hours without any interruption to the running applications. Workloads will simply be rescheduled to other instances, a functionality you need for supporting any form of high availablity anyway.
Cost savings
The up-front investment in the infrastructure and gathering institutional knowledge has a very short ROI. When you use full machine instances for your application, chances are that you will always have some overhead from the OS itself. This overhead is small, but non-zero. With Kubernetes, this overhead can be spread out a lot more, saving you those additional resources.
In addition, most of the time you will use overspecced servers for your application. After all, you want a bit of leeway in case of a sudden spike or when it takes AWS a bit longer than usual to start up a new server. You’ll have this leeway on each of your servers and that adds up to quite a lot with a large amount of servers. With Kubernetes, you can still have this leeway, while making better use of the available resources. By using Pod Priorities, you can scale up by removing less important Pods from your cluster, which will trigger a scale up. Those less important Pods can wait on a newly provisioned server, while your production environment can use the freed resources immediately. You can even create a few dummy resources which make sure that you have a few “empty” servers waiting to take on load if needed. By using the API and some operators like kube-downscaler, you could only reserve those additional resources during times when you actually expect to need them.
You can use the same trick to disable staging and development environments during non-office hours, saving you the costs of those machines when you’re actually not using them.
All in all, by making better use of your existing resources, you can lower your Cloud computing bill quite quickly.
You can do the downscaling using ASGs and some Lambda functions as well, but by integrating it into Kubernetes and your infrastructure, you allow developers access to the same tooling. They don’t need to know how to operate lambda functions, but can just add a yaml key to their Deployment.
Conclusion
In the end, the article as provided by Coinbase does not offer enough insight into their current infrastructure to say whether or not it would be beneficial for them to switch to Kubernetes. However, the arguments they provide against switching to Kubernetes are not very convincing to us. We’ve seen the Coinbase article mentioned a lot by people who think this whole Kubernetes thing is just a fad and will pass once we find something better so we wanted to take this opportunity to point out that Kubernetes solves some real world issues and actually makes live a lot easier. It makes sense to really look at the benefits it can provide you with if you’re looking for ways to improve your infrastructure and the access you can give to your developers.
If you lack the knowledge on how to implement this correctly and need it quickly? Well, we’re happy to help you! Contact us at sales@kumina.nl if you would like to talk to us about this. We can help you get the most out of a Kubernetes cluster without you having to learn how it all fits together in detail!
Tags: AWS, features, in-response, kubernetes