Cloud | 8 min read

GCP or: How I Learned to Stop Worrying and Love the Cloud

Lukas Resch
May 2019
written by Lukas Resch

Starting with the ongoing year we became a Google Cloud Platform (GCP) Partner. Hence we are now able to re-sell GCP services and help you on your journey to the cloud.

Within this article, we will outline the process from moving an on-premise service to the cloud. We will show how we can increase performance and availability while reducing the cost of maintenance based on the story of one of our customers.

Onboarding

Before we begin we must decide what kind of organizational setup we want. When working with a GCP reseller you have two options: One is to set up your own organization and just use the billing account provided by your reseller. The other is to rely on your reselling partner to set up everything for you in their organization. In any case, you will get your bill from your reseller and there is no need to have a credit card on file for payment.

Given the fine-grained privilege model provided by GCP, it is also possible to tweak both options. For example, it is possible to set up your own organization for convenience but defer the entire setup to your partner.

If you want to set up your own organization, you either must prove ownership of a domain to Google or sign an offline contract. Domain ownership can be done by simple TXT records in your nameserver. The setup wizard will guide you through the process.

Irrespective of your initial decision it is possible to move to a different setup later. However, this is not the usual process. Also, it offers a quite large attack surface for information retrieval, denial of service or other attacks. Therefore, the process is quite burdensome. Take this into account before deciding which way you want to go.

Analysis

Once the organizational decisions have been made and the initial setup has been completed, we can start with analyzing the current setup on your site. While it is possible to just move VMs or bare metal machines to the cloud, it usually pays off to spend the extra time.

In the current story, the setup consisted of a simple application, which should be the first thing to be migrated to the cloud. It consisted of a monolithic application, which acts as a proxy as well as it is processing requests on its own. To be able to compute proper answers to incoming requests a user database is required. Front-end, back-end, database.  So far, we have a very typical application which we find in any given IT setup. The one slightly less common thing in the setup is that incoming traffic is UDP rather than TCP. While this is usually not the case, it does not impose a problem at all as we will see later.

The service is mission critical and accessed from all over the world. So, running it on-premise imposes a risk of service outages. Most businesses cannot afford multiple uplinks or the personnel to run a 24/7 on-site service. A single incident on Saturday night can break your 99.99% SLA for an entire year.

Developing a solution

The given setup can be replicated within GCP as is. That would give us a single VM running all the services. Every single fault would then bring the entire system down. Also, we would have to overcommit on the required resources to deal with peaks, while we would not change the cost for maintenance. Therefore, we decided to leverage as much of the GCP service catalog as feasible.

Breaking the monolithic application into its parts and rebuild it with a microservice architecture is not possible. Therefore, we must get VMs to run it in. Overcommitting on-premise is quite cheap. You bought those huge machines anyway, the hypervisor is doing ballooning, balancing and all sort of other nice things to keep all VMs served with the resources they really need. In the cloud, you pay for the resources you allocate. While the underlying hypervisor is still able to do all the nice things, you pay the same price whether your usage is at 1% or at 100%.

In the given setup we reduced the resources for the VM by more than 50%. With that, we can cover the load 95% of the time. At the same time, we start another VM to run in parallel. With the second VM we are not only able to cover load peaks, but also handle failures of a VM with no downtime noticeable for the user. The service was not built with that in mind initially. All clients rely on a single IP address they can send their requests to. Within GCP we can deal with the shortcoming of the service by configuring a load balancer. Remember that we are dealing with UDP traffic! Well, we just set the load balancer to handle UDP. Problem solved.

Besides distributing traffic between multiple machines GCPs load balancer will take care of a few other things for us. The initial two VMs running our service cover the needs in most situations. But if there are larger peaks we would still have to deny service to some users. Denying service is not an option, but scaling is. Using the load balancer from GCP auto-scaling, in fact, is good and easy to use option. Requests can cause high CPU loads on our service. We, therefore, configured the load balancer start new VMs once the average CPU load is above 80% on all available machines. The additional allocated resources are generating costs. To avoid a bill that threatens our business operation we set a cap on the number of machines that can be spawned. Cloud logging will inform us about scaling anyway. Should we require more resources in the future we can deal with that by increasing VM size for example.

The load balancer also comes with some more benefits. For example, we are now able to do canary releases. It will also do health checks on the existing VMs and replace them with new instances if they become unresponsive for any reason. Furthermore, we can reduce the risk of downtimes by spreading our machines across multiple availability zones.

Finally, we still require the database. We need something where we can look-up data. We could just install the DBMS of our choice on another VM. The better option, when it comes to operation, is to use Cloud SQL. While we are still only paying the cost of the hardware resources we allocate, we get a fully managed DBMS. Updates, access restriction, firewall settings. It’s all taken care of already. A high availability setup is just one checkmark and we scale the required storage as we go.

The resulting architecture is shown in figure 1.

 

Implementation

All of the above can be configured within Cloud Console, GCPs web interface. Being a software developer by heart that is not an option. Clicking through something is an error-prone process, that is impossible to trace or reproduce. The good thing about GCP is that it provides all features from its API. In fact, the web interface is just built on top of the API. While it is possible to use the API directly that is not comfortable. Google supports a wide range of tools to manage your resources within GCP.

The tool of our choice is Terraform. It is cloud agnostic, which allows us to build multi-cloud setups and reduce to vendor lock-in as much as possible. Terraform works on its own descriptive configuration language called Hashicorp Configuration Language (HCL). The workflow is not much different than what you are used to from your on-premise tools like Puppet, Chef, Salt or any other tool you use for configuration management.

We describe the system we designed before using HCL. Then we let Terraform plan the change and apply it. The configuration is written in plain text. Obviously, that is stored in our version control system. Thereby, we can trace any changes to the infrastructure. Reproducing the same setup for our test system just requires us to apply the same configuration to our test project. If we want to, we can just set up the exact same system for every employee. Thereby we can easily test new software releases before deploying them to production.

Conclusion

Within this article, we outlined the process of moving to the cloud. We showed how to identify the resources and services needed for an application. To save on maintenance we leveraged managed cloud services where possible and beneficial. We did not change a single line of code within the application itself but were able to integrate it into a cloud environment. Thereby we reduced the risk of downtimes and can grow as required in the future with ease.

We built a system that is highly dynamic. Machines are spawned and wiped out of existence as needed. Still keeping an eye on what is happening is easily possible with the integrated logging system which scales with our machines as required. It is also possible to integrate it with existing on-premise logging or monitoring systems.

If you want to move your workloads to the cloud, just get in touch with us!

12+
Lukas Resch
May 2019
written by Lukas Resch

Like this article? | Share it with a colleague