Why cloud is not the answer – just part of it

Avoiding mistakes in multi-cloud

Patrick McFadin
© Shutterstock / Bankrx

The flexibility and choice around cloud are compelling. However, this ease of deployment and management has to move over to the data side of applications too. Therefore, distributed data models have to be a part of the initial design. Patrick McFadin explains how developers will be able to avoid issues around scaling data.

Sometimes, it’s hard to know where you want to get to. Irish comedian Dave Allen once said that it’s all too common to ask directions in Ireland and hear, “Well, I wouldn’t start from here if I were you.” For developers, it is exactly the same.

We have so many options open to us now when we build applications. We have so many languages available that can meet different needs. We have cloud – public, private, hybrid, and even multi-cloud. We have lots of options open to us for compute. The problem is that it’s all too easy to start from the wrong place.

What is this wrong place? It’s the data.

Why data is the big problem for developers

Applications create data in response to requests, and they save that data as a record. These bits of data have to go somewhere – hence the creation of databases. As our applications got bigger, so have our databases. As our applications get more complex, so have the requirements on how and when to store data. Customer demand has grown, so our applications have got faster and so have our databases.

As our approach to designing applications has moved to microservices and APIs, and deployment within containers …. Our databases have not. While applications have become decoupled and decentralised, data has remained centralised (or siloed for lack of a better word). From mainframe to client-server, to web applications, traditional database design has been based on the concept of one database element being in charge of all data management.

Application requirements have controlled where all the data sits, and then pass that data out to where it is stored. This is fine for the initial application, but from a design and infrastructure perspective, it is extremely limiting, as it does not take areas like scalability, availability, and distribution of data into account.

Today’s modern applications can be spread around – in fact, this is one of the primary ways that we can achieve what business teams want around service levels and functionality. Modern applications can run in private clouds, on public cloud services, in hybrid public and private deployments, or spread across multiple clouds. This makes it very hard to scale up around data without large scale application rewrites. When one node has to be “in charge” it makes it harder to manage data across multiple locations when you get beyond a certain volume of data. When you spread across more than one cloud provider? This becomes even more difficult.

To avoid this problem, it’s worth looking at how to avoid the issue of making mistakes around applications and data management. Rather than sharding systems or redeveloping your applications – again – you should consider decentralising and distributing your data in the first place.

SEE ALSO: A tour of cloud computing: “A seamless multi-cloud experience is currently practically impossible”

How to approach distributed data

Decentralisation involves spreading your data around. However, this is not the only consideration. When you decentralise, you have to work on how the systems will manage the data over time.

Typically in the past, this was conceived based on how you would choose to handle consistency, availability and partition tolerance, or CAP for short.

CAP describes how distributed computing systems often have to choose what qualities they want to prioritise in order to run effectively. Under this theorem, a distributed data store will only be able to achieve a maximum of two of the three elements. Using CAP as a basis, it was possible to think about what trade-offs you are willing to make around your application. Were you going to concentrate on performance and consistency, or should you prioritise availability?

Today, distributed computing approaches have developed further. While CAP is still essential to planning, databases have developed further to address some of the main concerns. Areas like tune-able consistency and eventual consistency can help developers think through what their requirements are around application performance and how this might affect data management in turn.

For example, you might have thousands or millions of customers all hitting your application at the same time and carrying out transactions. Are you concerned about the data being available and accurate? Then looking at data being quorate across multiple nodes would be the key decision. Are you more concerned with serving data as fast as possible? Then data quoracy would be less important than speed of serving information back. Much of the decision here will be driven by the use case you have, and how beholden you are to regulation.

For customers trying to buy products, stock levels are important when you want to place an order. However, performance here might be more important. Compare this with a banking application: you can’t make a mistake on the level of finance within a customer’s account. Different levels of data consistency would therefore be required across these different applications, and instances can tuned in order to deliver the most appropriate levels of consistency, availability and partition tolerance.

To avoid some of the big problems around data, it’s, therefore, worth understanding CAP. However, it is just as important to look at what compromises are available on the data side, and what your priorities really are around each of your applications.

SEE ALSO: How well do you know your Apache Cassandra trivia?

Why cloud is not the answer – just part of it

CAP has had a big role in the development of distributed computing systems and data. Today, the role of cloud has evolved alongside these models to take advantage of different models that combine public, private, hybrid and multi-cloud options. Using the most appropriate mix of cloud services, it’s possible to design and build applications that can span the world and be as close to customers as possible, regardless of where they happen to be.

However, relying on a public cloud provider alone is not enough. While there are options to embed a public cloud element into your own data centre, this does effectively tie you to that cloud service and their data models. While it might be possible to get more isolation for applications and compute using containers, this is not possible for data.

So how to get that same approach in place across multiple clouds and avoid surrendering control? Open source databases have been developed to fill some of the gaps around storing and handling huge volumes of data; each one has its own approach and qualities that can help developers meet their needs. However, to run in a real multi-cloud or hybrid cloud environment, Apache Cassandra™ is currently the only option.

Cassandra runs across multiple locations and cloud services independently of the underlying service, and automatically distributes data across different data centres and geographies. At the same time, it was built to handle huge scale volumes of data being created by applications – both in terms of data being stored and transaction volumes. For those looking at multi-cloud or hybrid deployments, Cassandra can help solve some of the biggest challenges that exist around distributed data. More importantly, it can achieve this without requiring rewrites in code.

Looking ahead in the clouds

Cloud computing services will continue to be vital for developers. The flexibility and choice around cloud are compelling. However, this ease of deployment and management has to move over to the data side of applications too. Without this same approach in mind, it will be all too easy to commit to a strategy that locks you into a specific cloud service or approach to handling data. This will create technical debt that would then require significant work and redevelopment to fix.

To avoid this problem, distributed data models have to be a part of the initial design. However, this should not exchange one flawed model for another. It’s only by being able to run on any cloud – indeed, every cloud – that developers will be able to avoid issues around scaling data.


Patrick McFadin

Patrick McFadin is Vice President Developer Relations at DataStax.

Inline Feedbacks
View all comments