Cloud Resilience with Clumio

How Atlassian improved JIRA Cloud resilience - AWS re:Invent 2023

Clumio

In this podcast episode, Woon, the co-founder and CTO of Clumio, joins Andrew Jackson from Atlassian to discuss their successful collaboration in the realm of cloud data backup and recovery. Andrew, a Senior Engineer at Atlassian, provides insights into the challenges faced by their team in ensuring data resiliency, especially with the vast amount of data processed daily.

The conversation unfolds with an overview of Atlassian, its mission to unleash team potential, and the array of products facilitating innovation and collaboration. Andrew sheds light on the complexities of managing extensive data, compliance standards, and the architecture powered by various AWS services, with a spotlight on the pivotal role of S3 as the primary data store.

Clumio is redefining data protection for the cloud. Clumio’s massively scalable data protection service helps the world’s leading enterprises automate the protection and recovery of critical data in applications, data lakes, and other data services on AWS. Try Clumio today.


Hello. Yes. My name is Woon. I'm the co founder and CTO of Clumio. And today I have Andrew Jackson from Atlassian. Andrew, easy. Thank you. Woon. So, hi, everyone. My name is Andrew. I'm a senior engineer at Atlassian, and I spend my days working with teams to solve problems surrounding data, with my most recent focus being Disaster recovery. And it's an absolute pleasure to be here today to talk a little bit about our successful collaboration with Clumio and how their game changing approach to cloud data backup recovery has been instrumental in tackling some of our biggest data resiliency challenges. How do I use this? Easy. I'm going to quickly run through the agenda for today's session. So first, I'm going to provide some context about Atlassian and what we're all about. And I'll also provide some details more like a high level walkthrough of our architecture. Then I'll tackle one of the most complex problems in this line of work. Being data resiliency. We'll provide some of the complexities and challenges associated with the space. Then I'll pass it over to Woon, who will delve deeper into the efforts required to provide a highly performance and highly scalable s three backup solution. And then we'll come together to talk a little bit more about the implementation at Atlassian specifically, as well as the benefits achieved. So what is Atlassian? Well, at its core, it's a company that thrives innovation and collaboration, with the mission being simple yet powerful to unleash the potential of every team. And Atlassian helps facilitate this by providing software that you probably use already in your day to day work. Some of you here may already use Jira for project management capabilities, or Confluence for extensive collaboration, maybe Bitbucket for source code management, or even Trello just to organize tasks in a fun and flexible way. And those are just four of a variety of different products we provide. But in line with the mission, it's not necessarily about the range of the products, but more about how they're used. And since being founded in 2002, atlassian has amassed over 260,000 customers worldwide, with some of the biggest names in various industries. And what that really means is that whether you're coming from a small startup or a large enterprise, these type of tools have been available to help streamline processes, enhance productivity, and overall foster innovation. But with 260,000 customers, the amount of data that we have to process daily is vast. And not just vast in terms of the volume of data, but also because of the requirements and compliance standards associated with that data. So I'm going to quickly walk through the high level architecture here to give you some ideas about how we operate. So, at its core, Atlassian is powered by a variety of different AWS services, including compute, data network, and storage related services. These services effectively underpin Atlassian's own internal platforms and services that in turn provide capabilities for our products. And these product capabilities include things such as the Jira Issue Service or Confluence Analytics. For product specific capabilities you've got things like the editor in Confluence specifically as well as media identity and commerce for anyone familiar with those aspects. Now, typically an Atlassian product consists of multiple containerized services that have been deployed onto AWS using our in house provisioning layer micros which effectively orchestrates our AWS deployments. And these services contain a variety of different features from request handling, transactional, user generated content authentication, management, data lakes, observability and even analytical services. And with all these different services powering the Atlassian ecosystem, we actually need to have extremely strong foundations, which in this environment is our data source. And for the purpose today, I will focus on S3  because S3 in itself is probably the largest data store we've probably got due to the vast variety of data that has to receive and process daily. Now, with such a significant architecture comes equally tricky challenges. So first off, in an era where data is vital, it actually becomes an increasingly complex task to handle the amount of data that we've got as more unique ways to view and transform data actually occur day to day. And that means it actually becomes harder to guarantee our requirements are satisfied. The second aspect of it is that well, compliance and scalability are crucial and it actually was becoming an arduous task to even guarantee these requirements were satisfied as more and more customers onboarded to our platform. And what this really meant is that these type of challenges, when I say them, are not just operational hurdles, but they're actual impediments to the quality experience that people expect from Atlassian. And so we knew that there had to be some form of data backup solution that didn't just meet our requirements for minimal downtime and quick recovery, but could actually tick off some of these requirements with these actually being the highlights. Things like 99.95% uptime data residue to controls, 1 hour RPO and these are just the highlights of those requirements. There's substantially more requirements there. And so we knew that there needs to be a solution in this space and we reached out and we're keen to find out different interests about what was on the market. We investigated internally and luckily Clumio actually reached out to us after seeing our interest in the space and we collaborated very closely. And through that collaboration we were able to retain an optimal solution that enabled us to back up our S Three data at that scale of the 40 petabytes there, whilst at the same time ensuring that we had safety and accessibility in the event of unforeseen circumstances. So what I'm going to do now is pass it over to Woon who will delve a little bit deeper into the efforts required to provide such an optimal solution that has effectively exceeded our expectations there in that space and has turned those challenges into a success story. Thank you, Andrew. So what I'm going to start is first start with a high level overview of the architectures, and we're going to go deeper. So, first of all, on the left hand side, we have the customer account. So in this example will be Atlassian's AWS account. And then that bucket in the middle is the bucket that we're trying to protect. The way that you onboard is actually pretty straightforward. You onboard by installing either a cloud formation template or a TerraForm that we provide. And you will actually install it on your environment. Obviously, every environment is a little unique, so we also allow you to customize that CloudFormation template so you could actually fit your needs on your environment. Once that CloudFormation template is installed, basically we install all the assets that is needed for us to actually carry out the backup. For example, an Im role gets created on your AWS environment, and that is the im role that we use to actually assume and actually carry out all the operations that is needed for backups. Along with that, we also install things like S3 inventory and S3 EventBridge. This is all mechanisms for us to get the list of objects to actually backup. The S Three inventory will get us the full list of the objects, as opposed to the S Three event, which will get us the delta, the changes that is happening on your bucket minute by minute. And these are the technology that allow us to build continuous backups. For example, we actually backup every 15 minutes, and we do provide the RPO 15 minutes through that event. Bridge integration, once that is set up on the right hand side, is basically everything that is actually managed by Clumio. So all the processing, all the cataloging, all the data verification, that all happens on the and the entire architecture is pretty much serverless. And it actually scales up and down, up and down based on the load that you have. If there's a lot more objects to be backed up, we'll employ more lambda function, and if there's less, we'll actually employ a little less lambda function, but all that is actually completely managed by us. And then on top of that, all this processing and the housing of the data happens in an AWS account that is actually dedicated for that one customer. So in this example, this account over here is actually dedicated for Atlassian and all that data processing and housing of the data happens on that account. Think of this being your secure Vault account that we actually house your data. So if I move on, obviously, for the entire backup and restore solution, we built innovations in the areas of backup. In the case of Restores Instant Access that we'll talk about it tomorrow, but for the sake of time in this presentation, we'll talk about some of the innovations in the backup space in the ingestion layer that allow us to actually backup large buckets like the ones that Atlassian have. So let's start first with a bucket again on the right hand side. So imagine that you have a bucket with 30 billions of objects, right? So you have objects starting with the prefix A all the way to prefix Z, let's say, all sorted in that list. So at a high level you might think, how hard could it be, right? You fire up a bunch of lambda functions and you start copying objects out and that's basically your backup. But not so quick, right? So if you have a ton of lambda functions and you schedule them in a way they all work on the same prefix, then what's going to happen is that you're going to hit the API limit before anything else. You can add more lambda functions, you can do more things at it, but if they're all working on the same prefix, essentially all you're going to get is API throttle. So we know that S three behind the scene partitions, the key spaces in different partitions, and every partition can actually have about up to 5000 gigs per second. So it doesn't matter how many lambda functions you employ, but if you actually all work on the same partitions, all you get is API throttle. So then one of the things that we added is using various heuristics, we come up with a partition of our own, Clumio partition. So we look at the key space and we determine roughly where the partition should be in the bucket itself. Because the S three bucket partition is not known to us, it's not known outside, but we can employ heuristics to actually guess where that line would be. And once we actually know that partition, what we actually go ahead and do, we would employ the same number of lambda functions, but actually schedule them across all the partitions. So effectively all of these lambda functions are actually working on different areas of the key space and they're not bombarding the same partition, so to say. And then the other thing to note is that at the end of the day, backup is always the secondary. So that bucket over there, the primary application is the primary owner of that bucket. So if the backup operation goes and steals all the APIs requests per second, all the TPS, then guess what the primary is? The one that suffers is the jira tickets that's going to load slowly, or is the attachment that is not going to load in a speedy manner, right? So what we do is that we're constantly looking at the API back pressure. So we know and we observe and we track the API back pressure. And what we do is that we actually schedule accordingly. So if you see a ton of back pressure happening on that partition A, we actually reschedule the lambda functions to actually give a little bit of a break on that partition A, and we actually do a little bit more work on the partition b because there's more capacity there and this process happens throughout all the time, dynamically up until the backup is complete. And then, by the way, all of this is actually automatic. All of it is dynamic and it's all managed by Clumio. All right, so just to sum up my part, it really end to end onboarding is pretty straightforward. It's really a TerraForm template. 15 minutes, you're up and running, and all that complexity that I talked about, dynamically scheduling and partitioning and all that stuff happens behind the scene and it's delivered to you as a service. Andrew, do you want to take it over and see describe your experience? Oh, absolutely. Because the beautiful thing about these type of results is that once we had that initial backup done within that 17 days, and for context, we actually had substantial challenges getting to that point. Once we got there, though, that was huge because it meant now we could actually start working towards the standard that our customers expect from us. The very first thing that comes to mind when it comes to backup and recovery then is, well, RPO. So we wanted to make sure we could provide that 1 hour RPO when it came to being able to restore data. But what was fantastic was that after some testing and rolling out this feature with collaboration with Clumio here, that we actually were able to identify through our test that we were meeting a 15 minutes RPO, which had practically been unheard of with this level of data. And once that technical feasibility was handled, then it just became a question of actually tackling the business problems, which for the purpose of this, I'll talk about two very specific ones, the first one being data residency controls. So the beautiful thing about data residency controls is that you have to make sure that you can easily roll out your controls to new environments as you need to. And the best part about this is that we were simply able to work with Clumio and enable those features for those specific regions, which actually lifted a substantial weight off our shoulders there and ticked that box in terms of compliance standards. The second part, which I'm sure everyone loves to talk about, is cost reduction, because for context, Clumio actually provides an air gap solution. And for those not familiar with that, it means that if we want to delete our backups, we actually have to go through a very rigid process to actually do so. So developers can't just simply knock out their backups. And by enabling this in place, it meant we could actually revisit some of our existing backup solutions in tail and actually review them and optimize accordingly and remove some of those resources and actually fine tune our approach, which overall resulted in a 70% cost reduction. And I can't stress how that huge that is, given this type of data that we're working with, and overall has made this a major success. Overall, it's been a ton of fun working with the folks at Atlassian. So today we're in 20 different AWS regions and we're going to continuously to expand to actually match the regions that Atlassian is using. And then the level of performance optimizations and the scalability work that we did with Atlassian was basically a huge amount of fun. We're at a position that we can actually backup 30 billions of objects in literally two weeks, 17 days to be exact, while also throttling because we don't want to impact the primary applications because we could actually go faster, but we have to throttle. And that's kind of a partnership because we don't want the backup to impact the primary application. So it's basically a collaboration with the team, with Andrew. It's been a pleasure. Absolutely. This is the end of the talk. And we have a demo step by for the demo or visit us at Clumio.com. Thank you. Thank.