How to Properly Leverage Amazon S3 for Your Cloud Backups - Part 1

  • March 12, 2024

In today's digital age, data is the lifeblood of businesses. Whether you're a small startup or a large enterprise, protecting your data is paramount. With the rise of cloud technology, businesses now have more options than ever to securely store their valuable information. Amazon Simple Storage Service (Amazon S3) is one such option that has gained popularity due to its reliability, scalability, and cost-effectiveness. In this post we'll explore how you can properly leverage Amazon S3 for your cloud backups to ensure the safety and accessibility of your data.

This is just the first part in a series of posts. Cloud backups, and backups in general is a massive topic that I don't think anyone would want to read in one shot. So in this post I'll explain what S3 is and how to start planning for your cloud deployment of the storage services you'll need based on size, cost, availability data retrieval times. In future posts I'll begin to discuss some of the other related topics like backup strategies, compliance, and tool selection.

Understanding Amazon S3

Amazon S3 is like a giant digital warehouse offered by Amazon. It lets businesses easily store and access huge amounts of data online, anytime, from anywhere. Think of it as a super secure and reliable storage space where your files are always safe. With a durability rate of 99.999999999%, it's perfect for keeping your important data backed up and secure for the long haul. You might be asking yourself, “what doesn’t “durability” even mean?” What that “11 Nines”, as we call it, represents is the probability of your data being stored properly, and for as long as you need it to be. Sometimes it’s better to think about this from the opposite perspective. 11 9’s of durability means that there is a 0.000000001% chance that data you store on Amazon S3 will get corrupted or lost. In plain english, that means, literally that there is ONE in a BILLION chance that your data won’t be there exactly as you left it. To say the least, it’s pretty unlikely!

Steps to Properly Leverage Amazon S3 for Cloud Backups

1. THINK! ANALYZE! DO YOUR DISCOVERY! 

Your backups simply will not be of any value if you forgot to back things up. There are key things you’ll need to know to inform your decisions as you make your cloud backup plan. You’ll need to know how many files you have, the total number, you’ll need to know the total size of your dataset, including the average size of those files as it will effect the total number of object (files) you store. Also, one of the most important things is WHERE are you files! The biggest mistake you can make in your backup strategy is forgetting about certain data and not backing it up at all!

Bottom line, use your noodle... if every server, datacenter, and old spare hard drive you had laying around all blew up in the same second, what would you need to restore everything? You should ask yourself that question repeatedly.

Then, last, but certainly not least, if you're going to do backups of any kind, an essential part of your backup strategy should be restore testing! More on that one later as well. 


2. Set Up Your Amazon S3 Bucket

Once you know the what of what you need to back up, the next step is to decide what Amazon S3 solution is best to store your backup data. There are a whole lot of "gotchas" in here too:

  • Enable Versioning - Why?

    • To protect yourself against accidental deletion. S3 does WHATEVER you tell it to do. If you accidentally spill coffee on your keyboard and lean on the delete key, it's all gone! If you get hit with malware, have some sort of odd system failure mode that corrupts nearly everything, but the files are still there, maybe a ransomware attack that encrypts all of your files and the attacker demands Bitcoin before agreeing to give you the unlock key. If you don't know these things have happened, the next time your backup runs, your backup software will detect the files have changed, upload the same filename, and overwrite the good copy... because... that's what you told it to do. Think about versioning as the cloud version of tape rotation. If you keep previous versions for 90 days, and you find out the ransomware attack happened 4 days ago, fine... go back and restore your data from 5 days ago. There are also a whole bunch of other reasons to have versioning, like audit and compliance, data retention policies, even ways to leverage it to speed up your development and testing workflows.
  • Implement Encryption - Why?
    • This one is simple, it's because your data belongs to you. You've got a passcode on your phone, why wouldn't you have something on your other data as well. There are really two main ways to do this:
      • Server Side Encryption: This is where the keys to encrypt your data are stored and managed by AWS. This is probably good enough for most non-critical data and it has one massive benefit: YOU do not need to create or store the encryption keys, it's all done for you. Your data is encrypted at rest, and decrypted when a user or service with the proper permissions requests it back.
      • Client Side Encryption: This is where YOU own the keys. You encrypt your data before sending to Amazon in such a manner that even AWS can't access any of it. They are just blobs of data, which in the AWS S3 world, we call "objects". You encrypt it, THEN upload it, so nobody outside of people with access to the key can EVER access it. This, by far, the most secure option, but it does come with a few cons, one of them is pretty major. First of all, your own server has to work much harder to process the encryption... you may not be doing this on a device with that kind of power... think a small business running a local NAS they bought, probably from Amazon, and it lacks the CPU power to encrypt things quickly. This could slow down your backups. The far more serious disadvantage is pretty simple, lost encryption keys. If you lose the key, or if you're not sure which one goes with which dataset... it's going to be very very very difficult to get that back, and in all honesty, it might be impossible! It's worth thinking about whether or not you need THAT level of security.

3. Choosing A Storage Tier

There are a lot of them, but I'll highlight for you some of the most popular and why you might use them:

  • S3 Standard: Default storage class for frequently accessed data with high durability, availability, and low latency. Suitable for a wide range of use cases including dynamic websites.
    • Pros:
      • It's FAST. Nearly limitless. You can upload almost as much as you want, download it at any time, make deletes, updates, replacements, you only get charged for the exact amount you use for the exact number of hours you're using it.
    • Cons:
      • It's expensive! I have a project I'm working on right now for which I'll need to store about 20TB of data. That would cost me about $470 monthly, not including the charges per request or the charges to download any of it. With ANY S3 tier, uploading data is always free.
  • S3 Standard-IA: The "IA" here stands for "Infrequent Access", it is optimized for data that is accessed less frequently but requires rapid access when needed.

    • Pros: 
      • Cost. That same 20TB of data I'm playing with, on Standard Infrequent Access, it would only cost approximately $256 monthly. Compared to the $470 of regular Standard, that's almost HALF the price. It's about 46% cheaper!
    • Cons:
      • It's a little slower. Not by much, but enough that if you were trying to store static content for your website there, users would surely notice the increased load time.
      • FEES! If I download 1TB of that data, it costs about $0.01 per GB, that would end up hitting my bill at about $10 compared to downloading the same data from S3 Standard at $0.0007 per GB for a total of less than $0.75! So yeah, if you're doing the math in your head already, downloading data from the Infrequent Access Tier is well over 10x the price!
  • S3 Glacier Flexible Retrieval: Low-cost storage for long-term archival of data accessed infrequently with retrieval times of minutes to hours. Offers significantly lower storage costs.
    • Pros:
      • Cost, just like before. However, this is where the discounts start to get real! Same data as before, that 20TB heap of "I don't know why I'm even saving this" that I intend to put in the cloud, once we get to Glacier, that cost drops like a rock. Now it only costs me $74 monthly. Again, comparing that to Standard, Glacier Flexible Retrieval is about 84% cheaper!
    • Cons: (oh boy, there's a few!)
      • It's a LOT slower. In fact, without purchasing PSU "Per Storage Unit", you would have to wait anywhere from 1 - 6 HOURS to get your data back and it's not nearly as straight forward. First you have to "initiate a restore" which is how you tell S3 you want that data back, then wait until AWS fishes it out of whatever corner of their datacenters its hiding in before you can actually download it.
      • PSU Fees. Let's say you really DO want to download something out of S3 Glacier Flexible Retrieval quickly. You would have to buy a minimum of 1 PSU which is $100 and is only good for 1 month. It does allow you to download up to 40TB, but you still have to pay the whole $100 even if you just want to download a single file quickly. Without that, you're just going to have to wait
      • Much higher data retrieval fees either way. Even if you choose to wait a few hours for your data to be accessible again, that same 1TB of data to download, if you remember... $0.75 on standard, about $10 on Standard Infrequent Access, well here... it would cost you about $75!
  • S3 Glacier Deep Archive: Lowest-cost storage class for long-term archival of data rarely accessed. Provides the lowest storage costs but has longer retrieval times compared to S3 Glacier.
    • Pros:
      • Cheap, cheap, cheap! We've already seen an 84% discount moving from standard to Glacier Flexible Retrieval, check this out... how would you like a 95% discount?! That same 20TB of data stored in the depth of the AWS Cloud basement would only cost you about $20! Yeah... if you don't need your data, really for anything OTHER than safe offsite storage, what would have cost you $470 a month on Standard, now only costs 20 bucks!
    • Cons:
      • Slower than the passage of time! If you want your data back, not only is it the same multistep process as the Glacier tier described above, but now you have no option to pay to get it back quickly, and it could take 12+ hours and I've personally seen it go a decent bit higher than that... more like 18 - 24 hours.
      • Retrieval fees... again! This time it's not as bad though. For that same 1TB of data you'd like to download, this time it would actually only cost you about $23. So while that's well over 30x the price of downloading from Standard, it's only a 3rd of the cost of Flexible Retrieval! Not the best, but sure a decent compromise!
  • S3 Intelligent-Tiering: Automatically moves data between frequent access and infrequent access tiers based on access patterns. Offers cost savings for data with unknown or changing access patterns.

    • Pros:
      • A combination of cost effectiveness and speed. I'm not going to fully "math" it out here, I know my brain is on fire writing this... no need to make you, the reader, cry. Long story short here... it moves data around. You get the benefits of saving money on data you rarely access, but still get to pull recent data fairly quickly. For most people, this makes a decent amount of sense. This about it this way: How likely are you to restore a backup from last week vs. restoring one from last year? Right?
        • So... a bit of math EXPLANATION, not the actual math, don't worry! Intelligent Tiering breaks your storage costs down into categories by age:
          • Frequent: The good stuff... stuff you've accessed in the last 30 days
          • Infrequent: Data that hasn't been accessed in the last 30 days
          • Archive: Data that hasn't been accessed in the last 90 days
          • Deep Archive: Data that hasn't been accessed in the last 180 days (Pro tip, THIS is probably where most of your data lives!)
        • So yeah, as you'd expect, data storage costs get cheaper as it gets older. If you were to break up your data in a somewhat conservative ratio between frequent, infrequent, archive and deep archive... for the purposes of this example I chose to do the math WITHOUT enabling the Deep Archive level. So I split it up at 85% Archive, 5% on all the rest... your storage costs blended between all of them it only adds up to about $107 per month. That's still a 77% discount! It's not 95% like Glacier Deep Archive, but it's still pretty good!
    • Cons:
      • It's a liiiiiitle slower IF... major caveat, IF you do NOT enable Deep Archive. The top 3 tiers, we are talking about seconds on the lower tiers, maybe a few milliseconds at the top. However, if you do enable the Deep Archive Tier, you go back to the 12+ hours

Next Up - Selecting Backup Software

Yeah... this has gone on long enough! Am I right?! I'll be putting together another post with a good selection of both enterprise and SMB tools. For now, let's just say... there are a LOT to choose from!!

Conclusion

To wrap it all up... I ❤️ Amazon S3! It's flexible, secure, reliable and with a little bit of planning and maybe some decent testing, you'd be hard-pressed to not find a solution that works for you! I've been using S3 myself for a long time to store both personal and company data and the ability to mix and match, move between tiers, automatically replicate data to multiple locations all over the globe, and know that my data is right where I left it has saved me countless numbers of times! Just be mindful of what you're storing, what tier you're using, how sensitive your data is and there is surely an S3 solution out there that fits.

I'll do my best to follow this up with some additional posts about use cases, tools, personal experiences... the good and the bad... and if you end up deciding S3 is your next move, let me know! I'd love to hear some of your stories!