Cloud-based products like search, maps, and email have grown exponentially in the last 20 years. These services serve millions, even billions, of users, and to handle such vast numbers, they must be highly scalable. This requires a new approach to designing software systems.
We’ve seen highly available, scalable cloud architectures using microservices become one of the leading architecture paradigms. However, navigating security challenges in scalable architectures is critical, as these systems’ complexity introduces bugs, performance issues, maintenance problems, and, most importantly, security risks. Unlike other issues, security can't always be easily measured, which makes it more likely to be overlooked during design and development.
If you're not careful, security might be added after the service is created and this can lead to serious problems.
What Is a Highly Scalable Architecture?
Highly scalable architectures exist for a reason. They are built to handle growing numbers of users, data, or transactions smoothly. A key feature of a scalable system is its ability to expand without needing design changes, keeping performance and reliability steady even when the load increases beyond expectations.
Specific designs form the fundamental building block of any highly scalable architecture.
- Load balancers distribute incoming traffic across multiple servers so that no single point becomes a bottleneck.
- Distributed databases spread data across multiple machines and handle large big data.
- Caching systems store frequently accessed data for quick retrieval, which reduces the load on primary data stores and improves performance.
- Message queues facilitate asynchronous communication between components to reduce blocking time and tight coupling between individual components.
Grasping the key concepts and components of scalable architecture is essential for building systems that can handle the growing demands of modern applications. By applying these principles and technologies, developers and architects can design strong, adaptable systems that scale to meet future needs while ensuring performance, reliability, and cost-efficiency.
Security Challenges
What goes around, comes around, and there’s no free lunch. Behind all the benefits of modern systems, there often hides an overlooked security risk, waiting to catch even the most experienced software architect off guard.
These systems are usually distributed, unlike the centralized designs where traditional security methods work best. Trying to force internal IT security practices onto external products often leads to frustration. Cloud services evolve much faster than internal IT systems, driven by customer needs. Using checklist-based security in this fast-paced environment is like tracking airline progress by measuring weight—it just doesn’t fit.
Let’s discuss some of the biggest challenges:
Increased Attack Surface
In complex systems with many parts, the risk of security problems being missed is higher. Engineers often focus on developing features and daily tasks, so small changes can lead to big issues that go unnoticed.
More complexity means more entry points: users who log in, some who access features without logging in, administrators, customer support, vendors, and more. Test accounts can get lost among millions of users, becoming easy targets for hackers to quietly exploit.
Authentication and Authorization Complexities
As systems grow, managing identities and permissions becomes more complicated. Complex designs need equally complex permission management. On top of this, creating user identities for internal operations adds more complexity, leading to a system filled with many exceptions.
Over time, permissions are manually assigned to individuals instead of being based on roles or groups. This breaks the principle of least privilege when someone switches roles but keeps their old permissions, which aren’t removed. If that identity is compromised, the potential damage is greater than it should be.
Visibility and Monitoring Challenges
As systems grow more complex, they become harder to understand. This reduced visibility makes it easier for threats to hide and for vulnerabilities to appear unnoticed. Traditional observability tools may struggle to keep up with this growth. While a simple logging and monitoring system might have worked early on, rapid expansion can cause these tools to fail. If attackers know you're using weak monitoring, they may exploit known vulnerabilities and use DDoS attacks to cover their tracks.
With the massive amount of data being generated, it becomes harder to spot patterns between normal use and malicious activity because of the wide range of behavior from legitimate users.
Third-Party Dependencies
Supply Chain Attacks
A typical software stack is composed of thousands of libraries at every level, working together like a well-oiled machine. There is a library for file reading, another for file compression, and so on. The attacker needs to insert a backdoor in any of the major components, and all the pieces of software that consume it will be compromised. This is quite a transparent attack, as a typical security review will not reveal this threat.
Some of the supply chain attacks can be very elaborate and drawn out. The xz utils backdoor nearly infected the world. Were it not for a lone Software Engineer at Microsoft, who caught the issue noticing the login performance had degraded, we would be living every day with this major flaw. The attack was well thought out, and a person going by the name of “Jia Tan” gained the trust of the xz utils community and introduced the backdoor over time without raising any suspicion. As per Ars Technica, the extent of the backdoor was massive.
In a nutshell, it allows someone with the right private key to hijack sshd, the executable file responsible for making SSH connections, and from there to execute malicious commands. The backdoor is implemented through a five-stage loader that uses a series of simple but clever techniques to hide itself. It also provides the means for new payloads to be delivered without major changes being required.
Credits: Thomas Roccia on Mastadon
Vulnerabilities in Open-Source Components
One of the bombshell disclosures of 2014 was the Heartbleed bug. It affected OpenSSL, the foundational library over which pretty much all of the SSL features are built. To make matters worse, we all use OpenSSL one way or the other without realizing it. It wasn’t a crafted attack or a backdoor but a weakness in the implementation.
Here’s how it worked: the SSL standard has a heartbeat feature, which lets one computer in an SSL connection send a brief message to check if the other computer is still online and receive a reply. Researchers discovered that by crafting a malicious heartbeat message, they could trick the receiving computer into revealing sensitive information.
The companies affected included Google, Dropbox, Netflix, and Facebook. In an odd twist of fate, companies running older versions were not affected. While it worked out well for them once, that doesn’t mean we should normalize running older unpatched systems.
Denial of Service (DoS) Vulnerabilities
This type of attack floods a service with requests to use up all its resources. These attacks are split into two types: L3/L4 and L7. L3/L4 attacks target the TCP/IP layer of the network, where the attacker starts many new TCP connections, hoping to overload the server by keeping those connections open without serving real traffic. Most cloud firewalls are designed to block these kinds of attacks.
L7 attacks focus on the application layer, often through HTTP requests. The problem worsens when certain actions on the service are more resource-intensive than others. If an attacker discovers the most costly operation, they can overload the system with minimal traffic. The best defense against L7 attacks is a Web Application Firewall with rate-limiting rules and protection against known exploits.
Best Approaches
Zero Trust Architecture
It's easy to assume a private network is secure and overlook security for internal traffic between services, but this is a mistake. Use a service mesh for extra security and ensure all services use mutual TLS authentication.
All traffic should be encrypted, and services should communicate with strong authentication like OAuth. Permissions should be carefully controlled using an Access Control List (ACL) to manage and enforce authorization.
Proper Observability Platform
There are plenty of observability tools like DataDog, Prometheus, and Axiom. It's important to have the right balance of logging—too little creates blind spots, while too much adds unnecessary noise.
In addition to logging, use both standard and custom metrics from the code. Set up alarms to alert you about issues like DDoS attacks or suspicious activity. These alarms can serve as an early warning system, helping you catch problems before they cause major damage.
Containerization Security
When you start using containers for easier deployment and isolation, there’s also the benefit of sandboxing. But it doesn't end there. What if the containers have known issues or outdated versions? What if the Docker image itself is compromised?
These issues can be addressed, but it requires extra time and effort. Someone needs to monitor or automate the detection of these problems. Thankfully, tools like Clair (open-source) or Anchore (commercial) can scan images for vulnerabilities before deployment. You can also sign images with a private key and check their integrity using tools like the open-source Notary before deployment.
API Security
APIs are usually the primary method for entry into a supposedly secure service. Any points of entry are rife for abuse and attacks. Some common threats are DDoS, injection attacks, improper authentication, and access controls to name a few. We can implement a rate-limiting Web Application Firewall rule right on the API itself.
For example on AWS, API Gateway can be associated with a WebACL, which contains WAF rules. Not only can it protect against DDoS attacks, but also against injection attacks as AWS provides managed rules, so that you do not have to write your own.
Data Encryption
It’s not 2005. Everything needs to be encrypted both in transit and at rest. Encryption has come a long way and there is no excuse to leave your customer’s data unprotected. Use strong encryption algorithms like AES-256 and apply them to files, databases, and credentials to name a few.
Use Let’s Encrypt or similar Certificate Authority to have an X509 certificate for all endpoints, including internal endpoints, and have a shorter rotation period. One thing I love about Let’s Encrypt is that it forces 90-day certificate validity, which encourages best practices in certificate management. They do not provide any exceptions.
Use a key management system like AWS KMS or Azure Key Vault. For extremely critical data like root keys, skip the cloud and keep such keys in a physical Hardware Security Module (HSM). Train your employees about the best practices, so that they know the steps to restore the system from root keys after mitigating the threats.
Conclusion
A software engineer must also think like a security engineer. Central security teams are often stretched thin and reactive, which isn't enough in today's constantly evolving cyber threat landscape. Software engineers should regularly improve the product, upgrade components, and make configuration changes. Without proper attention, new security issues can easily slip through.
Looking ahead, the intersection of scalability and security will continue to evolve. Emerging technologies like edge computing, 5G, and advanced AI will introduce new demands and security challenges.
We must keep learning and adapting our security practices, staying on top of new threats and understanding how changes in architecture and infrastructure impact overall security.
By adopting a proactive, holistic approach to security that aligns with scalable architecture, we can create systems that are not only powerful and flexible but also reliable and secure in an ever-changing threat landscape.