Systems should be fault-tolerant or resilient. Basically we need to design systems so that faults (one component stops working) do not lead to failures (whole system stops working).
Hardware fails all the time, we can prevent this by adding redundancy to the individiaul hw components, (set up disks in a RAID config, add dual power supplies to servers, hot-swappable cpus, etc). With cloud computing, we need to move toward systems that can tolerate the loss of entire machines (cloud platforms are now designed to prioritize flexibility and elasticity over single-machine reliability).
Software faults are also common and usually lie dormant for a long time until they are triggered by an unusual set of circumstances. We should carefully think about assumptions and interactions in the system, test thoroughly, process isolation, allowing processes to crash and restart, monitor applications, etc.
A lot of failures are also caused by human errors (e.g. wrong config). Prevent this by designing systems in a way that minimizes opportunity for error (good UX).
If the system grows in a particular way, what are our options for coping with the growth? How can we add computing resources to handle the additional load?
Current load on a system can be described with a few numbers called load parameters. And these vary by system.
Here’s an example, Twitter, using data from 2012:
SELECT tweets.*, users.* FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows on follows.followee_id = users.id
WHERE follows.follower_id = current_user
Using percentiles (and the median which is p50) is a why of dealing with ouliers. X p95 means 95% of response times are faster than X.
Service Level Objective (SLO) and Service Level Agreement (SLA), contracts that define the expected performance and availability of a service. An SLA may state that the service is considered to be up if it has a median response time of less than 200ms and a 99th percentile under 1 s. And service may be required to be up at least 99.9% of the time.
Tail latency amplification: if you have your request needs to call multiple backend service, each service increases the probability of your request being slow (it takes just 1 slow request and it can queue up a service).
Machines can scale up or scale our or both. An elastic system is one that adds computing resources when they detect a load increase, whereas other systems are scaled manually by a human. Architectures of systems (and the way they scale) is very system dependent, as well as dependent on the load parameters.
Majority of cost of software development is for maintaining an existing system. To minimize pain during maintenance, here’s three design principles:
Make it easy for operations team to keep the system running smoothly. Operations teams responsibilities: monitor health of system and quickly restore service if it goes in a bad state. Tracking down the cause of problems. Keeping software/platforms up to date. Keep tabs on how different systems affect each other. Anticipate future problems, establish good practices for deploying stuff, etc…
Systems should then make routine tasks easy, allowing the ops teams to focus on high-value activities. Good systems should provide visibility into the runtime behavior of systems. Provide support for automation, avoid dependency on individual machines, provide good docs, etc.
Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
Sometimes systems suffer from accidental complexity. Which is complexity that’s not inherent to the problem that the software solves, but only arises from the implementation. We can solve accidental complexity by abstraction. A good abstraction can hide a great deal of implementation behind a clean and clear façade. Can also be reused.
Make it easy for engineers to make changes to the system in the future, allowing for unanticipated use cases as requirements change. Agile frameworks alongside its technical tools like TDD and refactoring cna help. But more importatly the easy with which you can modify a system is linked to its simplicity and abstractions.