Availability and consistency are very important parts of system design, and based on our decision whether the system needs to be highly available or highly consistent, the entire design can be affected. Based on the requirements, we should understand how the system should behave when there is a network failure or node failure. Before diving deep into this, let's first understand the meaning of some terms:
Availability: Availability means a system should be available at all times, even in cases of node failures, network failures, etc. This is very critical for business continuity as a system that goes down from time to time may affect business or other systems as well. Example: We all use social media every day. Social media not only provides us with the latest updates of people around us, but it is also a great source for businesses to sell their products to potential customers via ads. These businesses pay money to social media platforms to run ads and, in return, they get business. Imagine a social media platform fails; the platform won't be able to receive money from these businesses to run their ads. Businesses won't be able to sell their products as the platform is down. And in turn, there is a huge loss of revenue during this time.
Consistency: Consistency means a system should be consistent at all times. Either the system should provide the latest data, or the system should return an error. Example: You might have used internet banking. Imagine you did a transaction of X amount to someone. The money got deducted from your account, but it is not visible in the account of the person you sent it to. There may be several causes to this: a network failure, a server failure, etc. At this point, the system is inconsistent. So consistency plays a vital role here. It would be better if the system had returned an error instead of a successful transaction.
Partition Tolerance: Partition tolerance means whenever there is a partition in the system (server failure, network failure, etc.), it should continue to work and should not crash. It may be working in a degraded mode where some services are working, and some are not working, instead of complete unavailability. Consider the bank example where we are transferring money from Account A to Account B. Due to some reason, there is an issue with the transfer. The issue can be that the transfer microservice is down. So instead of a complete system shutdown, some features should still work, like inquiring about the balance.
Consistency, Availability, and Partition Tolerance (CAP) theorem states that a system can only have two of the three properties defined. A system can be:
Let's see each of these properties in detail.
A system that is consistent and available assumes that there will never be a network partition. This is only possible in single-node systems or a monolithic application. These kinds of systems are unrealistic in the case of distributed architecture. In the real world, network-related issues do happen, and it will make the system prone to complete failures.
As an example, imagine a single-node small e-commerce store:
This defines that the system will always be available even if there is an incident of network partition. However, there is a catch here. The responses we are getting may not be the latest ones. The reason is, there are some node failures due to which the latest state is not updated across all the nodes.
As an example, consider a social media platform like Instagram:
This defines that the system will always be consistent even if there is a network partition. The system will always return the latest data. However, availability may take a hit here.
As an example, consider a banking system where someone is trying to transfer money from account A to account B. This system will be consistent and make sure that if the money is deducted, it should be reflected in account B.
In practice, no system follows 100% AP or 100% CP. The systems in practice always establish a balance between availability and consistency. Though we have a consistent banking system (CP) that blocks transactions during a network partition, it can still allow balance inquiries as they do not modify any data. Similarly, a social media platform (AP) allows users to interact with posts during a network failure, but like counts may be inconsistent. However, they do have consistent operations that these systems may block, like profile updates, authentication, etc., until the partition failure gets resolved.