KeyForge is a key-value store that helps you store and quickly retrieve data, like a super-fast dictionary for your applications. It is designed to work across multiple instances, making it reliable and able to handle lots of traffic without slowing down or losing data.
Where is KeyForge Used?
KeyForge is useful in many areas where applications need to save and access data quickly and reliably, such as:
Caching:
Keeping frequently used data ready for fast access, like storing a copy of your favorite website for quicker loading.
Session Management:
Saving user sessions, like when a website remembers you're logged in, even if you refresh the page.
Configuration Storage:
Storing settings for applications that need to change based on the environment, like a game adapting its controls for different players.
Event Tracking:
Keeping track of events in real-time systems, like logging actions in a multiplayer game.
Metadata Storage:
Saving extra information about files or data, like descriptions of photos in a photo-sharing app.
KeyForge is designed to be fast, reliable, and scalable, meaning it can handle more data and users by adding more instances to the system.
Functional requirements
KeyForge is a key-value store. So it can store keys and values via APIs. It supports the following:
Performing CRUD operations on key-value pairs:
Create: Add new keys with associated values to the store.
Read: Retrieve the value associated with a given key.
Update: Modify the value of an existing key.
Delete: Remove a key and its value from the store.
Ensuring data is persisted on disk, so the data remains available even if the application restarts.
Providing fast read and write operations, making it ideal for real-time applications.
Returning error messages for invalid operations, such as trying to read a non-existent key or updating a deleted key.
Non-Functional Requirements of KeyForge
KeyForge is designed to be reliable, scalable, and efficient. Below are its non-functional requirements, tailored for an achievable initial implementation:
1. Performance
Read operations should be completed in under 10ms for local reads and under 20ms for distributed reads.
Write operations should be completed in under 15ms for local writes and under 20ms for distributed writes with replication.
2. Scalability
The system must scale horizontally, allowing the addition of more servers to handle increased data and traffic.
Data must be evenly distributed across nodes using consistent hashing to avoid hotspots.
3. Fault Tolerance
If a node fails, the system must continue operating by relying on replicated data stored on other nodes.
Replicas must be kept consistent with primary data to minimize potential data loss.
4. Availability
KeyForge must provide high availability, ensuring minimal disruption during node failures or maintenance.
The system should aim for an uptime of at least 99.9%, equivalent to less than 9 hours of downtime per year.
5. Data Consistency
The system should ensure strong consistency for critical operations:
Once a write operation is acknowledged, subsequent reads must reflect the updated value.
Non-critical operations can use eventual consistency to optimize performance.
6. Reliability
The system must reliably recover from failures without manual intervention.
Node failures should trigger automated recovery mechanisms, including:
Promoting replicas to primary nodes.
Recalculating shard ownership.
7. Maintainability
The system must be easy to deploy, configure, and monitor.
Provide clear logging and telemetry to debug issues, monitor performance, and trace requests.
8. Security
Communication between nodes must be secure, using encryption protocols such as TLS.
Only authorized nodes and clients should be able to interact with the system, enforced through authentication mechanisms.
9. Extensibility
The system should be designed to allow future enhancements, such as:
Adding new replication strategies.
Supporting advanced caching mechanisms.
Integrating with different storage backends.
10. Resource Efficiency
KeyForge should optimize CPU, memory, and disk usage to run efficiently on standard hardware without requiring high-end infrastructure.
Single-Node Architecture
At its core, KeyForge operates as a key-value store capable of handling CRUD operations on a single node. This simple architecture forms the foundation for building more complex features like scaling and fault tolerance in a multi-node environment.
Key Components
Client Interface:
The entry point for external users or applications.
Exposes APIs (e.g., REST or gRPC) to perform operations like adding, retrieving, updating, or deleting key-value pairs.
Node Manager:
Responsible for managing key-value operations within the node.
In a single-node environment, its primary role is to interact with the storage layer and manage data locally.
Handles (We will discuss in more detail):
Request validation and routing.
Shard ownership (in future multi-node setups).
Replication and failure detection (in more advanced setups).
Storage Layer:
Uses an embedded database like Pebble to persist key-value data.
Provides durability by ensuring that data is written to disk and is recoverable after restarts.
Basic Workflow
Write Operation (Set)
The client sends a Set request to the API with the key and value to be stored.
The Node Manager:
Validates the request.
Passes the key-value pair to the Storage Layer, which persists it to disk.
Once the data is successfully written, the system responds to the client with a success message.
Request Lifecycle in a Single-Node KeyForge Cluster
In a single-node KeyForge cluster, the request lifecycle describes how client requests (CRUD operations) are handled from the moment they are received until a response is sent back. Below is a step-by-step explanation of the request lifecycle for write and read operations.
1. Write Operation (Set)
Lifecycle Steps
Request Arrival:
The client sends a Set request with the key and value to the API endpoint (e.g., via REST or gRPC).
API Layer:
The Client API validates the request:
Ensures the key and value are well-formed.
Checks for basic errors (e.g., invalid payload).
Once validated, the request is forwarded to the Node Manager.
Node Manager:
The Node Manager processes the request:
It determines the appropriate action (e.g., storing the key-value pair).
The key-value pair is passed to the Storage Layer for persistence.
Storage Layer:
The Pebble storage engine stores the key-value pair:
The data is written to disk to ensure durability.
An acknowledgment is returned to the Node Manager.
Response to Client:
The Node Manager sends a success response back to the API Layer.
The API Layer formats the response (e.g., HTTP status code 200 OK) and sends it back to the client.
2. Read Operation (Get)
Lifecycle Steps
Request Arrival:
The client sends a Get request with the key to the API endpoint.
API Layer:
The Client API validates the request:
Ensures the key is well-formed.
Checks for basic errors (e.g., missing key in the request).
Once validated, the request is forwarded to the Node Manager.
Node Manager:
The Node Manager checks if the requested key exists in the Storage Layer.
If the key is found, it retrieves the associated value.
Storage Layer:
The Pebble storage engine fetches the value associated with the key:
If the key exists, the value is returned.
If the key does not exist, an error (e.g., Key Not Found) is returned to the Node Manager.
Response to Client:
The Node Manager sends the retrieved value (or error) back to the API Layer.
The API Layer formats the response (e.g., HTTP status code 200 OK with the value or 404 Not Found for a missing key) and sends it back to the client.
Summary of Lifecycle
Step
Description
Request Arrival
Client sends a request to the API.
API Layer
Validates the request and forwards it to the Node Manager.
Node Manager
Processes the request, determines the action, and interacts with the Storage Layer.
Storage Layer
Handles the actual storage or retrieval of key-value pairs using Pebble.
Response
Results are sent back through the API to the client.
Adding Scalability, Fault Tolerance, and High Availability
Till now, we have seen the functionality of a single node. We have covered most of the functional requirements. However, this is not enough. When designing scalable systems, it is crucial to make the system highly available, scalable, and fault-tolerant. Additionally, durability is a key aspect we need to address. If the system is not durable, it is prone to data loss in case of failures.
Before diving deep into the non-functional requirements, let's first understand what these terms mean:
1. High Availability:
Definition: High availability means the system is always accessible and operational, even during failures or maintenance.
Importance:
Ensures minimal downtime for users.
Critical for systems requiring 24/7 uptime, such as banking or e-commerce platforms.
How to Achieve:
Implementing replication so data is available on multiple nodes.
Using load balancers to distribute traffic evenly across nodes.
2. Scalability:
Definition: Scalability refers to the system's ability to handle an increase in workload by adding resources.
Importance:
Ensures the system can support more users, data, or requests as demand grows.
Types:
Vertical Scalability: Adding more resources (e.g., CPU, RAM) to existing nodes.
Horizontal Scalability: Adding more nodes to the cluster.
How to Achieve:
Distributing data and requests across multiple nodes using consistent hashing.
3. Fault Tolerance:
Definition: Fault tolerance ensures the system can continue functioning even when one or more components fail.
Importance:
Reduces the risk of complete system outages.
Improves user trust and reliability.
How to Achieve:
Replicating data across multiple nodes to avoid single points of failure.
Detecting and recovering from failures automatically using techniques like leader election and heartbeats.
4. Durability:
Definition: Durability ensures that once a write operation is acknowledged, the data remains intact, even in the event of a failure.
Importance:
Prevents data loss during crashes or hardware failures.
How to Achieve:
Writing data to persistent storage (e.g., Pebble) immediately.
Using Write-Ahead Logs (WAL) to recover data after a failure.
5. CAP Theorem:
CAP (Consistency, Availability, Partition Tolerance) states that a system can provide only two of the following guarantees:
Consistency: Every read request will return the most recent write or an error.
Availability: Every request will receive a non-error response, even if some nodes are down or unavailable.
Partition Tolerance: The cluster will continue to function even when there is a communication breakdown between nodes.
Things to Understand Before Designing Scalability
Before we design scalability for KeyForge, it's essential to understand the foundational concepts and mechanisms that enable distributed systems to scale effectively. These concepts will guide how data is distributed, requests are routed, and failures are handled in a multi-node environment.
1. Hashing Algorithms
Why It's Important:
Hashing algorithms are used to distribute data across nodes in a consistent and balanced way.
A good hashing mechanism ensures minimal disruption when nodes are added or removed.
Key Concepts:
Consistent Hashing:
Maps keys to a circular hash space.
Ensures that only a small portion of keys need to be redistributed when nodes are added or removed.
Uniform Distribution:
The hash algorithm should evenly distribute keys across all nodes to avoid hotspots.
Examples:
MD5, SHA-256, or custom hash functions optimized for speed and uniformity.
2. Partitioning (Sharding)
Why It's Important:
Partitioning divides the dataset into smaller chunks (shards), each managed by a specific node.
Helps in distributing the storage and computational load.
Key Concepts:
Shard Key:
The key used to determine which shard a piece of data belongs to.
Static vs. Dynamic Sharding:
Static: Fixed number of shards, requiring a predefined node count.
Dynamic: Adjusts the number of shards as nodes are added or removed.
Challenges:
Choosing a shard key that prevents uneven data distribution.
3. Data Replication
Why It's Important:
Ensures fault tolerance by creating copies of data on multiple nodes.
Key Concepts:
Replication Factor:
The number of nodes that store copies of the same data.
Synchronous vs. Asynchronous Replication:
Synchronous: Writes are acknowledged only after all replicas are updated.
Asynchronous: Writes are acknowledged immediately, and replicas are updated in the background.
Trade-offs:
Synchronous replication improves consistency but increases latency.
Asynchronous replication improves performance but may risk stale reads.
4. Node Discovery and Membership
Why It's Important:
Nodes in a cluster need to discover and communicate with each other.
Key Concepts:
Gossip Protocol:
A decentralized way for nodes to share state information.
Membership Management:
Ensures that all nodes in the cluster know which nodes are active or failed.
Challenges:
Handling node churn (frequent additions and removals).
5. Failure Detection
Why It's Important:
Quickly identifying failed nodes prevents the system from routing requests to unresponsive nodes.
Key Concepts:
Heartbeat Mechanism:
Periodic checks to ensure nodes are alive.
Timeouts and Retries:
Defines when a node is considered failed and how requests are retried.
Tools:
Distributed systems often use gossip protocols or centralized health checks for failure detection.
6. Load Balancing
Why It's Important:
Ensures that traffic is evenly distributed across nodes to prevent overloading any single node.
Key Concepts:
Client-Side Load Balancing:
Clients are aware of the cluster topology and send requests directly to the appropriate node.
Server-Side Load Balancing:
A central load balancer routes traffic to the appropriate node.
Challenges:
Maintaining consistent routing while scaling.
7. Data Durability and Recovery
Why It's Important:
Ensures data remains intact during crashes or unexpected shutdowns.
Key Concepts:
Write-Ahead Logs (WAL):
Logs write operations before applying them to the main database for recovery purposes.
Snapshotting:
Periodic backups of the database state to speed up recovery.
By understanding these concepts, we can build a scalable, fault-tolerant, and highly available KeyForge system. These foundational principles will guide the design of a multi-node architecture that distributes data effectively, ensures durability, and handles failures gracefully.
System Components
Node
A node is an individual instance of KeyForge. It can be a service running on a specific port on a virtual machine, a service running inside a Docker container, or a pod in a Kubernetes cluster. A node includes all the components of KeyForge, including its database.
Cluster
A cluster is a collection of multiple nodes within the KeyForge system.
Node Manager
The Node Manager is a component within each node. It manages various operations, such as health checks, handling key-value operations, and managing replications. It is also responsible for calculating the node's position on the hash ring and redirecting requests to the correct node within the cluster.
Bootstrap Node
The Bootstrap Node is the first node in the cluster. Once the cluster is initialized with two or more nodes, any node can act as a bootstrap node. The primary responsibility of the bootstrap node is to serve as the initial point of contact whenever a new node is added to the cluster.
Deep Dive
Till here, we have discussed a very high level design. And at this point, we have understood a few things about the system. Like, we will have hashring, gossip protocol, replication and fault tolerance. But how we will implement it? This topic discusses that part in very detail. Feel free to skip this now and come back later, when we actually implement it, but it's very important to understand it.
In KeyForge, we use consistent hashing to distribute keys evenly across the nodes in the cluster. This ensures that the addition or removal of nodes impacts only a small portion of the keys, making the system scalable and resilient.
How Hashes are Calculated
Unique Identifiers for Nodes:
Each node in the cluster has a unique identifier, such as a combination of its Node ID and IP:Port.
Example: Node ID = bc12450a-20c1-4e7e-a14a-5ec3bde78fc5.
Generating the Hash:
To calculate the node's position on the hash ring, we generate a hash of the unique identifier using the crc32.ChecksumIEEE function.
The resulting CRC32 hash is represented as an unsigned 32-bit integer, which is directly converted into an integer to determine the node's position.
Resulting Position on the Hash Ring:
The resulting integer from the CRC32 hash determines the node's position.
Example:
Node ID: bc12450a-20c1-4e7e-a14a-5ec3bde78fc5
CRC32 Hash (Hex): 5b8d9c8f
Position (Decimal): 1539038447.
Key Placement
When a key-value pair is stored in the cluster:
The key (e.g., "foo") is hashed using the same CRC32 algorithm.
The resulting hash determines the position of the key on the hash ring.
The key is assigned to the first node clockwise from its position.
Example:
Key = foo
CRC32 Hash (Hex) = b3e2d1f5.
Position (Decimal) = 20 (for simplicity, we assume this as an example).
Assigned Node: Assume nodes have positions 10, 40, and 70.
Position 20 is greater than 10 but less than 40, so the key is assigned to the node at position 40.
If the key's position exceeds the highest node position (e.g., key position is 89), it wraps around the ring and is assigned to the node at the lowest position (10).
Handling Collisions
While hash collisions are rare with CRC32, there is still a theoretical chance they could occur. In such cases:
Collisions are mitigated by ensuring that each node's identifier is unique (e.g., using Node ID + IP:Port).
KeyForge ensures that any conflicting positions on the hash ring are still resolved deterministically, avoiding any ambiguity.
Gossip Protocol
If you are new to the gossip protocol, please check out this article for a brief understanding of how the gossip protocol works: Gossip Protocol Explained.
A great way to visualize how messages spread to all the nodes in a distributed system: Gossip Simulator.
In our implementation, we will select a random n number of nodes, as configured. A message, for example, the latest cluster state, will be sent to these random n nodes. Once these nodes receive this message, they will forward it to the next random n nodes. This process will continue until all nodes reach the same version of the cluster state.
In the KeyForge cluster, a gossip will be triggered in the following cases:
Each node initiates the version of the latest state containing the node details, their health, etc. This ensures that all cluster information remains synchronized across all nodes.
Inside each node, a separate thread runs periodic health checks on random n nodes at regular t intervals. If any node shows a failed status, a gossip is initiated instantly to propagate the node failure.
Whenever a new node is added to the cluster, a gossip is initiated to inform every other node about the availability of this new node.
In case of a failed node recovery, a gossip is initiated by the recovering node to inform every other node that it has successfully recovered.
In the above diagram, the following things happen:
Node 10 initiates a gossip and sends a message to Node 90 and Node 40. These nodes are randomly selected.
Node 90 selects Node 40 and Node 60 randomly and propagates the message.
Node 40 selects Node 10 and Node 60 randomly and propagates the message.
Node 60 selects Node 90 and Node 10 randomly and propagates the message.
Once all the nodes have the same version, they stop propagating, and the entire cluster gets in sync.
Data Replication
A cluster-wide configuration file will define the replication factor. For example, if the replication factor is set to 2, the data from any node n will be replicated to the next two nodes (n+1 and n+2) in the hash ring. This replication strategy ensures high availability by providing redundancy. In case the primary node fails, reads and writes can still be served by the replicas, depending on the consistency mode (AP or CP) chosen for the cluster.
Replication Factor Configuration: The replication factor is defined at the cluster level and stored in a configuration file shared across all nodes. This ensures uniform replication behavior throughout the cluster.
Data Placement:
When a key-value pair is stored on the primary node (determined by consistent hashing), the primary node replicates the data to its n+1 and n+2 replicas in the hash ring.
The replicas are selected based on their positions on the hash ring, ensuring even distribution and fault tolerance.
Replication Workflow:
Upon a SetKey operation:
The primary node writes the key-value pair to its local database.
The primary node sends replication requests to the replica nodes (n+1 and n+2).
Each replica node writes the data to its local database and acknowledges the write back to the primary node.
If any replica node is temporarily unavailable, the primary node retries the replication periodically until it succeeds.
In the above diagram, n1 replicates its data to n2 and n3. Similarly, n2 replicates its data to n3 and n4, and so on.
Failure Handling:
Temporary Node Failure:
If a replica node is temporarily unavailable, the primary node keeps retrying until the node recovers.
Once the node recovers, it initiates a sync operation to ensure it has all missing data.
Permanent Node Failure:
In the event of a permanent node failure, the hash ring recalculates, and the replication factor is adjusted to include new nodes as replicas.
Keys previously replicated to the failed node are redistributed to new replicas.
Read and Write Consistency:
AP Mode:
Reads can be served by any available replica, even if the primary node is unavailable.
Writes are acknowledged after being written to the primary node, with replication happening asynchronously.
CP Mode:
Reads require a majority of replicas to agree on the value, ensuring strong consistency.
Writes are acknowledged only after being successfully replicated to a quorum of replicas.
Periodic Data Sync:
Each node periodically syncs its data with its replicas to ensure eventual consistency, especially after temporary failures or network partitions.
This replication mechanism ensures that KeyForge provides high availability and fault tolerance while balancing performance and consistency based on the configured mode (AP or CP).
Lifecycle of a Node
Understanding the lifecycle of a node is crucial to comprehend how the system operates from start to finish. Below is a detailed breakdown of each stage in the lifecycle of a KeyForge node. Feel free to skip this part for now and come back later when we start implementing.
1. Initialization
When a node starts, it goes through the following steps to prepare itself for operation:
Generate or Retrieve Node ID:
If the node is starting for the first time, it generates a unique Node ID (e.g., a UUID) and persists it in its local database.
If the node has previously run, it retrieves its Node ID from the database to ensure continuity.
Load Configuration:
Reads its configuration file, which contains:
Address of the bootstrap node (if any).
Cluster-wide settings like replication factor and consistency mode (AP/CP).
Initialize Local Components:
Sets up the local database to store key-value pairs.
Starts internal services such as the Node Manager, health check modules, and gossip protocol handlers.
Determine Role:
If the node is specified as the bootstrap node, it initializes the cluster and prepares to serve as the primary point of contact for new nodes.
If it is a non-bootstrap node, it prepares to join an existing cluster.
2. Joining the Cluster
When a non-bootstrap node starts, it performs the following steps to join the cluster:
Contact Bootstrap Node:
Sends a JoinCluster request to the bootstrap node, including its Node ID and address.
Requests the latest cluster state, including:
Hash ring positions.
Active nodes and their health statuses.
Verify Node Identity:
The bootstrap node checks if the joining node matches any previously failed nodes in its records.
If it is a recovered node:
The bootstrap node marks it as recovered and triggers a cluster-wide gossip to update the hash ring and synchronize state.
If it is a new node:
The bootstrap node updates the cluster state with the new node's information.
Calculate Position on Hash Ring:
The new node calculates its position on the hash ring based on its Node ID.
The updated hash ring is propagated to all nodes via gossip.
Replication Factor Check:
The new node checks if the current number of nodes in the cluster meets the configured replication factor.
If the number of nodes is less than the replication factor, the node adjusts its responsibilities to include additional replicas.
Data Synchronization:
The new node requests keys for its responsible range on the hash ring from neighboring nodes.
Synchronizes its local database with the latest data.
3. Normal Operation
During normal operation, a node performs the following tasks:
Handling Client Requests:
Processes CRUD operations (SetKey, GetKey, DeleteKey) for keys within its range.
Forwards requests to other nodes if the key belongs to a different range on the hash ring.
Data Replication:
Ensures all keys are replicated to the required number of replicas.
Retries replication periodically if any replica nodes are temporarily unavailable.
Health Checks:
Periodically runs health checks on random nodes in the cluster using the gossip protocol.
Marks nodes as suspected failures if health checks fail multiple times.
Cluster State Synchronization:
Participates in gossip to propagate updates about node additions, failures, and recoveries.
Ensures its local view of the cluster is consistent with other nodes.
4. Handling Failures
When a node detects failures, it handles them as follows:
Temporary Node Failure:
If a node loses connectivity, other nodes mark it as a suspected failure.
Gossip propagates the suspected failure status across the cluster.
Writes targeting the failed node are redirected to replicas.
Permanent Node Failure:
If a node remains unresponsive after multiple gossip cycles, it is marked as permanently failed.
The hash ring is recalculated to exclude the failed node.
Keys previously managed by the failed node are redistributed to neighboring nodes.
Replication Adjustments:
Data stored on the failed node's replicas is promoted to primary status where necessary.
New replicas are created to maintain the replication factor.
5. Node Recovery
If a previously failed node comes back online, it performs the following steps:
Rejoin the Cluster:
Sends a recovery request to the bootstrap node or a known peer.
Retrieves the latest cluster state, including the hash ring and key responsibilities.
Data Synchronization:
Requests missing data for its range from the new primary nodes.
Synchronizes its local database to match the current state of the cluster.
Cluster Update:
The recovering node initiates a gossip message to inform other nodes of its recovery.
The hash ring is updated to include the recovered node.
6. Node Decommissioning
When a node is intentionally removed from the cluster, it follows these steps:
Data Migration:
Transfers its keys to the next responsible nodes on the hash ring.
Ensures all replicas are updated before leaving the cluster.
Cluster Notification:
Sends a decommissioning request to the bootstrap node.
Triggers a gossip message to inform all nodes about its removal.
Shutdown:
Stops all internal services, including the database and Node Manager.
Cleans up any resources associated with the node.
Next Steps
We will now discuss the API specification of the system in the next ticket.