Comprehensive Guide to Distributed Systems

May 3, 2025

1 min read

Comprehensive Guide to Distributed Systems

Operating System Paradigms
- Network Operating Systems
- Distributed Operating Systems
- Middleware
Networking Models
- OSI Reference Model
- Remote Procedure Call (RPC)
- Remote Method Invocation (RMI)
Communication Paradigms
- Message-Oriented Communication
- Stream-Oriented Communication
- Group Communication
Clock Synchronization
- Physical Clock Synchronization
- Logical Clock Synchronization
- Leader Election Algorithms
Mutual Exclusion
- Centralized Approach
- Distributed Approaches
- Token-Based Approaches
Resource Management
- Global Scheduling
- Task Assignment
- Load Balancing

Would you like a downloadable markdown version or a PDF of this structure?

1. Operating System Paradigms

1.1 Network Operating Systems

Network Operating Systems (NOS) provide an abstraction layer that enables multiple independent computers to communicate and share resources while maintaining their individual autonomy. Unlike distributed operating systems, NOS maintains clear boundaries between machines.

Loading diagram...

Key characteristics:

Distinct machine boundaries: Each computer retains its own identity, operating system, and resources
Explicit network awareness: Users must explicitly connect to remote resources
Authentication-based access: Users typically log into specific servers to access shared resources
Machine-oriented management: Administration happens at the individual machine level

Examples:

Early versions of Novell NetWare
Windows Server (with its file and print sharing capabilities)
Traditional Unix network services (NFS, NIS)

Architecture components:

Network protocols: TCP/IP, IPX/SPX, NetBEUI
Network services: File sharing, print services, authentication
Client software: Network redirectors, service clients
Server software: Resource management, authentication, permission management

NOS functions primarily as a means to share peripherals (printers, storage) and data (files, databases) among connected computers. Users typically need to know where resources are located and explicitly connect to them using network paths or mapped drives.

1.2 Distributed Operating Systems

Distributed Operating Systems (DOS) represent a more advanced paradigm where multiple networked computers function as a unified system, with the operating system abstracting away the distributed nature of the infrastructure.

Loading diagram...

Key characteristics:

System-wide transparency: Users perceive a single, unified system
Automatic resource management: The system handles resource allocation automatically
Location transparency: Users don't need to know where resources physically reside
Migration transparency: Processes can move between nodes without user awareness
Replication transparency: System can replicate data for reliability without user intervention
Concurrency transparency: Multiple users can access resources simultaneously

Types of transparency in distributed OS:

Type	Definition	Example
Access	Hides differences in data representation	Different CPU architectures sharing data
Location	Hides where resources are located	Files accessible via same path regardless of server
Migration	Hides resource movement	Process continues executing after moving to another node
Replication	Hides that resources are replicated	System automatically reads from nearest replica
Concurrency	Hides that resource may be shared	Distributed locking handled automatically
Failure	Hides failures and recovery	Automatic failover to backup components
Performance	System can be reconfigured for performance	Load redistribution based on usage patterns

Examples:

Amoeba (developed at Vrije Universiteit)
Sprite (UC Berkeley)
Plan 9 (Bell Labs)
Inferno (Bell Labs)

A true DOS provides the illusion of a single system image across multiple physical machines. Users interact with a single logical system, with the complexity of distribution handled by the operating system itself. This approach enables seamless resource sharing, better fault tolerance, and dynamic scalability.

1.3 Middleware

Middleware occupies the space between traditional operating systems and applications, providing services that facilitate communication and resource sharing across a distributed system without requiring a fully distributed operating system.

Definition: Software layer that sits between the operating system and applications, providing services that enable distributed systems integration.

Loading diagram...

Key functions:

Masking heterogeneity of hardware, operating systems, and networks
Providing uniform, high-level interfaces for developers and applications
Offering standardized services like naming, security, and transactions
Enabling interoperability between diverse components

Types of middleware:

Communication-oriented middleware:
- Message-Oriented Middleware (MOM): Systems like IBM MQ, RabbitMQ, Apache Kafka
- Remote Procedure Call (RPC): ONC RPC, DCE RPC
- Object Request Brokers: CORBA, DCOM
Application-oriented middleware:
- Web services middleware: SOAP, REST frameworks
- Transaction Processing Monitors: CICS, Tuxedo
- Database middleware: ODBC, JDBC, ORM frameworks
Integration middleware:
- Enterprise Service Bus (ESB): Mule ESB, WSO2 ESB
- Enterprise Application Integration (EAI): WebMethods, TIBCO

Middleware services typically include:

Naming services: Allow resources to be located by name rather than address
Directory services: Like LDAP for resource discovery and organization
Security services: Authentication, authorization, encryption
Transaction services: ACID properties across distributed resources
Concurrency control: Coordination of access to shared resources
Event notification: Publish-subscribe mechanisms
Data conversion: Transforming data between different formats

Middleware vs. DOS comparison:

Aspect	Middleware	Distributed OS
Integration depth	Application-level	Kernel-level
Installation	Sits on top of existing OS	Replaces or modifies OS
Heterogeneity	Handles diverse systems	Usually requires homogeneity
Deployment	Easier to deploy incrementally	Often requires complete system overhaul
Transparency	Provides partial transparency	Aims for full transparency

Middleware represents a pragmatic approach to building distributed systems, allowing heterogeneous systems to interoperate without requiring a completely new operating system. It's particularly valuable in enterprise environments where legacy systems need to be integrated.

2. Networking Models

2.1 OSI Reference Model

The Open Systems Interconnection (OSI) model is a conceptual framework that standardizes the functions of a communication system into seven distinct layers. Though not strictly implemented in most systems, it provides a reference for understanding network protocols and interactions.

Loading diagram...

The Seven Layers:

Physical Layer (Layer 1)
- Function: Transmission of raw bit streams over physical medium
- Units: Bits
- Examples: Ethernet (IEEE 802.3), USB, Bluetooth physical layer
- Key concepts: Signal encoding, transmission rates, physical connectors
- Devices: Hubs, repeaters, cables, network interface cards
Data Link Layer (Layer 2)
- Function: Reliable transmission of data frames between nodes
- Units: Frames
- Examples: Ethernet, PPP, HDLC, Wi-Fi (IEEE 802.11)
- Key concepts: MAC addressing, error detection/correction, flow control
- Devices: Network switches, bridges
- Sublayers:
  - Media Access Control (MAC): Hardware addressing and channel access
  - Logical Link Control (LLC): Flow/error control, multiplexing
Network Layer (Layer 3)
- Function: Routing of packets across networks
- Units: Packets
- Examples: IP, ICMP, OSPF, BGP
- Key concepts: Logical addressing, routing, packet forwarding, fragmentation
- Devices: Routers
- Services:
  - Connectionless communication (IP)
  - Connection-oriented (virtual circuits)
Transport Layer (Layer 4)
- Function: End-to-end communication and data reliability
- Units: Segments (TCP) or Datagrams (UDP)
- Examples: TCP, UDP, SCTP
- Key concepts: Port numbers, connection establishment, flow control, congestion control
- Services:
  - Reliable delivery (TCP)
  - Best-effort delivery (UDP)
Session Layer (Layer 5)
- Function: Establishes, manages, and terminates connections
- Examples: NetBIOS, RPC, SIP
- Key concepts: Session establishment, maintenance, termination, synchronization
- Features: Checkpointing, recovery, dialog control
Presentation Layer (Layer 6)
- Function: Data translation, encryption, compression
- Examples: SSL/TLS, JPEG, MPEG, ASCII, Unicode
- Key concepts: Syntax negotiation, data transformation
- Services:
  - Data encryption/decryption
  - Character set conversion
  - Data compression/decompression
Application Layer (Layer 7)
- Function: User-facing applications and services
- Examples: HTTP, FTP, SMTP, DNS, SSH, Telnet
- Key concepts: Application-specific protocols, user authentication, resource sharing
- Services: File transfer, email, web browsing, network management

Encapsulation Process:

Each layer adds its own header (and sometimes trailer) to the data
As data moves down the stack, each layer encapsulates the data from the layer above
At the receiving end, each layer strips its headers and passes the payload up

OSI vs. TCP/IP Model:

OSI Layer	TCP/IP Layer	Protocols
Application, Presentation, Session	Application	HTTP, FTP, SMTP, DNS
Transport	Transport	TCP, UDP
Network	Internet	IP, ICMP, ARP
Data Link, Physical	Network Interface	Ethernet, Wi-Fi

The OSI model provides a conceptual framework that helps understand network operations, troubleshoot problems, and design interoperable systems, even though most modern networks are based on the simpler TCP/IP model.

2.2 Remote Procedure Call (RPC)

Remote Procedure Call (RPC) is a communication protocol that enables a program to cause a procedure to execute in another address space (commonly on another computer on a shared network) without explicitly coding the details of the remote interaction.

Loading diagram...

Key components of RPC:

Client stub: Client-side proxy that makes the remote procedure look local
Server stub: Server-side component that receives the call and invokes the actual procedure
RPC runtime: Middleware that handles the communication details
Interface Definition Language (IDL): Language for defining remote interfaces

RPC operation flow:

Client application calls a local procedure (client stub)
Client stub packages arguments into a message (marshalling)
Client's RPC runtime transmits the message to the server
Server's RPC runtime receives the message and passes it to the server stub
Server stub unpacks the arguments (unmarshalling)
Server stub calls the actual procedure
Server procedure executes and returns results to the server stub
Server stub packages the results into a message (marshalling)
Server's RPC runtime transmits the message back to the client
Client's RPC runtime receives the message and passes it to the client stub
Client stub unpacks the results (unmarshalling)
Client application receives the results

RPC binding process:

Static binding: Server location known at compile time
Dynamic binding: Server location determined at runtime via name service
Auto-binding: System automatically selects an appropriate server

RPC semantics:

At-most-once: Guarantees no duplication, may fail silently
At-least-once: Guarantees execution, may duplicate
Exactly-once: Ideal but hard to achieve in distributed systems
Maybe: No guarantees (best effort)

Common RPC implementations:

ONC RPC (Open Network Computing): Developed by Sun, used in NFS
DCE RPC (Distributed Computing Environment): By Open Software Foundation
gRPC: Modern, high-performance RPC framework by Google
XML-RPC: HTTP-based, uses XML for encoding
JSON-RPC: HTTP-based, uses JSON for encoding

Challenges in RPC:

Partial failures: One system may fail while others continue
Latency: Remote calls take longer than local calls
Parameter passing: Complexity with reference parameters
Security: Authentication and encryption concerns
Heterogeneity: Different data representations across systems

RPC remains a fundamental building block for distributed systems, providing a programming abstraction that makes remote service invocation appear similar to local procedure calls.

2.3 Remote Method Invocation (RMI)

Remote Method Invocation (RMI) extends the concept of Remote Procedure Call to object-oriented systems, allowing objects in one Java Virtual Machine (JVM) to invoke methods on objects in another JVM.

Loading diagram...

Key characteristics of RMI:

Object-oriented: Operates on remote objects rather than procedures
Language-specific: Primarily a Java technology (though similar concepts exist in other languages)
Parameter passing: Supports passing objects by value or reference
Distributed garbage collection: Manages remote object lifecycles

RMI architecture components:

Client: Application that invokes methods on remote objects
Client-side stub: Proxy representing the remote object locally
RMI Registry: Naming service for looking up remote objects
Server-side skeleton: Receives calls and forwards to implementation
Remote object implementation: Actual object implementation

RMI operation flow:

Server registers remote object with RMI registry
Client looks up remote object in registry
Registry returns a stub for the remote object to client
Client invokes methods on the stub as if it were local
Stub marshals parameters and forwards call to server
Server-side skeleton unmarshals parameters and invokes method
Method executes and returns results
Skeleton marshals results and sends back to client
Client stub unmarshals results and returns to client application

RMI vs. RPC:

Aspect	RMI	RPC
Paradigm	Object-oriented	Procedure-oriented
Language support	Java-specific	Language-neutral
Parameter passing	Objects (by value or reference)	Basic data types
Dynamic downloading	Supports code mobility	Generally no
Garbage collection	Distributed GC	No GC
Interface definition	Java interfaces	IDL specifications
Naming service	RMI Registry	Various name services

Advanced RMI features:

Dynamic class loading: Client can download implementation classes
Distributed garbage collection: Automatic management of remote references
Activation framework: On-demand activation of server objects
Custom socket factories: For security and custom transport protocols
Security manager: Controls access to resources

RMI over IIOP: RMI can also operate over the Internet Inter-ORB Protocol (IIOP), enabling Java RMI to interoperate with CORBA systems, referred to as RMI-IIOP.

RMI limitations:

Java-centricity: Limited interoperability with other languages
Performance overhead: Serialization and network latency
Network dependence: Vulnerable to network failures
Firewall issues: May be blocked by restrictive firewalls

RMI represents an elegant object-oriented approach to distributed computing, allowing developers to work with remote objects using familiar object-oriented programming techniques.

3. Communication Paradigms

3.1 Message-Oriented Communication

Message-Oriented Communication is a paradigm where applications communicate by sending discrete messages to each other, typically through intermediary message-queuing systems or message brokers.

Loading diagram...

Key characteristics:

Asynchronous: Sender doesn't wait for receiver to process message
Loosely coupled: Minimal dependencies between components
Discrete messages: Information exchanged in self-contained units
Store-and-forward: Messages can be stored until delivery is possible
Middleware-mediated: Often uses dedicated message-oriented middleware

Core components:

Message: Self-contained unit of information with:
- Header: Metadata about the message (routing, priority, expiration)
- Body: Actual payload/content
- Properties: Additional application-specific metadata
Queue: Storage structure for messages
- Point-to-point: One sender, one receiver
- Publish-subscribe: One sender, multiple receivers
Message broker/MOM: Middleware that manages message routing, delivery, and persistence
- Ensures reliable delivery
- Handles routing and transformations
- Provides persistence and transactions

Common interaction patterns:

Request-reply: Two-way communication pattern
- Client sends request, includes reply address
- Server processes request, sends response to reply address
Publish-subscribe: One-to-many communication pattern
- Publishers send messages to topics
- Subscribers receive messages from topics they've subscribed to
- Loose coupling between publishers and subscribers
Message queuing: Reliable point-to-point communication
- Messages stored until consumer retrieves them
- Provides temporal decoupling
- Supports load balancing across multiple consumers
Broadcast/multicast: Sending to multiple recipients simultaneously
- More efficient than multiple point-to-point sends
- Can be implemented via IP multicast or application-level distribution

Messaging guarantees:

At-most-once: Message delivered once or not at all
At-least-once: Message delivered one or more times
Exactly-once: Message delivered exactly once (hardest to implement)

Message-oriented middleware platforms:

Apache Kafka: High-throughput distributed streaming platform
RabbitMQ: Feature-rich message broker implementing AMQP
IBM MQ: Enterprise messaging solution with strong transaction support
Apache ActiveMQ: Open-source message broker with multiple protocols
Amazon SQS/SNS: Cloud-based queuing and notification services

Advantages of message-oriented communication:

Temporal decoupling: Components don't need to be active simultaneously
Spatial decoupling: Components don't need to know each other's locations
Load balancing: Natural distribution of work across consumers
Fault tolerance: Messages persist even if receivers are unavailable
Scalability: Easy to add producers or consumers

Challenges and considerations:

Complexity: Additional infrastructure requirements
Latency: Potentially higher compared to direct communication
Ordering: Ensuring message sequence when required
Idempotency: Handling duplicate message processing
Transactionality: Managing transactions across multiple messages

Message-oriented communication is particularly valuable in distributed systems where reliability, scalability, and loose coupling are important design goals.

3.2 Stream-Oriented Communication

Stream-Oriented Communication is a paradigm where applications exchange data as continuous sequences of bytes rather than discrete messages, typically using a connected transport protocol like TCP.

Loading diagram...

Key characteristics:

Connection-based: Requires established connection before communication
Continuous data flow: Data transmitted as an uninterrupted sequence
Byte-oriented: Fundamental unit is a byte rather than a message
Ordered delivery: Bytes arrive in the same order they were sent
Bi-directional: Communication typically flows in both directions

Core concepts:

Connection establishment: Three-way handshake (in TCP)
- Client sends SYN packet
- Server responds with SYN-ACK
- Client confirms with ACK
Data transfer: Sending bytes through established connection
- Data segmented into packets for transmission
- Sequence numbers ensure correct ordering
- Acknowledgments confirm reception
Flow control: Preventing sender from overwhelming receiver
- Sliding window protocols
- Receive window advertisements
Congestion control: Preventing network overload
- Slow start
- Congestion avoidance
- Fast retransmit/recovery
Connection termination: Four-way handshake (in TCP)
- Initiator sends FIN
- Receiver acknowledges with ACK
- Receiver sends FIN when ready
- Initiator acknowledges with ACK

Transport protocols for stream communication:

TCP (Transmission Control Protocol):
- Reliable, ordered, error-checked
- Connection-oriented
- Flow control and congestion control
- Most common stream protocol
SCTP (Stream Control Transmission Protocol):
- Multi-streaming and multi-homing capabilities
- Message-oriented but over a connection
- Partial reliability option
- Used in telecommunications
WebSockets:
- Full-duplex communication over TCP
- Web-friendly protocol
- Minimal overhead after connection
- Supports sub-protocols

Application-level protocols over streams:

HTTP/1.1: Request-response protocol with persistent connections
HTTP/2: Multiplexed streams over a single connection
TLS/SSL: Security layer providing encryption
SSH: Secure shell protocol for remote operations
SMTP: Email transmission protocol

Framing in stream communication:

Since streams don't have inherent message boundaries, applications must implement framing strategies:

Length-prefixed: Prepend message size before content
Delimiter-based: Use special characters to mark boundaries
Fixed-length: Use predetermined message sizes
TLV (Type-Length-Value): Structured approach with headers

Advantages of stream-oriented communication:

Efficiency: Lower overhead for large data transfers
Reliability: Built-in mechanisms for reliable delivery
Ordered delivery: Guaranteed sequencing
Flow control: Prevents overwhelming receivers
Congestion awareness: Adapts to network conditions

Challenges and considerations:

Connection management: Overhead of establishing/maintaining connections
Head-of-line blocking: One lost packet blocks entire stream
Stateful nature: More complex failure handling
Latency: Connection establishment adds delay
Message boundaries: Must be implemented at application level

Stream-oriented communication is well-suited for applications requiring reliable, ordered data transfer, especially for larger data volumes or continuous interactions.

3.3 Group Communication

Group Communication involves sending messages to multiple recipients simultaneously, providing abstractions for coordinating distributed processes that need to work together as a logical group.

Loading diagram...

Key characteristics:

One-to-many or many-to-many: Communication involving multiple participants
Group abstraction: Collection of processes treated as a logical unit
Membership management: Tracking who belongs to which groups
Ordering guarantees: Various delivery ordering semantics
Fault tolerance: Mechanisms to handle process failures

Fundamental concepts:

Group membership:
- Static groups: Fixed membership known at design time
- Dynamic groups: Processes can join/leave during operation
- Membership protocols: Algorithms to maintain consistent group view
- Failure detection: Identifying crashed or disconnected members
Message ordering guarantees:
- FIFO (First-In-First-Out): Messages from same sender delivered in sending order
- Causal ordering: If message A could have caused message B, A is delivered before B
- Total ordering: All processes receive all messages in identical order
- Atomic ordering: Either all correct processes deliver a message or none do
Delivery semantics:
- Best-effort: No guarantees about delivery
- Reliable: All non-faulty members eventually receive the message
- Atomic: All members receive the message or none do
- Uniform: If any process (even faulty) delivers a message, all correct processes deliver it

Group communication patterns:

Broadcast/multicast:
- IP multicast: Network-level distribution (limited deployment)
- Application-level multicast: Overlay networks for distribution
- Gossip protocols: Epidemic-style dissemination
Publish-subscribe:
- Topic-based: Subscribers register interest in specific topics
- Content-based: Subscribers specify predicates on message content
- Type-based: Subscribers receive messages of specific types
Active replication:
- All replicas process all requests
- Requires total order broadcast
- State machine replication approach
Virtual synchrony:
- Provides illusion of synchronous execution
- Group membership changes appear between message deliveries
- Simplifies distributed algorithm design

Group communication frameworks and systems:

Isis Toolkit: One of the first group communication systems
Spread: High-performance group communication toolkit
JGroups: Java-based reliable group communication library
Apache ZooKeeper: Coordination service with group primitives
etcd: Distributed key-value store with watch functionality

Implementation approaches:

Centralized coordinator:
- Sequencer assigns order to messages
- Single point of failure
- Simpler implementation
Distributed agreement:
- Consensus protocols (e.g., Paxos, Raft)
- Higher fault tolerance
- More complex implementation
Gossip-based dissemination:
- Probabilistic guarantees
- Highly scalable
- Eventually consistent

Applications of group communication:

Replicated databases: Maintaining consistent replicas
Distributed caches: Cache invalidation and updates
Cluster management: Coordinating server clusters
Collaborative applications: Real-time collaboration tools
Distributed monitoring: Event notification systems

Challenges and considerations:

Scalability: Many protocols don't scale to large groups
Network partitions: Handling split-brain scenarios
Performance overhead: Ensuring ordering guarantees adds complexity
Failure modes: Distinguishing between process and network failures
Consistency vs. availability: Fundamental tradeoffs

Group communication provides powerful abstractions for building fault-tolerant distributed systems, especially where coordination among multiple processes is essential.

4. Clock Synchronization

4.1 Physical Clock Synchronization

Physical clock synchronization involves aligning the time values of physical clocks across distributed systems to ensure coordinated operations, enabling correct event ordering and time-based operations.

Loading diagram...

Need for clock synchronization:

Event ordering: Determining the sequence of distributed events
Distributed algorithms: Many require synchronized notion of time
Timestamps: For logging, auditing, and debugging
Time-based coordination: Scheduling activities across systems
Security protocols: Authentication and cryptographic operations

Clock basics:

Clock drift: Tendency for clocks to run at slightly different rates
- Typically 1-100 parts per million (ppm)
- Results in growing divergence over time
Clock skew: Difference between two clocks at a given moment
UTC (Coordinated Universal Time): Standard reference time

Challenges in clock synchronization:

Network latency: Variable message transit times
Asymmetric paths: Different delays in different directions
Process scheduling delays: Unpredictable execution timing
Temperature effects: Environmental impacts on oscillator frequency
Reference source accuracy: Quality of time standard

Main synchronization algorithms:

Cristian's Algorithm

A centralized approach where clients synchronize with a time server:

Client sends request to time server
Server responds with its current time T
Client records round-trip time RTT
Client sets its time to T + RTT/2 (estimated one-way delay)

Characteristics:

Simple implementation
Accuracy limited by RTT and path asymmetry
Vulnerable to server failures
Error bounds can be calculated as ±(RTT/2 - min_delay)

Berkeley Algorithm

A decentralized approach where a coordinator polls other machines and calculates a time adjustment:

Coordinator polls all machines for their time
Coordinator calculates average time (often excluding outliers)
Coordinator sends relative adjustments to each machine
Machines adjust their clocks accordingly

Characteristics:

Works with faulty time sources
Achieves internal synchronization (relative to group)
Less dependent on absolute time source
Better fault tolerance than Cristian's

Network Time Protocol (NTP)

A hierarchical, fault-tolerant protocol for synchronizing clocks over variable-latency networks:

Stratum levels:
- Stratum 0: Reference clocks (atomic clocks, GPS)
- Stratum 1: Servers directly connected to reference clocks
- Stratum 2: Clients of stratum 1 servers
- And so on (higher numbers = further from reference)
Synchronization modes:
- Client/server: Periodic polling of NTP servers
- Symmetric: Peer-to-peer synchronization
- Broadcast/multicast: One-to-many updates
Clock selection algorithm:
- Polls multiple servers
- Filters outliers and falsetickers
- Combines remaining sources using weighted average
Clock discipline algorithm:
- Phase-locked loop (PLL) for frequency stability
- Gradual adjustments to avoid time steps
- Handles both drift and offset correction

Characteristics:

Highly accurate (within milliseconds over internet, microseconds on LANs)
Robust against failures and Byzantine behavior
Scalable hierarchical design
Handles asymmetric paths and variable latency
Widely deployed internet standard

GPS-based synchronization:

Modern systems often use GPS receivers for precision timing:

Accuracy typically within 100 nanoseconds of UTC
Requires antenna with clear sky view
Independent of network conditions
Often used as stratum 0/1 sources for NTP

Precision Time Protocol (PTP/IEEE 1588):

Designed for high-precision local network synchronization:

Hardware timestamping capabilities
Sub-microsecond accuracy
Path delay measurement
Best suited for industrial and technical applications

Physical clock synchronization remains a fundamental challenge in distributed systems, balancing practical considerations of accuracy, cost, and complexity against application requirements.

4.2 Logical Clock Synchronization

Logical clock synchronization focuses on maintaining consistent event ordering rather than absolute time values, providing a framework for reasoning about causality in distributed systems.

Key concepts:

Happened-before relation: Denoted a → b, meaning event a could have influenced event b
Concurrent events: Events that are not causally related (neither a → b nor b → a)
Causality: Capturing potential cause-effect relationships between events
Logical time: Abstract notion of time based on event ordering

Types of logical clocks:

Lamport's Logical Clocks

Developed by Leslie Lamport, these scalar clocks assign timestamps to events in a way that respects causality:

Rules:

When process P executes an event, increment P's logical clock
When process P sends a message m, include P's logical clock timestamp T
When process Q receives message m with timestamp T, set Q's logical clock to max(Q's clock, T) + 1

Properties:

Monotonicity: Clock values always increase
Causality: If a → b, then C(a) < C(b) where C() is the clock function
Weak clock condition: Cannot determine if concurrent events are causally related

Data structure:

Single integer per process

Limitations:

Cannot detect concurrent events
Only partial ordering of events
No information about relationships between concurrent events

Vector Clocks

An extension of Lamport clocks that maintains vector timestamps, enabling detection of concurrent events:

Rules:

Each process Pi maintains a vector Vi of size n (where n is the number of processes)
Initially, Vi[j] = 0 for all i, j
When Pi executes an internal event, increment Vi[i]
When Pi sends a message, increment Vi[i] and send entire vector Vi with message
When Pi receives a message with vector V, update Vi[j] = max(Vi[j], V[j]) for all j, then increment Vi[i]

✅ Vector Clock Basics Recap:

For n processes, each process keeps a vector of length n.

Vi[j] means: What process i knows about process j's time.
Helps capture causal relationships between events.
If V(e1) < V(e2) → e1 happened before e2.
If neither V(e1) ≤ V(e2) nor V(e2) ≤ V(e1) → events are concurrent.

🧠 Example - Let’s Take 3 Processes: P1, P2, P3

Each process keeps a vector [P1, P2, P3] Let’s simulate a sequence of events:

🔹 Initial State:

All clocks start at:

P1: [0, 0, 0]
P2: [0, 0, 0]
P3: [0, 0, 0]

🔹 Step 1: P1 executes an internal event

Increment own index:

P1: [1, 0, 0]

🔹 Step 2: P1 sends a message to P2

Before sending, P1 increments its clock:

P1: [2, 0, 0]  → sends this vector to P2

🔹 Step 3: P2 receives the message from P1

P2 first updates its vector:
V2[j] = max(V2[j], V1[j]) for all j
P2's current: [0, 0, 0]
Incoming from P1: [2, 0, 0]
So, updated: [2, 0, 0]
Then, P2 increments its own clock:
→ [2, 1, 0]

P2: [2, 1, 0]

🔹 Step 4: P2 sends a message to P3

Increment own clock before sending:
→ [2, 2, 0]
Send this vector to P3

🔹 Step 5: P3 receives message from P2

P3’s current: [0, 0, 0]
Incoming vector: [2, 2, 0]
Update with max: → [2, 2, 0]
Increment P3’s clock: → [2, 2, 1]

P3: [2, 2, 1]

🔹 Step 6: P1 executes another internal event

→ [3, 0, 0]

🔍 Final Vector Clocks

Process	Vector Clock
P1	[3, 0, 0]
P2	[2, 2, 0]
P3	[2, 2, 1]

🧩 How to Compare Events?

Let’s say:

Event e1: P1’s [2, 0, 0] (before message to P2)
Event e2: P2’s [2, 1, 0] (after receiving)

Compare:
[2, 0, 0] < [2, 1, 0] → Yes
So e1 happened before e2

Now consider:

Event e3: P1’s [3, 0, 0]
Event e4: P3’s [2, 2, 1] Compare:
Not e3 < e4 (because 3 > 2)
Not e4 < e3 (because 2 > 0)

→ So e3 and e4 are concurrent events.

Properties:

Causality capture: V(a) < V(b) if and only if a → b
Concurrency detection: If neither V(a) < V(b) nor V(b) < V(a), events are concurrent
Strong clock condition: Complete characterization of causal relationships

Data structure:

Vector of integers, one entry per process

Vector clock comparison:

V(a) ≤ V(b) if V(a)[i] ≤ V(b)[i] for all i
V(a) < V(b) if V(a) ≤ V(b) and V(a) ≠ V(b)
V(a) || V(b) (concurrent) if neither V(a) < V(b) nor V(b) < V(a)

Limitations:

O(n) space complexity, where n is the number of processes
Message size grows with system size
Overhead for maintaining complete causal history

Matrix Clocks (Version Vectors)

An extension of vector clocks that captures not just direct knowledge but also what each process knows about other processes:

Data structure:

n×n matrix where n is the number of processes
Mi[j, k] represents process i's knowledge of process j's knowledge of process k's events

Properties:

Complete information about causal knowledge
Enables garbage collection of message logs
Used in optimistic replication systems

Limitations:

O(n²) space complexity
Significant communication overhead

Applications of logical clocks:

Causal broadcast: Ensuring messages are delivered respecting causality
Distributed snapshots: Capturing consistent global states
Distributed debugging: Analyzing event relationships
Conflict detection: In optimistic replication and collaborative editing
Consistency models: Implementing causal consistency

Logical clocks provide the theoretical foundation for reasoning about event ordering in distributed systems, offering a framework that addresses the fundamental limitations of physical time in distributed environments.

4.3 Leader Election Algorithms

Leader election is a fundamental problem in distributed systems where processes must reach agreement on selecting a single process (the leader) to coordinate activities or make decisions.

Loading diagram...

Use cases for leader election:

Centralized control: Coordinating distributed operations
Resource arbitration: Managing access to shared resources
Load balancing: Distributing work among processes
Fault tolerance: Recovering from leader failures
Consensus: Facilitating agreement protocols

Desirable properties:

Safety: At most one leader is elected at any time
Liveness: Eventually a leader is elected if a suitable candidate exists
Symmetry breaking: Distinguishing between otherwise identical processes
Fault tolerance: Handling process failures during and after election
Efficiency: Minimizing message complexity and time

Common leader election algorithms:

Bully Algorithm

The Bully Algorithm elects the process with the highest ID as the leader:

Process:

When a process P detects leader failure or system initialization:
- P sends ELECTION message to all processes with higher IDs
- If no responses, P declares itself leader and sends COORDINATOR message
- If any process responds, P becomes inactive in election
When receiving ELECTION message:
- Send OK response to sender
- Start new election if not already in one
When receiving COORDINATOR message:
- Update leader information

Characteristics:

Simple implementation
O(n²) worst-case message complexity
Works in synchronous systems
Vulnerable to repeated failures during election

Ring Algorithm

Processes are arranged in a logical ring, with election messages passed around the ring:

Process:

When a process P detects leader failure:
- P marks itself as participant
- P sends ELECTION message containing its ID to its successor
When receiving ELECTION message:
- If message contains P's ID, P becomes leader
- If ID in message is higher than P's ID, forward message
- If ID in message is lower than P's ID, replace with P's ID and forward
After completing ring traversal, highest ID process becomes leader
Leader sends COORDINATOR message around ring

Characteristics:

O(n) to O(n²) message complexity (depending on variation)
Requires stable ring topology
More resilient to certain failure patterns than Bully

Paxos Leader Election

Based on the Paxos consensus algorithm, but specialized for leader election:

Process:

Proposer suggests itself as leader with a ballot number (epoch, ID)
Acceptors respond with promises if ballot number is highest seen
If proposer receives majority promises, it sends accept messages
Acceptors acknowledge acceptance if ballot still valid
Process becomes leader when majority accepts

Characteristics:

Robust against network partitions
Works in asynchronous systems
Guaranteed correctness even with partial failures
Higher complexity than simpler algorithms

Raft Leader Election

Part of the Raft consensus algorithm, designed for understandability:

Process:

All nodes start as followers with random election timeout
On timeout, follower becomes candidate and increments term
Candidate votes for itself and requests votes from others
Nodes vote for candidate if candidate's term ≥ their term and they haven't voted in this term
Candidate becomes leader if it receives majority votes
Leaders send periodic heartbeats to prevent new elections

Characteristics:

Randomized timeouts prevent split votes
Clear role separation (follower, candidate, leader)
Integrated with broader consensus mechanism
Built-in term numbering for safety

Election in partition-prone systems:

Split-brain problem: Network partition can lead to multiple leaders
Quorum-based approaches: Require majority agreement to prevent split-brain
Fencing tokens: Monotonically increasing identifiers to detect stale leaders
Lease mechanisms: Time-bounded leadership grants

Practical considerations:

Timeout calibration: Balancing quick detection with false-positive risk
Heterogeneous environments: Handling varying process capabilities
Network topology: Impact on message delivery and timeouts
Stability vs. responsiveness: Preventing frequent re-elections

Leader election algorithms provide critical coordination mechanisms in distributed systems, serving as building blocks for more complex distributed algorithms and ensuring orderly system operation even in the face of failures.

5. Mutual Exclusion

5.1 Centralized Approach

Loading diagram...

The centralized approach to mutual exclusion relies on a single coordinator process that manages access to shared resources, serving as an arbiter for all critical section requests.

Basic algorithm:

Process P sends a REQUEST message to the coordinator when it wants to enter the critical section
Coordinator either:
- Grants permission immediately with REPLY if no other process is in the critical section
- Adds request to a queue if another process is currently in the critical section
After receiving REPLY, process P enters the critical section
When P exits the critical section, it sends RELEASE to the coordinator
Coordinator grants permission to the next process in the queue (if any) with REPLY

Message patterns:

For each critical section entry: 2 messages (REQUEST, REPLY)
For each critical section exit: 1 message (RELEASE)
Total per critical section usage: 3 messages

Data structures:

Coordinator: Boolean flag indicating if resource is in use, queue of pending requests
Process: State variable tracking if request is pending or granted

Queue management policies:

FIFO: First-come, first-served (fairness)
Priority-based: Higher priority requests served first
Deadline-aware: Scheduling based on time constraints

Advantages:

Simplicity: Straightforward implementation
Low message complexity: O(1) messages per critical section access
Fairness: Easy to implement fair scheduling policies
Deadlock-free: Centralized control prevents circular wait
Low overhead: Minimal state tracking required

Disadvantages:

Single point of failure: If coordinator crashes, entire system is affected
Performance bottleneck: All requests funnel through one process
Vulnerability to coordinator failures: Special recovery procedures needed
Network partition sensitivity: Partitions can prevent access to coordinator

Failure handling:

Coordinator failure:
- Detect coordinator failure (e.g., using timeouts or heartbeats)
- Run leader election to choose new coordinator
- Reconstruct queue state (may require processes to resubmit requests)
Process failure:
- While holding critical section: Requires recovery mechanism
- Before/after critical section: Coordinator removes from queue if applicable
Message loss:
- Request lost: Timeout and retry
- Reply lost: Timeout and retry, with duplicate detection
- Release lost: Timeout at coordinator to handle stuck critical sections

Practical implementations:

Lock servers: Dedicated services for distributed locking
Zookeeper: Using ephemeral nodes for locks with automatic release
Database locks: Using central database for coordination
Redis-based locks: Using atomic operations in Redis

Optimizations:

Token passing within queue: Reduce coordinator involvement
Batching of requests/releases: Reduce message overhead
Hierarchical coordination: Multiple coordinators for scaling

The centralized approach represents the simplest solution to distributed mutual exclusion but trades fault tolerance and scalability for simplicity and low message overhead.

5.2 Distributed Approaches

Distributed approaches to mutual exclusion eliminate the central coordinator, distributing the responsibility of coordination among all participating processes to improve fault tolerance.

Loading diagram...

Lamport's Distributed Mutual Exclusion Algorithm

Based on Lamport's logical clocks, this algorithm uses timestamps to establish a total ordering of critical section requests:

Algorithm steps:

When process Pi wants to enter critical section:
- Increment logical clock
- Send timestamped REQUEST(ts, i) to all processes (including itself)
- Place request in local request queue
When process Pj receives REQUEST(ts, i):
- Place request in local request queue
- Send REPLY to Pi
Process Pi enters critical section when:
- Its request has the lowest timestamp in its queue
- It has received a message (REPLY or later timestamped REQUEST) from every other process
After exiting critical section:
- Remove request from local queue
- Send RELEASE to all processes
When process Pj receives RELEASE from Pi:
- Remove Pi's request from local queue

Message analysis:

Enter critical section: N REQUEST messages + (N-1) REPLY messages = 2N-1
Exit critical section: N-1 RELEASE messages
Total per critical section: 3(N-1) messages

Correctness properties:

Mutual exclusion: Guaranteed by total ordering of timestamps
Deadlock-free: Lowest timestamp always proceeds
Starvation-free: FIFO service based on timestamps

Advantages:

No single point of failure
Symmetric algorithm (all processes run same code)
Fair (first-come, first-served based on timestamps)

Disadvantages:

High message complexity: O(N) messages per critical section
Vulnerability to process failures
All processes involved in every critical section access

Ricart and Agrawala's Algorithm

An optimization of Lamport's algorithm that reduces message complexity:

Algorithm steps:

When process Pi wants to enter critical section:
- Increment logical clock
- Send timestamped REQUEST(ts, i) to all other processes
- Wait for REPLY from all other processes
When process Pj receives REQUEST(ts, i):
- If Pj is not interested in critical section or Pi's request has priority (lower timestamp or same timestamp but lower process ID):
  - Send REPLY immediately
- Otherwise:
  - Defer REPLY until Pj exits critical section
Process Pi enters critical section when:
- It has received REPLY from all other processes
After exiting critical section:
- Send deferred REPLY messages to pending requests

Message analysis:

Enter critical section: (N-1) REQUEST messages + (N-1) REPLY messages = 2(N-1)
Exit critical section: 0 messages (replies are sent as needed)
Total per critical section: 2(N-1) messages

Advantages:

Reduced message complexity compared to Lamport's algorithm
No explicit RELEASE messages

Disadvantages:

Still O(N) message complexity
All processes must respond for critical section entry
Vulnerable to process failures

Maekawa's Algorithm

A quorum-based approach that significantly reduces message complexity:

Key idea:

Each process has a "voting set" (quorum) of processes
Any two voting sets must have a non-empty intersection
Permission needed only from voting set, not all processes

Voting set properties:

Each process belongs to K voting sets
Each voting set has K processes
Optimal K = √N (approximately)

Algorithm steps:

When process Pi wants to enter critical section:
- Send REQUEST to all processes in its voting set
- Wait for GRANT from all processes in voting set
When process Pj receives REQUEST from Pi:
- If Pj hasn't sent GRANT to anyone:
  - Send GRANT to Pi
- Otherwise:
  - Queue Pi's request
Process Pi enters critical section when:
- It has received GRANT from all processes in its voting set
After exiting critical section:
- Send RELEASE to all processes in voting set
When process Pj receives RELEASE:
- Send GRANT to next queued request (if any)

Message analysis:

Enter critical section: K REQUEST + K GRANT messages = 2K
Exit critical section: K RELEASE messages
Total per critical section: 3K messages (approximately 3√N)

Advantages:

Significantly reduced message complexity: O(√N) instead of O(N)
Distributed control

Disadvantages:

Complex voting set construction
Potential for deadlock in basic algorithm
More complex failure handling

Distributed approaches to mutual exclusion eliminate single points of failure but typically require more complex algorithms and higher message counts than centralized approaches.

5.3 Token-Based Approaches

Token-based mutual exclusion algorithms use the possession of a unique token to grant the right to enter the critical section, eliminating the need for request-grant protocols.

Loading diagram...

Basic concept:

Only one token exists in the system
Only the process holding the token may enter the critical section
Token is passed between processes
No explicit permission needed—token possession is permission

Token Ring Algorithm

Processes are arranged in a logical ring, with the token circulating around the ring:

Algorithm steps:

Token circulates in a fixed direction around the ring
When process Pi receives the token:
- If Pi wants to enter critical section:
  - Pi retains token and enters critical section
  - After exiting, Pi passes token to next process in ring
- If Pi does not want to enter:
  - Pi passes token immediately to next process

Message analysis:

One token pass per process in ring
Average wait time proportional to ring size
Single token message circulating continuously

Advantages:

Predictable behavior and bounded waiting time
Simple implementation
Constant message size

Disadvantages:

Continuous token circulation even when no requests
Single token creates vulnerability to loss
Ring maintenance required

Raymond's Tree-Based Algorithm

Processes are arranged in a spanning tree with the token holder at the root:

Data structures:

For each process Pi:
- HOLDER: pointer to neighbor closer to token (parent in tree)
- USING: boolean indicating if Pi is in critical section
- ASKED: boolean indicating if Pi has requested token
- REQUEST_Q: queue of neighbors who have requested token

Algorithm steps:

When process Pi wants to enter critical section:
- If Pi holds token:
  - Mark USING = true and enter
- Else:
  - Set ASKED = true
  - Send REQUEST to HOLDER
When process Pi receives REQUEST from Pj:
- If Pi holds token and not USING:
  - Send token to Pj
  - Update HOLDER = Pj
- Else if Pi holds token and USING:
  - Add Pj to REQUEST_Q
- Else:
  - If not ASKED:
    - Set ASKED = true
    - Send REQUEST to HOLDER
  - Add Pj to REQUEST_Q
When process Pi receives token:
- Set HOLDER = self
- If ASKED:
  - Enter critical section (USING = true)
- Else:
  - If REQUEST_Q not empty:
    - Dequeue Pj from REQUEST_Q
    - Send token to Pj
    - Set HOLDER = Pj
When process Pi exits critical section:
- Set USING = false
- If REQUEST_Q not empty:
  - Dequeue Pj from REQUEST_Q
  - Send token to Pj
  - Set HOLDER = Pj

Message analysis:

O(log N) messages per critical section request (tree height)
No messages when token holder wants to enter
Dynamically optimizes paths based on usage patterns

Advantages:

More efficient than ring algorithm
Adapts to usage patterns
Lower average waiting time

Disadvantages:

More complex than ring algorithm
Tree maintenance required
Potential for path stretching

Suzuki-Kasami Broadcast Algorithm

A token-based algorithm using sequence numbers to track requests:

Data structures:

In each process Pi:
- RN[1..n]: request sequence numbers for each process
- LN[1..n]: sequence numbers of last requests satisfied (stored in token)
- Token: contains privilege and queue

Algorithm steps:

When process Pi wants to enter critical section:
- Increment RN[i]
- Broadcast REQUEST(i, RN[i]) to all processes
When process Pj receives REQUEST(i, sn):
- Set RN[i] = max(RN[i], sn)
- If Pj has token and not using it, and RN[i] > LN[i]:
  - Send token to Pi
  - Update HOLDER reference
When process Pi receives token:
- Enter critical section
- Update LN[i] = RN[i]
When process Pi exits critical section:
- For each process Pj where RN[j] > LN[j]:
  - Add Pj to token queue
- If token queue not empty:
  - Dequeue process Pk
  - Send token to Pk
  - Update LN[k] = RN[k]

Message analysis:

Request broadcast: N-1 messages
Token transfer: 1 message
Best case (if already have token): 0 messages
Worst case: N messages

Advantages:

Efficient when critical section usage is infrequent
No explicit release messages
Tolerant to dynamic group membership

Disadvantages:

Broadcast overhead for requests
Queue management complexity

Token-based approaches offer elegant solutions to mutual exclusion that often perform well in practice, particularly in systems with predictable communication patterns or frequent critical section usage.

6. Resource Management

6.1 Global Scheduling

Global scheduling in distributed systems involves coordinating task execution across multiple nodes to optimize system-wide performance, resource utilization, and other objectives.

Loading diagram...

Key goals of global scheduling:

Resource utilization: Maximize use of available computational resources
Throughput: Increase the number of tasks completed per unit time
Response time: Minimize time between task submission and completion
Fairness: Ensure equitable access to resources
Energy efficiency: Minimize power consumption
Deadline satisfaction: Meet time constraints for time-critical tasks

Scheduling dimensions:

Static vs. dynamic scheduling:
- Static: Assignment decisions made prior to execution
  - Based on a priori knowledge of task characteristics
  - Simpler implementation, lower runtime overhead
  - Less adaptive to changing conditions
- Dynamic: Assignment decisions made during execution
  - Adapts to current system state and load conditions
  - Higher runtime overhead for decision-making
  - Better handles unpredictable workloads
Centralized vs. distributed scheduling:
- Centralized: Single scheduler makes all decisions
  - Global visibility of system state
  - Potential bottleneck and single point of failure
  - Simpler to implement optimal algorithms
- Distributed: Multiple schedulers coordinate
  - Better scalability and fault tolerance
  - Partial system visibility
  - More complex coordination required
Cooperative vs. non-cooperative scheduling:
- Cooperative: Nodes work together for global objectives
  - System-wide optimization possible
  - Requires trust between nodes
- Non-cooperative: Nodes optimize for local objectives
  - No trust assumptions required
  - May lead to suboptimal global outcomes

Common scheduling algorithms:

First Come, First Served (FCFS):
- Tasks executed in order of arrival
- Simple implementation
- Can lead to convoy effect (short tasks stuck behind long ones)
Shortest Job First (SJF):
- Prioritizes tasks with shortest execution time
- Minimizes average response time
- Requires accurate execution time estimates
- Can lead to starvation of larger tasks
Priority-based scheduling:
- Tasks assigned priorities based on various criteria
- Higher priority tasks executed first
- Can incorporate multiple objectives (deadlines, importance, etc.)
- May require aging mechanisms to prevent starvation
Fair share scheduling:
- Resources divided equitably among users or groups
- Historical usage tracked to ensure long-term fairness
- May temporarily deviate from fairness for efficiency
Deadline-based scheduling:
- Tasks scheduled to meet time constraints
- Earliest Deadline First (EDF): Prioritizes nearest deadlines
- Rate Monotonic: Fixed priorities based on period for periodic tasks

Scheduling information requirements:

Task characteristics: Execution time, resource requirements, dependencies
System state: Current load, available resources, queue lengths
Performance metrics: Current throughput, response times, utilization
Environmental factors: Network conditions, failure probabilities

Practical considerations:

Cost of migration: Overhead of moving tasks between nodes
Data locality: Proximity of computation to data
Communication patterns: Inter-task communication requirements
Heterogeneity: Varying capabilities of execution nodes
Reliability: Probability of node failures

Implementation approaches:

Queue-based systems:
- Central or distributed task queues
- Workers pull from queues based on capacity
- Examples: Apache Hadoop, traditional batch systems
Market-based approaches:
- Resources priced based on supply and demand
- Tasks bid for required resources
- Price mechanisms drive efficient allocation
Constraint satisfaction:
- Formulate scheduling as constraint problem
- Apply constraint solvers to find valid allocations
- Handles complex requirements and dependencies

Global scheduling provides the foundation for efficient resource utilization in distributed systems, balancing multiple competing objectives while adapting to the dynamic nature of distributed environments.

6.2 Task Assignment

Task assignment focuses on the specific problem of deciding which processor or node should execute each task in a distributed system, considering factors such as load balancing, communication costs, and resource requirements.

Loading diagram...

Core task assignment problems:

Mapping: Assigning tasks to processors
Ordering: Determining execution sequence on each processor
Synchronization: Coordinating dependent task execution

Assignment models:

One-time assignment:
- Tasks permanently assigned to processors
- Typical in batch processing systems
- Focus on initial distribution quality
Dynamic reassignment:
- Tasks can migrate between processors
- Adapts to changing system conditions
- Balances migration costs against benefits

Assignment objectives:

Minimizing execution time: Fastest completion of tasks
Minimizing communication: Reducing network traffic
Load balancing: Even distribution of work
Resource matching: Placing tasks where their resource needs are best met
Energy efficiency: Minimizing power consumption
Reliability: Ensuring task completion despite failures

Task characteristics affecting assignment:

Computation requirements: CPU, memory, storage, specialized hardware
Communication patterns: Which tasks communicate with each other
Precedence constraints: Task dependencies and execution order
Divisibility: Whether tasks can be partitioned for parallel execution
State requirements: Access to specific data or services

Common assignment strategies:

Graph Partitioning

Models tasks and communication as a graph, then partitions to minimize communication while balancing load:

Nodes: Represent tasks or computation units
Edges: Represent communication or dependencies
Edge weights: Communication volume
Node weights: Computation requirements
Goal: Partition graph to minimize edge cuts while balancing partition sizes

Algorithms:

Kernighan-Lin algorithm
Recursive bisection
Spectral partitioning
Multilevel partitioning (METIS)

Task Clustering

Groups heavily communicating tasks together to minimize communication overhead:

Tasks clustered based on communication patterns
Clusters assigned to processors as units
Trade-offs between parallelism and communication overhead

Approaches:

Communication volume-based clustering
Dependency-based clustering
Dominant sequence clustering

List Scheduling

Assigns prioritized tasks to available processors:

Tasks sorted according to priority metric (critical path, etc.)
Each task assigned to processor that allows earliest completion

Common priority metrics:

Task size (longest first)
Communication requirements
Number of successors
Critical path contribution

Work Stealing

Dynamic load balancing approach where idle processors "steal" work from busy ones:

Each processor maintains a queue of ready tasks
When a processor's queue is empty, it attempts to steal tasks from others
Typically targets largest available tasks for stealing
Balances load while minimizing task migrations

Considerations:

Stealing granularity (task size)
Victim selection policy
Locality awareness
Overhead of synchronization

Advanced assignment techniques:

Machine learning approaches:
- Predictive task placement based on historical performance
- Reinforcement learning for adaptive assignment
- Classification of tasks into placement categories
Container-based assignment:
- Containerized tasks with specified resource requirements
- Orchestration systems (Kubernetes, Docker Swarm) handling placement
- Resource reservation and isolation
Speculative execution:
- Running multiple copies of slow tasks
- Taking result from first to complete
- Handling stragglers in large-scale systems
Co-scheduling:
- Coordinated scheduling of communicating tasks
- Minimizing waiting time due to dependencies
- Aligning execution slots across processors

Effective task assignment is critical for distributed system performance, requiring careful consideration of task characteristics, system capabilities, and assignment objectives.

6.3 Load Balancing

Load balancing distributes workload across multiple computing resources to optimize resource utilization, maximize throughput, minimize response time, and avoid overload of any single resource.

Loading diagram...

Goals of load balancing:

Resource utilization: Making efficient use of all available resources
Throughput maximization: Processing more requests per unit time
Response time minimization: Reducing user-perceived delays
Avoiding hotspots: Preventing resource exhaustion on individual nodes
Scalability: Accommodating growing workloads by adding resources
Availability: Handling component failures without service disruption

Types of load balancing:

Static load balancing:
- Fixed distribution of work based on predetermined rules
- No adaptation to current system state
- Examples: Round-robin, weighted distribution, hash-based
- Low runtime overhead, but doesn't adapt to changing conditions
Dynamic load balancing:
- Distribution adapts based on current system state
- Monitors actual load and performance metrics
- Examples: Least-connections, resource-based, predictive
- Higher overhead but better handles variable workloads
Centralized load balancing:
- Single component makes all distribution decisions
- Complete system visibility
- Potential single point of failure
- Examples: Traditional load balancer appliances, central dispatchers
Distributed load balancing:
- Multiple components cooperate to balance load
- No single point of failure
- Partial system visibility
- Examples: DNS-based distribution, anycast, client-side selection

Common load balancing algorithms:

Static Algorithms

Round-Robin:
- Tasks distributed in circular order among nodes
- Simple implementation, no need for system state
- Assumes homogeneous nodes and tasks
- Variations: Weighted Round-Robin (accommodates node differences)
Hash-Based:
- Consistent hashing maps requests to nodes
- Same request always routed to same node (useful for caching)
- Minimizes redistribution when nodes added/removed
- Examples: Consistent hashing in distributed caches
Fixed Mapping:
- Predetermined assignments based on request type
- Specializes nodes for specific workloads
- Enables hardware optimization for specific tasks
- Inflexible to changing workload characteristics

Dynamic Algorithms

Least Connections:
- Tasks sent to node with fewest active connections
- Adapts to varying request processing times
- Requires continuous connection tracking
- Variations: Weighted Least Connections
Shortest Queue:
- Tasks sent to node with shortest job queue
- Minimizes waiting time for new tasks
- Requires queue length monitoring
- May not account for varying job sizes
Fastest Response:
- Tasks sent to node with lowest response time
- Adapts to actual processing capacity
- Requires performance monitoring
- Automatically accommodates heterogeneous nodes
Resource-Based:
- Distribution based on current resource utilization
- Considers CPU, memory, network, disk metrics
- Prevents resource exhaustion
- Requires comprehensive monitoring

Specialized Approaches

Two-Phase Algorithms:
- Phase 1: Nodes classified as senders or receivers based on load
- Phase 2: Load transferred from senders to receivers
- Reduces unnecessary migrations
- Examples: Diffusion methods, dimension exchange
Agent-Based:
- Mobile agents collect system state and make decisions
- Reduces central coordination requirements
- Can adapt to network topology changes
- Higher implementation complexity
Auction-Based:
- Nodes bid for tasks based on their capacity
- Market mechanisms drive efficient allocation
- Self-organizing and adaptive
- Examples: Spawn system, computational economies

Implementation mechanisms:

DNS-based load balancing:
- Multiple IP addresses returned for single domain
- Client connects to one of the provided addresses
- Simple but limited control and responsiveness
Transport layer (L4) load balancing:
- Distributes based on IP address and port
- Network Address Translation (NAT) or Direct Server Return (DSR)
- Efficient but application-unaware
Application layer (L7) load balancing:
- Distribution based on request content (URLs, headers, cookies)
- Content-aware routing and sticky sessions
- More intelligent but higher processing overhead
Global server load balancing (GSLB):
- Distributes load across multiple geographic locations
- Considers network proximity and regional availability
- Improves user experience and disaster resilience

Practical considerations:

Session persistence: Keeping user connected to same server
Health checking: Detecting and avoiding failed nodes
Geographic distribution: Routing users to nearby resources
Capacity planning: Ensuring sufficient overall capacity
Elasticity: Adding/removing resources based on demand
Cost optimization: Balancing performance against resource costs

Load balancing is a critical component of distributed systems, ensuring efficient resource utilization while providing the scalability and reliability needed for modern applications.

Summary and Further Reading

Key Takeaways

Operating System Paradigms:
- Network OS provides resource sharing while maintaining distinct machine boundaries
- Distributed OS creates a unified system image across multiple machines
- Middleware serves as a pragmatic middle ground, enabling integration without replacing existing OS
Communication Models:
- The OSI model provides a conceptual framework for network protocols
- RPC and RMI enable remote invocation of procedures and methods
- Message vs. stream-oriented communication offer different tradeoffs in discrete vs. continuous data flow
- Group communication provides abstractions for coordinating multiple processes
Clock Synchronization:
- Physical clock synchronization (Cristian's, Berkeley, NTP) aligns actual time values
- Logical clocks (Lamport, Vector) maintain causal ordering without absolute time
- Leader election algorithms select coordinators for distributed operations
Mutual Exclusion:
- Centralized approaches offer simplicity but create single points of failure
- Distributed approaches (Lamport, Ricart-Agrawala, Maekawa) eliminate central coordinator
- Token-based solutions provide elegant access control with different topologies
Resource Management:
- Global scheduling coordinates

Comprehensive Guide to Distributed Systems

Table of Contents

Would you like a downloadable markdown version or a PDF of this structure?

1. Operating System Paradigms

1.1 Network Operating Systems

1.2 Distributed Operating Systems

1.3 Middleware

2. Networking Models

2.1 OSI Reference Model

2.2 Remote Procedure Call (RPC)

2.3 Remote Method Invocation (RMI)

3. Communication Paradigms

3.1 Message-Oriented Communication

3.2 Stream-Oriented Communication

3.3 Group Communication

4. Clock Synchronization

4.1 Physical Clock Synchronization

Cristian's Algorithm

Berkeley Algorithm

Network Time Protocol (NTP)

4.2 Logical Clock Synchronization

Lamport's Logical Clocks

Vector Clocks

✅ Vector Clock Basics Recap:

🧠 ** Example - Let’s Take 3 Processes: P1, P2, P3**

🔹 Initial State:

🔹 Step 1: P1 executes an internal event

🔹 Step 2: P1 sends a message to P2

🔹 Step 3: P2 receives the message from P1

🔹 Step 4: P2 sends a message to P3

🔹 Step 5: P3 receives message from P2

🔹 Step 6: P1 executes another internal event

🔍 Final Vector Clocks

🧩 How to Compare Events?

Matrix Clocks (Version Vectors)

4.3 Leader Election Algorithms

Bully Algorithm

Ring Algorithm

Paxos Leader Election

Raft Leader Election

5. Mutual Exclusion

5.1 Centralized Approach

5.2 Distributed Approaches

Lamport's Distributed Mutual Exclusion Algorithm

Ricart and Agrawala's Algorithm

Maekawa's Algorithm

5.3 Token-Based Approaches

Token Ring Algorithm

Raymond's Tree-Based Algorithm

Suzuki-Kasami Broadcast Algorithm

6. Resource Management

6.1 Global Scheduling

6.2 Task Assignment

Graph Partitioning

Task Clustering

List Scheduling

Work Stealing

6.3 Load Balancing

Static Algorithms

Dynamic Algorithms

Specialized Approaches

Summary and Further Reading

Key Takeaways

🧠 Example - Let’s Take 3 Processes: P1, P2, P3