Back to all posts

Comprehensive Guide to Distributed Systems

1 min read
Comprehensive Guide to Distributed Systems

Table of Contents

  1. Operating System Paradigms
    • Network Operating Systems
    • Distributed Operating Systems
    • Middleware
  2. Networking Models
    • OSI Reference Model
    • Remote Procedure Call (RPC)
    • Remote Method Invocation (RMI)
  3. Communication Paradigms
    • Message-Oriented Communication
    • Stream-Oriented Communication
    • Group Communication
  4. Clock Synchronization
    • Physical Clock Synchronization
    • Logical Clock Synchronization
    • Leader Election Algorithms
  5. Mutual Exclusion
    • Centralized Approach
    • Distributed Approaches
    • Token-Based Approaches
  6. Resource Management
    • Global Scheduling
    • Task Assignment
    • Load Balancing

Would you like a downloadable markdown version or a PDF of this structure?

1. Operating System Paradigms

1.1 Network Operating Systems

Network Operating Systems (NOS) provide an abstraction layer that enables multiple independent computers to communicate and share resources while maintaining their individual autonomy. Unlike distributed operating systems, NOS maintains clear boundaries between machines.

Loading diagram...
graph TD A[Client] -->|Request| B[Server] B -->|Response| A C[Client] -->|Request| B B -->|Response| C D[Printer] --> B E[Storage] --> B

Key characteristics:

  • Distinct machine boundaries: Each computer retains its own identity, operating system, and resources
  • Explicit network awareness: Users must explicitly connect to remote resources
  • Authentication-based access: Users typically log into specific servers to access shared resources
  • Machine-oriented management: Administration happens at the individual machine level

Examples:

  • Early versions of Novell NetWare
  • Windows Server (with its file and print sharing capabilities)
  • Traditional Unix network services (NFS, NIS)

Architecture components:

  • Network protocols: TCP/IP, IPX/SPX, NetBEUI
  • Network services: File sharing, print services, authentication
  • Client software: Network redirectors, service clients
  • Server software: Resource management, authentication, permission management

NOS functions primarily as a means to share peripherals (printers, storage) and data (files, databases) among connected computers. Users typically need to know where resources are located and explicitly connect to them using network paths or mapped drives.

1.2 Distributed Operating Systems

Distributed Operating Systems (DOS) represent a more advanced paradigm where multiple networked computers function as a unified system, with the operating system abstracting away the distributed nature of the infrastructure.

Loading diagram...
graph TD subgraph "Single System Image" A[Process] --> B[Resource Manager] C[Process] --> B B --> D[CPU Cluster] B --> E[Distributed Storage] B --> F[Network] end

Key characteristics:

  • System-wide transparency: Users perceive a single, unified system
  • Automatic resource management: The system handles resource allocation automatically
  • Location transparency: Users don't need to know where resources physically reside
  • Migration transparency: Processes can move between nodes without user awareness
  • Replication transparency: System can replicate data for reliability without user intervention
  • Concurrency transparency: Multiple users can access resources simultaneously

Types of transparency in distributed OS:

TypeDefinitionExample
AccessHides differences in data representationDifferent CPU architectures sharing data
LocationHides where resources are locatedFiles accessible via same path regardless of server
MigrationHides resource movementProcess continues executing after moving to another node
ReplicationHides that resources are replicatedSystem automatically reads from nearest replica
ConcurrencyHides that resource may be sharedDistributed locking handled automatically
FailureHides failures and recoveryAutomatic failover to backup components
PerformanceSystem can be reconfigured for performanceLoad redistribution based on usage patterns

Examples:

  • Amoeba (developed at Vrije Universiteit)
  • Sprite (UC Berkeley)
  • Plan 9 (Bell Labs)
  • Inferno (Bell Labs)

A true DOS provides the illusion of a single system image across multiple physical machines. Users interact with a single logical system, with the complexity of distribution handled by the operating system itself. This approach enables seamless resource sharing, better fault tolerance, and dynamic scalability.

1.3 Middleware

Middleware occupies the space between traditional operating systems and applications, providing services that facilitate communication and resource sharing across a distributed system without requiring a fully distributed operating system.

Definition: Software layer that sits between the operating system and applications, providing services that enable distributed systems integration.

Loading diagram...
graph LR A[App] --> B[Middleware] B --> C[OS] D[App] --> B B --> E[Network] F[App] --> B

Key functions:

  • Masking heterogeneity of hardware, operating systems, and networks
  • Providing uniform, high-level interfaces for developers and applications
  • Offering standardized services like naming, security, and transactions
  • Enabling interoperability between diverse components

Types of middleware:

  1. Communication-oriented middleware:

    • Message-Oriented Middleware (MOM): Systems like IBM MQ, RabbitMQ, Apache Kafka
    • Remote Procedure Call (RPC): ONC RPC, DCE RPC
    • Object Request Brokers: CORBA, DCOM
  2. Application-oriented middleware:

    • Web services middleware: SOAP, REST frameworks
    • Transaction Processing Monitors: CICS, Tuxedo
    • Database middleware: ODBC, JDBC, ORM frameworks
  3. Integration middleware:

    • Enterprise Service Bus (ESB): Mule ESB, WSO2 ESB
    • Enterprise Application Integration (EAI): WebMethods, TIBCO

Middleware services typically include:

  • Naming services: Allow resources to be located by name rather than address
  • Directory services: Like LDAP for resource discovery and organization
  • Security services: Authentication, authorization, encryption
  • Transaction services: ACID properties across distributed resources
  • Concurrency control: Coordination of access to shared resources
  • Event notification: Publish-subscribe mechanisms
  • Data conversion: Transforming data between different formats

Middleware vs. DOS comparison:

AspectMiddlewareDistributed OS
Integration depthApplication-levelKernel-level
InstallationSits on top of existing OSReplaces or modifies OS
HeterogeneityHandles diverse systemsUsually requires homogeneity
DeploymentEasier to deploy incrementallyOften requires complete system overhaul
TransparencyProvides partial transparencyAims for full transparency

Middleware represents a pragmatic approach to building distributed systems, allowing heterogeneous systems to interoperate without requiring a completely new operating system. It's particularly valuable in enterprise environments where legacy systems need to be integrated.


2. Networking Models

2.1 OSI Reference Model

The Open Systems Interconnection (OSI) model is a conceptual framework that standardizes the functions of a communication system into seven distinct layers. Though not strictly implemented in most systems, it provides a reference for understanding network protocols and interactions.

Loading diagram...
graph BT A[Application] --> B[Presentation] B --> C[Session] C --> D[Transport] D --> E[Network] E --> F[Data Link] F --> G[Physical]

The Seven Layers:

  1. Physical Layer (Layer 1)

    • Function: Transmission of raw bit streams over physical medium
    • Units: Bits
    • Examples: Ethernet (IEEE 802.3), USB, Bluetooth physical layer
    • Key concepts: Signal encoding, transmission rates, physical connectors
    • Devices: Hubs, repeaters, cables, network interface cards
  2. Data Link Layer (Layer 2)

    • Function: Reliable transmission of data frames between nodes
    • Units: Frames
    • Examples: Ethernet, PPP, HDLC, Wi-Fi (IEEE 802.11)
    • Key concepts: MAC addressing, error detection/correction, flow control
    • Devices: Network switches, bridges
    • Sublayers:
      • Media Access Control (MAC): Hardware addressing and channel access
      • Logical Link Control (LLC): Flow/error control, multiplexing
  3. Network Layer (Layer 3)

    • Function: Routing of packets across networks
    • Units: Packets
    • Examples: IP, ICMP, OSPF, BGP
    • Key concepts: Logical addressing, routing, packet forwarding, fragmentation
    • Devices: Routers
    • Services:
      • Connectionless communication (IP)
      • Connection-oriented (virtual circuits)
  4. Transport Layer (Layer 4)

    • Function: End-to-end communication and data reliability
    • Units: Segments (TCP) or Datagrams (UDP)
    • Examples: TCP, UDP, SCTP
    • Key concepts: Port numbers, connection establishment, flow control, congestion control
    • Services:
      • Reliable delivery (TCP)
      • Best-effort delivery (UDP)
  5. Session Layer (Layer 5)

    • Function: Establishes, manages, and terminates connections
    • Examples: NetBIOS, RPC, SIP
    • Key concepts: Session establishment, maintenance, termination, synchronization
    • Features: Checkpointing, recovery, dialog control
  6. Presentation Layer (Layer 6)

    • Function: Data translation, encryption, compression
    • Examples: SSL/TLS, JPEG, MPEG, ASCII, Unicode
    • Key concepts: Syntax negotiation, data transformation
    • Services:
      • Data encryption/decryption
      • Character set conversion
      • Data compression/decompression
  7. Application Layer (Layer 7)

    • Function: User-facing applications and services
    • Examples: HTTP, FTP, SMTP, DNS, SSH, Telnet
    • Key concepts: Application-specific protocols, user authentication, resource sharing
    • Services: File transfer, email, web browsing, network management

Encapsulation Process:

  • Each layer adds its own header (and sometimes trailer) to the data
  • As data moves down the stack, each layer encapsulates the data from the layer above
  • At the receiving end, each layer strips its headers and passes the payload up

OSI vs. TCP/IP Model:

OSI LayerTCP/IP LayerProtocols
Application, Presentation, SessionApplicationHTTP, FTP, SMTP, DNS
TransportTransportTCP, UDP
NetworkInternetIP, ICMP, ARP
Data Link, PhysicalNetwork InterfaceEthernet, Wi-Fi

The OSI model provides a conceptual framework that helps understand network operations, troubleshoot problems, and design interoperable systems, even though most modern networks are based on the simpler TCP/IP model.

2.2 Remote Procedure Call (RPC)

Remote Procedure Call (RPC) is a communication protocol that enables a program to cause a procedure to execute in another address space (commonly on another computer on a shared network) without explicitly coding the details of the remote interaction.

Loading diagram...
sequenceDiagram Client->>+Stub: Call Stub->>+Server: Marshalled Request Server->>+Stub: Marshalled Response Stub->>Client: Result

Key components of RPC:

  1. Client stub: Client-side proxy that makes the remote procedure look local
  2. Server stub: Server-side component that receives the call and invokes the actual procedure
  3. RPC runtime: Middleware that handles the communication details
  4. Interface Definition Language (IDL): Language for defining remote interfaces

RPC operation flow:

  1. Client application calls a local procedure (client stub)
  2. Client stub packages arguments into a message (marshalling)
  3. Client's RPC runtime transmits the message to the server
  4. Server's RPC runtime receives the message and passes it to the server stub
  5. Server stub unpacks the arguments (unmarshalling)
  6. Server stub calls the actual procedure
  7. Server procedure executes and returns results to the server stub
  8. Server stub packages the results into a message (marshalling)
  9. Server's RPC runtime transmits the message back to the client
  10. Client's RPC runtime receives the message and passes it to the client stub
  11. Client stub unpacks the results (unmarshalling)
  12. Client application receives the results

RPC binding process:

  • Static binding: Server location known at compile time
  • Dynamic binding: Server location determined at runtime via name service
  • Auto-binding: System automatically selects an appropriate server

RPC semantics:

  • At-most-once: Guarantees no duplication, may fail silently
  • At-least-once: Guarantees execution, may duplicate
  • Exactly-once: Ideal but hard to achieve in distributed systems
  • Maybe: No guarantees (best effort)

Common RPC implementations:

  • ONC RPC (Open Network Computing): Developed by Sun, used in NFS
  • DCE RPC (Distributed Computing Environment): By Open Software Foundation
  • gRPC: Modern, high-performance RPC framework by Google
  • XML-RPC: HTTP-based, uses XML for encoding
  • JSON-RPC: HTTP-based, uses JSON for encoding

Challenges in RPC:

  • Partial failures: One system may fail while others continue
  • Latency: Remote calls take longer than local calls
  • Parameter passing: Complexity with reference parameters
  • Security: Authentication and encryption concerns
  • Heterogeneity: Different data representations across systems

RPC remains a fundamental building block for distributed systems, providing a programming abstraction that makes remote service invocation appear similar to local procedure calls.

2.3 Remote Method Invocation (RMI)

Remote Method Invocation (RMI) extends the concept of Remote Procedure Call to object-oriented systems, allowing objects in one Java Virtual Machine (JVM) to invoke methods on objects in another JVM.

Loading diagram...
classDiagram class Client { +Stub proxy } class Stub { +marshal() +unmarshal() } class Skeleton { +dispatch() } class Server { +remoteMethod() } Client --> Stub Stub --> Skeleton Skeleton --> Server

Key characteristics of RMI:

  • Object-oriented: Operates on remote objects rather than procedures
  • Language-specific: Primarily a Java technology (though similar concepts exist in other languages)
  • Parameter passing: Supports passing objects by value or reference
  • Distributed garbage collection: Manages remote object lifecycles

RMI architecture components:

  1. Client: Application that invokes methods on remote objects
  2. Client-side stub: Proxy representing the remote object locally
  3. RMI Registry: Naming service for looking up remote objects
  4. Server-side skeleton: Receives calls and forwards to implementation
  5. Remote object implementation: Actual object implementation

RMI operation flow:

  1. Server registers remote object with RMI registry
  2. Client looks up remote object in registry
  3. Registry returns a stub for the remote object to client
  4. Client invokes methods on the stub as if it were local
  5. Stub marshals parameters and forwards call to server
  6. Server-side skeleton unmarshals parameters and invokes method
  7. Method executes and returns results
  8. Skeleton marshals results and sends back to client
  9. Client stub unmarshals results and returns to client application

RMI vs. RPC:

AspectRMIRPC
ParadigmObject-orientedProcedure-oriented
Language supportJava-specificLanguage-neutral
Parameter passingObjects (by value or reference)Basic data types
Dynamic downloadingSupports code mobilityGenerally no
Garbage collectionDistributed GCNo GC
Interface definitionJava interfacesIDL specifications
Naming serviceRMI RegistryVarious name services

Advanced RMI features:

  • Dynamic class loading: Client can download implementation classes
  • Distributed garbage collection: Automatic management of remote references
  • Activation framework: On-demand activation of server objects
  • Custom socket factories: For security and custom transport protocols
  • Security manager: Controls access to resources

RMI over IIOP: RMI can also operate over the Internet Inter-ORB Protocol (IIOP), enabling Java RMI to interoperate with CORBA systems, referred to as RMI-IIOP.

RMI limitations:

  • Java-centricity: Limited interoperability with other languages
  • Performance overhead: Serialization and network latency
  • Network dependence: Vulnerable to network failures
  • Firewall issues: May be blocked by restrictive firewalls

RMI represents an elegant object-oriented approach to distributed computing, allowing developers to work with remote objects using familiar object-oriented programming techniques.


3. Communication Paradigms

3.1 Message-Oriented Communication

Message-Oriented Communication is a paradigm where applications communicate by sending discrete messages to each other, typically through intermediary message-queuing systems or message brokers.

Loading diagram...
graph LR P[Publisher] --> T[Topic] S1[Subscriber] --> T S2[Subscriber] --> T Q[Queue] --> C1[Consumer] Q --> C2[Consumer]

Key characteristics:

  • Asynchronous: Sender doesn't wait for receiver to process message
  • Loosely coupled: Minimal dependencies between components
  • Discrete messages: Information exchanged in self-contained units
  • Store-and-forward: Messages can be stored until delivery is possible
  • Middleware-mediated: Often uses dedicated message-oriented middleware

Core components:

  1. Message: Self-contained unit of information with:

    • Header: Metadata about the message (routing, priority, expiration)
    • Body: Actual payload/content
    • Properties: Additional application-specific metadata
  2. Queue: Storage structure for messages

    • Point-to-point: One sender, one receiver
    • Publish-subscribe: One sender, multiple receivers
  3. Message broker/MOM: Middleware that manages message routing, delivery, and persistence

    • Ensures reliable delivery
    • Handles routing and transformations
    • Provides persistence and transactions

Common interaction patterns:

  1. Request-reply: Two-way communication pattern

    • Client sends request, includes reply address
    • Server processes request, sends response to reply address
  2. Publish-subscribe: One-to-many communication pattern

    • Publishers send messages to topics
    • Subscribers receive messages from topics they've subscribed to
    • Loose coupling between publishers and subscribers
  3. Message queuing: Reliable point-to-point communication

    • Messages stored until consumer retrieves them
    • Provides temporal decoupling
    • Supports load balancing across multiple consumers
  4. Broadcast/multicast: Sending to multiple recipients simultaneously

    • More efficient than multiple point-to-point sends
    • Can be implemented via IP multicast or application-level distribution

Messaging guarantees:

  • At-most-once: Message delivered once or not at all
  • At-least-once: Message delivered one or more times
  • Exactly-once: Message delivered exactly once (hardest to implement)

Message-oriented middleware platforms:

  • Apache Kafka: High-throughput distributed streaming platform
  • RabbitMQ: Feature-rich message broker implementing AMQP
  • IBM MQ: Enterprise messaging solution with strong transaction support
  • Apache ActiveMQ: Open-source message broker with multiple protocols
  • Amazon SQS/SNS: Cloud-based queuing and notification services

Advantages of message-oriented communication:

  • Temporal decoupling: Components don't need to be active simultaneously
  • Spatial decoupling: Components don't need to know each other's locations
  • Load balancing: Natural distribution of work across consumers
  • Fault tolerance: Messages persist even if receivers are unavailable
  • Scalability: Easy to add producers or consumers

Challenges and considerations:

  • Complexity: Additional infrastructure requirements
  • Latency: Potentially higher compared to direct communication
  • Ordering: Ensuring message sequence when required
  • Idempotency: Handling duplicate message processing
  • Transactionality: Managing transactions across multiple messages

Message-oriented communication is particularly valuable in distributed systems where reliability, scalability, and loose coupling are important design goals.

3.2 Stream-Oriented Communication

Stream-Oriented Communication is a paradigm where applications exchange data as continuous sequences of bytes rather than discrete messages, typically using a connected transport protocol like TCP.

Loading diagram...
graph LR A[Client] -- TCP --> B[Server] B -- Media Stream --> A C[Client] -- TCP --> B

Key characteristics:

  • Connection-based: Requires established connection before communication
  • Continuous data flow: Data transmitted as an uninterrupted sequence
  • Byte-oriented: Fundamental unit is a byte rather than a message
  • Ordered delivery: Bytes arrive in the same order they were sent
  • Bi-directional: Communication typically flows in both directions

Core concepts:

  1. Connection establishment: Three-way handshake (in TCP)

    • Client sends SYN packet
    • Server responds with SYN-ACK
    • Client confirms with ACK
  2. Data transfer: Sending bytes through established connection

    • Data segmented into packets for transmission
    • Sequence numbers ensure correct ordering
    • Acknowledgments confirm reception
  3. Flow control: Preventing sender from overwhelming receiver

    • Sliding window protocols
    • Receive window advertisements
  4. Congestion control: Preventing network overload

    • Slow start
    • Congestion avoidance
    • Fast retransmit/recovery
  5. Connection termination: Four-way handshake (in TCP)

    • Initiator sends FIN
    • Receiver acknowledges with ACK
    • Receiver sends FIN when ready
    • Initiator acknowledges with ACK

Transport protocols for stream communication:

  • TCP (Transmission Control Protocol):

    • Reliable, ordered, error-checked
    • Connection-oriented
    • Flow control and congestion control
    • Most common stream protocol
  • SCTP (Stream Control Transmission Protocol):

    • Multi-streaming and multi-homing capabilities
    • Message-oriented but over a connection
    • Partial reliability option
    • Used in telecommunications
  • WebSockets:

    • Full-duplex communication over TCP
    • Web-friendly protocol
    • Minimal overhead after connection
    • Supports sub-protocols

Application-level protocols over streams:

  • HTTP/1.1: Request-response protocol with persistent connections
  • HTTP/2: Multiplexed streams over a single connection
  • TLS/SSL: Security layer providing encryption
  • SSH: Secure shell protocol for remote operations
  • SMTP: Email transmission protocol

Framing in stream communication:

Since streams don't have inherent message boundaries, applications must implement framing strategies:

  1. Length-prefixed: Prepend message size before content
  2. Delimiter-based: Use special characters to mark boundaries
  3. Fixed-length: Use predetermined message sizes
  4. TLV (Type-Length-Value): Structured approach with headers

Advantages of stream-oriented communication:

  • Efficiency: Lower overhead for large data transfers
  • Reliability: Built-in mechanisms for reliable delivery
  • Ordered delivery: Guaranteed sequencing
  • Flow control: Prevents overwhelming receivers
  • Congestion awareness: Adapts to network conditions

Challenges and considerations:

  • Connection management: Overhead of establishing/maintaining connections
  • Head-of-line blocking: One lost packet blocks entire stream
  • Stateful nature: More complex failure handling
  • Latency: Connection establishment adds delay
  • Message boundaries: Must be implemented at application level

Stream-oriented communication is well-suited for applications requiring reliable, ordered data transfer, especially for larger data volumes or continuous interactions.

3.3 Group Communication

Group Communication involves sending messages to multiple recipients simultaneously, providing abstractions for coordinating distributed processes that need to work together as a logical group.

Loading diagram...
graph TD A[Member] -->|Multicast| B[Group] B --> C[Member] B --> D[Member] B --> E[Member]

Key characteristics:

  • One-to-many or many-to-many: Communication involving multiple participants
  • Group abstraction: Collection of processes treated as a logical unit
  • Membership management: Tracking who belongs to which groups
  • Ordering guarantees: Various delivery ordering semantics
  • Fault tolerance: Mechanisms to handle process failures

Fundamental concepts:

  1. Group membership:

    • Static groups: Fixed membership known at design time
    • Dynamic groups: Processes can join/leave during operation
    • Membership protocols: Algorithms to maintain consistent group view
    • Failure detection: Identifying crashed or disconnected members
  2. Message ordering guarantees:

    • FIFO (First-In-First-Out): Messages from same sender delivered in sending order
    • Causal ordering: If message A could have caused message B, A is delivered before B
    • Total ordering: All processes receive all messages in identical order
    • Atomic ordering: Either all correct processes deliver a message or none do
  3. Delivery semantics:

    • Best-effort: No guarantees about delivery
    • Reliable: All non-faulty members eventually receive the message
    • Atomic: All members receive the message or none do
    • Uniform: If any process (even faulty) delivers a message, all correct processes deliver it

Group communication patterns:

  1. Broadcast/multicast:

    • IP multicast: Network-level distribution (limited deployment)
    • Application-level multicast: Overlay networks for distribution
    • Gossip protocols: Epidemic-style dissemination
  2. Publish-subscribe:

    • Topic-based: Subscribers register interest in specific topics
    • Content-based: Subscribers specify predicates on message content
    • Type-based: Subscribers receive messages of specific types
  3. Active replication:

    • All replicas process all requests
    • Requires total order broadcast
    • State machine replication approach
  4. Virtual synchrony:

    • Provides illusion of synchronous execution
    • Group membership changes appear between message deliveries
    • Simplifies distributed algorithm design

Group communication frameworks and systems:

  • Isis Toolkit: One of the first group communication systems
  • Spread: High-performance group communication toolkit
  • JGroups: Java-based reliable group communication library
  • Apache ZooKeeper: Coordination service with group primitives
  • etcd: Distributed key-value store with watch functionality

Implementation approaches:

  1. Centralized coordinator:

    • Sequencer assigns order to messages
    • Single point of failure
    • Simpler implementation
  2. Distributed agreement:

    • Consensus protocols (e.g., Paxos, Raft)
    • Higher fault tolerance
    • More complex implementation
  3. Gossip-based dissemination:

    • Probabilistic guarantees
    • Highly scalable
    • Eventually consistent

Applications of group communication:

  • Replicated databases: Maintaining consistent replicas
  • Distributed caches: Cache invalidation and updates
  • Cluster management: Coordinating server clusters
  • Collaborative applications: Real-time collaboration tools
  • Distributed monitoring: Event notification systems

Challenges and considerations:

  • Scalability: Many protocols don't scale to large groups
  • Network partitions: Handling split-brain scenarios
  • Performance overhead: Ensuring ordering guarantees adds complexity
  • Failure modes: Distinguishing between process and network failures
  • Consistency vs. availability: Fundamental tradeoffs

Group communication provides powerful abstractions for building fault-tolerant distributed systems, especially where coordination among multiple processes is essential.


4. Clock Synchronization

4.1 Physical Clock Synchronization

Physical clock synchronization involves aligning the time values of physical clocks across distributed systems to ensure coordinated operations, enabling correct event ordering and time-based operations.

Loading diagram...
graph TD A[Member] -->|Multicast| B[Group] B --> C[Member] B --> D[Member] B --> E[Member]

Need for clock synchronization:

  • Event ordering: Determining the sequence of distributed events
  • Distributed algorithms: Many require synchronized notion of time
  • Timestamps: For logging, auditing, and debugging
  • Time-based coordination: Scheduling activities across systems
  • Security protocols: Authentication and cryptographic operations

Clock basics:

  • Clock drift: Tendency for clocks to run at slightly different rates
    • Typically 1-100 parts per million (ppm)
    • Results in growing divergence over time
  • Clock skew: Difference between two clocks at a given moment
  • UTC (Coordinated Universal Time): Standard reference time

Challenges in clock synchronization:

  • Network latency: Variable message transit times
  • Asymmetric paths: Different delays in different directions
  • Process scheduling delays: Unpredictable execution timing
  • Temperature effects: Environmental impacts on oscillator frequency
  • Reference source accuracy: Quality of time standard

Main synchronization algorithms:

Cristian's Algorithm

A centralized approach where clients synchronize with a time server:

  1. Client sends request to time server
  2. Server responds with its current time T
  3. Client records round-trip time RTT
  4. Client sets its time to T + RTT/2 (estimated one-way delay)

Characteristics:

  • Simple implementation
  • Accuracy limited by RTT and path asymmetry
  • Vulnerable to server failures
  • Error bounds can be calculated as ±(RTT/2 - min_delay)

Berkeley Algorithm

A decentralized approach where a coordinator polls other machines and calculates a time adjustment:

  1. Coordinator polls all machines for their time
  2. Coordinator calculates average time (often excluding outliers)
  3. Coordinator sends relative adjustments to each machine
  4. Machines adjust their clocks accordingly

Characteristics:

  • Works with faulty time sources
  • Achieves internal synchronization (relative to group)
  • Less dependent on absolute time source
  • Better fault tolerance than Cristian's

Network Time Protocol (NTP)

A hierarchical, fault-tolerant protocol for synchronizing clocks over variable-latency networks:

  1. Stratum levels:

    • Stratum 0: Reference clocks (atomic clocks, GPS)
    • Stratum 1: Servers directly connected to reference clocks
    • Stratum 2: Clients of stratum 1 servers
    • And so on (higher numbers = further from reference)
  2. Synchronization modes:

    • Client/server: Periodic polling of NTP servers
    • Symmetric: Peer-to-peer synchronization
    • Broadcast/multicast: One-to-many updates
  3. Clock selection algorithm:

    • Polls multiple servers
    • Filters outliers and falsetickers
    • Combines remaining sources using weighted average
  4. Clock discipline algorithm:

    • Phase-locked loop (PLL) for frequency stability
    • Gradual adjustments to avoid time steps
    • Handles both drift and offset correction

Characteristics:

  • Highly accurate (within milliseconds over internet, microseconds on LANs)
  • Robust against failures and Byzantine behavior
  • Scalable hierarchical design
  • Handles asymmetric paths and variable latency
  • Widely deployed internet standard

GPS-based synchronization:

Modern systems often use GPS receivers for precision timing:

  • Accuracy typically within 100 nanoseconds of UTC
  • Requires antenna with clear sky view
  • Independent of network conditions
  • Often used as stratum 0/1 sources for NTP

Precision Time Protocol (PTP/IEEE 1588):

Designed for high-precision local network synchronization:

  • Hardware timestamping capabilities
  • Sub-microsecond accuracy
  • Path delay measurement
  • Best suited for industrial and technical applications

Physical clock synchronization remains a fundamental challenge in distributed systems, balancing practical considerations of accuracy, cost, and complexity against application requirements.

4.2 Logical Clock Synchronization

Logical clock synchronization focuses on maintaining consistent event ordering rather than absolute time values, providing a framework for reasoning about causality in distributed systems.

Key concepts:

  • Happened-before relation: Denoted a → b, meaning event a could have influenced event b
  • Concurrent events: Events that are not causally related (neither a → b nor b → a)
  • Causality: Capturing potential cause-effect relationships between events
  • Logical time: Abstract notion of time based on event ordering

Types of logical clocks:

Lamport's Logical Clocks

Developed by Leslie Lamport, these scalar clocks assign timestamps to events in a way that respects causality:

Rules:

  1. When process P executes an event, increment P's logical clock
  2. When process P sends a message m, include P's logical clock timestamp T
  3. When process Q receives message m with timestamp T, set Q's logical clock to max(Q's clock, T) + 1

Properties:

  • Monotonicity: Clock values always increase
  • Causality: If a → b, then C(a) < C(b) where C() is the clock function
  • Weak clock condition: Cannot determine if concurrent events are causally related

Data structure:

  • Single integer per process

Limitations:

  • Cannot detect concurrent events
  • Only partial ordering of events
  • No information about relationships between concurrent events

Vector Clocks

An extension of Lamport clocks that maintains vector timestamps, enabling detection of concurrent events:

Rules:

  1. Each process Pi maintains a vector Vi of size n (where n is the number of processes)
  2. Initially, Vi[j] = 0 for all i, j
  3. When Pi executes an internal event, increment Vi[i]
  4. When Pi sends a message, increment Vi[i] and send entire vector Vi with message
  5. When Pi receives a message with vector V, update Vi[j] = max(Vi[j], V[j]) for all j, then increment Vi[i]
Vector Clock Basics Recap:

For n processes, each process keeps a vector of length n.

  • Vi[j] means: What process i knows about process j's time.

  • Helps capture causal relationships between events.

  • If V(e1) < V(e2)e1 happened before e2.

  • If neither V(e1) ≤ V(e2) nor V(e2) ≤ V(e1) → events are concurrent.


🧠 ** Example - Let’s Take 3 Processes: P1, P2, P3**

Each process keeps a vector [P1, P2, P3] Let’s simulate a sequence of events:


🔹 Initial State:

All clocks start at:

P1: [0, 0, 0]
P2: [0, 0, 0]
P3: [0, 0, 0]

🔹 Step 1: P1 executes an internal event

Increment own index:

P1: [1, 0, 0]

🔹 Step 2: P1 sends a message to P2

Before sending, P1 increments its clock:

P1: [2, 0, 0]  → sends this vector to P2

🔹 Step 3: P2 receives the message from P1
  • P2 first updates its vector:
    V2[j] = max(V2[j], V1[j]) for all j

  • P2's current: [0, 0, 0]
    Incoming from P1: [2, 0, 0]
    So, updated: [2, 0, 0]

  • Then, P2 increments its own clock:
    [2, 1, 0]

P2: [2, 1, 0]

🔹 Step 4: P2 sends a message to P3
  • Increment own clock before sending:
    [2, 2, 0]
  • Send this vector to P3

🔹 Step 5: P3 receives message from P2
  • P3’s current: [0, 0, 0]
    Incoming vector: [2, 2, 0]

  • Update with max: → [2, 2, 0]

  • Increment P3’s clock: → [2, 2, 1]

P3: [2, 2, 1]

🔹 Step 6: P1 executes another internal event

[3, 0, 0]


🔍 Final Vector Clocks
ProcessVector Clock
P1[3, 0, 0]
P2[2, 2, 0]
P3[2, 2, 1]

🧩 How to Compare Events?

Let’s say:

  • Event e1: P1’s [2, 0, 0] (before message to P2)
  • Event e2: P2’s [2, 1, 0] (after receiving)

Compare:
[2, 0, 0] < [2, 1, 0] → Yes
So e1 happened before e2


Now consider:

  • Event e3: P1’s [3, 0, 0]
  • Event e4: P3’s [2, 2, 1] Compare:
  • Not e3 < e4 (because 3 > 2)
  • Not e4 < e3 (because 2 > 0)

→ So e3 and e4 are concurrent events.

Properties:

  • Causality capture: V(a) < V(b) if and only if a → b
  • Concurrency detection: If neither V(a) < V(b) nor V(b) < V(a), events are concurrent
  • Strong clock condition: Complete characterization of causal relationships

Data structure:

  • Vector of integers, one entry per process

Vector clock comparison:

  • V(a) ≤ V(b) if V(a)[i] ≤ V(b)[i] for all i
  • V(a) < V(b) if V(a) ≤ V(b) and V(a) ≠ V(b)
  • V(a) || V(b) (concurrent) if neither V(a) < V(b) nor V(b) < V(a)

Limitations:

  • O(n) space complexity, where n is the number of processes
  • Message size grows with system size
  • Overhead for maintaining complete causal history

Matrix Clocks (Version Vectors)

An extension of vector clocks that captures not just direct knowledge but also what each process knows about other processes:

Data structure:

  • n×n matrix where n is the number of processes
  • Mi[j, k] represents process i's knowledge of process j's knowledge of process k's events

Properties:

  • Complete information about causal knowledge
  • Enables garbage collection of message logs
  • Used in optimistic replication systems

Limitations:

  • O(n²) space complexity
  • Significant communication overhead

Applications of logical clocks:

  • Causal broadcast: Ensuring messages are delivered respecting causality
  • Distributed snapshots: Capturing consistent global states
  • Distributed debugging: Analyzing event relationships
  • Conflict detection: In optimistic replication and collaborative editing
  • Consistency models: Implementing causal consistency

Logical clocks provide the theoretical foundation for reasoning about event ordering in distributed systems, offering a framework that addresses the fundamental limitations of physical time in distributed environments.

4.3 Leader Election Algorithms

Leader election is a fundamental problem in distributed systems where processes must reach agreement on selecting a single process (the leader) to coordinate activities or make decisions.

Loading diagram...
sequenceDiagram participant A as Node1 participant B as Node2 participant C as Node3 A->>B: Election A->>C: Election B-->>A: OK C-->>A: OK A->>B: Coordinator A->>C: Coordinator

Use cases for leader election:

  • Centralized control: Coordinating distributed operations
  • Resource arbitration: Managing access to shared resources
  • Load balancing: Distributing work among processes
  • Fault tolerance: Recovering from leader failures
  • Consensus: Facilitating agreement protocols

Desirable properties:

  • Safety: At most one leader is elected at any time
  • Liveness: Eventually a leader is elected if a suitable candidate exists
  • Symmetry breaking: Distinguishing between otherwise identical processes
  • Fault tolerance: Handling process failures during and after election
  • Efficiency: Minimizing message complexity and time

Common leader election algorithms:

Bully Algorithm

The Bully Algorithm elects the process with the highest ID as the leader:

Process:

  1. When a process P detects leader failure or system initialization:
    • P sends ELECTION message to all processes with higher IDs
    • If no responses, P declares itself leader and sends COORDINATOR message
    • If any process responds, P becomes inactive in election
  2. When receiving ELECTION message:
    • Send OK response to sender
    • Start new election if not already in one
  3. When receiving COORDINATOR message:
    • Update leader information

Characteristics:

  • Simple implementation
  • O(n²) worst-case message complexity
  • Works in synchronous systems
  • Vulnerable to repeated failures during election

Ring Algorithm

Processes are arranged in a logical ring, with election messages passed around the ring:

Process:

  1. When a process P detects leader failure:
    • P marks itself as participant
    • P sends ELECTION message containing its ID to its successor
  2. When receiving ELECTION message:
    • If message contains P's ID, P becomes leader
    • If ID in message is higher than P's ID, forward message
    • If ID in message is lower than P's ID, replace with P's ID and forward
  3. After completing ring traversal, highest ID process becomes leader
  4. Leader sends COORDINATOR message around ring

Characteristics:

  • O(n) to O(n²) message complexity (depending on variation)
  • Requires stable ring topology
  • More resilient to certain failure patterns than Bully

Paxos Leader Election

Based on the Paxos consensus algorithm, but specialized for leader election:

Process:

  1. Proposer suggests itself as leader with a ballot number (epoch, ID)
  2. Acceptors respond with promises if ballot number is highest seen
  3. If proposer receives majority promises, it sends accept messages
  4. Acceptors acknowledge acceptance if ballot still valid
  5. Process becomes leader when majority accepts

Characteristics:

  • Robust against network partitions
  • Works in asynchronous systems
  • Guaranteed correctness even with partial failures
  • Higher complexity than simpler algorithms

Raft Leader Election

Part of the Raft consensus algorithm, designed for understandability:

Process:

  1. All nodes start as followers with random election timeout
  2. On timeout, follower becomes candidate and increments term
  3. Candidate votes for itself and requests votes from others
  4. Nodes vote for candidate if candidate's term ≥ their term and they haven't voted in this term
  5. Candidate becomes leader if it receives majority votes
  6. Leaders send periodic heartbeats to prevent new elections

Characteristics:

  • Randomized timeouts prevent split votes
  • Clear role separation (follower, candidate, leader)
  • Integrated with broader consensus mechanism
  • Built-in term numbering for safety

Election in partition-prone systems:

  • Split-brain problem: Network partition can lead to multiple leaders
  • Quorum-based approaches: Require majority agreement to prevent split-brain
  • Fencing tokens: Monotonically increasing identifiers to detect stale leaders
  • Lease mechanisms: Time-bounded leadership grants

Practical considerations:

  • Timeout calibration: Balancing quick detection with false-positive risk
  • Heterogeneous environments: Handling varying process capabilities
  • Network topology: Impact on message delivery and timeouts
  • Stability vs. responsiveness: Preventing frequent re-elections

Leader election algorithms provide critical coordination mechanisms in distributed systems, serving as building blocks for more complex distributed algorithms and ensuring orderly system operation even in the face of failures.


5. Mutual Exclusion

5.1 Centralized Approach

Loading diagram...
graph TD A[Client] -->|Request| C[Coordinator] B[Client] -->|Request| C C -->|Grant| A A -->|Release| C C -->|Grant| B

The centralized approach to mutual exclusion relies on a single coordinator process that manages access to shared resources, serving as an arbiter for all critical section requests.

Basic algorithm:

  1. Process P sends a REQUEST message to the coordinator when it wants to enter the critical section
  2. Coordinator either:
    • Grants permission immediately with REPLY if no other process is in the critical section
    • Adds request to a queue if another process is currently in the critical section
  3. After receiving REPLY, process P enters the critical section
  4. When P exits the critical section, it sends RELEASE to the coordinator
  5. Coordinator grants permission to the next process in the queue (if any) with REPLY

Message patterns:

  • For each critical section entry: 2 messages (REQUEST, REPLY)
  • For each critical section exit: 1 message (RELEASE)
  • Total per critical section usage: 3 messages

Data structures:

  • Coordinator: Boolean flag indicating if resource is in use, queue of pending requests
  • Process: State variable tracking if request is pending or granted

Queue management policies:

  • FIFO: First-come, first-served (fairness)
  • Priority-based: Higher priority requests served first
  • Deadline-aware: Scheduling based on time constraints

Advantages:

  • Simplicity: Straightforward implementation
  • Low message complexity: O(1) messages per critical section access
  • Fairness: Easy to implement fair scheduling policies
  • Deadlock-free: Centralized control prevents circular wait
  • Low overhead: Minimal state tracking required

Disadvantages:

  • Single point of failure: If coordinator crashes, entire system is affected
  • Performance bottleneck: All requests funnel through one process
  • Vulnerability to coordinator failures: Special recovery procedures needed
  • Network partition sensitivity: Partitions can prevent access to coordinator

Failure handling:

  1. Coordinator failure:

    • Detect coordinator failure (e.g., using timeouts or heartbeats)
    • Run leader election to choose new coordinator
    • Reconstruct queue state (may require processes to resubmit requests)
  2. Process failure:

    • While holding critical section: Requires recovery mechanism
    • Before/after critical section: Coordinator removes from queue if applicable
  3. Message loss:

    • Request lost: Timeout and retry
    • Reply lost: Timeout and retry, with duplicate detection
    • Release lost: Timeout at coordinator to handle stuck critical sections

Practical implementations:

  • Lock servers: Dedicated services for distributed locking
  • Zookeeper: Using ephemeral nodes for locks with automatic release
  • Database locks: Using central database for coordination
  • Redis-based locks: Using atomic operations in Redis

Optimizations:

  • Token passing within queue: Reduce coordinator involvement
  • Batching of requests/releases: Reduce message overhead
  • Hierarchical coordination: Multiple coordinators for scaling

The centralized approach represents the simplest solution to distributed mutual exclusion but trades fault tolerance and scalability for simplicity and low message overhead.

5.2 Distributed Approaches

Distributed approaches to mutual exclusion eliminate the central coordinator, distributing the responsibility of coordination among all participating processes to improve fault tolerance.

Loading diagram...
graph TD A[P1] -->|Request| B[P2] A -->|Request| C[P3] B -->|Reply| A C -->|Reply| A A --> CS((Critical Section))

Lamport's Distributed Mutual Exclusion Algorithm

Based on Lamport's logical clocks, this algorithm uses timestamps to establish a total ordering of critical section requests:

Algorithm steps:

  1. When process Pi wants to enter critical section:

    • Increment logical clock
    • Send timestamped REQUEST(ts, i) to all processes (including itself)
    • Place request in local request queue
  2. When process Pj receives REQUEST(ts, i):

    • Place request in local request queue
    • Send REPLY to Pi
  3. Process Pi enters critical section when:

    • Its request has the lowest timestamp in its queue
    • It has received a message (REPLY or later timestamped REQUEST) from every other process
  4. After exiting critical section:

    • Remove request from local queue
    • Send RELEASE to all processes
  5. When process Pj receives RELEASE from Pi:

    • Remove Pi's request from local queue

Message analysis:

  • Enter critical section: N REQUEST messages + (N-1) REPLY messages = 2N-1
  • Exit critical section: N-1 RELEASE messages
  • Total per critical section: 3(N-1) messages

Correctness properties:

  • Mutual exclusion: Guaranteed by total ordering of timestamps
  • Deadlock-free: Lowest timestamp always proceeds
  • Starvation-free: FIFO service based on timestamps

Advantages:

  • No single point of failure
  • Symmetric algorithm (all processes run same code)
  • Fair (first-come, first-served based on timestamps)

Disadvantages:

  • High message complexity: O(N) messages per critical section
  • Vulnerability to process failures
  • All processes involved in every critical section access

Ricart and Agrawala's Algorithm

An optimization of Lamport's algorithm that reduces message complexity:

Algorithm steps:

  1. When process Pi wants to enter critical section:

    • Increment logical clock
    • Send timestamped REQUEST(ts, i) to all other processes
    • Wait for REPLY from all other processes
  2. When process Pj receives REQUEST(ts, i):

    • If Pj is not interested in critical section or Pi's request has priority (lower timestamp or same timestamp but lower process ID):
      • Send REPLY immediately
    • Otherwise:
      • Defer REPLY until Pj exits critical section
  3. Process Pi enters critical section when:

    • It has received REPLY from all other processes
  4. After exiting critical section:

    • Send deferred REPLY messages to pending requests

Message analysis:

  • Enter critical section: (N-1) REQUEST messages + (N-1) REPLY messages = 2(N-1)
  • Exit critical section: 0 messages (replies are sent as needed)
  • Total per critical section: 2(N-1) messages

Advantages:

  • Reduced message complexity compared to Lamport's algorithm
  • No explicit RELEASE messages

Disadvantages:

  • Still O(N) message complexity
  • All processes must respond for critical section entry
  • Vulnerable to process failures

Maekawa's Algorithm

A quorum-based approach that significantly reduces message complexity:

Key idea:

  • Each process has a "voting set" (quorum) of processes
  • Any two voting sets must have a non-empty intersection
  • Permission needed only from voting set, not all processes

Voting set properties:

  • Each process belongs to K voting sets
  • Each voting set has K processes
  • Optimal K = √N (approximately)

Algorithm steps:

  1. When process Pi wants to enter critical section:

    • Send REQUEST to all processes in its voting set
    • Wait for GRANT from all processes in voting set
  2. When process Pj receives REQUEST from Pi:

    • If Pj hasn't sent GRANT to anyone:
      • Send GRANT to Pi
    • Otherwise:
      • Queue Pi's request
  3. Process Pi enters critical section when:

    • It has received GRANT from all processes in its voting set
  4. After exiting critical section:

    • Send RELEASE to all processes in voting set
  5. When process Pj receives RELEASE:

    • Send GRANT to next queued request (if any)

Message analysis:

  • Enter critical section: K REQUEST + K GRANT messages = 2K
  • Exit critical section: K RELEASE messages
  • Total per critical section: 3K messages (approximately 3√N)

Advantages:

  • Significantly reduced message complexity: O(√N) instead of O(N)
  • Distributed control

Disadvantages:

  • Complex voting set construction
  • Potential for deadlock in basic algorithm
  • More complex failure handling

Distributed approaches to mutual exclusion eliminate single points of failure but typically require more complex algorithms and higher message counts than centralized approaches.

5.3 Token-Based Approaches

Token-based mutual exclusion algorithms use the possession of a unique token to grant the right to enter the critical section, eliminating the need for request-grant protocols.

Loading diagram...
graph LR A -->|Token| B B -->|Token| C C -->|Token| D D -->|Token| A

Basic concept:

  • Only one token exists in the system
  • Only the process holding the token may enter the critical section
  • Token is passed between processes
  • No explicit permission needed—token possession is permission

Token Ring Algorithm

Processes are arranged in a logical ring, with the token circulating around the ring:

Algorithm steps:

  1. Token circulates in a fixed direction around the ring
  2. When process Pi receives the token:
    • If Pi wants to enter critical section:
      • Pi retains token and enters critical section
      • After exiting, Pi passes token to next process in ring
    • If Pi does not want to enter:
      • Pi passes token immediately to next process

Message analysis:

  • One token pass per process in ring
  • Average wait time proportional to ring size
  • Single token message circulating continuously

Advantages:

  • Predictable behavior and bounded waiting time
  • Simple implementation
  • Constant message size

Disadvantages:

  • Continuous token circulation even when no requests
  • Single token creates vulnerability to loss
  • Ring maintenance required

Raymond's Tree-Based Algorithm

Processes are arranged in a spanning tree with the token holder at the root:

Data structures:

  • For each process Pi:
    • HOLDER: pointer to neighbor closer to token (parent in tree)
    • USING: boolean indicating if Pi is in critical section
    • ASKED: boolean indicating if Pi has requested token
    • REQUEST_Q: queue of neighbors who have requested token

Algorithm steps:

  1. When process Pi wants to enter critical section:

    • If Pi holds token:
      • Mark USING = true and enter
    • Else:
      • Set ASKED = true
      • Send REQUEST to HOLDER
  2. When process Pi receives REQUEST from Pj:

    • If Pi holds token and not USING:
      • Send token to Pj
      • Update HOLDER = Pj
    • Else if Pi holds token and USING:
      • Add Pj to REQUEST_Q
    • Else:
      • If not ASKED:
        • Set ASKED = true
        • Send REQUEST to HOLDER
      • Add Pj to REQUEST_Q
  3. When process Pi receives token:

    • Set HOLDER = self
    • If ASKED:
      • Enter critical section (USING = true)
    • Else:
      • If REQUEST_Q not empty:
        • Dequeue Pj from REQUEST_Q
        • Send token to Pj
        • Set HOLDER = Pj
  4. When process Pi exits critical section:

    • Set USING = false
    • If REQUEST_Q not empty:
      • Dequeue Pj from REQUEST_Q
      • Send token to Pj
      • Set HOLDER = Pj

Message analysis:

  • O(log N) messages per critical section request (tree height)
  • No messages when token holder wants to enter
  • Dynamically optimizes paths based on usage patterns

Advantages:

  • More efficient than ring algorithm
  • Adapts to usage patterns
  • Lower average waiting time

Disadvantages:

  • More complex than ring algorithm
  • Tree maintenance required
  • Potential for path stretching

Suzuki-Kasami Broadcast Algorithm

A token-based algorithm using sequence numbers to track requests:

Data structures:

  • In each process Pi:
    • RN[1..n]: request sequence numbers for each process
    • LN[1..n]: sequence numbers of last requests satisfied (stored in token)
    • Token: contains privilege and queue

Algorithm steps:

  1. When process Pi wants to enter critical section:

    • Increment RN[i]
    • Broadcast REQUEST(i, RN[i]) to all processes
  2. When process Pj receives REQUEST(i, sn):

    • Set RN[i] = max(RN[i], sn)
    • If Pj has token and not using it, and RN[i] > LN[i]:
      • Send token to Pi
      • Update HOLDER reference
  3. When process Pi receives token:

    • Enter critical section
    • Update LN[i] = RN[i]
  4. When process Pi exits critical section:

    • For each process Pj where RN[j] > LN[j]:
      • Add Pj to token queue
    • If token queue not empty:
      • Dequeue process Pk
      • Send token to Pk
      • Update LN[k] = RN[k]

Message analysis:

  • Request broadcast: N-1 messages
  • Token transfer: 1 message
  • Best case (if already have token): 0 messages
  • Worst case: N messages

Advantages:

  • Efficient when critical section usage is infrequent
  • No explicit release messages
  • Tolerant to dynamic group membership

Disadvantages:

  • Broadcast overhead for requests
  • Queue management complexity

Token-based approaches offer elegant solutions to mutual exclusion that often perform well in practice, particularly in systems with predictable communication patterns or frequent critical section usage.


6. Resource Management

6.1 Global Scheduling

Global scheduling in distributed systems involves coordinating task execution across multiple nodes to optimize system-wide performance, resource utilization, and other objectives.

Loading diagram...
graph TD S[Scheduler] -->|Assign| W1[Worker] S -->|Assign| W2[Worker] S -->|Assign| W3[Worker] Q[Task Queue] --> S

Key goals of global scheduling:

  • Resource utilization: Maximize use of available computational resources
  • Throughput: Increase the number of tasks completed per unit time
  • Response time: Minimize time between task submission and completion
  • Fairness: Ensure equitable access to resources
  • Energy efficiency: Minimize power consumption
  • Deadline satisfaction: Meet time constraints for time-critical tasks

Scheduling dimensions:

  1. Static vs. dynamic scheduling:

    • Static: Assignment decisions made prior to execution
      • Based on a priori knowledge of task characteristics
      • Simpler implementation, lower runtime overhead
      • Less adaptive to changing conditions
    • Dynamic: Assignment decisions made during execution
      • Adapts to current system state and load conditions
      • Higher runtime overhead for decision-making
      • Better handles unpredictable workloads
  2. Centralized vs. distributed scheduling:

    • Centralized: Single scheduler makes all decisions
      • Global visibility of system state
      • Potential bottleneck and single point of failure
      • Simpler to implement optimal algorithms
    • Distributed: Multiple schedulers coordinate
      • Better scalability and fault tolerance
      • Partial system visibility
      • More complex coordination required
  3. Cooperative vs. non-cooperative scheduling:

    • Cooperative: Nodes work together for global objectives
      • System-wide optimization possible
      • Requires trust between nodes
    • Non-cooperative: Nodes optimize for local objectives
      • No trust assumptions required
      • May lead to suboptimal global outcomes

Common scheduling algorithms:

  1. First Come, First Served (FCFS):

    • Tasks executed in order of arrival
    • Simple implementation
    • Can lead to convoy effect (short tasks stuck behind long ones)
  2. Shortest Job First (SJF):

    • Prioritizes tasks with shortest execution time
    • Minimizes average response time
    • Requires accurate execution time estimates
    • Can lead to starvation of larger tasks
  3. Priority-based scheduling:

    • Tasks assigned priorities based on various criteria
    • Higher priority tasks executed first
    • Can incorporate multiple objectives (deadlines, importance, etc.)
    • May require aging mechanisms to prevent starvation
  4. Fair share scheduling:

    • Resources divided equitably among users or groups
    • Historical usage tracked to ensure long-term fairness
    • May temporarily deviate from fairness for efficiency
  5. Deadline-based scheduling:

    • Tasks scheduled to meet time constraints
    • Earliest Deadline First (EDF): Prioritizes nearest deadlines
    • Rate Monotonic: Fixed priorities based on period for periodic tasks

Scheduling information requirements:

  • Task characteristics: Execution time, resource requirements, dependencies
  • System state: Current load, available resources, queue lengths
  • Performance metrics: Current throughput, response times, utilization
  • Environmental factors: Network conditions, failure probabilities

Practical considerations:

  • Cost of migration: Overhead of moving tasks between nodes
  • Data locality: Proximity of computation to data
  • Communication patterns: Inter-task communication requirements
  • Heterogeneity: Varying capabilities of execution nodes
  • Reliability: Probability of node failures

Implementation approaches:

  1. Queue-based systems:

    • Central or distributed task queues
    • Workers pull from queues based on capacity
    • Examples: Apache Hadoop, traditional batch systems
  2. Market-based approaches:

    • Resources priced based on supply and demand
    • Tasks bid for required resources
    • Price mechanisms drive efficient allocation
  3. Constraint satisfaction:

    • Formulate scheduling as constraint problem
    • Apply constraint solvers to find valid allocations
    • Handles complex requirements and dependencies

Global scheduling provides the foundation for efficient resource utilization in distributed systems, balancing multiple competing objectives while adapting to the dynamic nature of distributed environments.

6.2 Task Assignment

Task assignment focuses on the specific problem of deciding which processor or node should execute each task in a distributed system, considering factors such as load balancing, communication costs, and resource requirements.

Loading diagram...
flowchart TB T1[Task] -->|CPU Intensive| N1[Node] T2[Task] -->|GPU| N2[Node] T3[Task] -->|Memory| N3[Node]

Core task assignment problems:

  1. Mapping: Assigning tasks to processors
  2. Ordering: Determining execution sequence on each processor
  3. Synchronization: Coordinating dependent task execution

Assignment models:

  1. One-time assignment:

    • Tasks permanently assigned to processors
    • Typical in batch processing systems
    • Focus on initial distribution quality
  2. Dynamic reassignment:

    • Tasks can migrate between processors
    • Adapts to changing system conditions
    • Balances migration costs against benefits

Assignment objectives:

  • Minimizing execution time: Fastest completion of tasks
  • Minimizing communication: Reducing network traffic
  • Load balancing: Even distribution of work
  • Resource matching: Placing tasks where their resource needs are best met
  • Energy efficiency: Minimizing power consumption
  • Reliability: Ensuring task completion despite failures

Task characteristics affecting assignment:

  • Computation requirements: CPU, memory, storage, specialized hardware
  • Communication patterns: Which tasks communicate with each other
  • Precedence constraints: Task dependencies and execution order
  • Divisibility: Whether tasks can be partitioned for parallel execution
  • State requirements: Access to specific data or services

Common assignment strategies:

Graph Partitioning

Models tasks and communication as a graph, then partitions to minimize communication while balancing load:

  • Nodes: Represent tasks or computation units
  • Edges: Represent communication or dependencies
  • Edge weights: Communication volume
  • Node weights: Computation requirements
  • Goal: Partition graph to minimize edge cuts while balancing partition sizes

Algorithms:

  • Kernighan-Lin algorithm
  • Recursive bisection
  • Spectral partitioning
  • Multilevel partitioning (METIS)

Task Clustering

Groups heavily communicating tasks together to minimize communication overhead:

  • Tasks clustered based on communication patterns
  • Clusters assigned to processors as units
  • Trade-offs between parallelism and communication overhead

Approaches:

  • Communication volume-based clustering
  • Dependency-based clustering
  • Dominant sequence clustering

List Scheduling

Assigns prioritized tasks to available processors:

  1. Tasks sorted according to priority metric (critical path, etc.)
  2. Each task assigned to processor that allows earliest completion

Common priority metrics:

  • Task size (longest first)
  • Communication requirements
  • Number of successors
  • Critical path contribution

Work Stealing

Dynamic load balancing approach where idle processors "steal" work from busy ones:

  • Each processor maintains a queue of ready tasks
  • When a processor's queue is empty, it attempts to steal tasks from others
  • Typically targets largest available tasks for stealing
  • Balances load while minimizing task migrations

Considerations:

  • Stealing granularity (task size)
  • Victim selection policy
  • Locality awareness
  • Overhead of synchronization

Advanced assignment techniques:

  1. Machine learning approaches:

    • Predictive task placement based on historical performance
    • Reinforcement learning for adaptive assignment
    • Classification of tasks into placement categories
  2. Container-based assignment:

    • Containerized tasks with specified resource requirements
    • Orchestration systems (Kubernetes, Docker Swarm) handling placement
    • Resource reservation and isolation
  3. Speculative execution:

    • Running multiple copies of slow tasks
    • Taking result from first to complete
    • Handling stragglers in large-scale systems
  4. Co-scheduling:

    • Coordinated scheduling of communicating tasks
    • Minimizing waiting time due to dependencies
    • Aligning execution slots across processors

Effective task assignment is critical for distributed system performance, requiring careful consideration of task characteristics, system capabilities, and assignment objectives.

6.3 Load Balancing

Load balancing distributes workload across multiple computing resources to optimize resource utilization, maximize throughput, minimize response time, and avoid overload of any single resource.

Loading diagram...
graph LR C1[Client] --> LB[Load Balancer] C2[Client] --> LB LB --> S1[Server] LB --> S2[Server] LB --> S3[Server]

Goals of load balancing:

  • Resource utilization: Making efficient use of all available resources
  • Throughput maximization: Processing more requests per unit time
  • Response time minimization: Reducing user-perceived delays
  • Avoiding hotspots: Preventing resource exhaustion on individual nodes
  • Scalability: Accommodating growing workloads by adding resources
  • Availability: Handling component failures without service disruption

Types of load balancing:

  1. Static load balancing:

    • Fixed distribution of work based on predetermined rules
    • No adaptation to current system state
    • Examples: Round-robin, weighted distribution, hash-based
    • Low runtime overhead, but doesn't adapt to changing conditions
  2. Dynamic load balancing:

    • Distribution adapts based on current system state
    • Monitors actual load and performance metrics
    • Examples: Least-connections, resource-based, predictive
    • Higher overhead but better handles variable workloads
  3. Centralized load balancing:

    • Single component makes all distribution decisions
    • Complete system visibility
    • Potential single point of failure
    • Examples: Traditional load balancer appliances, central dispatchers
  4. Distributed load balancing:

    • Multiple components cooperate to balance load
    • No single point of failure
    • Partial system visibility
    • Examples: DNS-based distribution, anycast, client-side selection

Common load balancing algorithms:

Static Algorithms

  1. Round-Robin:

    • Tasks distributed in circular order among nodes
    • Simple implementation, no need for system state
    • Assumes homogeneous nodes and tasks
    • Variations: Weighted Round-Robin (accommodates node differences)
  2. Hash-Based:

    • Consistent hashing maps requests to nodes
    • Same request always routed to same node (useful for caching)
    • Minimizes redistribution when nodes added/removed
    • Examples: Consistent hashing in distributed caches
  3. Fixed Mapping:

    • Predetermined assignments based on request type
    • Specializes nodes for specific workloads
    • Enables hardware optimization for specific tasks
    • Inflexible to changing workload characteristics

Dynamic Algorithms

  1. Least Connections:

    • Tasks sent to node with fewest active connections
    • Adapts to varying request processing times
    • Requires continuous connection tracking
    • Variations: Weighted Least Connections
  2. Shortest Queue:

    • Tasks sent to node with shortest job queue
    • Minimizes waiting time for new tasks
    • Requires queue length monitoring
    • May not account for varying job sizes
  3. Fastest Response:

    • Tasks sent to node with lowest response time
    • Adapts to actual processing capacity
    • Requires performance monitoring
    • Automatically accommodates heterogeneous nodes
  4. Resource-Based:

    • Distribution based on current resource utilization
    • Considers CPU, memory, network, disk metrics
    • Prevents resource exhaustion
    • Requires comprehensive monitoring

Specialized Approaches

  1. Two-Phase Algorithms:

    • Phase 1: Nodes classified as senders or receivers based on load
    • Phase 2: Load transferred from senders to receivers
    • Reduces unnecessary migrations
    • Examples: Diffusion methods, dimension exchange
  2. Agent-Based:

    • Mobile agents collect system state and make decisions
    • Reduces central coordination requirements
    • Can adapt to network topology changes
    • Higher implementation complexity
  3. Auction-Based:

    • Nodes bid for tasks based on their capacity
    • Market mechanisms drive efficient allocation
    • Self-organizing and adaptive
    • Examples: Spawn system, computational economies

Implementation mechanisms:

  1. DNS-based load balancing:

    • Multiple IP addresses returned for single domain
    • Client connects to one of the provided addresses
    • Simple but limited control and responsiveness
  2. Transport layer (L4) load balancing:

    • Distributes based on IP address and port
    • Network Address Translation (NAT) or Direct Server Return (DSR)
    • Efficient but application-unaware
  3. Application layer (L7) load balancing:

    • Distribution based on request content (URLs, headers, cookies)
    • Content-aware routing and sticky sessions
    • More intelligent but higher processing overhead
  4. Global server load balancing (GSLB):

    • Distributes load across multiple geographic locations
    • Considers network proximity and regional availability
    • Improves user experience and disaster resilience

Practical considerations:

  • Session persistence: Keeping user connected to same server
  • Health checking: Detecting and avoiding failed nodes
  • Geographic distribution: Routing users to nearby resources
  • Capacity planning: Ensuring sufficient overall capacity
  • Elasticity: Adding/removing resources based on demand
  • Cost optimization: Balancing performance against resource costs

Load balancing is a critical component of distributed systems, ensuring efficient resource utilization while providing the scalability and reliability needed for modern applications.


Summary and Further Reading

Key Takeaways

  1. Operating System Paradigms:

    • Network OS provides resource sharing while maintaining distinct machine boundaries
    • Distributed OS creates a unified system image across multiple machines
    • Middleware serves as a pragmatic middle ground, enabling integration without replacing existing OS
  2. Communication Models:

    • The OSI model provides a conceptual framework for network protocols
    • RPC and RMI enable remote invocation of procedures and methods
    • Message vs. stream-oriented communication offer different tradeoffs in discrete vs. continuous data flow
    • Group communication provides abstractions for coordinating multiple processes
  3. Clock Synchronization:

    • Physical clock synchronization (Cristian's, Berkeley, NTP) aligns actual time values
    • Logical clocks (Lamport, Vector) maintain causal ordering without absolute time
    • Leader election algorithms select coordinators for distributed operations
  4. Mutual Exclusion:

    • Centralized approaches offer simplicity but create single points of failure
    • Distributed approaches (Lamport, Ricart-Agrawala, Maekawa) eliminate central coordinator
    • Token-based solutions provide elegant access control with different topologies
  5. Resource Management:

    • Global scheduling coordinates