Summary of "Lec- 3: Distributed Systems and Distributed Computing"

Summary: Distributed Systems and Distributed Computing

Definitions and core idea

Distributed system: a collection of independent computers that appears to users as a single coherent system. Components cooperate (via message passing) to achieve a common goal.

Distributed computing: using a distributed system to solve computational problems by dividing a problem into tasks that are solved by one or more computers communicating with one another.

Primary purposes and benefits

Resource sharing and improved utilization (aggregate CPU, memory, storage across machines).
Scalability: capacity can grow by adding machines.
Availability and fault tolerance: replication and backups across nodes prevent a single-node failure from bringing down the whole system.
Concurrency: multiple tasks processed in parallel across nodes.
Other desirable properties: heterogeneity, openness, transparency.

High-level architecture (layers / components)

Hardware / Network layer: physical computers, CPU, RAM, storage, and network links.
Operating System (OS) layer: manages hardware and provides basic services (process and resource management).
IPC (interprocess communication) primitives: standardized protocols and mechanisms (TCP/IP, UDP, HTTP, etc.) for exchanging control and data messages between processes on different machines.
Middleware: builds on OS services to provide a uniform development and deployment environment (higher-level protocols, data formats, APIs).
Applications / Services: user-facing programs deployed on top of the middleware and OS that use the distributed infrastructure.

How distributed computing solves problems (workflow)

Form a team (cluster) of independent machines and present them as a single system to users.
Partition the large problem or data set into smaller tasks or chunks.
Assign tasks to different machines (nodes); each node works on its assigned portion.
Nodes communicate and coordinate via message passing (IPC) to exchange data, control signals, or intermediate results.
Aggregate results from nodes to produce the final output visible to the user (single input → single output despite internal distribution).
Use replication, redundancy, and middleware features so services remain available when individual nodes fail.

Distinction: distributed computing vs. parallel computing

Distributed computing:
- Multiple independent machines, each with its own memory.
- Coordination via message passing across a network.
- Loosely coupled; typically more fault-tolerant and can scale by adding machines.
Parallel computing:
- Multiple processors or cores share the same physical memory (single address space).
- Coordination via shared memory mechanisms.
- Tightly coupled; failure of shared memory can be a single point of failure.

Historical and practical context / examples

Architectural evolution: client–server → monolithic → service-oriented architectures (SOA) → microservices.
Other distributed systems: peer-to-peer systems, massively multiplayer online games (MMOGs).
The accompanying course/video will cover specific paradigms (client–server, SOA, microservices) in later lectures.

Noted takeaways

Distributed systems present a single coherent system to users while being implemented across many independent machines.
Key mechanisms are message passing (IPC) and middleware that abstracts complexity for application developers.
Distributed computing divides work across nodes; parallel computing shares memory among processors — related but distinct models.

Methodology / step-by-step instructions

Build or identify the cluster of independent machines (hardware + network).
Ensure OS-level services and network protocols (TCP/IP, HTTP, etc.) are available.
Design how to partition the problem/data into tasks small enough to distribute.
Implement IPC-based communication between tasks (message formats, protocols).
Use middleware to unify development/deployment and handle concerns like serialization, RPC, discovery, and fault handling.
Deploy application components across nodes and configure replication/backups for availability.
Coordinate aggregation of outputs and handle errors/failures (reassign tasks, use replicated copies).
Monitor and scale by adding or removing nodes as needed.