Distributed Systems Definition Example Essays

For many of us worldwide connectivity through the Internet is as common as being able to send a postcard to anyone anywhere around the world. Moreover, where until recently we were used to having relatively powerful desktop computers for office applications and storage, we are now witnessing that such applications and services are being placed in what has been coined “the cloud,” in turn leading to an increase of much smaller networked devices such as tablet computers. With this in mind, scalability has become one of the most important design goals for developers of distributed systems.

3.4.1 Scalability dimensions

Scalability of a system can be measured along at least three different dimensions (see Neuman [45]):

Size scalability A system can be scalable with respect to its size, meaning that we can easily add more users and resources to the system without any noticeable loss of performance.

Geographical scalability A geographically scalable system is one in which the users and resources may lie far apart, but the fact that communication delays may be significant is hardly noticed.

Administrative scalability An administratively scalable system is one that can still be easily managed even if it spans many independent administrative organizations.

Let us take a closer look at each of these three scalability dimensions.

Size scalability When a system needs to scale, very different types of problems need to be solved. Let us first consider scaling with respect to size. If more users or resources need to be supported, we are often confronted with the limitations of centralized services, although often for very different reasons. For example, many services are centralized in the sense that they are implemented by means of a single server running on a specific machine in the distributed system. In a more modern setting, we may have a group of collaborating servers colocated on a cluster of tightly coupled machines physically placed at the same location. The problem with this scheme is obvious: the server, or group of servers, can simply become a bottleneck when it needs to process an increasing number of requests. To illustrate how this can happen, let us assume that a service is implemented on a single machine. In that case there are essentially three root causes for becoming a bottleneck:
  • The computational capacity, limited by the CPUs

  • The storage capacity, including the transfer rate between CPUs and disks

  • The network between the user and the centralized service

Let us first consider the computational capacity. Just imagine a service for computing optimal routes taking real-time traffic information into account. It is not difficult to imagine that this may be primarily a compute-bound service requiring several (sometimes tens of) seconds to complete a request. If there is only a single machine available, then even a modern high-end system will eventually run into problems if the number of requests increases beyond a certain point.

Likewise, but for different reasons, we will run into problems when having a service that is mainly I/O bound. A typical example is a poorly designed centralized search engine. The problem with content-based search queries is that we essentially need to match a query against an entire data set. Even with advanced indexing techniques, we may still face the problem of having to process a huge amount of data exceeding the main-memory capacity of the machine running the service. As a consequence, much of the processing time will be determined by the relatively slow disk accesses and transfer of data between disk and main memory. Simply adding more or higher-speed disks will prove not to be a sustainable solution as the number of requests continues to increase.

Finally, the network between the user and the service may also be the cause of poor scalability. Just imagine a video-on-demand service that needs to stream high-quality video to multiple users. A video stream can easily require a bandwidth of 8–10 Mbps, meaning that if a service sets up point-to-point connections with its customers, it may soon hit the limits of the network capacity of its own outgoing transmission lines.

Size scalability problems for centralized services can be formally analyzed using queuing theory and making a few simplifying assumptions. At a conceptual level, a centralized service can be modeled as the simple queuing system shown in Fig. 2: requests are submitted to the service where they are queued until further notice. As soon as the process can handle a next request, it fetches it from the queue, does its work, and produces a response. We largely follow Menasce and Almeida [41] in explaining the performance of a centralized service.
In many cases, we may assume that the queue has an infinite capacity, meaning that there is no restriction on the number of requests that can be accepted for further processing. Strictly speaking, this means that the arrival rate of requests is not influenced by what is currently in the queue or being processed. Assuming that the arrival rate of requests is \(\lambda \) requests per second, and that the processing capacity of the service is \(\mu \) requests per second, one can compute that the fraction of time \(p_k\) that there are \(k\) requests in the system is equal to:

$$\begin{aligned} p_k = \left( 1 - \frac{\lambda }{\mu }\right) \left( \frac{\lambda }{\mu }\right) ^k \end{aligned}$$

If we define the utilization\(U\) of a service as the fraction of time that it is busy, then clearly,

$$\begin{aligned} U = \sum _{k > 0} p_k = 1 - p_0 = \frac{\lambda }{\mu } \Rightarrow p_k = (1-U) U^k \end{aligned}$$

We can then compute the average number \(\overline{N}\) of requests in the system as

$$\begin{aligned} \overline{N} = \sum _{k\ge 0} k \cdot p_k = \sum _{k \ge 0} k \cdot (1-U)U^k = (1-U)\sum _{k\ge 0} k\cdot U^k = \frac{(1-U)U}{(1-U)^2} = \frac{U}{1-U}. \end{aligned}$$

What we are really interested in, is the response time \(R\): how long does it take before the service to process a request, including the time spent in the queue. To that end, we need the average throughput \(X\). Considering that the service is “busy” when at least one request is being processed, and that this then happens with a throughput of \(\mu \) requests per second, and during a fraction \(U\) of the total time, we have:

$$\begin{aligned} X = \underbrace{U \cdot \mu }_{ \text{ server } \text{ at } \text{ work }} + \underbrace{(1-U) \cdot 0}_{ \text{ server } \text{ idle }} = \frac{\lambda }{\mu } \cdot \mu = \lambda \end{aligned}$$

Using Little’s formula [57], we can then derive the response time as

$$\begin{aligned} R = \frac{\overline{N}}{X} = \frac{S}{1-U} \Rightarrow \frac{R}{S} = \frac{1}{1-U} \end{aligned}$$

where \(S = \frac{1}{\mu }\), the actual service time. Note that if \(U\) is very small, the response-to-service time ratio is close to 1, meaning that a request is virtually instantly processed, and at the maximum speed possible. However, as soon as the utilization comes closer to 1, we see that the response-to-server time ratio quickly increases to very high values, effectively meaning that the system is coming close to a grinding halt. This is where we see scalability problems emerge. From this simple model, we can see that the only solution is bringing down the service time \(S\).

Geographical scalability Geographical scalability has its own problems. One of the main reasons why it is still difficult to scale existing distributed systems that were designed for local-area networks is that many of them are based on synchronous communication. In this form of communication, a party requesting service, generally referred to as a client, blocks until a reply is sent back from the server implementing the service . More specifically, we often see a communication pattern consisting of many client-server interactions as may be the case with database transactions. This approach generally works fine in LANs where communication between two machines is often at worst a few 100 \({\mu }\mathrm {s}\). However, in a wide-area system, we need to take into account that interprocess communication may be hundreds of milliseconds, three orders of magnitude slower. Building applications using synchronous communication in wide-area systems requires a great deal of care (and not just a little patience), notably with a rich interaction pattern between client and server.

Another problem that hinders geographical scalability is that communication in wide-area networks is inherently much less reliable than in local-area networks. In addition, we also need to deal with limited bandwidth. The effect is that solutions developed for local-area networks cannot always be easily ported to a wide-area system. A typical example is streaming video. In a home network, even when having only wireless links, ensuring a stable, fast stream of high-quality video frames from a media server to a display is quite simple. Simply placing that same server far away and using a standard TCP connection to the display will surely fail: bandwidth limitations will instantly surface, but also maintaining the same level of reliability can easily cause headaches.

Yet another issue that pops up when components lie far apart is the fact that wide-area systems generally have only very limited facilities for multipoint communication. In contrast, local-area networks often support efficient broadcasting mechanisms. Such mechanisms have proven to be extremely useful for discovering components and services, which is essential from a management point of view. In wide-area systems, we need to develop separate services, such as naming and directory services to which queries can be sent. These support services, in turn, need to be scalable as well and in many cases no obvious solutions exist.

Administrative scalability Finally, a difficult, and in many cases open, question is how to scale a distributed system across multiple, independent administrative domains. A major problem that needs to be solved is that of conflicting policies with respect to resource usage (and payment), management, and security.

To illustrate, for many years scientists have been looking for solutions to share their (often expensive) equipment in what is known as a computational grid. In these grids, a global distributed system is constructed as a federation of local distributed systems, allowing a program running on a computer at organization \(\mathsf {A}\) to directly access resources at organization \(\mathsf {B}\).

For example, many components of a distributed system that reside within a single domain can often be trusted by users that operate within that same domain. In such cases, system administration may have tested and certified applications, and may have taken special measures to ensure that such components cannot be tampered with. In essence, the users trust their system administrators. However, this trust does not expand naturally across domain boundaries.

If a distributed system expands to another domain, two types of security measures need to be taken. First, the distributed system has to protect itself against malicious attacks from the new domain. For example, users from the new domain may have only read access to the file system in its original domain. Likewise, facilities such as expensive imagesetters or high-performance computers may not be made available to unauthorized users. Second, the new domain has to protect itself against malicious attacks from the distributed system. A typical example is that of downloading programs such as applets in Web browsers. Basically, the new domain does not know what to expect from such foreign code, and may therefore decide to severely limit the access rights for such code. The problem is how to enforce those limitations.

As a counter example of distributed systems spanning multiple administrative domains that apparently do not suffer from administrative scalability problems, consider modern file-sharing peer-to-peer networks. In these cases, end users simply install a program implementing distributed search and download functions and within minutes can start downloading files. Other examples include peer-to-peer applications for telephony over the Internet such as Skype [10], and peer-assisted audio-streaming applications such as earlier versions of Spotify [35]. What these distributed systems have in common is that end users, and not administrative entities, collaborate to keep the system up and running. At best, underlying administrative organizations such as Internet Service Providers (ISPs) can police the network traffic that these peer-to-peer systems cause, but so far such efforts have not been very effective.

3.4.2 Scaling techniques

Having discussed some of the scalability problems brings us to the question of how those problems can generally be solved. In most cases, scalability problems in distributed systems appear as performance problems caused by limited capacity of servers and network. Simply improving their capacity (e.g., by increasing memory, upgrading CPUs, or replacing network modules) is often a solution, referred to as scaling up. When it comes to scaling out, that is, expanding the distributed system by essentially deploying more machines, there are basically only three techniques we can apply: hiding communication latencies, distribution of work, and replication.

Hiding communication latencies Hiding communication latencies is applicable in the case of geographical scalability. The basic idea is simple: try to avoid waiting for responses to remote-service requests as much as possible. For example, when a service has been requested at a remote machine, an alternative to waiting for a reply from the server is to do other useful work at the requester’s side. Essentially, this means constructing the requesting application in such a way that it uses only asynchronous communication. When a reply comes in, the application is interrupted and a special handler is called to complete the previously issued request. Asynchronous communication can often be used in batch-processing systems and parallel applications in which independent tasks can be scheduled for execution while another task is waiting for communication to complete. Alternatively, a new thread of control can be started to perform the request. Although it blocks waiting for the reply, other threads in the process can continue.

However, there are many applications that cannot make effective use of asynchronous communication. For example, in interactive applications when a user sends a request he will generally have nothing better to do than to wait for the answer. In such cases, a much better solution is to reduce the overall communication, for example, by moving part of the computation that is normally done at the server to the client process requesting the service. A typical case where this approach works is accessing databases using forms. Filling in forms can be done by sending a separate message for each field and waiting for an acknowledgement from the server, as shown in Fig. 3a. For example, the server may check for syntactic errors before accepting an entry. A much better solution is to ship the code for filling in the form, and possibly checking the entries, to the client, and have the client return a completed form, as shown in Fig. 3b.

Partitioning and distribution Another important scaling technique is partition and distribution, which involves taking a component, splitting it into smaller parts, and subsequently spreading those parts across the system. A good example of partition and distribution is the Internet Domain Name System (DNS). The DNS name space is hierarchically organized into a tree of domains, which are divided into nonoverlapping zones, as shown for the original DNS in Fig. 4. The names in each zone are handled by a single name server. Without going into too many details, now one can think of each path name being the name of a host in the Internet, and is thus associated with a network address of that host. Basically, resolving a name means returning the network address of the associated host. Consider, for example, the name \(\mathsf {flits.cs.vu.nl}\). To resolve this name, it is first passed to the server of zone \(\mathsf {Z1}\) (see Fig. 4) which returns the address of the server for zone \(\mathsf {Z2}\), to which the rest of name, \(\mathsf {flits.cs.vu}\), can be handed. The server for \(\mathsf {Z2}\) will return the address of the server for zone \(\mathsf {Z3}\), which is capable of handling the last part of the name and will return the address of the associated host.

This example illustrates how the naming service, as provided by DNS, is distributed across several machines, thus avoiding that a single server has to deal with all requests for name resolution.

As another example, consider the World Wide Web. To most users, the Web appears to be an enormous document-based information system in which each document has its own unique name in the form of a URL. Conceptually, it may even appear as if there is only a single server. However, the Web is physically partitioned and distributed across a few 100 million servers, each handling a number of Web documents. The name of the server handling a document is encoded into that document’s URL. It is only because of this distribution of documents that the Web has been capable of scaling to its current size.

Replication Considering that scalability problems often appear in the form of performance degradation, it is generally a good idea to actually replicate components across a distributed system. Replication not only increases availability, but also helps to balance the load between components leading to better performance. Also, in geographically widely dispersed systems, having a copy nearby can hide much of the communication latency problems mentioned before.

Caching is a special form of replication, although the distinction between the two is often hard to make or even artificial. As in the case of replication, caching results in making a copy of a resource, generally in the proximity of the client accessing that resource. However, in contrast to replication, caching is a decision made by the client of a resource and not by the owner of a resource.

There is one serious drawback to caching and replication that may adversely affect scalability. Because we now have multiple copies of a resource, modifying one copy makes that copy different from the others. Consequently, caching and replication leads to consistency problems.

To what extent inconsistencies can be tolerated depends highly on the usage of a resource. For example, many Web users find it acceptable that their browser returns a cached document of which the validity has not been checked for the last few minutes. However, there are also many cases in which strong consistency guarantees need to be met, such as in the case of electronic stock exchanges and auctions. The problem with strong consistency is that an update must be immediately propagated to all other copies. Moreover, if two updates happen concurrently, it is often also required that updates are processed in the same order everywhere, introducing an additional global ordering problem. To further aggravate problems, combining consistency with other desirable properties such as availability may simply be impossible. The latter is illustrated by the so-called CAP problem that states that combining consistency, availability, and being tolerant to network partitions is not possible [16, 24].

Replication therefore often requires some global synchronization mechanism. Unfortunately, such mechanisms are extremely hard or even impossible to implement in a scalable way, if alone because network latencies have a natural lower bound. Consequently, scaling by replication may introduce other, inherently nonscalable solutions.

Discussion When considering these scaling techniques, one could argue that size scalability is the least problematic from a technical point of view. In many cases, increasing the capacity of a machine will save the day, although perhaps there is a high monetary cost to pay. Geographical scalability is a much tougher problem as network latencies are naturally bound from below. As a consequence, we may be forced to copy data to locations close to where clients are, leading to problems of maintaining copies consistent. Practice shows that combining distribution, replication, and caching techniques with different forms of consistency generally leads to acceptable solutions. Finally, administrative scalability seems to be the most difficult problem to solve, partly because we need to deal with nontechnical issues, such as politics of organizations and human collaboration. The introduction and now widespread use of peer-to-peer technology has successfully demonstrated what can be achieved if end users are put in control [39, 47]. However, peer-to-peer networks are obviously not the universal solution to all administrative scalability problems.

"Distributed application" redirects here. For smart contracts, see Decentralized application.

"Distributed Information Processing" redirects here. For the computer company, see DIP Research.

Distributed computing is a field of computer science that studies distributed systems. A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages.[1] The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.[1] Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.

A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.[2] There are many alternatives for the message passing mechanism, including pure HTTP, RPC-like connectors and message queues.[citation needed]

A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional non-distributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be "papered over" by attempts at "transparency" (see CAP theorem).[citation needed]

Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers,[3] which communicate with each other by message passing.[4]

Introduction[edit]

The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area.[5] The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing.[4]

While there is no single definition of a distributed system,[6] the following defining properties are commonly used:

  • There are several autonomous computational entities (computers or nodes), each of which has its own local memory.[7]
  • The entities communicate with each other by message passing.[8]

A distributed system may have a common goal, such as solving a large computational problem;[9] the user then perceives the collection of autonomous processors as a unit. Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users.[10]

Other typical properties of distributed systems include the following:

  • The system has to tolerate failures in individual computers.[11]
  • The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program.[12]
  • Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.[13]

Parallel and distributed computing[edit]

Distributed systems are groups of networked computers, which have the same goal for their work. The terms "concurrent computing", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them.[14] The same system may be characterized both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel.[15] Parallel computing may be seen as a particular tightly coupled form of distributed computing,[16] and distributed computing may be seen as a loosely coupled form of parallel computing.[6] Nevertheless, it is possible to roughly classify concurrent systems as "parallel" or "distributed" using the following criteria:

  • In parallel computing, all processors may have access to a shared memory to exchange information between processors.[17]
  • In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.[18]

The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; the system is represented as a network topology in which each node is a computer and each line connecting the nodes is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory.

The situation is further complicated by the traditional uses of the terms parallel and distributed algorithm that do not quite match the above definitions of parallel and distributed systems (see below for more detailed discussion). Nevertheless, as a rule of thumb, high-performance parallel computation in a shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms.[citation needed]

History[edit]

The use of concurrent processes that communicate by message-passing has its roots in operating system architectures studied in the 1960s.[19] The first widespread distributed systems were local-area networks such as Ethernet, which was invented in the 1970s.[20]

ARPANET, the predecessor of the Internet, was introduced in the late 1960s, and ARPANET e-mail was invented in the early 1970s. E-mail became the most successful application of ARPANET,[21] and it is probably the earliest example of a large-scale distributed application. In addition to ARPANET, and its successor, the Internet, other early worldwide computer networks included Usenet and FidoNet from the 1980s, both of which were used to support distributed discussion systems.[citation needed]

The study of distributed computing became its own branch of computer science in the late 1970s and early 1980s. The first conference in the field, Symposium on Principles of Distributed Computing (PODC), dates back to 1982, and its European counterpart International Symposium on Distributed Computing (DISC) was first held in 1985.[citation needed]

Architectures[edit]

Various hardware and software architectures are used for distributed computing. At a lower level, it is necessary to interconnect multiple CPUs with some sort of network, regardless of whether that network is printed onto a circuit board or made up of loosely coupled devices and cables. At a higher level, it is necessary to interconnect processes running on those CPUs with some sort of communication system.[citation needed]

Distributed programming typically falls into one of several basic architectures: client–server, three-tier, n-tier, or peer-to-peer; or categories: loose coupling, or tight coupling.[22]

  • Client–server: architectures where smart clients contact the server for data then format and display it to the users. Input at the client is committed back to the server when it represents a permanent change.
  • Three-tier: architectures that move the client intelligence to a middle tier so that stateless clients can be used. This simplifies application deployment. Most web applications are three-tier.
  • n-tier: architectures that refer typically to web applications which further forward their requests to other enterprise services. This type of application is the one most responsible for the success of application servers.
  • Peer-to-peer: architectures where there are no special machines that provide a service or manage the network resources.[23]:227 Instead all responsibilities are uniformly divided among all machines, known as peers. Peers can serve both as clients and as servers.[citation needed]

Another basic aspect of distributed computing architecture is the method of communicating and coordinating work among concurrent processes. Through various message passing protocols, processes may communicate directly with one another, typically in a master/slave relationship. Alternatively, a "database-centric" architecture can enable distributed computing to be done without any form of direct inter-process communication, by utilizing a shared database.[24]

Applications[edit]

Reasons for using distributed systems and distributed computing may include:

  1. The very nature of an application may require the use of a communication network that connects several computers: for example, data produced in one physical location and required in another location.
  2. There are many cases in which the use of a single computer would be possible in principle, but the use of a distributed system is beneficial for practical reasons. For example, it may be more cost-efficient to obtain the desired level of performance by using a cluster of several low-end computers, in comparison with a single high-end computer. A distributed system can provide more reliability than a non-distributed system, as there is no single point of failure. Moreover, a distributed system may be easier to expand and manage than a monolithic uniprocessor system.[25]

Examples[edit]

Examples of distributed systems and applications of distributed computing include the following:[26]

  • telecommunication networks:
  • network applications:
  • real-time process control:
  • parallel computation:

Theoretical foundations[edit]

Main article: Distributed algorithm

Models[edit]

Many tasks that we would like to automate by using a computer are of question–answer type: we would like to ask a question and the computer should produce an answer. In theoretical computer science, such tasks are called computational problems. Formally, a computational problem consists of instances together with a solution for each instance. Instances are questions that we can ask, and solutions are desired answers to these questions.

Theoretical computer science seeks to understand which computational problems can be solved by using a computer (computability theory) and how efficiently (computational complexity theory). Traditionally, it is said that a problem can be solved by using a computer if we can design an algorithm that produces a correct solution for any given instance. Such an algorithm can be implemented as a computer program that runs on a general-purpose computer: the program reads a problem instance from input, performs some computation, and produces the solution as output. Formalisms such as random access machines or universal Turing machines can be used as abstract models of a sequential general-purpose computer executing such an algorithm.[citation needed]

The field of concurrent and distributed computing studies similar questions in the case of either multiple computers, or a computer that executes a network of interacting processes: which computational problems can be solved in such a network and how efficiently? However, it is not at all obvious what is meant by "solving a problem" in the case of a concurrent or distributed system: for example, what is the task of the algorithm designer, and what is the concurrent or distributed equivalent of a sequential general-purpose computer?[citation needed]

The discussion below focuses on the case of multiple computers, although many of the issues are the same for concurrent processes running on a single computer.

Three viewpoints are commonly used:

Parallel algorithms in shared-memory model
  • All processors have access to a shared memory. The algorithm designer chooses the program executed by each processor.
  • One theoretical model is the parallel random access machines (PRAM) that are used.[27] However, the classical PRAM model assumes synchronous access to the shared memory.
  • Shared-memory programs can be extended to distributed systems if the underlying operating system encapsulates the communication between nodes and virtually unifies the memory across all individual systems.
  • A model that is closer to the behavior of real-world multiprocessor machines and takes into account the use of machine instructions, such as Compare-and-swap (CAS), is that of asynchronous shared memory. There is a wide body of work on this model, a summary of which can be found in the literature.[28][29]
Parallel algorithms in message-passing model
  • The algorithm designer chooses the structure of the network, as well as the program executed by each computer.
  • Models such as Boolean circuits and sorting networks are used.[30] A Boolean circuit can be seen as a computer network: each gate is a computer that runs an extremely simple computer program. Similarly, a sorting network can be seen as a computer network: each comparator is a computer.
Distributed algorithms in message-passing model
  • The algorithm designer only chooses the computer program. All computers run the same program. The system must work correctly regardless of the structure of the network.
  • A commonly used model is a graph with one finite-state machine per node.

In the case of distributed algorithms, computational problems are typically related to graphs. Often the graph that describes the structure of the computer network is the problem instance. This is illustrated in the following example.[citation needed]

An example[edit]

Consider the computational problem of finding a coloring of a given graph G. Different fields might take the following approaches:

Centralized algorithms[citation needed]
  • The graph G is encoded as a string, and the string is given as input to a computer. The computer program finds a coloring of the graph, encodes the coloring as a string, and outputs the result.
Parallel algorithms
  • Again, the graph G is encoded as a string. However, multiple computers can access the same string in parallel. Each computer might focus on one part of the graph and produce a coloring for that part.
  • The main focus is on high-performance computation that exploits the processing power of multiple computers in parallel.
Distributed algorithms
  • The graph G is the structure of the computer network. There is one computer for each node of G and one communication link for each edge of G. Initially, each computer only knows about its immediate neighbors in the graph G; the computers must exchange messages with each other to discover more about the structure of G. Each computer must produce its own color as output.
  • The main focus is on coordinating the operation of an arbitrary distributed system.[citation needed]

While the field of parallel algorithms has a different focus than the field of distributed algorithms, there is a lot of interaction between the two fields. For example, the Cole–Vishkin algorithm for graph coloring[31] was originally presented as a parallel algorithm, but the same technique can also be used directly as a distributed algorithm.

Moreover, a parallel algorithm can be implemented either in a parallel system (using shared memory) or in a distributed system (using message passing).[32] The traditional boundary between parallel and distributed algorithms (choose a suitable network vs. run in any given network) does not lie in the same place as the boundary between parallel and distributed systems (shared memory vs. message passing).

Complexity measures[edit]

In parallel algorithms, yet another resource in addition to time and space is the number of computers. Indeed, often there is a trade-off between the running time and the number of computers: the problem can be solved faster if there are more computers running in parallel (see speedup). If a decision problem can be solved in polylogarithmic time by using a polynomial number of processors, then the problem is said to be in the class NC.[33] The class NC can be defined equally well by using the PRAM formalism or Boolean circuits—PRAM machines can simulate Boolean circuits efficiently and vice versa.[34]

In the analysis of distributed algorithms, more attention is usually paid on communication operations than computational steps. Perhaps the simplest model of distributed computing is a synchronous system where all nodes operate in a lockstep fashion. During each communication round, all nodes in parallel (1) receive the latest messages from their neighbours, (2) perform arbitrary local computation, and (3) send new messages to their neighbors. In such systems, a central complexity measure is the number of synchronous communication rounds required to complete the task.[35]

This complexity measure is closely related to the diameter of the network. Let D be the diameter of the network. On the one hand, any computable problem can be solved trivially in a synchronous distributed system in approximately 2D communication rounds: simply gather all information in one location (D rounds), solve the problem, and inform each node about the solution (D rounds).

On the other hand, if the running time of the algorithm is much smaller than D communication rounds, then the nodes in the network must produce their output without having the possibility to obtain information about distant parts of the network. In other words, the nodes must make globally consistent decisions based on information that is available in their local neighbourhood. Many distributed algorithms are known with the running time much smaller than D rounds, and understanding which problems can be solved by such algorithms is one of the central research questions of the field.[36]

Other commonly used measures are the total number of bits transmitted in the network (cf. communication complexity).[citation needed]

Other problems[edit]

Traditional computational problems take the perspective that we ask a question, a computer (or a distributed system) processes the question for a while, and then produces an answer and stops. However, there are also problems where we do not want the system to ever stop. Examples of such problems include the dining philosophers problem and other similar mutual exclusion problems. In these problems, the distributed system is supposed to continuously coordinate the use of shared resources so that no conflicts or deadlocks occur.

There are also fundamental challenges that are unique to distributed computing. The first example is challenges that are related to fault-tolerance. Examples of related problems include consensus problems,[37]Byzantine fault tolerance,[38] and self-stabilisation.[39]

A lot of research is also focused on understanding the asynchronous nature of distributed systems:

Election

Coordinator election (or leader election) is the process of designating a single process as the organizer of some task distributed among several computers (nodes). Before the task is begun, all network nodes are either unaware which node will serve as the "coordinator" (or leader) of the task, or unable to communicate with the current coordinator. After a coordinator election algorithm has been run, however, each node throughout the network recognizes a particular, unique node as the task coordinator.[citation needed]

The network nodes communicate among themselves in order to decide which of them will get into the "coordinator" state. For that, they need some method in order to break the symmetry among them. For example, if each node has unique and comparable identities, then the nodes can compare their identities, and decide that the node with the highest identity is the coordinator.[citation needed]

The definition of this problem is often attributed to LeLann, who formalized it as a method to create a new token in a token ring network in which the token has been lost.[43]

Coordinator election algorithms are designed to be economical in terms of total bytes transmitted, and time. The algorithm suggested by Gallager, Humblet, and Spira [44] for general undirected graphs has had a strong impact on the design of distributed algorithms in general, and won the Dijkstra Prize for an influential paper in distributed computing.

Many other algorithms were suggested for different kind of network graphs, such as undirected rings, unidirectional rings, complete graphs, grids, directed Euler graphs, and others. A general method that decouples the issue of the graph family from the design of the coordinator election algorithm was suggested by Korach, Kutten, and Moran.[45]

In order to perform coordination, distributed systems employ the concept of coordinators. The coordinator election problem is to choose a process from among a group of processes on different processors in a distributed system to act as the central coordinator. Several central coordinator election algorithms exist.[46]

Properties of distributed systems[edit]

So far the focus has been on designing a distributed system that solves a given problem. A complementary research problem is studying the properties of a given distributed system.[47][48]

The halting problem is an analogous example from the field of centralised computation: we are given a computer program and the task is to decide whether it halts or runs forever. The halting problem is undecidable in the general case, and naturally understanding the behaviour of a computer network is at least as hard as understanding the behaviour of one computer.[citation needed]

However, there are many interesting special cases that are decidable. In particular, it is possible to reason about the behaviour of a network of finite-state machines. One example is telling whether a given network of interacting (asynchronous and non-deterministic) finite-state machines can reach a deadlock. This problem is PSPACE-complete,[49] i.e., it is decidable, but it is not likely that there is an efficient (centralised, parallel or distributed) algorithm that solves the problem in the case of large networks.

See also[edit]

Notes[edit]

  1. ^ abCoulouris, George; Jean Dollimore; Tim Kindberg; Gordon Blair (2011). Distributed Systems: Concepts and Design (5th Edition). Boston: Addison-Wesley. ISBN 0-132-14301-1. 
  2. ^Andrews (2000). Dolev (2000). Ghosh (2007), p. 10.
  3. ^Godfrey (2002).
  4. ^ abAndrews (2000), p. 291–292. Dolev (2000), p. 5.
  5. ^Lynch (1996), p. 1.
  6. ^ abGhosh (2007), p. 10.
  7. ^Andrews (2000), pp. 8–9, 291. Dolev (2000), p. 5. Ghosh (2007), p. 3. Lynch (1996), p. xix, 1. Peleg (2000), p. xv.
  8. ^Andrews (2000), p. 291. Ghosh (2007), p. 3. Peleg (2000), p. 4.
  9. ^Ghosh (2007), p. 3–4. Peleg (2000), p. 1.
  10. ^Ghosh (2007), p. 4. Peleg (2000), p. 2.
  11. ^Ghosh (2007), p. 4, 8. Lynch (1996), p. 2–3. Peleg (2000), p. 4.
  12. ^Lynch (1996), p. 2. Peleg (2000), p. 1.
  13. ^Ghosh (2007), p. 7. Lynch (1996), p. xix, 2. Peleg (2000), p. 4.
  14. ^Ghosh (2007), p. 10. Keidar (2008).
  15. ^Lynch (1996), p. xix, 1–2. Peleg (2000), p. 1.
  16. ^Peleg (2000), p. 1.
  17. ^Papadimitriou (1994), Chapter 15. Keidar (2008).
  18. ^See references in Introduction.
  19. ^Andrews (2000), p. 348.
  20. ^Andrews (2000), p. 32.
  21. ^Peter (2004), The history of email.
  22. ^"Real Time And Distributed Computing Systems"(PDF). ISSN 2278-0661. Retrieved 2017-01-09. 
  23. ^Vigna P, Casey MJ. The Age of Cryptocurrency: How Bitcoin and the Blockchain Are Challenging the Global Economic Order St. Martin's Press January 27, 2015 ISBN 9781250065636
  24. ^Lind P, Alm M (2006), "A database-centric virtual chemistry system", J Chem Inf Model, 46 (3): 1034–9, doi:10.1021/ci050360b, PMID 16711722. 
  25. ^Elmasri & Navathe (2000), Section 24.1.2.
  26. ^Andrews (2000), p. 10–11. Ghosh (2007), p. 4–6. Lynch (1996), p. xix, 1. Peleg (2000), p. xv. Elmasri & Navathe (2000), Section 24.
  27. ^Cormen, Leiserson & Rivest (1990), Section 30.
  28. ^Herlihy & Shavit (2008), Chapters 2-6.
  29. ^Lynch (1996)
  30. ^Cormen, Leiserson & Rivest (1990), Sections 28 and 29.
  31. ^Cole & Vishkin (1986). Cormen, Leiserson & Rivest (1990), Section 30.5.
  32. ^Andrews (2000), p. ix.
  33. ^Arora & Barak (2009), Section 6.7. Papadimitriou (1994), Section 15.3.
  34. ^Papadimitriou (1994), Section 15.2.
  35. ^Lynch (1996), p. 17–23.
  36. ^Peleg (2000), Sections 2.3 and 7. Linial (1992). Naor & Stockmeyer (1995).
  37. ^Lynch (1996), Sections 5–7. Ghosh (2007), Chapter 13.
  38. ^Lynch (1996), p. 99–102. Ghosh (2007), p. 192–193.
  39. ^Dolev (2000). Ghosh (2007), Chapter 17.
  40. ^Lynch (1996), Section 16. Peleg (2000), Section 6.
  41. ^Lynch (1996), Section 18. Ghosh (2007), Sections 6.2–6.3.
  42. ^Ghosh (2007), Section 6.4.
  43. ^LeLann, G. (1977). "Distributed systems - toward a formal approach,". Information Processing. 77: 155·160. – via Elsevier. 
  44. ^R. G. Gallager, P. A. Humblet, and P. M. Spira (January 1983). "A Distributed Algorithm for Minimum-Weight Spanning Trees"(PDF). ACM Transactions on Programming Languages and Systems. 5 (1): 66–77. doi:10.1145/357195.357200. 
  45. ^Korach, Ephraim; Kutten, Shay; Moran, Shlomo (1990). "A Modular Technique for the Design of Efficient Distributed Leader Finding Algorithms"(PDF). ACM Transactions on Programming Languages and Systems. 12 (1): 84–101. doi:10.1145/77606.77610. 
  46. ^Hamilton, Howard. "Distributed Algorithms". Retrieved 2013-03-03. 
  47. ^http://cstheory.stackexchange.com/questions/10045/major-unsolved-problems-in-distributed-systems
  48. ^http://www.theserverside.com/feature/How-big-data-and-distributed-systems-solve-traditional-scalability-problems
  49. ^Papadimitriou (1994), Section 19.3.

References[edit]

Books
  • Andrews, Gregory R. (2000), Foundations of Multithreaded, Parallel, and Distributed Programming, Addison–Wesley, ISBN 0-201-35752-6 .
  • Arora, Sanjeev; Barak, Boaz (2009), Computational Complexity – A Modern Approach, Cambridge, ISBN 978-0-521-42426-4 .
  • Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. (1990), Introduction to Algorithms (1st ed.), MIT Press, ISBN 0-262-03141-8 .
  • Dolev, Shlomi (2000), Self-Stabilization, MIT Press, ISBN 0-262-04178-2 .
  • Elmasri, Ramez; Navathe, Shamkant B. (2000), Fundamentals of Database Systems (3rd ed.), Addison–Wesley, ISBN 0-201-54263-3 .
  • Ghosh, Sukumar (2007), Distributed Systems – An Algorithmic Approach, Chapman & Hall/CRC, ISBN 978-1-58488-564-1 .
  • Lynch, Nancy A. (1996), Distributed Algorithms, Morgan Kaufmann, ISBN 1-55860-348-4 .
  • Herlihy, Maurice P.; Shavit, Nir N. (2008), The Art of Multiprocessor Programming, Morgan Kaufmann, ISBN 0-12-370591-6 .
  • Papadimitriou, Christos H. (1994), Computational Complexity, Addison–Wesley, ISBN 0-201-53082-1 .
  • Peleg, David (2000), Distributed Computing: A Locality-Sensitive Approach, SIAM, ISBN 0-89871-464-8 .
Articles
  • Cole, Richard; Vishkin, Uzi (1986), "Deterministic coin tossing with applications to optimal parallel list ranking", Information and Control, 70 (1): 32–53, doi:10.1016/S0019-9958(86)80023-7 .
  • Keidar, Idit (2008), "Distributed computing column 32 – The year in review", ACM SIGACT News, 39 (4): 53–54, doi:10.1145/1466390.1466402 .
  • Linial, Nathan (1992), "Locality in distributed graph algorithms", SIAM Journal on Computing, 21 (1): 193–201, doi:10.1137/0221015 .
  • Naor, Moni; Stockmeyer, Larry (1995), "What can be computed locally?"(PDF), SIAM Journal on Computing, 24 (6): 1259–1277, doi:10.1137/S0097539793254571 .
Web sites

Further reading[edit]

Books
  • Attiya, Hagit and Jennifer Welch (2004), Distributed Computing: Fundamentals, Simulations, and Advanced Topics, Wiley-Interscience ISBN 0-471-45324-2.
  • Christian Cachin; Rachid Guerraoui; Luís Rodrigues (2011), Introduction to Reliable and Secure Distributed Programming (2. ed.), Springer, ISBN 978-3-642-15259-7 
  • Coulouris, George; et al. (2011), Distributed Systems: Concepts and Design (5th Edition), Addison-Wesley ISBN 0-132-14301-1.
  • Faber, Jim (1998), Java Distributed Computing, O'Reilly : Java Distributed Computing by Jim Faber, 1998
  • Garg, Vijay K. (2002), Elements of Distributed Computing, Wiley-IEEE Press ISBN 0-471-03600-5.
  • Tel, Gerard (1994), Introduction to Distributed Algorithms, Cambridge University Press 
  • Chandy, Mani; et al., Parallel Program Design 
Articles
  • Keidar, Idit; Rajsbaum, Sergio, eds. (2000–2009), "Distributed computing column", ACM SIGACT News .
  • Birrell, A. D.; Levin, R.; Schroeder, M. D.; Needham, R. M. (April 1982). "Grapevine: An exercise in distributed computing"(PDF). Communications of the ACM. 25 (4): 260–274. doi:10.1145/358468.358487. 
Conference Papers

External links[edit]

(a), (b): a distributed system.
(c): a parallel system.

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *