Interactive communication of balanced distributions and edited files


Introduction

Interactive communication concerns the transmission required when a sender wants to convey information to a receiver who has some related data. A (recommended but not prerequisite) general overview describes the basic scenario, defines the main concepts, and mentions some results on the effects of interaction. In this outline, we discuss a class of problems that are more likely to arise in practice.

We study problems where the sender's information is "comparable" to that known to the receiver. We call such distributions balanced. They include distributed data-base systems where closely-related files reside in different locations and occasionally a file needs to be communicated from one location to another. We provide general results pertaining to all balanced distributions and analyze in detail files that are close in terms of their:


Preview of results

As a subclass, we can say more about balanced distributions than about general ones. In particular, for this subclass we can resolve the two main open problems that remain unsolved in general. We show that for all balanced problems:

Perhaps more interestingly, we show that for all balanced distributions Consider for example two similar files, different in only a few bits, that are stored in two locations. One location (the "sender") wants to communicate its file to the other (the "receiver"). We compare the following scenarios: (1) the sender knows the receiver's file in advance; (2) the sender does not knows the receiver's file in advance. In the first scenario, the sender can simply transmit the locations where the two files differ. In the second case, the sender cannot determine the difference locations, so it may appear that more bits are needed. Yet the result says that the number of bits required is essentially the same in both cases.


Outline

In the next few segments we review the general communication scenario and the complexity measures involved. We illustrate these concepts via the League example which was studied extensively in the general overview. We then define balanced distributions and motivate them via several examples, including files that are within a small "distance" from each other. Thereafter, we describe the results and apply them to show for example that similar files can be communicated efficiently. Finally, we mention some open problems.


Review of the communication scenario

In the general communication scenario, a sender knows a random value X while a receiver knows a random value Y and wants to determine X. We are interested in the number of bits that must be transmitted in the worst case to ensure that the receiver can always determine X correctly.

We assume that X and Y are distributed according to a known probability distribution and that the sender and the receiver agree in advance on a communication protocol. Given the actual values of X and Y, the communicators alternate in transmitting messages to each other. Each message is determined by the protocol based on the value known to its transmitter and on previous transmissions.


Bit counting

We let Wm denote the least number of bits that must be transmitted by both communicators in the worst case when they are allowed to exchange at most m messages. W1 is therefore the number of bits required with one message, namely only the sender can transmit, and W2 is the total number of bits required with at most two messages: the receiver transmits and the sender replies.

Allowing (not requiring...) additional messages can only reduce the number of bits hence W1, W2, W3, ... is a non-increasing sequence of nonnegative integers and therefore tends to a limit W which is the number of bits that must be transmitted in the worst case when there are no restrictions on the number of messages exchanged.


League

It is convenient to illustrate the aforementioned quantities via the following example. There are n teams in a league. Bob knows two teams that played in a game and Alice knows the game's winner. How many bits of information must Alice and Bob exchange in the worst case for Bob to find out who won?

We'll describe two protocols. Both assume that Alice and Bob agree in advance on a log n-bit representation of the n teams.

If only Alice can transmit, she simply transmits the winner's representation, showing that

W1 log n.

If two messages are allowed, Bob uses log log n bits to describe a location where the two teams differ, and Alice replies with the winning team's bit in that location. Therefore,

W2 log log n+1.

This problem is discussed extensively in the general overview which shows that the first protocol is optimal for one-way communication and that the second is optimal with any number of messages. Hence

W1 = log n
and
W2=W3=...=W=log log n+1.


If the sender knew the receiver's information

In our scenario, the sender knows his random value X, but doesn't know the receiver's information, Y. It is also interesting to consider a simpler scenario where the sender knows both X and Y.

In this easier to analyze setup, the sender has strictly more information than the receiver, hence can predict the receiver's replies. Therefore, to minimize communication, the receiver should never transmit. It follows that in this scenario, a single message sent by the sender always achieves the optimal number of bits, which we denote by "W".

Example: When the League problem is viewed under this scenario, the sender knows the two teams, and the one that won. He can simply indicate whether the winning team is lower or higher lexicographically than the other team. This is clearly the best he can do, hence
"W" = 1.

A main part of our investigation involves the comparison of the number of bits required when the sender knows the receiver's data and when he doesn't.


W vs. "W"

One reason to consider the scenario where the sender knows the receiver's information is that it provides a simple lower bound on W.

Any protocol where the sender does not know the receiver's data can also be used when the sender knows the receiver's information. Therefore, Wm "W" for all communication problems and every m. It follows that for all communication problems,

W "W".

In general, there can be an arbitrary discrepancy between the number of bits needed when the sender knows the receiver's information and when he doesn't. For example, in the league problem,

"W" = 1,
yet
W = log log n+1.
hence there is an unbounded transmission increase when the sender doesn't know the receiver's information (the game) in advance.

One of this outline's main results is that for all balanced distributions there is almost no difference between "W" and W. But before we discuss balanced distributions, we need to get some support and resolve some ambiguity.


The support set

Let p(x,y) denote the probability distribution underlying X and Y. Since we consider the worst-case number of bits and allow no errors, all the Wm's are determined by the support set of X and Y, the set

S = {(x,y) : p(x,y) > 0}

of (x,y) values that can occur. The precise values of p(x,y) over S are irrelevant.

Example: In the League problem we did not need the precise probabilities of games and winners. All we had to know was that every two teams could play and that the winner was one of them. Namely that the support set was the set of all (x,y)'s where y is a pair of teams (a game) and x is one of the two teams in y (the winner).


The receiver's ambiguity

The receiver's ambiguity when he has the value y is

A(y) = |{x : (x,y) is in S}|,

the number of possible x values the sender may know in that case. The receiver's maximum ambiguity is

AR = max {A(y)},
the largest number of x values possible with any given y value.

Example: In the League problem, the receiver knows two teams that played in a game and the sender knows the game's winner. For every game known to the receiver, the sender knows one of two possible winners. Hence, the receiver's ambiguity A(y) is always 2, and so is his maximum ambiguity AR.

Why ambiguity

The receiver's ambiguity is closely related to the number bits required when he knows the receiver's information in advance. It is easy to verify that in this scenario the following simple protocol is optimal.

For every value y the receiver may have, the communicators agree on an ordering of the A(y) values the sender may have. Given the actual values, the sender considers the ordering corresponding to y (recall that we assume the scenario where the sender knows y), and transmits log A(y) bits describing the index of x in that ordering. It is easy to verify that no fewer bits suffice in the worst case, hence
"W" = log AR . (1)

Example: Some segments ago we outlined a protocol that achieved "W" for the League problem. That protocol is of the type suggested here. For every game known to the receiver, the two participating teams are ordered lexicographically. The sender, knowing the game and the winner, transmits 0 if the first team wins and 1 if the second wins. As claimed in (1),

"W" = 1 = log 2 = log AR .


The sender's ambiguity

Just as we defined the receiver's ambiguity, the sender's ambiguity when he knows the value x is the number A(x) of possible y values the receiver can have in that case. The sender's maximum ambiguity, AS, is the largest of the a(x)'s over all values of x. It is the largest number of values the receiver can have with any given value the sender may have.

While less clear a priori, the sender's ambiguity too plays a role in determining the number of bits required when the sender does not know the receiver's information in advance.


Balanced distributions

A communication problem is balanced if the sender and the receiver have the same maximum ambiguities:

AR = AS.

The league problem is highly unbalanced as AR = 2 (number of possible game winners) while AS = n-1 (number of possible opponents).

On the other hand, several balanced distributions arise in practice.


Why balanced distributions?

In many applications (we will soon encounter some), the sender and the receiver have comparable data that are within a small "distance" from each other.

Since the distance between x and y is the same as that between y and x, the support sets of all those distance distributions are symmetric: (x,y) is in S if and only if (y,x) is in S. Every such symmetric distribution is clearly balanced.

There are therefore several balanced distributions that arise naturally from distance measures. We now describe a few.


Temperatures

A meteorologist wants to convey the temperature in New York to a colleague who knows the temperature in New Jersey. We assume that the temperatures can attain any integer value, but they are always at most t degrees from each other. Formally, the problem is defined by its support set { (x,y) : x,y in Z and |x-y| t }.

This is a distance distribution, hence it is symmetric, so balanced. The receiver's and sender's maximum ambiguities are

AR = AS = 2t+1.

Note that the largest temperature difference t is a fixed value known in advance and that the protocol is designed with t in mind.

If the sender knows the receiver's temperature y in advance, the number of bits he needs to transmit in the worst case is

"W" = log AR = log (2t+1).

One way to achieve this number is for the sender to transmit the temperature difference x-y which can attain exactly 2t+1 values.

How many bits would the sender need to transmit if he doesn't know y in advance?

For this problem, the same number of bits suffices even when the sender doesn't know the receiver's information in advance. It is easy to verify that each x in the range y-t to y+t has a different remainder when divided by 2t+1. Hence if the sender transmits that remainder, (x)2t+1, the receiver, based on his knowledge of y and that remainder, can uniquely determine x.

For example, if t = 3, x = 19, and y = 17, then the sender transmits (19)7 = 5. It is easy to verify that no other temperature between 17-3=14 and 17+3=20 has remainder 5 modulo 7, hence the receiver can tell that x=19.

Since the remainder can attain 2t+1 possible values (from 0 to 2t), the number of bits transmitted is log (2t+1).

It follows that for temperatures,

W1 = ... = W = "W" = log (2t+1).


Cyclic shifts

(To be added later.)


Thickening the plot

Not always are W1, W, and "W" all equal. We already saw that for the League, W1 = log n, W=log log n+1, and "W" = 1. We now present some balanced distributions where these measures differ.


Hamming distance

The Hamming distance, dH, between two equal-length binary sequences is the number of locations where they differ. For example, dH(010,001)=2. The Hamming distance plays an important role in reliable data communication and storage.

There are 2n binary sequences of length n, collectively denoted by {0,1}n. For example, {0,1}2 = {00, 01, 10, 11}. The Hamming ball of radius t around a sequence x in {0,1}n is the set of all n-bit sequences at Hamming distance t from x. For example, the Hamming ball of radius 2 centered at 0110 consists of the sequences 0101, 0011, 1111, 0000, 1100, and 1010. The ball's volume, the number of sequences it contains, is easily seen to be

(n choose 0) + (n choose 1) + ... + (n choose t).
It is determined by n and t, but is independent of the actual center point x.


Transmitted files

An n-bit file is transmitted to two users. Due to transmission errors, a total of at most t bits may be corrupted, namely changed from 0 to 1 or vice versa. Consequently, each user holds an n-bit file and the two files differ in at most t locations. One of the users wants to convey the his file to the other. How many bits must be transmitted?

Formally, x and y are in {0,1}n, and dH(x,y) t.

Again, this is a distance distribution, hence it's symmetric and balanced. The receiver's and sender's maximum ambiguities are the volume of the Hamming ball of radius t in {0,1}n:

aR = aS = (n choose 0) + (n choose 1) + ... + (n choose t),

If the sender knew the receiver's file he would need to transmit a worst case number of bits of

"W" = log [ (n choose 0) + (n choose 1) + ... + (n choose t) ] .

Basic combinatorial inequalities show that

t (log n - t) "W" t log n .

When t is fixed and n increases, this number remains within a constant from t log n. A simple scheme almost achieves t log n. The sender transmits t log n bits simply specifying the locations of the differences.

If the sender doesn't know the receiver's file, he can't specify the locations where they differ. How many bits are needed then? We'll resolve this problem after discussing the general results.


Edit distance

The edit distance, de, between two binary sequences is the number of bits that must be deleted and inserted to one to derive the other. For example, de(0101, 001) = 1 and de(0101, 1010) = 2. Note that the two sequences need not have the same length, and that when they do, the edit distance may be significantly smaller than the Hamming distance.

As suggested by its name, the edit distance is often used as an idealized measure of the similarity of two edited files.


Edited files

Of all the problems described in this outline, the one with the most obvious application and that motivated the study of interactive communication is the following edited-files problem.

Alice and Bob collaborate on a joint book. One day, Alice edits her version of the book. She performs t edit operations where each operation is either an insertion or a deletion of a bit. She then wants to communicate her new version to Bob. Formally, x and y are binary sequences and de(x,y) t.

How many bits must be transmitted?


If Alice keeps original (reference) file

One solution is for Alice to keep the original, or reference file. When she wants to communicate her new version, she simply transmits the changes from the reference files. Bob's ambiguity in that case is the number of sequences within edit distance t from y, and Alice needs to transmit the logarithm of that number.

As with the Hamming distance, Alice can approach the optimal number of bits by simply specifying the locations of all deletions, insertions of 0, and insertions of 1's. That requires roughly   t log n bits. ("Roughly" because she needs to specify the number of operations of each type, and because the sequence length and hence number of possible locations changes as bits get inserted or deleted.)

But what if Alice loses the reference file?


Why lose the reference file?

There could be several reasons for assuming that Alice doesn't have a reference file.

In that case, Alice would be sending modifications for the wrong file, resulting in garbled updates. In addition, Again, resulting in a garbled update. To avoid that, both Alice and Bob need to keep the reference file on which they last agreed. But then, because every two users must keep the file they last agreed on. A cost of m-1 files per user may be too high.

And while in the above cases the communicators can maintain a reference file, albeit at some cost, there are cases where reference files simply don't exist, such as when Alice and Bob's files are:

Of all the reasons for not keeping a reference file, the one I like best is: The last result follows from the general ones described next.


General results (balanced distributions)

The following segments describe results applying to general balanced distributions.


One message requires at most twice the optimum

For general distributions, one message may require exponentially more bits than the minimum necessary. For balanced distributions, the difference is less dramatic.

For all balanced distributions,
W1 2 W + 1. (2)
Furthermore, this is essentially the strongest bound possible. For some balanced distributions,

W1 2 W - 4. (3)
This discrepancy is achieved by projective planes.


Projective planes

(To be added later.)


Two messages may require twice the optimum

For general distributions there is a significant difference between the largest possible communication loss associated with one and two messages. One message may be exponentially wasteful while two messages are within a factor of four from the optimum.

For balanced distributions, the largest increase is essentially the same for one and two messages. Both are within twice the optimum. Since W2 W1, it follows from (2) that two messages are also within twice the optimum. Conversely, for some balanced distributions:

W2 2 W - o(W). (4)

There is a difference though. Whereas for one message we know of a balanced distribution that attains this ratio (projective planes), for two messages we only know that such a distribution exists.

Open problem: Find a concrete balanced problem where two messages are not asymptotically optimal.


Three messages are asymptotically optimal

For general distributions three messages require at most four times the optimum number of bits and may require at least twice the optimum. The precise ratio is unknown.

For balanced distributions three messages are asymptotically optimal. For all balanced distributions,

W3 W + 3 log W + 11. (5)

This follows from the following result.


As if the sender knew the receiver's information

Perhaps the most interesting result about balanced distributions is that there is essentially no difference between the number "W" of bits required when the sender knows the receiver's information in advance and when he doesn't.

For all balanced distributions,
W3 "W" + 3 log "W" + 11. (6)

Namely, for all balanced distributions, the sender can communicate his information using essentially the same number of bits he would need if he had known the receiver's information in advance. Note also that this can be done using at most three messages.

We demonstrate this result via edited files.


Application to edited files

Alice and Bob collaborate on a 225-bit (about four million characters) book. Each performs at most 1000 edit operations where an operation is either a deletion or an insertion of a bit. As a result, Alice knows a file x and Bob knows a file y such that

de(x,y) 2000.
Alice wants to communicate her file to Bob.

If Alice knows Bob's file, she can describe the locations of all deletions, insertions of 0's, and insertions of 1's. That requires

"W" approximately 2,000 * log 225 = 50,000 bits.

If Alice doesn't know Bob's file, they can use the result in Bound (6) and communicate the file using three messages and a total of at most

W3 "W" + 3 log "W" + 11 approximately 50,000 +3 log 50,000 + 11 50,058 bits.
Therefore at most 58 additional bits are needed when Alice doesn't know Bob's file in advance.


Practical protocols

Bound (6) says that there are efficient three-message protocols for communicating transmitted files (small Hamming distance) and edited files (small edit distance). Unfortunately, it only shows that these protocols exist, but doesn't tell us what they are.

We now discuss practical protocols for these problems.


Practical protocols for transmitted files

One-way communication of n-bit files that are within Hamming distance t from each other is related to the problem of correcting t errors.

Every one-way protocol can be converted into an error correction code, and every linear error-correction code can be converted into a protocol.

Using BCH codes, one can construct one-way protocols for the Hamming distance problem that require at most t log (n+1) bits.


Practical protocols for edited files

It can be shown that efficient edited-files protocols exist if and only if there are efficient protocols for the following insertions problem.

A sequence x is a supersequence of sequence y, and y is a subsequence of x, if x can be derived from y via insertions of bits. For example, 110010 is a supersequence of 101, but 0100 isn't.

In the t insertions problem, the receiver knows an n-bit sequence y and the sender knows an n+t bit super sequence x of y.

For example, if n=3 and t=1, then the receiver may know the sequence 010 while the sender has 0101.


The regularity of insertions

Insertions have a property that makes them particularly easy to analyze.

Consider fixed integers tn. Different n-bit sequences have different numbers of (n-t)-bit subsequences. For example, 000 has only one 2-bit subsequence (00) while 010 has three (10, 00, and 01). Yet all n-bit sequences have the same number of (n+t)-bit supersequences. For example, 000 has five 4-bit supersequences (0000, 1000, 0100, 0010, and 0001) and so does 010 (0010, 0100, 1010, 0110, and 0101).

Given this equality, it is easy to verify that, like the all-0 sequence, every n-bit sequence has
(n+t) choose 0 + (n+t) choose 1 + ... + (n+t) choose t (7)
(n+t)-bit supersequences.


One insertion

We will consider a single insertion. Namely the receiver has an n-bit sequence y and the sender has an (n+1)-bit sequence x that is derived from y via an insertion.

From (7), the receiver's ambiguity is

AR = n+2.
Hence, if the sender knows the receiver's sequence, he needs to transmit
"W" = log n+2 bits.

A simple protocol that approaches this number is where the sender tells the receiver which bit to insert and where.

In the example above, where y = 010 and x = 0101, if the sender knew the receiver's sequence he would tell him to insert 1 after the third bit location.

However, we assume that the sender doesn't know the receiver's information. All he sees is the sequence 0101, and he doesn't know which bit should be inserted where.


One-insertion protocol

We describe a simple optimal one-insertion protocol.

Let x1, x2, ... xn+1 denote the sequence known to the sender.

The sender simply transmits [ 1*x1 + 2*x2 + ... + (n+1)*xn+1 ] n+2, namely, the remainder of the sum when divided by n+2. It can be shown that the receiver can determine the sender's sequence.

For example, take the above example where n=3 and t=1, and we assumed that y = 010 and x = 0101. The sender transmits [ 1*0+2*1+3*0+4*1 ](3+2) = [ 6 ]5 = 1. It is easy to verify that for all other 4-bit super sequences of 010, namely 0100, 0010, 1010, and 0110, the sender would transmit a different remainder (2, 3, 4, and 0, respectively). Hence the receiver can tell that x = 0101.


Open problems

Several problems remain unsolved.


Summary

We showed that for the worst case number of bits of balanced distributions:

W1 2 W + 1. (2)
W1 2 W - 4. (3)
W2 2 W - o(W). (4) 
W3 W + 3 log W + 11. (5)  

The most interesting general result was that there is essentially no difference between the number of bits required when the sender knows the receiver's data and when he doesn't.

W3 "W" + 3 log "W" + 11. (6)  

Regarding practical protocols, we showed that:

We also mentioned some open problems, including:


Want to know more?

To find out (even) more about interactive communication you can read (coming "soon"):


References

To better understand the material presented in this outline and to see the proofs, you can read the following papers.

Bounds (2), (3), (5), and (6) and the single-insertion protocol appeared in:

Bound (4) is proven in:

The Hamming-distance protocol was issued as a patent for video compression: and was independently rediscovered in: