Benefits of Cache Assignment on Degraded Broadcast Channels

Degraded K-user broadcast channels (BC) are studied when receivers are facilitated with cache memories. Lower and upper bounds are derived on the capacity-memory tradeoff, i.e., on the largest rate of reliable communication over the BC as a function of the receivers' cache sizes, and the bounds are shown to match for some special cases. The lower bounds are achieved by two new coding schemes that benefit from non-uniform cache assignment. Lower and upper bounds are also established on the global capacity-memory tradeoff, i.e., on the largest capacity-memory tradeoff that can be attained by optimizing the receivers' cache sizes subject to a total cache memory budget. The bounds coincide when the total cache memory budget is sufficiently small or sufficiently large, characterized in terms of the BC statistics. For small cache memories, it is optimal to assign all the cache memory to the weakest receiver. In this regime, the global capacity-memory tradeoff grows as the total cache memory budget divided by the number of files in the system. In other words, a perfect global caching gain is achievable in this regime and the performance corresponds to a system where all cache contents in the network are available to all receivers. For large cache memories, it is optimal to assign a positive cache memory to every receiver such that the weaker receivers are assigned larger cache memories compared to the stronger receivers. In this regime, the growth rate of the global capacity-memory tradeoff is further divided by the number of users, which corresponds to a local caching gain. Numerical indicate suggest that a uniform cache-assignment of the total cache memory is suboptimal in all regimes unless the BC is completely symmetric. For erasure BCs, this claim is proved analytically in the regime of small cache-sizes.


Benefits of Cache Assignment on Degraded
Broadcast Channels

Shirin Saeedi Bidokhti, Michèle Wigger, and Aylin Yener
Abstract-Degraded K-user broadcast channels (BC) are studied when receivers are facilitated with cache memories. Lower and upper bounds are derived on the capacity-memory tradeoff, i.e., on the largest rate of reliable communication over the BC as a function of the receivers' cache sizes, and the bounds are shown to match for interesting special cases. The lower bounds are achieved by two new coding schemes that benefit from nonuniform cache assignments. Lower and upper bounds are also established on the global capacity-memory tradeoff, i.e., on the largest capacity-memory tradeoff that can be attained by optimizing the receivers' cache sizes subject to a total cache memory budget. The bounds coincide when the total cache memory budget is sufficiently small or sufficiently large, where the thresholds depend on the BC statistics. For small cache memories, it is optimal to assign all the cache memory to the weakest receiver. In this regime, the global capacity-memory tradeoff grows by the total cache memory budget divided by the number of files in the system. In other words, a perfect global caching gain is achievable in this regime and the performance corresponds to a system where all the cache contents in the network are available to all receivers. For large cache memories, it is optimal to assign a positive cache memory to every receiver such that the weaker receivers are assigned larger cache memories compared to the stronger receivers. In this regime, the growth rate of the global capacity-memory tradeoff is further divided by the number of users, which corresponds to a local caching gain. It is observed numerically that a uniform assignment of the total cache memory is suboptimal in all regimes unless the BC is completely symmetric. For erasure BCs, this claim is proved analytically in the regime of small cache-sizes.

I. INTRODUCTION
Storing popular contents at or close to the end users improves the network performance during peak-traffic times. The main challenge is that the contents have to be cached before knowing which files the users will request in peaktraffic periods. A conventional approach is to store popular contents in the cache memories of all the users. This allows the receivers to locally retrieve the contents without burdening the network and attain the so called local caching gain. Maddah-Ali and Niesen [1] have shown that further caching gains, i.e., the so called global caching gains, are achievable if different contents are stored at different users in a careful manner.
In particular, reference [1] considered a broadcast scenario with a transmitter that has access to a library of N independent S. Saeedi Bidokhti  files and with K receivers that are all equipped with individual cache memories of the same size. Communication during offpeak periods, when contents are cached, is assumed errorfree and constrained only by the amount of information that can be placed in the cache memories at the receivers. This communication is henceforth called cache placement or placement phase. The subsequent peak-traffic communication is called delivery phase. In this phase, each receiver requests a single file from the library and the transmitter delivers the requested files by communicating over a common noise-free link to all K receivers. Reference [1] has proposed to diversify the placed contents across cache memories so as to allow for coding opportunities during the delivery phase. These coding opportunities allow to simultaneously serve multiple receivers in each transmission, providing gains that scale with the total size of all cache memories in the network, i.e., global caching gains. By contrast, the previously reported gains depend only on individual cache sizes and are thus referred to as local caching gains.
The performance metric in [1] is the minimum required delivery rate for given cache sizes, leading to the fundamental quantity of interest, the delivery rate-memory tradeoff. Upper and lower bounds on this tradeoff are provided in [1]. Improved upper bounds (achievability results) have subsequently been presented in [2]- [8] and improved lower bounds (converse results) in [9]- [13]. The common noise-free link model of [1] has also been studied for networks in which receivers have cache memories of different sizes [14]- [16]. Rate-limited links in delivery have been considered in [17].
In this paper, we relax the assumption that delivery takes place over a noise-free link. Instead, we model the delivery phase by a degraded broadcast channel (BC). The class of degraded broadcast channels is a fairly general class that includes practical noisy channel models such as broadcast erasure channels and Gaussian channels. Our model is depicted in Figure 1. A transmitter communicates with receivers 1, . . . , K which are equipped with cache memories of sizes nM 1 , . . . , nM K , when communication is of blocklength n.
Noisy broadcast channels with caching receivers have been studied in different settings [18]- [30]. For example, references [18]- [20] explore the benefits of coded caching for Gaussian or slow fading BCs when all users have the same cache sizes. References [26]- [30] study the interplay between coded caching with spatial multiplexing (MIMO), channel state information (CSI), or feedback. Most related to the current work are the works in [21]- [24] which focus on erasure BCs with a set of weak receivers that are equipped with cache memories of equal size and a set of strong receivers without cache bits bits bits memories or with smaller cache memories. In accordance with multi-user information theory metrics, performance in these works is measured in terms of the capacity-memory tradeoff, i.e., the largest message rate for which receivers can decode their requested messages reliably as a function of the cache sizes. Lower bounds (achievability results) and upper bounds (converse results) are presented on the capacitymemory tradeoff. The lower bounds are based on joint cachechannel coding schemes where encoders and decoders exploit both the knowledge of the channel statistics and the cache contents. This is in contrast to previous works, e.g., [18], which adopt a separate cache-channel coding architecture. In separate cache-channel coding, the encoders (resp. decoders) consist of (i) a cache encoder (resp. decoder) that only exploits the cache contents and (ii) a channel encoder (resp. decoder) that only exploits the channel statistics; see Figure 2. As the results in [21,22] show, when receivers have different channel statistics and weaker receivers have larger cache sizes, then adopting a joint cache-channel coding architecture can significantly improve performance. Moreover, [21] illustrates that it is beneficial to assign larger cache memories to weaker receivers than to stronger receivers.
Joint cache-channel coding schemes 1 have also been employed for transmission over noisy BCs with caching receivers when the files to be sent are correlated [31,32] or when receivers have different fidelity constraints [24,25,33]. In these applications, improvements are possible even when users have perfectly symmetric channels and cache sizes.
In this work, we consider the problem of efficient cache assignment in degraded broadcast channels and how to code under these assignments. We quantify new caching gains obtained through such assignments and appropriate coding.

A. A Motivating Example
Consider an erasure BC with K = 10 users where receiver 1 has erasure probability δ 1 = 0.4 and all other receivers 2, . . . , K have erasure probability δ 2 = . . . = δ K = 0.1. Assuming no cache memories, denote the point to point channel capacity of user k by C {k} , and the symmetric capacity of the K-user broadcast channel by C {1,...,K} (see Section III for definitions). Now suppose that we are given a total cache size of nM to be distributed across the receivers (with any desired assignment), hence M = n(M 1 + . . . + M K ). We seek cache assignments, as well as coding schemes, that achieve high message rate. In particular, we are interested in the global capacity-memory tradeoff which we define as the largest rate achievable given the total cache budget M.
The traditional approach assigns the same cache size n M K to each receiver. We assign cache memories according to the channel strengths. For example, when the total cache rate M is small, we assign all the cache rate to the weak receiver, so M 1 = M. For large total cache rate, in the example at hand, we propose to assign a larger portion to the weak receiver and to distribute the rest uniformly over all strong receivers. The meaning of small and large cache sizes are made precise in Corollary 11 in Section VI. Tables I and II compare the rates that are achievable for this example by the cache assignments and coding schemes of this paper as well as traditional uniform cache assignments and standard (separate cache-channel) codes. Table I treats the regime of small  cache memories and Table II treats the regime of large cache memories. In both tables, the first column presents an upper bound on the capacity-memory tradeoff C M K , . . . , M K under uniform cache assignment (i.e., when the total cache memory M is assigned uniformly over the K users). The second column shows the rate that is achievable using the cache assignments and coding schemes proposed in this paper. We will show that these rates equal the global capacity-memory tradeoff C (M) in the regimes of small and large cache sizes. The third column considers the same cache assignment as in column 2, but presents the achievable rate R using standard (separate cache-channel) codes. More specifically, in the regime of small cache sizes, a standard BC code is used to communicate to each receiver the part of its requested file that is not in its cache memory. Since in this example only the weak receiver has a cache memory, coded caching (multicasting) is not possible. Under the cache assignment in Table II (large cache  memories), the rate achieved with separate source-channel coding is sufficiently small so that the weak receiver 1 can store all files in its cache memory, thus precluding delivery communication to this receiver. The optimal separation-based strategy is then to perform coded caching with parameter K − 2 (designed for the strong receivers) followed by a standard BC multicast code to those receivers. A comparison of columns 2 and 3 in Tables (I) TABLE I  COMPARISON OF TRADITIONAL SCHEMES WITH THE ONES PROPOSED IN THIS PAPER IN THE REGIME OF "SMALL" TOTAL  Upper bound with uniform assignment New coding Standard coding cache assignment creates new coding opportunities that can be exploited by joint cache-channel coding.
From Table I, column 2, we further observe the following behavior of the global capacity-memory tradeoff C (M): First of all, without cache memory, C (M = 0) equals the largest symmetric rate C {1,...,K} that is achievable to all the receivers in the BC. Now consider the slope with which C (M) increases with M. For small total cache budget M, smart cache assignment and coding (see column 2) allow to attain a steeper slope than traditional uniform cache assignment (see column 1) as well as separate cache-channel coding (see column 3). In particular, our proposed cache assignment and coding allow to achieve what we call a perfect caching gain, where the capacity-memory tradeoff grows as M N , i.e., like the total size of all cache memories in the network divided by the number of files N. This is the same performance as if each receiver had access to all cache memories in the network.
From Table II, we observe that for large cache memories, a smart cache assignment increases the capacity from So the weak channel of user 1 is no longer limiting in the delivery phase. The gain of additional cache memories is, however, only local, i.e., the capacitymemory tradeoff only grows as M KN . We remark that the results in this paper are not restricted to erasure BCs and hold for general memoryless degraded BCs.

B. Main Contributions and Implications
The main contributions of this paper are as follows.
• New coding schemes: We propose two new joint cachechannel coding schemes for degraded broadcast networks with heterogeneous cache sizes: superposition piggybackcoding and generalized coded caching. In superposition piggyback-coding, we assume a single cache memory at the weakest receiver and our delivery scheme loads (piggybacks) the information that is intended for stronger receivers and cached at weakest receiver onto the infor-mation that is communicated to this weakest receiver 2 .
When the rate of the piggybacked information is modest, the decoding at the strong receivers can be done without harming the performance at the weak receivers. The communication to the stronger receivers can thus be viewed as being almost for free. In some sense, piggyback coding provides the stronger receivers virtual access to the weak receivers' cache-memories as if these cache contents were locally present at the stronger receivers.
All receivers gain virtual access to the weakest receiver's cache memory (the only cache memory in this case) and hence perfect caching gain is achieved. In generalized coded caching, all receivers have cache memories, but weaker receivers have larger cache sizes than stronger receivers. We build the placement and delivery similar to the coded-caching scheme in [1]. However, by assigning larger cache memories to the weaker users, we create a new coding opportunity: piggyback coding. We piggyback information for the stronger receivers on the communication to the weaker receivers (without harming the weaker receivers' decodability). Using our coding scheme, the amount of the virtual cache memory that is provided to the stronger receivers increases compared to the original coded-caching scheme, resulting in an improved performance. We show that generalized coded caching is optimal for a specific cache assignment. • A New Converse Result: We prove a general converse result for degraded BCs with arbitrary cache sizes at the receivers. Our result strictly improves over the existing converse results for degraded BCs in [22,35], and at the time of submission 3 , also over all previous converse results for the noise-free bit-pipe model [1,10]- [12]. • Global Capacity-Memory Tradeoff: We study the problem of cache assignment on cache-aided noisy broadcast 2 Piggyback coding was proposed in a version without superposition coding in [21,22]. This original version can be seen as a simplified version of the coding scheme in [34] (without binning) in the context of Slepian-Wolf coding over broadcast channels. 3 The parallel work [36] slightly improves on this bound for the noise-free bit-pipe model (but does not generalize to noisy channels); see also [37]. networks, and derive new upper and lower bounds on the global capacity-memory tradeoff, i.e., on the largest rate that is achievable under an optimized cache assignment. The bounds match when the total available cache budget is small or large. For a small total cache budget M, a perfect global caching gain is achievable. For larger cache budgets M, the caching gain diminishes as M increases. Finally for M larger than a certain threshold, only a local caching gain is possible; i.e., the performance corresponds to a system where all receivers store the same content and no coded delivery is possible. In this case, the global capacity-memory tradeoff grows as M K·N , where K denotes the number of receivers. Finally, we demonstrate numerically that the popular approach of assigning equal cache sizes to all receivers is suboptimal over Gaussian and erasure BCs. We further prove this analytically for erasure BCs in the small cache size regime.

C. Notation
Random variables are denoted by uppercase letters, e.g. A, their alphabets by matching calligraphic font, e.g. A, and elements of an alphabet by lowercase letters, e.g. a ∈ A. We also use uppercase letters for deterministic quantities like rate R, capacity C, number of users K, cache size M, and number of files in the library N. Vectors are identified by bold font symbols, e.g., a, and matrices by the font A. We use the shorthand notation A n for the sequence A 1 , . . . , A n where n is an integer. The Cartesian product of A and A is A×A , and the n-fold Cartesian product of A is A n . Further, |A| denotes the cardinality of A. The notation (a) + , for a ∈ R, refers to max(0, a).
We will be using the abbreviation i.i.d. for independent and identically distributed.

D. Outline
The remainder of the paper is organized as follows. Section II describes the problem setup. Section III summarizes known results for the scenario where there is no cache memory in the network. The main results of this paper are described in Sections IV-VI. Section VII specializes these results to the examples of erasure and Gaussian BCs, and to the noise-free bit-pipe model. The paper is concluded in Section VIII.

II. PROBLEM DEFINITION
Consider a network with a transmitter and receivers 1, . . . , K. The transmitter has access to a library with N independent messages, W 1 , . . . , W N , each distributed uniformly over the set 1, . . . , 2 nR . Here, R ≥ 0 denotes the rate of transmission and n is the transmission blocklength. We assume that N ≥ K.
Each receiver k ∈ K := {1, . . . , K} is equipped with a cache of size nM k bits, where we have M k ≥ 0. The sizes of the cache memories thus scale linearly in the blocklength n. Communication takes place in two phases. First, in the placement phase, the transmitter chooses caching functions and places in receiver k's cache. This phase takes place in a noiseless fashion.
(4) Without loss of generality, this decomposition assumes that the K users of the degraded BC are ordered from weakest to strongest.
At the beginning of the delivery phase, each receiver k produces a random demand D k from the set N := {1, . . . , N} to indicate that it wishes to learn message W D k . The transmitter and all the receivers are informed about the entire demand vector D := (D 1 , . . . , D K ).
Using this information, the transmitter forms the channel input sequence X n = (X 1 , . . . , X n ) as for some encoding function f : {1, . . . , 2 nR } N ×N K → X n . Each receiver k ∈ K observes the outputs Y n k := (Y k,1 , . . . , Y k,n ) of the DMC Γ(y 1 , . . . , y K |x) for inputs X n . With the previously learned demand vector D, its local cache content V k , and the channel outputs Y n k , it then produces its estimate of the desired message W D k : by means of a decoding function We define the probability of error as: where and we assume that the random demand vector D has a uniform distribution on N K ; i.e., D = d with probability 1 N K for every d ∈ N K . Notice that the probability of error cannot be reduced by using stochastic instead of deterministic caching, encoding, and/or decoding functions. It is thus without loss in optimality that in this paper we assume deterministic functions.
A rate-memory tuple (R, M 1 , . . . , M K ) is achievable if for any > 0 there exists a sufficiently large blocklength n and caching, encoding, and decoding functions as in (2), (5), and (6) so that P e (n) ≤ .

Definition 1:
The capacity-memory tradeoff C(M 1 , . . . , M K ) is the largest rate R for which the rate-memory tuple (R, M 1 , . . . , M K ) is achievable: Our main goal in this paper is to optimize the cache assignment (M 1 , . . . , M K ) to attain the largest capacity-memory tradeoff C(M 1 , . . . , M K ) under the total cache constraint: Definition 2: The global capacity-memory tradeoff C (M) is defined as: Remark 1: The global capacity memory tradeoff depends on the BC law Γ(y 1 , . . . , y K |x) only through its marginal conditional laws Γ 1 (y 1 |x), . . . , Γ K (y K |x). All our results thus also apply to stochastically degraded BCs.

A. Minimum Delivery Rate
Most previous works on caching that modeled the BC as a noise-free bit-pipe, e.g. [1], have fixed the size of the messages to F bits and assumed that delivery takes place over ρ · F channel uses and that each receiver k ∈ K, has m k F bits of cache memory. Delivery rate ρ is then said to be achievable given normalized cache memory sizes m 1 , . . . , m K if there exist caching, encoding, and decoding functions such that the probability of error in (8) tends to 0 as F → ∞. In Section VII-B, we specialize our results to the noise-free bitpipe channel model and use the notion of delivery rate to compare our results with the state-of-the-art.
It is not difficult to see the following correspondence between the two definitions:

III. PRELIMINARIES: CAPACITIES WITHOUT CACHE MEMORIES
In this section, we recall known results for our network in the special case when there are no cache memories: These results will be utilized in the subsequent sections to state our results for cache-aided broadcast networks. When (12) holds, it is know from [38,39] that it is optimal to have the stronger receivers also decode messages intended to weaker receivers. This is done by the so called superposition coding. Here, the worst case probability of error in (8) is attained for a demand vector d that has all different entries and the capacity-memory tradeoff C(M 1 = 0, . . . , M K = 0) is the largest symmetric rate R with which K independent messages can be sent reliably to the K receivers. We thus have where C K is found from the capacity of degraded broadcast channels [38,39]: The maximization in (14) is over all auxiliary random tuples U 1 , . . . , U K−1 , X, Y 1 , . . . , Y K that satisfy the following Markov chain: and the channel transition law: Denote the alphabet sets of the auxiliary random variables U 1 , . . . , U K by U 1 , . . . , U K . Using the Fenchel-Eggleston-Carathéodory theorem [40, Appendix A], without loss of generality, one can restrict the cardinality of the sets as follows: To present the results in this paper, we will also need the no-cache capacity region of the BC to a subset of the receivers The no-cache capacity region C S is naturally given by the set of all nonnegative rate-tuples (R 1 , . . . , R |S| ) for which there exist random variables U 1 , . . . , U |S|−1 , X, Y j1 , . . . , Y j |S| that satisfy (15b) and form the Markov chain such that the following conditions hold: We denote by C S the largest symmetric rate R ≥ 0 in C S : Notice that C {k} is simply the point-to-point capacity to receiver k and we will abbreviate it as C k . By (20) and (21): where the maximization is over all random tuples U 1 , . . . , U |S|−1 , X, Y j1 , . . . , Y j |S| that satisfy (15b) and (19).

IV. CODING SCHEMES AND LOWER BOUNDS ON THE CAPACITY-MEMORY TRADEOFF
We present three lower bounds on the capacity-memory tradeoff along with coding schemes that achieve them. The first coding scheme and lower bound apply for general cache assignments M 1 , . . . , M K . The second and third ones apply only to specific cache assignments. Nevertheless, the proposed schemes are useful for a broad set of cache-assignments by time-(and memory-) sharing different coding schemes, or equivalently, taking convex combinations of different lower bounds.

A. The Local Caching Gain
The simplest way to use receiver cache memories is to store the same information at each and every receiver. This allows the receivers to retrieve this information locally, without transmission over the BC. Global caching gains are not possible under this caching strategy.
Applying the described simple caching strategy to only a part of the cache memory that is of size ∆ ≥ 0, while allowing a smart use of the remaining memory, leads to the following proposition, see also [41,Proposition 1].
Proposition 1 (Local caching gain): For all ∆ > 0 and M 1 , . . . , M K ≥ 0: As a consequence, for all ∆ total > 0 and M ≥ 0: We will see that this lower bound is tight in certain regimes of operation, namely, when the cache budget is larger than a threshold.

B. Superposition Piggyback-Coding
We generalize the piggyback coding scheme of [22,35], that was specific to erasure BCs, to general degraded BCs by introducing superposition coding. The scheme assigns all the available cache memory to the weakest receiver, and uses a layered superposition code for delivery, see Fig. 3. In this superposition code: • the lower-most layer encodes the part of message W d1 (intended for receiver 1) that is not stored in receiver 1's Recall that receiver 1 is the weakest. Each dot represents a codeword.
cache memory and the parts of messages W d2 , . . . , W d K that are stored at receiver 1. • the k-th lowest layer, for k ∈ {2, . . . , K}, encodes the part of the message W d k (intended for receiver k) that is not stored in the cache memory of receiver 1.
In particular, receiver 1 (the weakest user) only decodes the lowest layer. This layer encodes a part of message W d1 that is desired at receiver 1 together with parts of messages W d2 , . . . , W d K that are not desired at receiver 1, but are locally available at its cache memory. As we will see, the cache content allows receiver 1 to achieve the same decoding performance as if the additional messages to the other receivers were not encoded in the lowest layer. In other words, we can encode information desired at the stronger receivers 2, . . . , K through the lowest superposition layer without affecting the decoding performance at the weakest receiver. 1) Lower Bound on Capacity-Memory Tradeoff: Let (U 1 , . . . , U K−1 , X ) be a random K-tuple that achieves the symmetric-capacity C K , i.e., is a solution to the optimization problem in (14). Define Theorem 2: Under the following cache assignment Remark 2: Since receivers can always choose to ignore their cache memories, and because the superposition piggyback coding scheme can be time-and memory-shared with a nocaching scheme, Theorem 2 remains valid for all We will see in Corollary 6 ahead that (27) holds with equality for all The RHS of (27) coincides with the capacity-memory tradeoff of a scenario where each and every receiver has access to receiver 1's cache memory. Superposition piggyback coding can thus be viewed as a coding technique that virtually provides all stronger receivers access to the weakest receiver's cache memory.
2) Coding Scheme: Let (U 1 , . . . , U K−1 , X ) be a solution to the optimization problem in (14) so that the following inequality is strict: (If no such choice exists, Theorem 2 reduces to C(M 1 , . . . , M K ) ≥ C K and is trivial.) Let > 0 be arbitrary small, and define the rates The RHS of (31b) is positive by (30). Split each message W d , d ∈ {1, . . . , N}, into two parts: are of rates R (A) and R (B) , and thus the total message rate is N in the cache memory of receiver 1. This is possible by (26a) and the definition of M single 1 in (28). Delivery Phase: For the transmission in the delivery phase, construct a K-level superposition code C with a cloud center of rate R (A) + (K − 1)R (B) and satellites of rates R (A) in levels 2, . . . , K. For the code construction, use a probability distribution It will be convenient to arrange the codewords in the cloud center in an array with 2 nR (A) columns and ( 2 nR (B) ) K−1 rows. The columns are used to encode message W (A) d1 and the rows to encode the message tuple The k-th level satellite is used to encode message W Figure 3 for an illustration of the code construction.
The transmitter chooses and sends the codeword Decoding: Receiver k ∈ {2, . . . , K}, decodes all messages in levels 1, . . . , k. Recall that its desired message parts W d k are encoded in levels k and 1 (i.e., the cloud center), respectively.
Receiver 1 only has to decode W it performs the following steps: 1) It retrieves the message-tuple W (B) from its cache memory.
2) It forms the subcodebook C (W (B) ) ⊆ C that contains all level-1 codewords that are "compatible" with the retrieved tuple W (B) : Figure 3 illustrates such a subcodebook in red.

C. Generalized Coded-Caching
We generalize the coded-caching scheme of [1] to degraded BCs and to unequal cache sizes. In [1], the authors have proposed a scheme for error free channels, parametrized by an integer t∈ [1 : K − 1], that can simultaneously communicate to groups of t + 1 users and hence offer global caching gains. In (noisy) broadcast channels, users have different channel statistics. The main idea in this section is to assign a larger cache memory at weaker receivers to balance the worse channel condition and to send generalized XOR-messages to groups of t + 1 receivers at a time. Our cache assignment and delivery scheme are designed such that, in the transmission to any group of t + 1 receivers, each of the involved receivers is served at a rate close to its capacity. This is possible because each receiver has stored all other transmitted messages in its cache memory, and can exploit this knowledge in the decoding. Notice that if separate cache-channel coding was applied, the rate to each receiver was limited by the worst channel capacity.
1) Lower Bound on Capacity-Memory Tradeoff: We will need the following definitions. Let for each t ∈ K denote all unordered size-t subsets of K. Define their complements as: For any given distribution P X and t = 1, . . . , K − 1, define the cache sizes and rates: Note that when t = K − 1 the denominators of (39) and (40) are equal to 1, and hence we have Observe that for any given P X , we have so a larger cache memory is assigned the weaker a receiver is. The choices of M (t) in (39) and R (t) in (40) become clear in the description of the coding scheme. In particular, these choices ensure decodability of (sub-)messages in different phases of the coding scheme. Theorem 3: Fix a t ∈ {1, . . . , K − 1} and an input distribution P X , and consider the corresponding cache assignment in (39). Then, where M 1 , . . . , M K and R (t) are calculated from P X as described in (39) and (40).
As we will see in Corollary 7, the inequality in (44) holds with equality for t = K − 1.
We first explain the scheme for the special case of two users.
2) Coding Scheme in the Special Case K = 2 and t = 1: Fix an input distribution P X and a small > 0, and define the rates Notice that by the degradedness of the BC: Fix a blocklength n and generate a random codebook by choosing all entries i.i.d. according to P X . The codebook C is revealed to all terminals of the network. Allocate cache memories to receivers 1 and 2, respectively. Split each message W d , for d ∈ {1, . . . , N}, into two parts: which are of rates R (A) and R (B) , respectively. In the caching phase, the transmitter stores messages in receiver 2's cache memory. This is possible given the cache assignment in (49).
In the delivery phase the transmitter uses codebook C to send the XOR message 5 to both receivers using the codeword Note that subcodebook C (W (B) d2 ) is of rate R (A) which is smaller than the rate R (B) of the original codebook C.
To estimate W (A) d1 , Receiver 1 decodes the XOR message in (50) using an optimal decoding rule for this subcodebook C (W (B) d2 ), and XORs the decoded message with W (B) d2 , which it has stored in its cache memory. The remaining part of its desired message, W (B) d1 , is retrieved from its cache memory. With this scheme, both receivers correctly recover their desired messages W d1 and W d2 whenever they successfully decode the XOR-message in (50). Since the rate R (B) of the original codebook C satisfies and the rate of R (A) of the subcodebook C (W the probability of decoding error at both receivers tends to 0 as the blocklength n tends to infinity. Letting → 0, we conclude that for K = 2 the rate-memory triple R = I(X; Y 1 ) + I(X; Y 2 ), Notice that the weaker receiver 1 is assigned a larger cache memory than the stronger receiver 2: The described scheme can also be applied with a uniform cache assignment M 1 = M 2 = N·R (A) , however at the cost of a decreased achievable rate R = 2·I(X; Y 1 ). In fact, assigning a larger cache memory M 1 to receiver 1 allows to transmit more information to receiver 2 during the communication to receiver 1.
3) General Coding Scheme: Fix a positive integer t ∈ {1, . . . , K − 1}. This parameter is an indicator of the number of receivers that cache each part of a file. Similar to the work of [1], this caching scheme ensures that t + 1 receivers can be simultaneously served in each transmission during the delivery phase. Pick a small number > 0 and an input distribution P X . Consider the cache assignment in (39), where mutual informations are calculated with respect to P X .
Split each message W d into K t independent submessages: where each submessage W d,G (t) is of rate The total message rate is thus Placement Phase: For each d ∈ {1, . . . , N}, store the tuple in the cache memory of receiver k ∈ K. This is possible by (55) and the cache assignment in (39). Delivery Phase: Transmission in the delivery phase takes place in K t+1 subphases. We define the subphase j ∈ 1, . . . , K t+1 to be of length and to transmit messages to the intended receivers in G (t+1) j . For this purpose, the transmitter creates the generalized XOR message which is of rate and generates a codebook by drawing all entries i.i.d. according to P X . The transmitter then sends the codeword over the channel. We now describe the decoding. Each receiver k ∈ K can retrieve messages directly from its cache, see (57), and thus only needs to decode messages For each j ∈ {1, . . . , K t+1 } and k ∈ G (t+1) j , receiver k decodes message W d k ,G (t+1) j \{k} from its subphase-j outputs Specifically, with the messages stored in its cache memory, it forms the XOR message and it extracts a subcodebook C j,k (W XOR,j,k ) from C j that consists of all codewords that are compatible with W XOR,j,k : It then decodes the XOR message W XOR,G (t+1) j by applying an optimal decoding rule for subcodebook C j,k (W XOR,j,k ) to the subphase-j outputs Y nj k,j , and XORs the resulting messagê W XOR,G (t+1) j with W XOR,j,k to obtain After the last sub-phase K t+1 , each receiver k ∈ K has decoded all its missing messages in (64), and can thus recover W d k .
The probability that receiver k ∈ G (t+1) j finds an incorrect value for the XOR message W XOR,G (t+1) j tends to 0 as n (and thus n j ) → ∞ because the rate of the subcodebook C j,k satisfies lim see (55) and (58). Letting then → 0, establishes Theorem 3.

V. UPPER BOUNDS AND EXACT RESULTS ON THE CAPACITY-MEMORY TRADEOFF
We present a general upper bound on the capacity-memory tradeoff. We further show that the upper bound matches the lower bounds derived in the previous section in certain regimes of cache sizes. In these regimes, we can thus characterize the exact capacity-memory tradeoff.

A. Upper Bounds
Our upper bound is formulated in terms of the following parameters. Consider any set S ⊆ K and represent it by S = {j 1 , . . . , j |S| } as in (18). Define and for k ∈ {2, . . . , |S|}: Theorem 4: There exist random variables X, Y 1 , . . . , Y K and for every receiver set S as in (18) random variables {U S,1 , . . . , U S,|S|−1 } so that the channel law (15b) and the following Markov chain hold: and so that for each S we have The upper bound in Theorem 4 is asymmetric in the different cache sizes M 1 , M 2 , . . . , M K , because the parameters α S,ji are not symmetric. In fact, increasing the cache memories at weaker receivers generally increases the upper bound more than increasing the cache memories at stronger receivers.
The upper bound in Theorem 4 is weakened if the constraints in (69) are ignored for certain receiver sets S, or if in these constraints the input/output random variables X, Y j1 , . . . , Y j |S| are allowed to depend on the receiver set S. Using the latter relaxation, Theorem 4 results in the following corollary.
Corollary 5: Given cache sizes M 1 , . . . , M K ≥ 0, rate R ≥ 0, is achievable only if for every receiver set S ⊆ K: where C S denotes the no-cache capacity region to the receivers in S (assuming no cache memories at the receivers and ignoring all receivers in K\S).
Proof: If R is achievable, then Theorem 4 ensures that for every set S = {j 1 , . . . , j |S| }, there exists a set of random variables (U 1 , U 2 , . . . , U |S|−1 , X, Y j1 , . . . , Y j |S| ) that satisfy (19)- (20) and hence (70) holds for every set S. Note that this does not necessarily hold in the reverse direction, meaning that if (70) holds for every S, it is not clear if the conditions of Theorem 4 are satisfied in general. This is because the choice of random variable X that is found from (70) (for every S), may implicitly depend on S while this is not permissible in condition (68) in Theorem 4.

Remark 3:
The upper bounds of Theorem 4 and Corollary 5 are relaxed when each α S,k is replaced byα S,k , defined below: The same holds if each α S,k is replaced by In particular, Corollary 5 recovers the previous upper bound in [

B. Exact Results
By comparing the new upper bounds with the three lower bounds presented in Section IV, the exact expression for C(M 1 , . . . , M K ) can be obtained in some special cases. For example, as the following corollary states, the lower bound achieved by superposition piggyback coding matches the upper bound when only receiver 1 has a cache memory and this cache memory is small.
the capacity-memory tradeoff is Proof: Achievability follows by Theorem 2 (see also Remark 2) and the converse follows by Corollary 5, where it suffices to consider only the set S = K. In fact, under (73), α K,1 = . . . = α K,K = M1 N . The next corollary states that the lower bound attained by generalized coded caching with parameter t = K − 1 matches the upper bound under the corresponding cache assignment in (41). Moreover, any extra cache memory that is uniformly distributed over the K receivers only brings local caching gain.
Proposition 7: For each k ∈ K, let M (K−1) k be given by (41) when P X is chosen as a maximizer of For any ∆ ≥ 0, we have Proof: See Appendix E.

VI. BOUNDS ON THE GLOBAL CAPACITY-MEMORY TRADEOFF C (M)
The two preceding sections presented lower and upper bounds on the capacity-memory tradeoff for given cache assignments. In this section, we assume that a system designer is given a total cache budget M ≥ 0 that it can arbitrarily distribute across users. We are thus interested in the largest capacity-memory tradeoff optimized over the cache assignment M 1 , . . . , M K subject to a total cache budget M 1 + M 2 + . . . + M K ≤ M. We introduced this quantity as the global capacity-memory tradeoff C (M) in (11). This section presents lower and upper bounds on C (M), for any value of M, as well as exact results for C (M) when M is below a certain threshold or above another threshold.

A. Lower Bound
Proposition 1 and Theorems 2 and 3 readily yield a lower bound on C (M), see (77). As we will see in Corollary 11 ahead, this lower bound holds with equality when the total cache size M is small or large. Let and where C K is defined in (13) and M single 1 is defined in (25). Also, for any given P X , recall M (t) and R (t) from (39) and (40), and define for t ∈ {1, . . . , K − 1}: (76c) We have proved the achievability of each pair by proposing a corresponding scheme. Consider two schemes achieving the memory-rate pairs (M (t) , R (t) ) and (M (t ) , R (t ) ). By time sharing between the two schemes, we can achieve all memoryrate pairs that lie on the line connecting (M (t) , R (t) ) and (M (t ) , R (t ) ). I.e., the upper-convex envelope of all these ratememory pairs thus lower bounds C (M). This is formulated in Corollary 8 below. Corollary 8: The global capacity-memory tradeoff is lower bounded by: Notice that for any P X : and

B. Upper Bounds
Theorem 4 directly yields the following upper bound on the global capacity-memory tradeoff.
Corollary 9: There exist random variables X, Y 1 , . . . , Y K and for every receiver set S as in (18) where {α S,k } are defined in (67). Evaluating this bound numerically is cumbersome because for each possible subset S the coefficients α S,1 , . . . , α S,|S| have to be computed and then the optimal choice of U S,1 , . . . , U S,|S|−1 , X needs to be found in order to find the loosest upper bound in (80). To find upper bounds that have easier close-form expressions, we loosen the bound by either relaxing some of the constraints in (80); by replacing each parameter α S,k in (80) byα S,k or by α S,k (defined in (71) and (72)); and/or by allowing X, Y j1 , . . . , Y j S in (80) to depend on the set S. The following corollary presents such a simpler upper bound. Recall the definitions in (38). Corollary 10: For each t ∈ K: Proof: Fix t ∈ K. For each = 1, . . . K t , specialize Corollary 5 to S = G (t) and relax it by replacing each parameter α , we obtain Now, averaging (82) over all indices = 1, . . . , K t and upperbounding the sum M 1 + . . . + M K by M yields the desired result.

C. Exact Results
Lower and upper bounds on C (M) presented above match for small and large total cache sizes M. Corollary 11 below states this more formally. Recall M single from (25) and define M L as follows: where C avg is defined in (74). Corollary 11: For any positive total cache size M ≤ M single : and for any M ≥ M L : Before presenting the proof of this Corollary, let us briefly discuss the implications: For small cache sizes, the entire cache memory should be assigned to the weakest receiver and the superposition piggyback coding scheme of Section IV-B is optimal. For large total cache sizes M, a careful assignment of the available cache memory is needed. In particular, for M = M L , the generalized coded caching of Section IV-C (and its corresponding cache assignment) is optimal. Remark 4: For small total cache sizes, C (M) grows as M N . This corresponds to a perfect global caching gain as if each receiver could access all cache contents in the network locally. For large total cache sizes, the global benefit of receivers' cache memories is fully exploited. Any additional cache budget exceeding M L should be distributed among the receivers uniformly and it only offers a local caching gain.
In particular, C (M) grows as 1 K · M N . This is similar to the insights from [17] (for rate-limited links). For moderate cache sizes, C (M) grows with M N at a slope equal to: Proof of Corollary 11: The global capacity-memory tradeoff C (M) is upper bounded by the right-hand side of (84) and this follows by specializing Corollary 10 to t = K. Equality in (84) for M ≤ M single follows by Theorem 2.
The capacity memory tradeoff C (M) is also upper bounded by the right-hand side of (85). To see this, relax Corollary 9 by (i) replacing each parameter α S,k with α S,k and (ii) considering only the constraints (80) that correspond to sets S = {k}, for k ∈ K. Next, average the K resulting inequalities and maximize over the input distribution P X . This yields: Using the definition of M L as given in (83) where in the first step we have noted that each mutual information term appears (K − 1) times in the sum on the LHS of (87).

A. Erasure BCs
We specialize our results to erasure BCs where at time i receiver k's output Y k,i equals the channel input X i with probability 1 − δ k and it equals an erasure symbol "?" with probability δ k . Without loss of generality, we assume that the erasure probabilities satisfy: For erasure BCs, Moreover, a Bernoulli-1/2 input distribution P X maximizes I(X; Y k ) and I(X; Y k |U ) simultaneously for all k ∈ K and auxiliaries U that form the Markov chain U − X − Y k . Therefore, Theorem 4 and Corollary 5 coincide. Also, One observes that a smart allocation of the total cache memory M significantly increases the global capacity-memory tradeoff Analytically, we can prove that for a small total cache size M ≤ M single any cache assignment that does not allocate all the cache memory to the weakest receiver is suboptimal on the erasure BC. This follows from the achievability in Corollary 11 and Proposition 12 below.
The RHS of (92) is strictly less than C K + M N unless M = M 1 or δ 1 = . . . = δ K .
Proof: See Appendix F.

B. Noise-Free Bit-Pipe
Consider now the noise-free bit-pipe model in [1] with uniform cache assignment. It corresponds to an erasure BC where each receiver has zero erasure probability: (93) Consider also the notation introduced in Section II-A. We adopt the system model of [1] with equal cache sizes m 1 = · · · = m K = m and the delivery rate ρ.
From the upper bound on C(M 1 , . . . , M K ) in Theorem 4, the following lower bound on the minimum achievable delivery rate ρ can be obtained as a function of the normalized symmetric cache size m: Corollary 13: For the noise-free bit-pipe model in [1], the delivery rate is bounded from below as follows.
Proof: See Appendix G. This lower bound improves on the lower bounds [1], [10]- [12] that existed at the time when this manuscript was submitted. It is within a constant gap of 2.35 from the optimal rate-memory tradeoff [37]. A slightly improved bound has been established in a parallel work [36]. This latter bound, however, is specific to the noise-free bit-pipe model.

C. Gaussian BCs
Finally, we specialize our results to memoryless Gaussian BCs. At time t, the received symbol at receiver k is where X t is the input to the channel and {Z k,t } is an i.i.d. Gaussian process with zero mean and variance σ 2 k > 0. The channel inputs are subject to an average block-power constraint P . Without loss of generality, the receivers are ordered in increasing strength: [39], for every set S as defined in (18), we have where β 1 , . . . , β |S| form a unique choice of |S| real numbers in [0, 1] that sum to 1 and satisfy In particular, Moreover, given a power constraint P > 0, a zero-mean variance-P Gaussian input distribution maximizes I(X; Y k ) and I(X; Y k |U ) simultaneously for all k ∈ K and Gaussian auxiliaries U that form the Markov chain U − X − Y k . Therefore, Theorem 4 and Corollary 5 coincide. Also, (99) Figure 5 shows the upper and lower bounds on C (M) in Corollaries 8 and 9. The five blue points indicate the ratememory points (R (0) , M (0) ), (R single , M single ), (R (1) , M (1) ), (R (2) , M (2) ), and (R (3) , M (3) ) for a zero-mean variance-P Gaussian distribution P X . For comparison, the figure also shows the upper bound in Theorem 4 for a setup with uniform cache assignment M K across all receivers. We observe that a smart cache assignment provides substantial gains in the capacity-memory tradeoff.

VIII. SUMMARY AND CONCLUSION
We have provided close upper and lower bounds on the global capacity-memory tradeoff C (M) of degraded BCs. The bounds coincide in the regimes of small and large total cache memories with thresholds depending on the BC statistics.
For small cache memory sizes (characterized in (84)), the global capacity-memory tradeoff is achieved by assigning all the available cache memory to the weakest receiver. In this regime, C (M) grows as M N which corresponds to a perfect global caching gain; i.e., all receivers can benefit from all the cache contents in the network. This performance is achieved by superposition piggyback coding which provides every receiver virtual access to the weakest receiver's cache content.
For the regime of moderate M, we proposed a generalized coded caching scheme that performs a particular cache assignment such that the weaker receivers are provided with larger cache sizes. It then simultaneously serves t + 1, receivers in each delivery, where t ∈ {1, . . . , K−1} is the parameter of the scheme (similar to the scheme in [1]). We observed that the larger the total cache budget M, the larger the coded caching parameter t needs to be chosen. Hence, as M increases, cache memories will have to store more overlapping contents, and hence the caching gain decreases. In other words, the slope of the rate-memory tradeoff (achieved by the generalized coded caching) decreases as the total cache budget M increases. The same behavior is also observed from the upper bound.
In the regime of large M (characterized by the threshold in (85)), the caching gain is only local and C (M) grows as M KN . In particular, the memory threshold in (85)) corresponds to the extreme case where t = K − 1. In this case, the generalized coded caching scheme and its corresponding cache assignment are optimal and achieve the global capacity-memory tradeoff. For larger cache memories, it is optimal to first allocate the total cache memory as proposed by our scheme for t = K − 1, and then uniformly allocate all the remaining cache memory across all the receivers (and store the same content in those extra portions of the receivers' cache memories).
By examining several examples, we have demonstrated that assigning the total cache memory uniformly across all the receivers can be highly suboptimal over noisy BCs.
From a practical view point, one of the main technical challenges for implementing the proposed generalized coded caching scheme is its high level of subpacketization, i.e., the fact that messages need to be split into a very large number of smaller parts. Recent efforts focus on schemes with low subpacketization levels. See for example [42]- [44] for results on the noise-free bit-pipe model. Since R is achievable, for each sufficiently large blocklength n and for each demand vector d, there exist K caching functions g (n) k , an encoding function {f (n) (· · · , d)}, and K decoding functions ϕ (n) k (· · · , d) so that the probability of error P e (n) tends to 0 as n → ∞. Recall that P e (n) is the average over all error probabilities P e (n) (d), d ∈ N K . So denote the input of the degraded BC corresponding to the chosen encoding functions. Let Y n k,d denote the corresponding channel outputs at receiver k.

Lemma 14:
There exist random variables X d , Y 1,d , . . . , Y K,d and for each set S as in (18) random variables {U S,1,d , . . . , U S,|S|−1,d }, so that given X d = x ∈ X : and for each S: forms a Markov chain and the following |S| inequalities hold: Proof: The proof is inspired by the converse proof of the capacity of degraded BCs without caching [38]. Details are as follows. Since the worst case error probability is bounded by , using Fano's inequality we have where the equality follows by the chain rule of mutual information. Similarly, for k ∈ {2, . . . , K}: where (a) uses Fano's inequality as well as the fact that all messages are independent. Recall that the demand vector d has all different entries. We next develop the second summands in (104a) and (104b). For the second summand in (104a) we write where T denotes a random variable that is uniformly distributed over {1, . . . , n} and independent of all previously defined random variables, and where Step and For k ∈ {2, . . . , K − 1}, we expand the second summand in (104b) as: where (a) follows from the degradedness of the outputs; (b) by (106)  Similarly, we also have It can be verified that the defined random variables satisfy Conditions (102). Combining this observation with (104)-(115) concludes the proof.
We average the bounds in (103) over demand vectors. Let N dist K be the set of all the N K K! K-dimensional demand vectors with all distinct entries. Also, let Q be a uniform random variable over the elements of N dist Notice that the defined random variables defined in (116)-(119) satisfy conditions (15b) and (68) in the theorem. It remains to prove that they also satisfy (69). To this end, we average inequalities (103) over all the demand vectors in N dist K . Using standard arguments to take care of the averaging random variable Q, and defining α S,1 := 1 we obtain for each S as in (18): Lemma 15: For each set S, parameters α S,1 , . . . , α S,|S| satisfy the following constraints: Proof: See Appendix B. By (121)-(122) and letting → 0, the following intermediate result-which is used in other proofs in this paper-is obtained.

APPENDIX B PROOF OF LEMMA 15
We only prove the lemma for S = K. The proofs for the other sets are similar.
We first prove (122a). Every α K,k is non-negative, because mutual information is non-negative. To prove the upper bound in (122a), we proceed as follows. Let N dist K again be the set of K-dimensional demand vectors that have K distinct entries in {1, . . . , N}; and for each k ∈ {1, . . . , K} and each k − 1 dimensional demand vectord = (d 1 , . . . , d k−1 ), define Wd := (W d1 , . . . , W d k−1 ). We have: where (a) holds because for each value of K and j there are N−k K−k (K − k)! ordered demand vectors d ∈ N dist K with (d 1 , . . . , d k−1 ) =d and with d k = j; (b) holds by the independence of the messages; (c) holds because for any random tuple (A 1 , . . . , A L ) it holds that where for each positive integer ξ the term (ξ mod K) takes value in {1, . . . , K} so that For each ∈ {1, . . . , K − 1} and k, k ∈ {2, . . . , K} with k ≤ k, we write where (a) follows by (125) and (b) is by the independence of messages. Fix a demand vector d ∈ N dist K and sum up the above inequality (127) over all K cyclic shifts d (0) , d (1) , . . . , d (K−1) of d (where for simplicity we relabel the shifts) to obtain: Since the set N dist K can be partitioned into subsets of demand vectors that are cyclic shifts of each others and all cyclic shifts of a demand vector in N dist K are also in N dist K , we conclude from (128): This proves (122b).
We proceed to prove constraint (122c). For each d ∈ N dist K : So, where (a) holds by the chain rule of mutual information, (b) by the independence and uniform rate of messages W 1 , . . . , W N and the definition of the set N dist K , which is of size N K K!, and (c) by the generalized Han-Inequality (the following Proposition 18).
To simplify notation in the following, we define U S,|S| := X.
We now assume that (142b) holds. We show that the new constraints obtained for k =k and for k =k + 1 cannot be more stringent then the tighter of the two original constraints for k =k and k =k + 1.

APPENDIX D PROOF OF REMARK 3
We first prove that the bound in Theorem 4 is loosened when each α S,k is replaced byα S,k . Consider the intermediate Lemma 16 in the proof of Theorem 4, Appendix A. Relax the upper bound in this lemma by replacing for k = 2, . . . , K constraint (122a) by α S,k ≥ 0.
Since constraints (123) are increasing in α S,1 , . . . , α S,|S| , by constraint (122c), we conclude that the relaxed upper bound is loosest for i.e., for α S,k =α S,k . We now prove that the bound in Theorem 4 is loosened when each α S,k is replaced by α S,k . Consider again the intermediate Lemma 16 in Appendix A. Relax constraint (122a) by replacing it with α S,k ≥ 0, for all k = 1, . . . , K. Following the steps in [22,Lemma 12], it can be shown that the new constraints are loosest if each α S,k = α S,k .
This concludes the proof.

APPENDIX E PROOF OF PROPOSITION 7
For ∆ = 0, achievability follows by specializing Theorem 3 to t = K − 1 and to the input distribution P X that maximizes (74). In fact, for this input distribution: For ∆ > 0, achievability follows from Proposition 1. The converse is proved as follows. Consider cache sizes M * (K−1) 1 , . . . , M * (K−1) K as given in (41). Apply Theorem 4, but consider only the constraints (69) corresponding to the sets S = {k}, for k ∈ K. Taking the average over the resulting K constraints establishes that there exists a random variable (X, Y 1 , . . . , Y K ) satisfying (15b) and so that C(M (K−1) 1 , . . . , M Maximizing the right-hand side over input distributions P X yields the desired converse.

APPENDIX F PROOF OF PROPOSITION 12
Relax the upper bound in Theorem 4 by considering constraints (69) only for the set of all receivers S = K, and by replacing each α S,k byα S,k . Specializing the resulting relaxed bound to the erasure BC, one obtains the following upper bound: where the maximization is over the choice of parameters β 1 , β 2 , . . . , β K ≥ 0 satisfying K k=1 β k ≤ 1.
The upper bound in the proposition is established by solving this maximization problem. In fact, by noticing that the bound is increasing in β 1 , β 2 , . . . , β K ≥ 0, and by first fixing β 1 and optimizing over the choices β 2 , . . . , β K ≥ 0 summing to 1−β 1 , we obtain where we used that for erasure BCs Notice that the sum t k=1 α S,k takes on only two different values, depending on the outcomes of the minimizations defining α S,k . It is either Combining (164) with (165), applying the correspondence ρ = R −1 and m k = M k R , and setting m 1 = m 2 = . . . = m k = m yields, which is equivalent to the bound in the corollary.