This page intentionally left blank Probability and Statistics for Computer Science This page intentionally left blank Probability and Statistics for Computer Science JAMES L. JOHNSON Western Washington University WILEYINTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Copyright © 2003, 2008 by John Wiley & Sons, Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hobokcn, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without cither the prior written permission of the Publisher, or authorization through payment of the appropriate pcr-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive. Danvcrs, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons. Inc., 111 River Street, Hobokcn, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/pcrmission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategics contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic format. For information about Wiley products, visit our web site at www.wilcy.com. Library of Congress Cutaloging-in-Publication Data is available. ISBN 978-0-470-38342-1 10 9 8 7 6 5 4 3 2 I To my muses Shelley, Jessica, and Jimmy This page intentionally left blank Contents Preface ix 1 Combinatorics and Probability 1.1 Combinatorics 1.1.1 Sampling without replacement 1.1.2 Sampling with replacement 1.2 Summations 1.3 Probability spaces and random variables 1.4 Conditional probability 1.5 Joint distributions 1.6 Summary 1 4 6 23 34 44 61 69 82 2 Discrete Distributions 2.1 The Bernoulli and binomial distributions 2.2 Power series 2.3 Geometric and negative binomial forms 2.4 The Poisson distribution 2.5 The hypergeometric distribution 2.6 Summary 91 93 108 128 150 164 172 3 Simulation 3.1 Random number generation 3.2 Inverse transforms and rejection filters 3.3 Client-server systems 3.4 Markov chains 3.4.1 Irreducible aperiodic Markov chains 3.4.2 Convergence properties 3.5 Summary 179 180 189 205 227 238 247 264 4 Discrete Decision Theory 4.1 Decision methods without samples 4.2 Statistics and their properties 4.3 Sufficient statistics 4.4 Hypothesis testing 4.4.1 Simple hypothesis versus simple alternative 271 272 285 314 333 339 vu viii Contents 4.4.2 Composite hypotheses Summary 354 363 5 Real Line-Probability 5.1 One-dimensional real distributions 5.2 Joint random variables 5.3 Differentiable distributions 5.4 Summary 371 374 396 416 432 6 Continuous Distributions 437 6.1 The normal distribution 437 6.1.1 The univariate and bivariate normal distributions . . . . 437 455 6.1.2 The multivariate normal distribution 6.2 Limit theorems 468 6.2.1 Convergence concepts 471 6.2.2 An inversion formula 477 490 6.3 Gamma and beta distributions 6.4 The χ 2 and related distributions 504 527 6.5 Computer simulations 6.6 Summary 540 7 Parameter Estimation 7.1 Bias, consistency, and efficiency 7.2 Normal inference 7.3 Sums of squares 7.4 Analysis of variance 7.5 Linear regression 7.6 Summary 545 546 557 568 575 608 635 A Analytical Tools A.l Sets and functions A.2 Limits A.3 Structure of the real numbers A.4 Riemann-Stieltjes integrals A.5 Permutations and determinants 641 641 647 659 672 688 B Statistical Tables 713 Bibliography 733 Index 739 4.5 Preface This text develops introductory topics in probability and statistics with particular emphasis on concepts that arise in computer science. It starts with the basic definitions of probability distributions and random variables and elaborates their properties and applications. Statistics is the major application domain for probability theory and consequently merits mention in the title. Unsurprisingly, then, the text treats the most common discrete and continuous distributions and shows how they find use in decision and estimation problems. It also constructs computer algorithms for generating observations from the various distributions. However, the text has a major subtheme. It develops in a thorough and rigorous fashion all the necessary supporting mathematics. This approach contrasts with that adopted by most probability and statistics texts, which for economy of space or for fear of mixing presentations of different mathematical sophistication, simply cite supporting results that cannot be proved in the context of the moment. With careful organization, however, it is possible to develop all the needed mathematics beyond differential and integral calculus and introductory matrix algebra, and this text purports to do just that. Of course, as the book lengthens to accommodate the supporting mathematics, some material from the typical introduction to probability theory must be omitted. I feel the omissions are minor and that all major introductory topics receive adequate attention. Moreover, engagement with the underlying mathematics provides an opportunity to understand probability and statistics at a much deeper level than that afforded by mechanical application of unproved theorems. Although the presentation is as rigorous as a pure mathematics text, computer science students comprise the book's primary audience. Certain aspects of most computer science curriculums involve probabilistic reasoning, such as algorithm analysis and performance modeling, and frequently students are not sufficiently prepared for these courses. While it is true that most computer science curriculums do require a course in probability and statistics, these courses often fail to provide the necessary depth. This text certainly does not fail in presenting a thorough grounding in elementary probability and statistics. Moreover, it seizes the opportunity to extend the student's command of mathematical analysis. This approach is different than that taken by other probability and statistics texts currently aimed at computer science IX X Preface curriculums. The more rigorous approach does require more work, both from the student and from the instructor, but the rewards are commensurate. The engineering sciences, like computer science, also tend to use texts that place more emphasis on mechanical application of results than on the mathematical derivation of such results. Consequently, engineering science students will also benefit from the deeper presentation afforded by this text. Nevertheless, the primary audience remains computer science students because many of the illustrative examples are computer science applications. Therefore, from this point forward, I assume that I am addressing a computer science student or instructor. Computer science students typically follow a traditional curriculum that includes one or two terms of probability and statistics, which follow prerequisite courses in differential and integral calculus and linear algebra. Although these prerequisite courses do introduce limit processes and matrix transformations, they typically emphasize formulas that isolate applications from the underlying theory. For example, if we drain a swimming pool with a sinusoidal cross-section, we can calculate how fast the water level falls without invoking limit operations. We simply set up a standard differential ratio and equate it to the drain flow rate. Why this works is buried in the theory and receives less and less emphasis once a satisfactory collection of calculation templates is available. This text provides an opportunity to reconnect with the theoretical concepts of these prerequisite courses. As it probes deeper into the properties of probability distributions, the text puts these concepts to fruitful use in constructing rigorous proofs. The book's ambient prose deals with the principal themes and applications of probability, and a sequence of mathematical support modules interrupts this prose at strategic junctures. With some exceptions, these modules appear as needed by the probability concepts under discussion. A reader can omit the modules and still obtain a good grounding in elementary probability and statistics, including philosophical interpretations of probability and ample exercise in the associated numerical techniques. Reading the support modules will, however, strengthen this understanding and will also arouse an appreciation for the mathematics itself. The encapsulation is as follows. An appendix gathers selected topics from set theory, limit processes, the structure of the real numbers, RiemannStieltjes integrals, matrix transformations, and determinants. The treatment first reviews the material at an introductory level. The prepared reader will be familiar with these concepts from previous courses, but the results are nevertheless proved in detail. The less prepared reader will certainly find frequent recourse to the appendix, and the text provides pointers to the appropriate sections. However, even the prepared reader will benefit from the introductory presentations, which serve both as a review of proof technique and as an introduction to the argument style pursued in the main text. Upon completing an introductory review, the appendix then extends the topics as necessary to support the arguments that appear in the main body of the text. Therefore, xi Preface all chapters depend on the appendix for completeness. Even a reader well grounded in the aforementioned prerequisites can expect to spend some time mastering the specialized tools developed in the appendix. The appendix, with its eclectic collection of review topics and specialized extensions, provides general mathematical background. There is need, however, for more specific supporting mathematics in connection with particular probabilistic and statistical concepts. Until perhaps halfway through the text, this supporting mathematics appears in mathematical interludes, which occur in each chapter. These interludes introduce particular results that are needed for the first time in that chapter. The first interlude deals with summation techniques, which are useful tools for the combinatoric problems associated with probability over equally likely outcomes. Others treat convergence issues in power series, stability features of Markov matrices, and sufficient statistics. Before taking up continuous distributions, however, it is appropriate to devote a full chapter to the mathematical issues that arise when one attempts to generalize discrete probability to uncountable sets and to the real line in particular. This chapter is actually a brief introduction to measure theory, and its logical place is just prior to the discussion of the common distributions on the real line. Two further interludes follow in subsequent chapters. They deal with limit theorems for continuous random variables and with decompositions of the sample variance. In short, the text exploits opportunities to introduce the mathematical analysis necessary to establish the basic results of probability theory. Moreover, the presentation clearly considers the mathematical analysis and the probability theory to be of equal importance. The following sketch shows the dependencies among the chapters, with the understanding that portions of the appendix are prerequisite for any given path. The dashed boxes note the mathematical interludes within the chapters. Chapter 5 Probability on the Real Line Discrete Decision Theory Chapter 6 ■ Continuous Distributions 1 Chapter 7 Parameter Estimation - * » Limit theorems — , >| Sums of squares | The reader can study the appendix in detail to ensure familiarity with all the background mathematics needed in the text, or can start immediately xii Preface with the probability discussions of Chapter 1 and refer to the appendix as needed. Because of its breadth, the appendix is more difficult to master in its entirety than the mathematical interludes of the introductory chapters. A reader who prefers that the material increase monotonically in difficulty should start with the introductory chapters and digress into the appropriate appendix sections as needed. When using the text to support a course, an instructor should follow a similar path. As noted earlier, the intended audience is computer science students. Once past colored balls in numbered urns, which constitute the traditional examples in combinatoric problems, the text uses examples that reflect this readership. Client-server performance evaluation, for instance, offers many opportunities for probabilistic analysis. These examples should provide no difficulty for other readers, such as students from the engineering sciences, because the examples make no profound references to advanced concepts, but rather use generally accessible quantities, such as terminal response time, server queue length, error count per 1000 programming statements, or operation count in an algorithm. These examples are no more difficult than those in a more general probability text that ventures beyond the traditional urns, colored balls, and dice. The requirements of the Computer Science Accreditation Board and the Accreditation Board of Engineering Technology (CSAB/ABET) include a one-semester course in probability and statistics. This text satisfies that requirement. In truth, it is sufficient for a full-year course because it not only develops the traditional introductory probability concepts but also includes considerable material on mathematical reasoning. For a one-semester course, the following selection is appropriate. Note that the topics lie along an acceptable dependency chain in the earlier diagram. Appendix. Sections as referenced in the items below Chapter 1. Combinatorics and Probability Chapter 2. Discrete Distributions Chapter 3-4. Simulation, Sections 3.1 to 3.3, or Discrete Decision Theory, Sections 4.1 and 4.2 Chapter 6. Continuous distributions, Sections 6.1, 6.3, and 6.4 Chapter 7. Parameter Estimation, Sections 7.1, 7.2, and 7.4 The one-semester abbreviation is possible because Chapters 3 and 4 present major applications of discrete probability and, in the interest of time, need only be sampled. Chapter 5 is advanced material that elaborates the difficulties in extending discrete probability to uncountable sample spaces. It is present for logical completeness and to answer the nagging question that occurs to many students: Is the introduction of a sigma-algebra really necessary in the general definition of a probability space? Consequently, the proposed Preface xiii one-semester course omits Chapter 5 with minimal impact on subsequent material. Finally, Chapter 7 undertakes major applications of continuous probability and also admits partial coverage. At the time of this writing, many computer science curriculums include only a first probability course. However, there is a recognized need for further study, at least in the form of an elective second course, if not in a required sequel to the introductory course. Anticipating that this increased attention will also expose the need for a more complete mathematical treatment of the material, I have provided unusally detailed excursions into supporting topics, such as estimation arguments with limits, properties of power series, and Markov processes. Buttressed by these mathematical excursions, the text provides a thorough introduction to probability and statistics—concepts, techniques, and applications. Consequently, it offers a continuing discussion of the real-world meaning of probabilities, particularly when the frequency-of-occurrence interpretation becomes somewhat strained. Any science that uses probability must face the interpretation challenge. How can you apply a result that holds only in a probabilistic sense to a particular data set? The text also discusses competing interpretations, such as the credibility-of-belief interpretation, which might appear more appropriate to history or psychology. The goal is, of course, to remain continually in touch with the real-world meaning of the concepts. Probability as frequency of occurrence over many trials provides the most compelling interpretation of the phenomenon. It is intuitively plausible, for example, that a symmetric coin should have equal chances of landing heads or tails. The text attempts to carry this interpretation as far as possible. Indeed, the first chapter treats the combinatorics arising from symmetric situations, and this treatment serves as a prelude to the formal definitions of discrete probability. As the theory accumulates layer upon layer of reasoning, however, this viewpoint becomes difficult to sustain in certain cases. When testing a hypothesis, for example, we attempt to infer the prevailing state of nature from sampled data. What does it mean to assign a priori probabilities to the possible states? This practice allows statisticians to incorporate expert judgment into the decision rules, but the assigned probabilities do not admit a frequency-of-occurrence interpretation. Rather, they reflect relative strength-of-belief statements about the possible states. As necessary, the text interrupts the technical development to comment on the precise real-world interpretation of the model. Although beautiful as abstract theory, probability and statistics are also rightly praised for their ability to deliver meaningful statements about the real world. Interpreting the precise intent of these statements should be a primary goal of any text. A trend in modern textbooks, particularly those not addressed specifically to a mathematics curriculum, is to avoid the theorem-proof presentation style. This style can be sterile and detached, in the sense that it provides sparse context for the motivation or application of the theorems. Without χΐν Preface the theorem-proof style, on the other hand, arguments lose some precision, and there is a blurring of the line between the general result and its specific applications. I have adopted what I consider a middle ground. I maintain a running prose commentary on the material, but I punctuate the dialog with frequent theorems. Often, the theorem's proof is a simple statement: "See discussion above." This serves to set off the general results, and it also provides reference points for later developments. The ambient prose remains connected with applications and with the questions that motivate the search for new general results. Plentiful examples, displayed in a contrasting typographical style, play a major role in compensating for the perceived coldness of the theorems. Incidentally, I should say that I do not find the theorems cold, even in isolation. But I am responding to the spirit of the age, which suggests that a theorem wrapped in an example is more digestible than a naked theorem. The theorems also further a second ambition, noted above, which is to involve the reader more extensively in precise mathematical argument. An aspect of proofs that attracts major criticism is the tendency to display, out of thin air, an expression that magically satisfies all the required constraints and invites the algebraic manipulations necessary to complete the proof. I have tried to avoid this practice by including some explanation of the mysterious expression's origin. I must admit, however, that I am not always successful in this ploy. Sometimes an explanation adds nothing to a careful contemplation of the expression. In such cases, I am tempted to suggest that the reader reflect. on the beauty of the expression, note how one part attaches to the known information while another extends toward the desired result, and view the expression as a unifying link, growing naturally from a study of the context of the problem in question. Instead, however, I fall back on the age-old practice: "Consider the following expression — " The reader should take these words as an invitation to pause and ponder the situation. In summary, the text develops a main theme of probability and statistics, together with the mathematical techniques needed to support it. Since it is not practical to start with the Peano axioms, there are, however, some prerequisites. Specifically, the text's mathematical level assumes that the reader has mastered differential and integral calculus and has some exposure to matrix algebra. Nevertheless, acknowledging the mechanical fashion in which these subjects are taught these days, the text provides considerable detail in all arguments. It does assume, nevertheless, that readers have some familiarity with limiting operations, even if they do not have significant experience with the concept. For example, readers should be comfortable with l'Hôpital's rule for evaluating limits that initially appear to produce indeterminate results. By contrast, the text does develop the theory of absolutely convergent series to the point of justifying the interchange of summation order in double summations. As another example, the readers should be generally conversant with power series, although the text develops this topic sufficiently to justify the term-by-term differentiation that is needed to recover parameters from a Preface xv random variable's moment generating function. The background appendix and the mathematical interludes should bridge the gap between prerequisite knowledge and that needed to establish all probability and statistical concepts encountered. They also serve to present a self-contained book, which is my preference when learning new material. A reader can always skip sections that are peripheral to the main point, but cannot as easily fill in omissions. Some expositions consist of step-by-step procedures for solving a probabilistic or statistical problem. That is, they involve algorithms. For example, algorithms appear in sections concerned with computer simulations of probabilistic situations. In any case, algorithms in this text appear as mutilated C code, in the sense that I vary the standard syntax as necessary to describe the algorithm most clearly to a human reader. For instance, I use only two iterators, the while-loop and the for-loop, and in each case, the indentation serves to delineate the body of statements under repetition. The left fragment below, intentionally meaningless to focus attention on the code structure, must be reformulated as shown on the right to actually compile properly, while (X > Y) while (X > Y) { for (j = 1; j <=; j++) for (j = 1; j <=; j++) { Y = Y - t[i]; Y = Y - t[j]; t[j] = t[j] + 1.0; t[j] = t[j] + 1.0; if (Y < 0.0) Y = 0.0; } if (Y < 0.0) Y = 0.0; } This practice amounts to omitting the opening and closing braces, and it is surprisingly efficient in reducing the code length. The price, of course, is that the code will not execute as written. I also omit variable declarations and simply use any variables that I need in an untyped manner. So a variable, or a function return, can be a scalar, a vector, a matrix, or a list. A particular usage is always clear from context, and these omissions, like those of scope enclosures, produce more compact code. In summary, the algorithms in this text are intended for human consumption. The emphasis is on the concept, not the detailed syntax of a programming language. Nevertheless, I have constructed all the algorithms in functional code and executed them to ensure that they perform as promised. I emphasize that this text requires no C programming background. A C-based style finds frequent use in books and journals as a concise and precise algorithm presentation method for audiences that are not C literate. This widespread usage confirms the opinion that loose C structures are easily accessible to readers who do not have a background knowledge of C. Consequently, a reader with no previous C experience should not feel anxious about the algorithm descriptions in this text. Although readers must want to understand the algorithm and must expend the time to attend carefully the iterative processes in the C expression, they will find the algorithm comprehensible. Moreover, explanatory prose accompanies all C-structure descriptions. Another feature that will be appreciated by today's students is the text's xvi Preface early emphasis on discrete distributions. Most introductory probability books quickly proceed to the normal distribution and the central limit theorem because these tools enable certain computational approximations. I certainly do not wish to imply that the normal distribution or the central limit theorem is not important. These topics are developed in detail in the latter half of the book. However, much reasoning with discrete distributions can take place without replacing the discrete distribution with a more tractable normal distribution. This is especially true today when computers complete the mechanical, but sometimes tedious, computations. Working directly with discrete distributions develops a feeling for the subtleties of probability theory that can be overlooked in a rush to obtain approximate results. The many examples involving discrete distributions will sharpen a reader's capabilities for combinatorial reasoning and will exercise his or her skills in discrete mathematical argument. These skills are especially important in today's scientific world, which grows ever more digital, and therefore ever more discrete. In keeping with this approach, the text develops simulation techniques and certain aspects of statistical inference directly after introducing a repetoire of discrete distributions but before developing continuous distributions. This means that certain proofs, which admit more elegant expositions with continuous tools, contain detailed arguments with limits. These arguments are conceptually similar to those associated with calculus (e.g., passing to an integral limit from a summation), but the text nevertheless presents more intermediate steps than one normally finds in a textbook All definitions, theorems, and examples share a common numbering sequence within each chapter. To find Theorem 3.14, for example, you can use any definition, theorem, or example to direct your search forward or backward. I find it a considerable annoyance when these items are numbered separately. Of course, this means that there may be a Theorem 3.14 even though there is no Theorem 3.13. I try to make this dissonance more acceptable by placing the number before the theorem, definition, or example. Having located Theorem 3.14, for example, you find that it starts "3.14 Theorem " In this reading, Theorem 3.14 is an item, which happens to be a theorem. There is indeed an item 3.13, which may or may not be a theorem. In the traditional manner, tables and figures exhibit separate numbering sequences within each chapter. Chapter 1 Combinatorics and Probability The computer and engineering sciences investigate many nondeterministic phenomena. In computer science, examples include an algorithm's running time, the queue length encountered by a request arriving at a database server, and the delay in transferring a data block from disk to main memory. Currents in semiconductor devices, an important concern in electrical engineering, depend on the energy distributions of charge carriers. Biological evolution appears to involve random mutations in cells' offspring. In any applied science, measurement error adds an uncontrollable variation to the simplest datagathering activity. Therefore, it is common laboratory practice to average a sequence of measurements to reduce random effects. We also use probabilistic terms to describe certain ordinary events of everyday life. The chance of rain today is 40%. The probability of heads on a coin toss is 1/2. The odds that team X will win a sports event are 3 : 1 . Once in every thirty-six throws, approximately, a pair of dice will exhibit the minimum total of two spots on the upturned faces. In all these cases, we understand that some repetitive process generates outcomes that are not deterministic but nevertheless exhibit a rational pattern. For example, each rotation of the planet brings a new day, on which rain may or may not occur. We interpret a 40% chance of rain as meaning that over an extended run of days, 40% of them will be rainy. In this initial informal discussion, we use the term probability in the ordinary sense implied by the examples above. That is, the probability of an event is its relative frequency of occurrence over many observations. Probability is simply a synonym for chance. The probability of rain is 40%. The chance of rain is 40%. Both mean that similar meteorological conditions in the past have produced rain in 40% of the observations. More exact mathematical definitions appear later in the chapter, and they are consistent with the intuitive meanings discussed here. When faced with a repeating phenomenon that generates varying out1 2 Chapter 1: Combinatorics and Probability comes, we can adopt one of two viewpoints. On the one hand, we can argue that the phenomenon does not actually repeat. It only appears to do so because our knowledge is not sufficiently detailed to distinguish between iterations. On the other hand, we can accept the variations and plan accordingly. With the latter approach, we are indifferent as to whether the variations arise because we are not sufficiently informed or because the phenomenon is fundamentally nondeterministic. Consider again the weather forecast that predicts a 40% chance of rain under certain meteorological conditions. The first viewpoint suggests that the meteorological conditions are merely approximations of the true state of nature. It implies that we should direct our efforts toward attaining complete knowledge of nature, which would allow a prediction with 100% certainty. The second attitude suggests that we should live with the uncertainty. We should lay plans that make provisions for alternative outcomes. Indeed, an adherent of the first viewpoint must frequently follow the second course because complete knowledge of nature is inaccessible. Most persons find it practical to take a decision about carrying an umbrella without waiting for meteorologists to acquire complete knowledge of the weather. Probability theory provides a rigorous mathematical framework for investigating phenomena that exhibit repeatable patterns, even though individual outcomes may appear random. As Hamlet put it, "Though this be madness, yet there is method in't." Probability theory provides a lens that enables us to see the patterns underlying an apparently chaotic surface, a lens that tames the madness to reveal the method. Knowledge of repeatable patterns can be very useful, even if there are occasional disappointments. If I know that team X will play an extended sequence of games and that the chances of winning remain 3 to 1 throughout the series, I might find a betting office that will accept the following proposition. I will pay the office $10 for each game. When team X wins a game, the office will pay me $15; when team X loses, I get nothing back. Suppose that the team plays 12 games and in keeping with the odds, wins 3 for each loss. That is, the teams wins 9 games and loses 3. I pay $120, and I receive $135, for a net gain of $15. Of course, the betting office can run this calculation as well as I can, and therefore it would not accept my proposition. In truth, the betting office can run the same calculation only if it also knows that the 3:1 odds are an accurate description of the situation. So informed bets are made only between parties who have different perceptions of the underlying odds. Long study and experience can produce credible estimates of the odds for a sports event, or for the chance of rain on a given day, but the results cannot be completely accurate. Other probabilistic descriptions, however, can be precise, at least in an idealized situation, because they describe certain favorable combinations as a fraction of all possible outcomes. Only two outcomes, for example, are possible in a fair coin toss, heads and tails. Therefore, the probability of heads in a fair toss is 1/2. Similarly, rolling a pair of dice will produce one of the thirty-six patterns in the grouping that follows, where the ordered pair (x,y) means that the first Chapter 1: Combinatorics and Probability 3 die shows x spots on its upturned face and the second die shows y. We can infer the probabilities of certain simple events from direct observation of the display. For example, noticing that the (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) outcomes summing to seven are the di(2,1) (2,2) (2,3) (2,4) (2,5) (2,6) agonal entries, running from the lower (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) left to the upper right, we conclude that i4-1) (4>2) (4>3) (4-4) (4*5) (4-6) a seven total has probability 6/36. The j 5 · 1 ) j 5 ' 2 j [ 5 ' 3 j [5-4j j 5 'fj [5-6j calculations for other results are equally * ' ' *■ ' ' * ' ' ^ ' ' ^ ' ° ' * ' ' straightforward. The probability of a three total is 2/36. The probability that the absolute difference between the two dice is greater than two is 12/36 because the favorable outcomes are the upper-right and lower-left triangles, each containing six entries. Probabilistic quantities appear frequently in the computer and engineering sciences. To the examples noted earlier we can add the time required for a computer to respond to a command from an interactive terminal, the number of errors in a program module, the number of polls to find a receptive device, the time to access a transmission bus, the number of memory bits flipped by cosmic radiation, or the number of compare operations executed by a sort algorithm. In many cases, it is difficult to calculate meaningful probability weights. For example, suppose that you measure the system response time T to a terminal command by directly observing a number of such operations. You then have an empirical estimate of the fraction of cases for which T — 2 seconds. But is it possible to compute this fraction in advance from your knowledge of the system parameters? Such a computation is not likely; it depends on circumstances that are neither repeatable nor easily controlled, such as parallel activity from other system users. By contrast, the number of compare operations in a sort algorithm is subject to analysis. The possible inputs of size N are the permutations of the numbers 1,2,... ,N, and it is reasonable to assume that all such inputs are equally likely. For a specific input, the algorithm determines exactly the number of compare operations, and we can therefore proceed by direct count to enumerate the fraction of cases for which the count is a particular n. This chapter opens our study of probability by considering just such situations, where we can compute the relevant probabilities by exhaustive counting. Each observation of a nondeterministic phenomenon is called a trial. In symmetric situations, such as coin tosses or dice rolls, we can reasonably assume that each elementary outcome appears with equal frequency in a long sequence of trials. However, as we associate trials in various ways, the equal frequencies quickly disappear. For example, a tossed coin displays either a head or a tail, and we expect a sequence of tosses to exhibit both results with approximately equal frequencies. But, if we change the definition of a trial to cover all tosses necessary to obtain a head, each observation is a number of tails followed by a head. It is not likely that all such observations are equally likely. Do you feel that 6 tails followed by a head will occur as often as 2 4 Chapter 1: Combinatorics and Probability tails followed by a head? This chapter's early sections develop the tools to answer such questions. In the process, we obtain an intuitive appreciation for the probability of an event, such as 6 tails followed by a tail, as its relative frequency of occurrence in an extended sequence of observations. In a particular application, the usual difficulty is the systematic counting of the total possible outcomes and of the outcomes favorable to a certain result. Unlike the simple examples cited above, most applications provide a very large field of possible outcomes, and it is not feasible to list them explicitly. Before launching into the details, we take the time to elaborate three counting principles. All are immediately obvious, so you may feel that there is no need to state them explicitly. In complicated situations, however, when you cannot seem to find a starting point for a solution, you might well return to these simple principles. The first principle suggests breaking the mass to be counted into disjoint pieces and then summing the subcounts from these pieces. That is, if you want to count rooms within a building, you first perform subcounts on each floor and then add the subcounts. In applying this principle, you need to ascertain that there is no overlap among the pieces, which could lead to overcounting some elements. The second principle, known as the multiplicative principle, is actually a special case of the first principle. It again breaks the mass into disjoint pieces, but now such that each piece contains the same number of elements. The total count is then the number of pieces times the count of any one piece. If, for example, you want the number of houses on a rectangular site, you count the houses in one row and multiply by the number of rows. The third principle, known as the pigeonhole principle, states that the allocation of more than n pigeons to n pigeonholes results in multiple occupancy for at least one pigeonhole. If we randomly allocate four balls among three containers, what is the probability that all containers receive at most one ball? By the pigeonhole principle, it is zero. Some container must receive at least two balls. Consider a related question: What is the probability that the leftmost container receives at most one ball? Now, that is a more complicated story, which we now begin. 1.1 Combinatorics Many situations call for counting the number of ways to select, under specified constraints, objects from a given set. When we say that a set contains n objects, we understand, in keeping with the usual definition of a set, that the objects are distinct. We may wish, nevertheless, to disregard distinctions among certain elements. For example, in a set of 7 balls, we can have 4 reds and 3 blacks. In this case, we can adopt a notation that both captures the elements' distinct identities and emphasizes the two internal classes of red and black. The designation {ΓΙ,Γ2,Γ 3 ,Γ4,61,62.^3} serves this purpose, but other schemes are equally valid. The best notation varies with the context of the problem at hand. In some manner or other, a problem context always specifies three gen- Section 1.1: Combinatorics 5 Comb. Sequences abc abc, acb, bac, bca, cab, cba abd abd, adb, bad, bda, dab, dba acd acd, adc, cad, cda, dac, dca bed bed, bde, cbd, cdb, dbc, deb (a) Sampling without replacement Comb. abc abd acd bed aab abb aac ace aad add Comb. Sequences Sequences abc, acb, bac, bca, cab, cba 66c 66c, 6c6, c66 abd, adb, bad, bda, dab, dba 6cc 6cc, c6c, cc6 bbd bbd,bdb,dbb acd, adc, cad, cda, dac, dca bed, bde, cbd, cdb, dbc, deb bdd bdd, dbd, ddb aab,aba,baa ccd ccd, ede, dec edd edd, ded, ddc abb, bab, bba aaa aaa aac, aca, caa ace, cac, cca 666 666 aad,ada,daa ccc ccc ddd ddd add, dad, dda (b) Sampling with replacement T A B L E 1.1. 3-combinations and their related 3-sequences from a field of 4 symbols eral constraints and perhaps further particular constraints. The general constraints answer the questions: How many objects are selected? Does the selection order make a difference? After drawing an object and recording the result, do we return the object to the set before drawing the next item? The responses to some questions affect the possible answers to others. For example, if you return objects to the common pool after each selection, then the total number of objects drawn can exceed the pool size. 1.1 Definition: An ordered selection of k objects is a k-sequence. An unordered selection of k objects is a k-combination. In the process of sampling with replacement, we note the identity of each object as it is drawn, but we return it to the common pool before the next draw. In the alternative process, sampling without replacement, we do not return the drawn object to the pool. | From the set {a,b,c, d), we generate four 3-combinations and twentyfour 3-sequences when sampling without replacement. The upper tabulation of Table 1.1 suggests a 6-to-l relationship between the sequences and the combinations. Sampling with replacement, on the other hand, yields twenty 3combinations and sixty-four 3-sequences, as illustrated in the lower tabulation. If there is a relationship between a combination and the sequences involving the same choices, it is less apparent. Duplications appear in both sequences and combinations when choosing with replacement, and this complicates the count. As noted earlier, we use the term probability for relative frequency of occurrence. Suppose, for example, that we are generating sequences with- 6 Chapter 1: Combinatorics and Probability out replacement in the context of Table 1.1. Note that 4 of the 24 possible sequences contain the ordered pair "ab." Consequently, we say that the probability of "ab" occurring as a subsequence is 4/24 = 1/6. This policy agrees with our intuition when all sequences are equally likely. When, in due course, we come to a formal definition of probability, we will find that it conforms with this frequency-of-occurrence notion. 1.1.1 Sampling without replacement Under sampling without replacement, neither a sequence nor a combination exhibits a duplicate entry. Sequences, however, have more structure because the insertion order is important. Our goal is to find systematic methods for counting sequences and combinations under sampling without replacement, but also under a variety of further constraints. We start with sequences. Suppose that we have seven tiles, labeled with the letters: a b c d e f g. Suppose further that we consider a word to be any 3-sequence. How many such words can we compose with the seven tiles? Possible words are abc, cag, def, fed, and the like. We imagine a repetitive process. On each trial, the process selects three tiles from the initial seven and assembles them in the order selected to form a word. There is no replacement as the tiles are selected. However, all seven tiles are reinstated at the beginning of each trial. With this clarification, the desired count is the number of 3sequences from a set of seven symbols, where the selection process is without replacement. We apply the multiplicative counting principle that breaks the possibilities into groups of the same size. In particular, a systematic count arises from partitioning the words into seven groups—those beginning with a, those beginning with b, and so forth. Within the first group, where the first letter is a, we distinguish six subgroups: those with second letter b, those with second letter c, and so forth through those with second letter g. In the second group, where the first letter is b, we again distinguish six subgroups: those with second letter a, those with second letter c, and so forth through those with second letter g. An important observation is that each group divides into the same number of subgroups. Table 1.2 organizes these subdivisions. Within each subgroup, where the first two letters are now specified, there remains a choice of one of the five remaining letters for the last tile. Hence each of the 42 subgroups divides naturally into five further pieces, each distinguished by one of the remaining five letters. So, if we choose three tiles randomly from the group of seven and lay them out in the order chosen, we will form one of exactly 7 · 6 · 5 = 210 three-letter words. In terms of Definition 1.1, this demonstration shows that there are exactly 210 possible ways of choosing, without replacement, a sequence of length 3 from a group of size 7. Now suppose that we want the probability that a three-letter sequence chosen without replacement from a field of seven letters has its first two letters in alphabetical order. In other words, we want the fraction of the 210 sequences that have the first two letters in order. Referring Section 1.1: Combinatorics First letter a b c Second letter b c d e f g a c d e f g a b d e f g 7 Possible third letters cd e fg b d e fg b cefg b cd fg bcd eg bcd ef cd e fg a d e fg a ce fg a cd fg a cdeg a c d ef bd e fg a d efg a b e fg a b d fg a bdeg a bd ef T A B L E 1.2. Organizing the three-letter words from seven distinct tiles to the organization above, we see that the first group, containing 6 · 5 = 30 sequences, meets the criterion. Since the first letter is a, the first two letters must be in order, regardless of the remaining letters. Sequences in the second group, however, begin with b, so the first subgroup, where the second letter is a, must be excluded. The second group then contributes 5 ■ 5 = 25 sequences that meet the criterion. The third group starts with c, so we must exclude its first two subgroups, where the second letter is a or b. Hence, we obtain only 4 · 5 = 20 sequences here. The pattern is clear: Each group must exclude one more subgroup than its predecessor. The total number of sequences with the first two letters in order is then 6 · 5 + 5 ■ 5 + 4 · 5 + 3 · 5 + 2 · 5 + 1 · 5 + 0 · 5 = 105, which gives a probability of 105/210 = 1/2. Is this result surprising? The number of sequences with the first two letters in order is exactly one-half of the total number of three-letter sequences. A so-called "sanity check" is always advisable. This means that we should try to confirm by some other method that the result is correct or at least reasonable. An obvious check in this case is that the result is a proper fraction, between zero and 1, because it represents the ratio of some number of selected possibilities to the total number of possibilities. So the result is reasonable in this sense. However, we can argue further that it should be 1/2 exactly. Suppose that we reorganize the three-letter sequences into two different groups: Gx contains those sequences with the first two letters in order and G2 contains those with the first two letters out of order. We can define a function / : G\ —> G2 by f(xyz) — yxz, where xyz is a three-letter 8 Chapter 1: Combinatorics and Probability sequence from G\. This means that x < y alphabetically, so yxz is indeed a sequence in G2. This function establishes a one-to-one correspondence between the two sets, so they must have the same number of elements. Using the notation \X\ to indicate the number of elements in a set, we have that the probability of a three-letter sequence having its first two letters in order i s | G 1 | / | G 1 U G 2 | = |G 1 |/(2|G 1 |) = l / 2 . With some additional notation, we can extend the example to a general result. Recall that n! means the product of the integers n · (n - 1) · (n — 2) · · · 1 and that we define 0! = 1. The notation Pn>k will mean the first fc factors in the expansion for n!. 1.2 Definition: For integers n > 0, fc > 0, Ρ η ,* denotes the fcth falling factorial of n. It is the product n · (n — 1) · (n - 2) · · · (n - fc + 1). For the moment, we think of the " P " as an abbreviation for "product." As a special boundary case, we define Pn,o = 1-1 For k > n the expansion contains a zero factor, and therefore Pn<k = 0. For 0 < k < n, we have Pn,k = n · (n - 1) · (n - 2) · · · (n - fc + 1) _ n ■ (n - 1) · ■ · (n - fc + 1) · (n - fc) ■ ■ · 2 · 1 n! (n-fc)'(n-fc-l)---2-l (n - fc)! Pn,n = n · (n - 1) · (n - 2) · · · 2 · 1 = n!. 1.3 T h e o r e m : Let n > 0 and fc > 0. Drawing without replacement from a set S containing n symbols, we can construct exactly P n ^ different fc-sequences. PROOF: If fc = 0, then P„,fc = 1, and there is indeed just one 0-sequence—the empty sequence. If fc > n, then Pn<k = 0, but there are no possible fcsequences that we can select without replacement from the pool of n symbols. The theorem is therefore correct for fc = 0 and for fc > n. For fc in the intermediate range, we proceed by induction. For fc = 1, we count exactly n 1-sequences, each being one of the symbols from S. Since P nj i = n, the theorem is correct in this base case. If fc > 1, divide the possible fc-sequences into n groups, each distinguished by its starting symbol. The sequences within a group start with a specified first symbol and continue with some sequence of length fc - 1 from the remaining n - 1 symbols. Hence the number of sequences in the group is the number of continuation sequences, which is Pn-i,k-i from the induction hypothesis. The total number of fc-sequences is then _ n(n - 1)! _ n! _ n ■ 1.4 Definition: A rearrangement of the ascending sequence 1,2,3,... ,n is called a permutation of these numbers. | An immediate application of Theorem 1.3 shows that there are exactly n! permutations of the numbers 1,2,3,... ,n. The next example provides a more extensive application and also suggests a more useful formula—one that deals with combinations rather than sequences. Section 1.1: C o m b i n a t o r i c s 1 2 3 Card 2 3 4 5 6 7 8 9 10 J Q K A 4 5 6 7 8 9 10 J Q K A 3 In-order hands 4-A 5-A 6-A 7-A 8-A 9-A 10-A J-A Q-A Κ-Λ A 5-A 6-A 7-A 8-A 9-A 10-A J-A Q-A K-A A - 11 10 9 8 7 6 5 4 3 2 1 0 10 9 8 7 6 5 4 3 2 1 0 Subtotal 1 8 9 66 10 J Q 45 K A Card 2 9 10 J Q K A 10 J Q K A J Q K A Q K A K A A - 3 In-order hands 10-A J-A Q-A K-A A J-A Q-A K-A A Q-A K-A A K-A A A Grand 5 4 3 2 1 0 4 3 2 1 0 3 2 1 0 2 1 0 1 0 0 0 total: Subtotal 15 10 6 3 1 0 0 286 T A B L E 1.3. In-order three-card hands from thirteen spades (see Example 1.5) 1.5 E x a m p l e : A dealer removes all cards from a standard deck, except the spades, shuffles the remaining cards, and deals you three cards. What is the probability that you receive your cards in a sequence of increasing value? As you receive your cards, you keep them. Therefore, you receive a 3-sequence, chosen without replacement from a set of 13. Theorem 1.3 then says that you receive one of the Pi3,3 — 1716 possible sequences. To count the in-order sequences, we apply the multiplicative counting principle to elaborate several disjoint categories. If the deal starts (2,3), then any of the remaining 11 cards completes an in-order hand. If the deal starts (2,4), then only 10 of the remaining cards complete an in-order hand because we must exclude the 3. Similarly, only 9 cards successfully complete a hand that starts with (2, 5). Thus, the number of in-order hands starting with 2 is 11 + 10 + 9 + . .. + 1 + 0 = 66. Continuing with the hands that start with 3, we organize the in-order hands as shown in Table 1.3, which produces a grand total of 286 in-order hands. The required probability is then 286/1716 = 1/6. So you should receive your three cards in increasing order about once in every six deals. Can you think of a sanity check for why this should be so? Suppose that we ignore card order for the moment. We divide all possible 3sequences into groups that contain the same cards. Suppose that there are N such 10 Chapter 1: Combinatorics and Probability groups. Each group contains 3-sequences that are rearrangements of the same three cards, so according to Theorem 1.3, each group contains ^3,3 = 3! = 6 sequences, and only one of them can be in increasing order. So there must be N sequences in increasing order and a total of 6N sequences all together. The probability of receiving an in-order hand is then N/(6N) = 1/6. Π We have learned that a set of n distinct symbols admits Pn,k ^-sequences when they are chosen without replacement. These sequences are, of course, ordered displays of k symbols. A combination, however, is an unordered display, which we can visualize as a subset of the original set. Suppose that we want to count instead the number of fc-combinations. That is, how many fc-combinations can we construct if we choose without replacement from n distinct symbols? We follow the suggestion at the end of Example 1.5 to prove the following theorem. First, however, we introduce a notational convenience. 1.6 Definition: For integers n > 0, k > 0, Cn,k, pronounced "n choose k," is defined as Pn,k/k\. For the moment, think of the "C" as denoting "choose." | If k > n, then Pn>k = 0 and consequently Cn,k = 0. If 0 < k < n, then C„,fc = n\/[k)(n — k)}]. From this formula, it follows that Cn,k = C„tTl-k for 0 < k < n. 1.7 Theorem: Let n > 0 and 0 < k < n. Choosing without replacement, we can construct exactly Cn,k different fc-combinations from a set of n symbols. PROOF: By Theorem 1.3 the set admits Pn,k fc-sequences. If we collect together all sequences that are merely different orderings of the same symbols, we partition the Pn,k sequences into N collections. Each collection represents a different fc-combination, and, moreover, each fc-combination must appear as one of the collections. Therefore, N counts the number of fc-combinations. How large are these collections? Suppose that C is one of the collections. Every sequence in C contains the same k symbols. The sequences differ only in the ordering of these symbols. We apply Theorem 1.3 again to conclude that C contains exactly Pk,k = k\ sequences. This same reasoning applies to the other collections, so each of them is of size k\. Then Pn k = N ■ k\, or N = Pntk/k\ = n\/[(n-k)lk\} = Cn,k.i The theorem applies when k = 0 because then C„,o = n!/[0!(n - 0)!] = 1 and there is indeed just one subset of size zero, namely the empty set. It is not possible to select a subset with more than n objects, so Cn,k = 0 is correct for k > n. A useful interpretation envisions Cn<k as the number 1 1 ^^»^ of patterns obtained by darkening k slots from a field of n. ' ^^__^ , For example, C ^ = 6 produces the patterns to the right. * M \ *m In choosing 2 slots from a field of 4, you may, for example, } \ } "* ' choose slot 1 and then slot 4, or you may choose slot 4 followed by slot 1. In either case, you get the fourth pattern. The pattern is sensitive only to the particular 2-combination, {1,4}, not to the order in which the slots were chosen. In this sense, exactly 6 patterns are possible. The next example shows how to use this interpretation to count the possibilities from a large field of values. Section 1.1: Combinatorics 11 1.8 Example: You receive 13 cards from a well-shuffled deck. What is the probability that your hand contains 2 or more aces? Because you keep your cards as you receive them, the process unfolds without replacement. Moreover, you are concerned only with the particular 13-combination that you receive. The order in which you receive the cards is not important. Therefore, the total possible outcomes number Cs2,i3, the number of 13-combinations from the deck of 52. This is the number of patterns over 52 slots in which 13 are darkened. To count the number of 13-combinations that contain exactly two aces, we imagine the 52 cards arranged linearly over 52 slots, but with some further structure. Specifically, we put the four aces over the first four slots (spades, hearts, diamonds, and then clubs). The remaining 48 cards occupy slots 5 through 52, in an arbitrary order. The following diagram illustrates some sample choices that yield a 13-card hand with exactly 2 aces. ■ aces at» UM i non-aces - M I M M I M 1 1 1 M I M I M Γ Ε M M l y M l l i y II y M MIM y u n y l i n y N Éiy y yJiiyiMiiiimwiiiiiimiiiiiiyiiiwMiimJ In the top pattern, aces fill slots 2 and 3, while the remaining 11 cards come from slots 6, 9, 14, 19, 22, 25, 28, 30, 35, 37, and 42. The next pattern takes aces from slots 1 and 4 with the other 11 cards from slots 13, 19, 25, 29, 32, 34, 40, 44, 48, 49, and 51. The last pattern shows aces from slots 2 and 4 and another selection for the remaining cards. In general, a valid pattern arises from any of d,2 subpatterns in the first four slots associated with any of C48.11 subpatterns from the remaining slots. Hence, the number of 13-card hands that contain exactly 2 aces is C4.2C48.11· Similarly, the number of hands containing exactly 3 aces is C4.3C48.10· The number for 4 aces is C4,4C48,9· The probability of obtaining a hand with 2 or more aces is then [C4,2C48,n + C4.3C48.10 + C4.4C48.9l/C52,13 = 0.2573. In terms of the counting principles, we first decided to count separately the hands with 2,3, and 4 aces, which breaks the field into three disjoint components of varying sizes. We then invoke the multiplicative principle to count the component with k aces as a regular arrangement of C4.1t possible ace combinations by Cig.u-jt possible combinations for the other 11 cards. D 1.9 Example: A poker hand contains five cards. The hand "two-pair" contains cards of three distinct values. Two cards share one value, another two cards share a second value, and the final card assumes the third value. For example, the 4 of spades and diamonds, the king of clubs and diamonds, and the 7 of clubs constitute a two-pair hand. What is the probability of receiving a two-pair hand in a five-card deal from a well-shuffled deck? The total number of five-card hands is Cs2,s = 2, 598,960. This is the number of 5-combinations from a set of 52. We now need the number of combinations that result in two-pairs. As in the preceding example, we envision the 52 cards aligned linearly over slots, this time in descending value. The four aces occupy the first four slots, then the four kings, then the four queens, and so forth. The final four slots contain the four 2s. We view this arrangement as thirteen consecutive groups of four slots each. Each such group contains cards of a single value. A two-pair hand must involve two cards from each of two distinct groups and a final card from the 44 possibilities outside the two groups. There are Ci3,2 patterns corresponding to 12 Chapter 1: Combinatorics and Probability the choice of the two groups. The upper bars in the diagram below show one such pattern where the pairs come from the Q-group and the 4-group. The odd card comes from the 9-group. K | Q | J | 1 0 , 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 , 111 M 1111111 M 11 n 11 IIIIILÉÉÉJIIIIIMI Ι ι ι ι Ι ι ι ι ί w l i n l i i i l y t i n ιΐιι i l i i i l i n l a J i n l i n l Within each of the two groups chosen, a pair arises from any of d,2 possible combinations. Assuming that the suits are arrayed within groups as spades, hearts, diamonds, clubs, the diagram shows the Q-group generating the queen of spades and the queen of diamonds, one of the six possible pairs. The 4-group generates the four of hearts and the four of clubs, again one of six possibilities. Finally, the 44 remaining cards generate the 9 of hearts, one of Ci4,i possibilities. The diagram exemplifies only one possible pattern. The total number of two-pair patterns is Ci3,2C"4,22C44.i = 123,552. The desired probability is then Ci3,2[C4,2]2C44,i/C"52,5, which evaluates to 0.0475. D Following the strategy of the last two examples, you should now be able to verify the probabilities of any poker hand. For a flush, which is five cards of the same suit, it is C4,16*13,5/052,5 = 0.00198. This interpretation of a flush includes a royal flush, which is the ace through ten of the same suit. For a straight, which is five cards of consecutive value, it is C9 i i[C 4il ] 5 /C52,5 = 0.00355. This interpretation includes pure straights as well as straight flushes (consecutive values from the same suit) and royal flushes. The strategy in all cases is to construct the favored hand through a series of choices, where the total number of possibilities divides equally across the choices. The desired count is then the product of the number of subdivisions at each step. This process amounts to the repeated application of the multiplicative counting principle. To choose a straight, for example, you first choose the lowest value, which must be one of the nine values 2 through 10. Any higher value for the lowest would not allow four further cards in increasing value. For each such choice, you then have the option of 4 cards (spades, hearts, diamonds, or clubs) for the lowest value. Then you must choose one of 4 cards for the second lowest value. Indeed, for each of the five sequential values, you must choose one of the 4 available suits. Multiplying the sizes of the subdivisions at each step gives Cg,i[C4,i]5, which leads to the probability calculated above. Although the examples to this point have featured card games, the general ideas of Theorem 1.7 are relevant to a wide range of problems, some of which appear in the following examples. The examples introduce the terms random sample and random selection. For the moment, we adopt intuitive meanings for these terms. A random sample, or selection, of size k means a subset of k objects, selected without replacement by a process that affords each potential fc-combination an equal chance of being chosen. For small objects, we achieve a random sample by mixing all the objects in a jar and drawing k of them without looking at the jar. For larger objects, or for objects that have distinguishable shapes or textures, we must imagine a process that removes these clues. Section 1.1: Combinatorics 13 1.10 Example: Suppose that a bin of 100 components contains 4 defective ones. In a random sample of size 6 from this bin, what is the probability that none will be defective? What is the probability that exactly 1 will be defective? Let pk,k = 0,1,2,3, or 4, be the probability of k defects in the sample. We imagine an alignment with the 4 defective components in the first 4 slots and the properly functioning components in the subsequent 96 slots. A pattern of 6 slots that exhibits no entry from the first four must necessarily choose all 6 from the remaining 96. Consequently, we calculate po as shown below. In a similar manner, a pattern with one defective entry must have 1 slot from the first four and 5 slots from the remaining 96, which explains the pi computation. At this point, a sanity check should be stirring in the back of your mind. Between no defects and one defect, we have used up 0.778 + 0.205 = 0.983 of the total probability. There are three remaining possibilities, and they must share the final 0.017 probability. The remaining calculations follow as shown, and we note that the values add up properly, within round-off error. po = C 4 ,oC96,6 = 0.778 p2 = Cl00,6 Ci l Cgfi 5 pi = ' ' C 4 ,2^96,4 Cl00,6 C43G96I =0.205 pi = ' ' n m c , = 0.0167 t = 3.22 x 10~ 7 Cl00,6 Cioo.6 P4 = CAACM '° =8.39xlO-'°D C\oo,& 1.11 Example: Suppose that a 10-person committee contains 5 Republicans and 5 Democrats. Random selection produces a 3-person subcommittee. What is the probability that the Democrats will dominate the subcommittee? Democrats will dominate if they number 2 or 3 on the subcommittee. Cio,3 = 120 subcommittees are possible. Cs^Cs.i = 50 of them will contain exactly 2 Democrats, and Cs^Cs.o = 10 of them will contain exactly 3 Democrats. Therefore, Democrats will dominate 60 of the 120 possible subcommittees. The probability of a Democratically dominated committee is 60/120 = 1/2. Is this answer reasonable? A subcommittee with an odd number of members must be dominated by one group or the other. Since Democrats and Republicans are equally represented in the candidates, you should suspect from the symmetry of the situation that each group should have the same chance to dominate the subcommittee. That is, each group's chance should be 1/2. D The common theme across the last several examples expresses the desired probability as a ratio, in which the numerator is the number of combinations under some constraint and the denominator is the total number of combinations. In the preceding example, the denominator was the number of 3-person combinations from the set of ten, while the numerator was the number of 3-person combinations that contained 2 or more Democrats. In the earlier example, the denominator was the number of 6-component combinations from the field of 100, and the numerator was the number of 6-component combinations with a specified number of defects. With these kinds of problems, the most difficult part is identifying the relevant subsets. Consider the following examples. The first arises in computer science and communications engineering. The second introduces a mathematical tool with widespread application. Chapter 1: Combinatorics and Probability 14 10+ + + + + + + + + + *-♦ 9 +++ + + ++++ t +I » + 8 +++++ 7 +++ + + . 6+ + ++++ 5 + ++ + + + + + + 4 + <>—Ik + + + + + + 3+++++++++ 2 + -- + + + + + + + 1 0 r ^+++++++ ++++++++ 0 1 2 3 4 5 6 7 8 X (a) Typical message path (b) Traffic density versus location Figure 1.1. Node loading in a switching computer network (see Example 1.12) 1.12 E x a m p l e : A network has switching computers arrayed in a square pattern as shown in Figure 1.1(a). The computers occupy the grid points (x,y) for integer values of 0 < x < 10 and 0 < y < 10. Computers communicate only with their immediate neighbors. All message traffic arrives at node (0,0) and departs from node (10,10) after passing through some path across the intermediate computers. Moreover, each computer forwards messages either up or to the right, so that each link makes progress toward the final destination at (10,10). Figure 1.1(a) shows a sample valid path. Beyond this restriction, all paths are equally likely. What is the probability that the computer at location (2,6) participates in a message transfer from (0,0) to (10,10)? This question is a particular instance of a more general consideration, which asks how various nodes participate in the overall traffic flow. Figure 1.1(b) gives a graphical depiction of the traffic density, and we now proceed to verify that pattern. A path must proceed from an intermediate node either upward or to the right. It follows that every path must make 10 steps up and 10 steps to the right as it winds from (0,0) to (10,10). Every path, therefore, consists of exactly 20 links: 10 horizontal links and 10 vertical links. Paths differ only in the choice of which 10 links are horizontal. So there are C2o,io possible paths from (0,0) to (10,10). Now, consider a path that passes through node (2,6), such as the one in Figure 1.1(a). The initial segment, from (0,0) to (2, 6) must contain 8 links: 2 horizontal and 6 vertical. Different paths will choose different locations for the horizontal links. The path illustrated in the figure proceeds horizontally on links 2 and 6 and vertically on the remaining links. Hence there must be Cs,2 possible segments from (0,0) to (2,6). Similarly, there must be Ci2,8 possible continuations from (2,6) to (10,10). Since any initial segment can pair with any of these continuations to form a path through (2,6), the number of such complete paths is C8,2Ci2,8· The required probability is then P(2,6) = CB,2CI2,S/C2O,IO = 0.075. in Node (i,j) participates with probability p ( i j) = Ci+j,iCi0-(i+}),io-i/C20,io the general case. For a peripheral point, such as (1,9), the probability is 0.00054, while a central point, such as (5,5), participates with probability 0.344. If paths are chosen randomly, the central computer at node (5,5) must bear 34.4% of the traffic, while the peripheral computer at (1,9) sees very little traffic. These results are important because they reveal where to concentrate the fastest computers and the largest message buffers. Table 1.4 shows the probabilities associated with selected Section 1.1: Combinatorics y- 0 1 2 3 4 5 6 7 8 9 10 ι = 0 1.0000 0.5000 0.2368 0.1053 0.0433 0.0163 0.0054 0.0015 0.0004 0.0001 0.0000 1 0.5000 0.5263 0.3947 0.2477 0.1354 0.0650 0.0271 0.0095 0.0027 0.0005 0.0001 15 2 0.2368 0.3947 0.4180 0.3483 0.2438 0.1463 0.0750 0.0322 0.0110 0.0027 0.0004 3 0.1053 0.2477 0.3483 0.3715 0.3251 0.2401 0.1500 0.0779 0.0322 0.0095 0.0015 4 0.0433 0.1354 0.2438 0.3251 0.3501 0.3151 0.2387 0.1500 0.0750 0.0271 0.0054 5 0.0163 0.0G50 0.1463 0.2401 0.3151 0.3437 0.3151 0.2401 0.1463 0.0650 0.0163 T A B L E 1.4. Traffic density in a switching network (see Example 1.12) nodes. The source node is at the upper left corner; the destination node (not shown) is at the lower right. The boldface entries highlight the probabilities greater than 0.3 and show that the traffic load is highest near the source and destination and that a bottleneck persists through the center of the network. Horizontal positions 6-10, which do not appear in the table, are reflections of columns 4-0. That is, column 6 is a copy of column 4, installed upside down, and so forth. The diagonal bottleneck therefore continues to the destination node. Q ■ 1.13 Example: Verify the binomial theorem: (x + y)n = J2l=0C„ikxkyn~k To multiply two polynomials, we first multiply every term of the first by every term of the second. For example, (a + 26 — 2c)(6+2c) = a6+2ac+26 2 + 46c — 2bc — Ac2. Then we combine terms to obtain a more compact expression: ab + 2ac + 2b2 + 2bc — 4c 2 . We can view the initial product, before the final simplification, as the sum of all possible two-factor terms in which one factor comes from the first polynomial and the other from the second polynomial. If the two polynomials have ni and ri2 terms, respectively, then the product, before simplification, has n\ri2 terms. If we now multiply the product by a third polynomial with 713 terms, we obtain a sum of r i i o n s three-factor terms. In each such term, each of the three polynomials contributes one factor. Applying this observation to (x + y)n, we get 2 n terms. Each term, moreover, contains n factors, one from each copy of (x + y) in the multiplication. A given term might contain n i-factors and no y-factors and so appear as i n . It might contain n — 1 x-factors and one y and take the form xn~ly. The sum of the x and y exponents must always be n, so the general form of (x + y)n, after combining terms, must be 53£_ 0 CkXkyn~k, for some set of constants CkTo determine the Ck, we imagine the n factors of (x + y)n aligned over n slots. To produce a term x yn~ , we must choose an x-factor from k of the n slots and a y-factor from each of the remaining slots. The diagram to the right illustrates the point for the case n = 5,fc= 2. The number of patterns is, of course, Cs,2 = 10. That is, (x + y)s contains the term lOx y . We can calculate the coefficient associated with other values of k in a similar manner. In general, it is C„ *, and therefore The examples to this point illustrate the usefulness of counting the number of combinations available from a set of n objects. There is, however, another viewpoint that extends this concept to a different kind of problem. 16 Chapter 1: Combinatorics and Probability Consider a set of n objects, some of which are indistinguishable at a higher level of abstraction. The traditional example consists of n balls, of which r are red and b = n — r are black. For the moment, let us assume that n = 5 and r = 3, which means, of course, that 6 = 2. Because a set can have no duplicates, the balls must be distinguishable through some further characteristic, so we label the red balls as r i , r 2 , Γ3 and the blacks as 6i, 62. Now if we ask how many 5-sequences we can form, without replacement, from the five balls, Theorem 1.3 immediately answers 5! = 120. However, if we ignore the subscripts and consider only the colors, there will be fewer sequences because some of them will contain the same colors in the same places and differ only in the subscripts. So, how many sequences are possible when only color is considered? Suppose that we group the 120 permutations into N collections, each containing the same color pattern. That is, each sequence in a given collection contains red balls in exactly the same positions, differing only in the subscripts on the red balls. By default, all sequences in a given collection must then contain black balls in exactly the same positions, but again differing in the subscripts. For example, the sequences in Figure 1.2(a) all lie in one collection. The reds occupy positions 1,2, and 4, and the three red balls can appear in these slots in any arrangement of {ri,r2,r3}. Similarly, the blacks appear in slots 3 and 5 in any arrangement of {61,62}. So any of the 3! = 6 3-sequences of the reds (within the red slots, of course) can associate with any of the 2! = 2 2-sequences of the blacks to form a sequence in the collection. The collection then contains 3! · 2! = 12 sequences, as exemplified by the display above. The same analysis shows that each of the JV collections contains 12 sequences. Therefore, N = 120/12 = 10, and N counts the number of sequences that are distinguishable only by color. In the general case, we have n balls, of which r are red and b = n — r are black. If we partition the n\ sequences into collections such that all members of a collection exhibit the same color pattern, we find r!6! = r\(n — r)\ sequences in each collection. This follows because the r reds can be redistributed in the same slots in any of the r! permutations, and the blacks can be redistributed in their slots in any of 6! = (n — r)! permutations. The number of distinguishable color patterns is the number of collections: n\/[r\(n — r)!] = Cn,r. In the earlier examples, the result C„, r occurred when we arrayed the objects linearly above a set of slots and looked at the slot patterns formed by choosing r objects from the field of n. In the current context, however, we are choosing all the objects, and our interest shifts from which objects are chosen to where the r "special" objects appear. We can still use the slot-pattern technique, with one small variation. We now align the balls' potential positions over the slots. It should be clear that a color sequence is determined as soon as the positions of the three red balls are known; the remaining positions must, by default, contain black balls. For the case n = 5, r = 3, we envision the arrangements of Figure 1.2(b). Section 1.1: Combinatorics fclΓ3 <>2 Γ 2 fc fci 2 r2 Π fci Γ3 62 17 ri Γ 2 Γι T-2 &2 r3 ri ri r-3 *>2 r2 r2 ri T3 61 r2 61 61 01 Γ3 ri b2 r2 61 Γ3 Π 01 r2 <>2 Γ3 r 2 62 ri 61 Γ3 r2 il n 62 (a) Sequences with same color pattern Γ2 Γ3 <»1 Γι 62 r3 62 62 r3 ri (b) Possible color patterns Figure 1.2. Color patterns exhibited by 3 red balls in a field of 5 The slots are aligned with the potential positions of the red balls, not with the balls themselves. The 10 patterns correspond to the possible positions of the red balls, which in turn correspond to the possible color patterns. Since there are Cs,3 = 10 patterns, there are 10 distinguishable color sequences. This analysis verifies the more direct computation above and provides a further interpretation of the Cn>k formula. We can capture the new interpretation in the following theorem. The theorem refers to "classes," which may be colors, as in the example at hand, or any other attribute in the context of a particular problem. In the next example, we interpret class as gender. 1.14 T h e o r e m : Suppose that a set of n objects contains two classes with k objects in the first class and the remaining n — k objects in the second class. Consider n-sequences, chosen without replacement, in which objects of the same class are indistinguishable. The number of such sequences is Cn<kPROOF: See discussion above. | The interpretation accorded Cn^ in this theorem is really the same as that which visualizes Cn^ as the number of slot patterns in a field of n under the constraint that k slots must be shaded. Objects of the first class occupy the shaded slots, and by default, objects of the second class occupy the remaining slots. Given that objects within a class are indistinguishable, the number of distinguishable n-sequences is precisely the number of slot patterns. 1.15 Example: Consider a dinner party of 10 people—5 men and 5 women. If they randomly choose places at a linear table, what is the probability that there will be a strict alternation of men and women? The positions of the 5 men determines a gender pattern. Of the Cio.s = 252 possible patterns, only two exhibit a strict alternation of men and women: one starting with a man and the other starting with a woman. The probability is then 2/252 = 0.00862. D At this point, we have two interpretations of Cn,k, which we will call the selection and allocation solutions. It is convenient to add a third interpretation to form the following list. 1. Cn,k is the number of subsets of size k drawn from a pool of n objects. Stated differently, it is the number of ways to select k distinguishable Chapter 1: Combinatorics and Probability 18 objects from a pool of n. 2. Cn<k is the number of slot patterns in which k of the n slots are darkened. Equivalently, it is the number of ways to allocate k indistinguishable objects to n containers in such a manner that no container receives more than one object. The darkened slots indicate the chosen containers. 3. Cn,k is the number of binary rc-vectors whose components sum to k. A binary vector is a vector whose components are zeros and Is. A binary n-vector whose components sum to k must contain exactly k Is and n — k zeros. These vectors correspond to slot patterns in which the darkened slots represent Is. Hence the number of binary n-vectors whose components sum to k is the number of slot patterns in a field of n with the requirement that k are dark. This count is Cn^We conclude this section with a generalization of Theorem 1.14. We start by extending the example of the colored balls. Suppose now that the balls come in three colors: red, black, and white. Specifically, suppose that we have a set of n balls, of which k\ are red, k2 are black, and fc3 are white. Of course, ki + k2 + k3 = n. How many n-ball color patterns are possible? Reasoning as before, we have n! ?i-sequences when the balls are individually recognizable. We collect sequences that contain the same color balls in the same positions. Suppose that N collections result. Then TV is the required number of color patterns. To determine the number of sequences in each collection, consider an arbitrary member of a collection C. This sequence contains ki red balls in specific positions, and the A^ ! permutations of the red balls among the fixed set of positions result in new sequences with the same color pattern. So, for each fixed arrangement of the black and white balls, there are ki\ sequences that exhibit red balls in identical locations. Of course, each of thefci! sequences exhibits the same color pattern. Continuing this line of reasoning, we see that each of these ki ! sequences gives rise to k2 ! further sequences by permuting the black balls. Consequently, we have k\ \k2 ! sequences that display the same color pattern. Moreover, each of the ki\-k2\ obtained to this point gives rise to k3 ! additional sequences by permuting the white balls. Because the chosen collection was arbitrary, we conclude that each collection contains k\\ ■ k2\ ■ £3! of the n! sequences. So N = ΐΐ!/[λ:ι! · k2\ ■ Â;3!]. The general case concerns p colors, or classes, which appear k\, k2, ■.. , kp times in the original set of 7; objects. Of course, ki + k2 + ... + kp = n. Theorem 1.14 becomes a special case of the following theorem. 1.16 T h e o r e m : Suppose that a set of n objects contains p classes with ki, 1 < i < p, objects in class i. 11 — 5Zi=i ki. Consider ?i-sequences, chosen without replacement, in which objects of the same class are indistinguishable. The number of such sequences is n\/[k\! · k2\ ■ · · kp\]. P R O O F : The discussion above applies equally well to p classes. | Section 1.1: C o m b i n a t o r i c s 19 1.17 E x a m p l e : Verify the multinomial expansion: (I1 + Ia + ... + *„)» = Σ k1+k2+...+k,,=n ( ^ ! . ^ ! . · ν χ ' ' ) χ ' 2 , "xkp"' v p ' The notation under the summation symbol implies that the sum extends over all p-sequences of nonnegative integers, chosen with replacement, that sum to n. We will learn later, in Theorem 1.22, that there are precisely C p +„_i, n summands in this expression, but that detail is not important at this point. Anticipating an induction on p, we denote the left side by L£. The case L" is a straightforward verification, while the case LJ reduces to the binomial theorem. Both computations are illustrated below. rn _ n _ « ! n _ V'* fc "' l Jfcl = n - 2^ L2-(Xl+x2) fc,=l fcl!(n_fcl)! V z ' > x* - L· fc,!fc2! A:i+t2=n x > z* The final reduction for LS occurs because the pairs (ki, ta) which sum to n must have the form (fci,n —fci).Moreover, as kx runs from 0 to n, the expression (fci,n — fcj) generates all such pairs. If p > 2, we proceed by induction, assuming that 1 = Σ for q = 1, 2 , . . . ,p— 1. We then consider the multinomial of p terms to be a binomial by grouping the first p — 1 terms as a unit. At that point we can apply the binomial theorem. n Lp = [(ii + X2 + · ■ ■ + xP-i) + i p j " = y ^ Cn,kl,LlZ1''x/ k,,=0 Σ^ (n-kp)\ Σ kl+k2 + ... + kl,_l=n-k,, *,,- k,k2 k,, p The double sum effectively enumerates all the p-sequences that sum to n, so we can continue the computation as follows. Λ- p ~ n! V ^ (n-fc,)! kp\{n - fcp)! * , ! · 1 , ! . . . 1 Η ! Σ iki!-fc 2 !---A„-i!-ik„! t ' k2 2 * „ - t *„ "-1 P *Ι-1_*1· This calculation verifies the multinomial expansion, but an alternative approach is worth consideration, one that draws on the interpretation of n\/[ki\ ■ta!· · ■ta-!]as the number of color patterns, each of length n, that we can assemble from a group of n = ta + t a + ·. . +ta·balls of p different colors. From the discussion of Example 1.13, we know that (xi + X2 + ■ ■ · + i p ) " will generate p" terms, each containing n factors. 20 Chapter 1: Combinatorics and Probability Each copy of ( n 4- X2 + · · · + xp) contributes one of the n factors in each term. Since each copy must draw its contribution from the possibilities 1 1 , 1 2 , . . . , xp, we see that each term in the expansion must have the form i / a ^ 2 ' ' ' XP f° r some set of nonnegative integers ta, ta,... ,kp that sum to n. The question is: How many terms have the same set of exponents and are thus eligible for combination? Stated differently, we know that the expansion contains terms of the form Cxl1x22 · · · Xp1', but we do not yet know the value of C associated with a particular exponent set ta,ta,·· · , ta-· Let ta, ta, · · · , kp be a fixed set of nonnegative integers that sum to n. Now identify each of x i , X2,... , xp with a distinct color. Each of the n copies of (xi + X2 + ■ ■ ■ + Xp) now hovers over one of n slots. Each drops one of its colors into the slot below it, under the overall constraint that ta copies must drop color 11, ta copies must drop color 12, and so forth. The number of possible color patterns is n!/[ta-'-ta! · · ■ kp\]. But this is also the number of terms in the expansion that involve x 1 ' z 2 2 · · · xpr. That is, every choice that produces an n-factor term with exactly k\ xi-factors, ta X2-factors, and so forth through kp x p -factors must correspond to ta slots contributing the value x i , ta slots contributing the value X2, and so forth. Hence C = n!/[ta! ■ ta' · · · ta·!], for a particular set ta, ta, · · · , ta-, an< ^ ' n e expansion is verified. D Example 1.13 shows that the terms Cnyk appear as coefficients in the expansion of (x + y) n . For this reason, the Cn<k = n\/[k\(n — k)\], 1 < k < n, are called binomial coefficients. Similarly, Example 1.17 shows that the terms nl/f/ci'-fo! ' · -kpl] appear as coefficients in the expansion of (x\+x2+- ■ .+xp)n. The expressions n\/[ki! · fo! · · ■ kp\], for kt > 0 and k\ +fc2+ · · · + kp = n, are therefore known as multinomial coefficients. 1.18 Example: Suppose that we have a collection of objects, S = {χι,Χ2,... , i n } , each with two attributes, / ( x ; ) and g(xi). For example, / ( x ; ) might be the size of object x,·, while g{xi) might be its weight. In any case, we assume that both / ( x i ) and g{xi) are nonnegative. The goal is to choose a subset, {x;,, Xi 2 ,... , n , , } , such that the total g-values are as large as possible, subject to the constraint that the total /-values do not exceed a prescribed limit, F. Suppose further that our chosen subset must contain at least 2 objects. The code of Figure 1.3 conducts a search for the best subset. Since this is the first algorithm description in the book, we comment on the C structure's readability, even for a reader with no previous experience in C. Note carefully the convention for marking the scope of while-loop, for-loop, and conditional execution (if) constructions. The body is set off as an indented mass under the control statement. The while-loop controls all statements at the indentation level 1 greater than the while-statement itself [i.e., through the statement X = subset(S, X)]. The while-loop is simply a device for repeatedly executing a step sequence until a specified condition prevails. This example also illustrates the for-loop, which executes a step sequence a prescribed number of times. Executions use consecutive values for an index, which is the variable i in this case. Finally, the code contains if-statements, each with an indented body. The body contains a step sequence that is executed if and only if the attendant condition is true. This algorithm assumes the availability of a subroutine, subset(S, Y), which cyclically returns the subsets of its first argument. The routine constructs, in a manner that is not of interest here, the next subset beyond that given by the second Section 1.1: Combinatorics function binpack (S, F) X = phi; Y = phi; y = 0.0 while (X φ S) if (size(X) > 2) a = 0.0; b = for (i = 1 to 21 / / inputs are the set S and the / - l i m i t F // Λ X <.y<.i» cycles through imuugci SUUICLS subsets ui of JS,, starting with the empty set / / Y stores the best subset to date; y stores its g-value 0.0; size(X)) a = a + f(X[i]); b = b + g(X[i]); if ((a < F) and (b > y)) y = b; Y = X; X = subset(S, X); return Y; Figure 1.3. Algorithm for optimal bin packing (see Example 1.18) argument. Successive calls eventually recover all subsets, and the last subset to be returned is S itself. For starting the process, we assume that the empty set is available under the name phi. We also assume a routine that returns the number of elements in a set: size(X). X[i] extracts an element from the subset X. How many times does the code call the function / if the input set S is of size 10? How many times if the input set is of size n? When X is a subset of size k, the for-loop makes k calls to f, and this loop executes for all subsets of size 2 or greater. Let t(n) denote the number of calls when the input set is of size n. Since C\o,k is the number of subsets of size k, we have the following summation. ((10) = ]TCio,fc ' k = 45(2) + 120(3) + . . . + 1(10) = 5110 Jfc=2 For general n, we obtain the following summation. '■(") = Y^Cn.kk $ > •A: C'„,o · 0 — C„,i · 1 = n ■ 2" U=o The last simplification involves summing kC„,k, which you may or may not know how to do at this time. Section 1.2 will review these matters, but for the moment, even if you are not able to perform the reduction here, you can easily verify that the general formula generates the correct result for n = 10. D Exercises 1.1 The binary coded decimal (BCD) code assigns a four-bit binary pattern to each digit as follows. 0 1 2 3 4 5 6 7 8 |9 0000 0001 0010 0011 0100 0101 0110 0111 100011001 This is just one of N possible codes that express the ten digits as distinct four-bit binary patterns. What is iV? The BCD code leaves six patterns 22 Chapter 1: Combinatorics and Probability unused: 1010,1011,1100,1101,1110, and 1111. Other codes involve different sets of unused patterns, although each such set will always contain six members. Assume that the first step in constructing a code specifies the set of unused patterns. How many choices are available at this step? 1.2 What is the probability of receiving a full-house poker hand? A fullhouse hand contains 2 cards of one value and 3 cards of a different value. 1.3 What is the probability of straight-flush poker hand? A straight-flush hand contains 5 cards of consecutive value, all from the same suit. 1.4 In the game of bridge, point cards are aces, kings, queens, and jacks. What is the probability of receiving a 13-card bridge hand containing 3 or more point cards? 1.5 A 5-person committee is formed from a group of 6 men and 7 women. What is the probability that the committee has 3 or more men? 1.6 Seventeen 10-kilohm resistors are inadvertently mixed with a batch of eighty 100-kilohm resistors. Five resistors are drawn from the combined group and wired in series. What is the probability that the series resistance will be less than 500-kilohms? 1.7 If you randomly choose 3 days from the calendar, what is the probability that they all fall in June? 1.8 In a group of 1000 persons, 513 plan to vote Republican and 487 plan to vote Democrat. What is the probability that a random sample of 25 persons will erroneously indicate a Democratic victory? *1.9 Suppose that 8 balls are tossed toward a linear arrangement of 10 open boxes. What is the probability that the leftmost box receives no balls? *1.10 You arrive at the gym to find a single free locker, number 4 in a row of 10. You take that locker. When you finish your workout, you find 6 lockers, including yours, still occupied. What is the probability that the neighboring lockers on either side of yours are now free? 1.11 Calculate £ f c +k2+k3+k4=7 7!/[A:i! · fo! · fo! · A4!]. The notation means the sum over all sequences of four nonnegative integers, chosen with replacement, that add to 7. 1.12 A football fan is sitting high in the stadium and can barely make out the cheerleaders far below on the field. These cheerleaders are arranged in a line and are waving colored fabrics. If 3 cheerleaders have red fabrics, 2 have green fabrics, and 2 have blue fabrics, how many color patterns are discernible by the faraway fan? Section 1.1: Combinatorics 23 1.13 Bob and Alice are dinner guests at a party of eight, 4 male and 4 female. The hostess arranges the guests linearly along a table with the men on one side and the women on the other. What is the probability that Bob and Alice will be facing each other or be within one position of facing each other? 1.14 Six devices d\,d,2,... ,d$ connect with a central processor, which polls them in order to identify one that is ready to communicate. Suppose that three are ready to communicate when the polling cycle starts. What is the probability that the central processor issues exactly two polling requests before communicating. 1.15 If you rearrange the letters of "Mississippi," how many distinguishable patterns are possible? *1.16 In the context of Example 1.12, what fraction of paths from (0,0) to (10,10) never rise above the diagonal? That is, what fraction of paths pass only through nodes (x,y) with x > y? 1.1.2 Sampling with replacement Let us now consider the situation where repetitions are permitted in choosing from a set of n objects. That is, each chosen object, after being recorded in the result, rejoins the set before another object is drawn. This allows an object to appear multiple times in the fc-sequence or /^-combination under construction. It is an easy task to determine the number of ^-sequences from a set of size n, but it is somewhat less obvious how to calculate the number of /c-combinations. We start with the easy theorem. 1.19 T h e o r e m : Let n > 0 and k > 0. Under sampling with replacement, a set of n objects admits nk fc-sequences. PROOF: Let 5 be the set in question. \S\ = n. Let s(n,k) denote the number of fc-sequences that arise from S. There is just one empty sequence, and 1 = n°. Also, there are n = n1 1-sequences, each consisting of a single object from S. Therefore, s(n,0) = 1 and s(n, 1) = n. We now proceed by induction on k. Assume that s{n,j) = n J , for j = 1,2,... ,k - 1. Now consider all fc-sequences arising from S. Divide them into groups according to the first symbol. There are n such groups. Within each group, the sequence following the initial symbol is a (k — l)-sequence from S because S reacquires the initial symbol before any subsequent choices. By the induction hypothesis, the number of such continuation sequences is s(n, k - 1) = nk~1. Hence there are n groups of nk~1 sequences, giving n · nk~1 = nk sequences in total. I 1.20 Example: A sequence of heads and tails results from tossing a fair coin 10 times. What is the probability that heads and tails alternate in the sequence? The sequence is one of the 2 10 = 1024 10-sequences arising from the set {H, T} when sampling with replacement. Of these 1024 possibilities, only two provide a strict alternation of heads and tails—the one starting with a head and alternating Chapter 1: Combinatorics and Probability 24 = 04 1-2 c Φ |0.3 0) Ö-3 Ö0.2 a £1 1-4 ^ 0.1 ( 500 1000 Number of tosses 1500 "Ό 500 1000 Number of tosses 1500 5 log (Number of tosses) 10 05 .t; I04 c 0) S 0.3 ■5-3 « Ö0.2 *& 01 0 S ) 5 log (Number of tosses) 10 0 Figure 1. 4. Probability of an equal head-tail split in a sequence of coin tosses, (see Example 1.21). Logarithms are base 2. thereafter, and the one starting with a tail and alternating thereafter. So the probability is 2/1024 = 0.001953. The coin may be fair, but strict alternation of the two possibilities is not likely. U 1.21 Example: A sequence of heads and tails results from tossing a fair coin 10 times. What is the probability that the sequence has exactly five heads in it? The sequence is again one of 2 10 = 1024 possible 10-sequences that arise from the set {H,T} when sampling with replacement. To determine the number of sequences with exactly 5 heads, we note that such a sequence is fixed once we determine the positions of the 5 heads. Hence, we must choose 5 positions, without replacement, from the set {1,2,3,4,5,6,7,8,9,10}. Theorem 1.7 states that such choices number Cio.s = 252. The desired probability is then 252/1024 = 0.246. Again, the coin may be fair, but the chance that such fairness will be manifest as an equal split over 10 tosses is rather smaller than 0.5. Indeed, if we were looking for an even split in 100 tosses, the probability would be Cioo.5o/2100 = 0.0796. This suggests, correctly it turns out, that the chances of an even split become less as the number of tosses increases. Figure 1.4 provides some credibility for this suggestion. Since the probability of an even head-tail split is zero when the number of coin tosses is odd, all the graphs in Figure 1.4 provide probability points only for even numbers of coin tosses. The upper-left graph shows how the probability of an even split drops rapidly as the number of tosses grows larger, but it appears to level out. Switching to a logarithmic scale on the probability axis, we can discern a bit more detail about the transition, as shown in the graph in the upper right. However, the flattening of the decline rate persists in this representation. Using logarithmic scales on both axes provides the clearest indication of the overall trend, as shown in the lower-right graph. Π 1.22 T h e o r e m : Let n > 0 and k > 0. Under sampling with replacement, Cn+k-\,k fc-combinations arise from a set with n objects. P R O O F : Let the elements of the set be x\,... , xn. In a given ^-combination, Section 1.1: Combinatorics Pattern Repetitions *l*l*l (1,1,1,0) *l*ll* (1,1,0,1) »||*|* (1,0,1,1) 1 * 1 * 1 * (0,1,1,1) * * l * l l (2,1,0,0) * l * * l l (1,2,0,0) * * l l * l (2,0,1,0) * l l * * l (1,0,2,0) » » I I I * (2,0,0,1) • I I I * * (1,0,0,2) 3-comb abc abd acd bed aab abb aac ace aad add 25 Pattern Repetitions 1 * * 1 * 1 (0,2,1,0) 1 * 1 * * 1 (0,1,2,0) 1**11* (0,2,0,1) 1 * 1 1 * * (0,1,0,2) 1 1 * * 1 * (0,0,2,1) 1 1 * 1 * * (0,0,1,2) * * * l l l (3,0,0,0) 1 * * * 1 1 (0,3,0,0) 1 1 * * * 1 (0,0,3,0) I I I * * * (0,0,0,3) 3-comb 66c 6cc bbd bdd ccd edd aaa 666 ccc jddd TABLE 1.5. Star-bar patterns and 3-combinations from a field of 4 symbols (see Theorem 1.22) each Xi is present some integer number of times, say fc\ > 0. Furthermore, these repetition numbers, provided that their sum is fc, completely specify the particular fc-combination. So the set {{ki,k2,... , kn)\ki > 0, Σ ki = A;} is in one-to-one correspondence with the fc-combinations. Therefore, an equivalent question is: How many such ordered sets of n nonnegative integers add to fc? These are n-sequences chosen from { 0 , 1 , 2 , 3 , . . . ,fc} with replacement. Although we know from Theorem 1.19 that the total number of such n-sequences is (fc + l ) n , not all of these n-sequences sum to fc. Instead, consider the following indirect approach to counting just those n-sequences that add to fc. Imaginefcasterisks and n—1 vertical bars, arranged linearly. Each such display contains n + k — 1 symbols. The possibilities for n — 4,fc= 3 appear as Table 1.5. (The table depicts 3-combinations from the set {abed} rather than from {£1,2:2,13,£4} in order to facilitate comparison with Table 1.1. That earlier table breaks out all the 3-combinations from a set of size 4 but does not offer a method for counting them systematically.) The number of asterisks to the left of the first bar specifies the repetition count for the first symbol from the set. The number of asterisks between the first and second bar specifies the repetition count for the second symbol, and so forth. Because exactly fc asterisks appear in each pattern, the sum of the repetition counts as separated by the bars is always fc. Each distinct sequence of repetition counts corresponds to a fc-combination, and vice versa. The total number of fc-combinations is then the number of ways of choosing n — 1 positions for the bars out of the total of n + fc — 1 possible positions. Theorem 1.7 gives that value as Cn+k-i,n-i = Cn+k-i,k, which proves the theorem. | 1.23 T h e o r e m : The number of n-sequences (x\,X2,X3,.. ■ ,xn) with nonnegative integer components such that 53"_ 0 £i = fc is Cn+k-i,kP R O O F : The proof of Theorem 1.22 establishes a one-to-one correspondence between these n-sequences and the fc-combinations from a set of size n. The theorem also asserts that the latter count is Cn+k-i,k- I The preceding section noted three interpretations of Cn,k, and there are 26 Chapter 1: Combinatorics and Probability three parallel interpretations for Cn+k-i,k- They again involve a selection process, an allocation process, and a count of constrained vectors. 1. Cn+k-i,k counts the number of ways to select k distinguishable objects from a pool of n, given that each object may be chosen zero, one, or more times. 2. C„+fc_i,fc counts the number of ways to allocate k indistinguishable objects to n containers, given that a container may acquire zero, one, or more objects. 3. Cn+k-\,k counts the number of n-vectors whose components are nonnegative integers summing to k. Theorem 1.23 notes the equivalence of the selection solution and the number of n-vectors with nonnegative integer components summing to k. The equivalence of the allocation solution follows because each such n-vector specifies an allocation pattern. The vector (k\,k2,.·· ,kn) specifies fcj objects in the first container, &2 in the second, and so forth. In discussing the pigeonhole principle, the chapter's introduction noted the impossibility of randomly allocating four balls among three containers such that all containers receive at most 1 ball. It then asked the probability of an allocation in which the leftmost container receives at most 1 ball. We can now answer that question. Total allocations number £3+4-1,4 = 15, while those favoring the leftmost container give it zero or 1 ball. If the leftmost container receives zero balls, the remaining two containers must accommodate four balls, which they can do in C 2 +4-i,4 = 5 ways. If the leftmost container receives 1 ball, the remaining two must receive 3 balls, which can happen in ^2+3-1,3 = 4 ways. The probability is then (5 + 4)/15 = 0.6 that the leftmost container receives at most 1 ball. In this small example, we can tabulate * (0,0,4) * (0,3,1) * (0,2,2) the possible allocations. In the scheme to the * (0,4,0) * (1,3,0) (2,2,0) right, {ki,k2,ks) means that the left container (4,0,0) (3,0,1) * (1,1,2) receives ki balls, the center container receives * (0,1,3) (3,1,0) * (1,2,1) &2, and the right container fc3. Of course, the * (1,0,3) (2,0,2) (2,1,1) ki must sum to four. The asterisks mark the allocations in which the left container receives at most 1 ball. When choosing fc-sequences or fc-combinations with replacement, it is possible for k to be much larger than n, the size of the pool. The following example illustrates this possibility and also investigates the difference between fc-sequences and fc-combinations in probability calculations. 1.24 Example: Ten tosses of a fair coin produces an ordered pair (h,t), where h is the number of heads over the 10 tosses and t is the number of tails. Of course, h + t = 10. How many ordered pairs can result from the experiment? What is the probability of getting (3, 7)? The elements h and t can assume any of the values from { 0 , 1 , . . . , 10}, provided that their sum is 10. According to Theorem 1.23, the number of such 2sequences that sum to 10 is C2+io-i,io = Cn,io = 11. Such a high-powered formula Section 1.1: Combinatorics 27 is not necessary in this simple case. We see that all possible (ft, t) pairs must be of the form (ft, 10 — ft), with 0 < ft < 10. Hence, there are 11 such pairs. Also, a given (ft, t) pair corresponds to a 10-combination from {H,T}, so Theorem 1.22 also provides the correct answer: Cio+2-1,10 = 11· Now, since there are 11 possible 10-combinations and just 1 with 3 heads, it might appear that the probability of getting (3,7) is 1/11. This is not correct. When calculating a probability as a fraction of some number of equally likely possibilities, we must check carefully that the denominator equitably represents the field of possibilities. Of the 11 possible 10-combinations, are all equally likely in a physical experiment with ten coins? Does (1,9), in which just 1 head occurs, appear as often as (5,5), in which 5 heads occur? No, these outcomes are not equally likely. What actually occurs are 10-sequences, such as HTTHHHTTHT. By Theorem 1.19, there are 2 10 = 1024 such sequences., and they are equally likely. Also, since a sequence with 3 heads is determined once the positions of the heads are known, we must have as many such sequences with 3 heads as there are 3combinations, without replacement, from the positions { 1 , 2 , . . . , 10}. Using Theorem 1.7, we then compute P 37 '( - > = W C ' ° ' 3 = W = 0117 ΡΓ 5 5 = 1 Cl 5 252 < · > w °- = w = 0·246 P ' 0 . » ) = ^ » . . = - ^ - = 0.0098.0 When calculating a probability as a fraction of cases favorable to some constraint out of all possible cases, we must be more careful when the process is sampling with replacement. In the simpler situation, sampling without replacement, we can normally use either sequences or combinations, as long as we are consistent in the numerator and denominator. As suggested by Table 1.1, each fc-combination, without replacement, gives rise tofc!fc-sequences. Therefore, a solution using fc-sequences in numerator and denominator is the same solution using fc-combinations, after multiplying numerator and denominator by fc!. The value of the fraction does not change. If you want, for example, the probability of receiving 4 hearts when dealt 4 cards from a well-shuffled deck, you can form the appropriate 4-sequence ratio, ^13,4/^52,4 = 0.00264, or the corresponding 4-combination ratio, Ciz^/C$2,4 = 0.00264. The numerator and denominator of the first fraction are just 4! = 24 times those of the second. Because combinations are simpler and extend more easily to complicated situations, they are more frequently used in situations where the underlying process is sampling without replacement. In sampling with replacement, each fc-combination does not. give rise to fc!fc-sequences;sometimes the number of sequences is much less, as illustrated in Table 1.1. Therefore, the ratio of fc-combinations is not the same as the ratio of the corresponding fc-sequences. In sampling with replacement, it is normally the ratio of the appropriate fc-sequences that corresponds to the reality of the experiment. The following example provides another situation where a simple 28 Chapter 1: Combinatorics and Probability sanity check shows that a proposed solution involvingfc-combinationscannot be correct. 1.25 Example: Suppose that 10 people are gathered in a room. What is the probability that two or more persons have the same birthday? By having the same birthday, we mean born on the same day of the year, not necessarily being the same age. We imagine each person announcing a birthday from the 365 possible choices. The birthdays constitute a 10-combination from the set of days: {1, 2, 3 , . . . , 365}. The 10-combination arises from sampling with replacement because an announcement can duplicate a birthday already used. The number of such 10-combinations is D = C365+io-i,io· The combinations corresponding to no two persons having the same birthday are just those that happen to have no duplicate elements. That is, they correspond to the 10-combinations that you could draw without replacement. Hence they number C365,io· The rest of the combinations, N = £374,10 — C*365,i0i then count the sequences for which at least two persons share a birthday. It might appear that the probability of two or more persons having the same birthday is N/D. This is not correct. Consider the following sanity check. Suppose that the room contains only two persons. You then expect the probability of a shared birthday to be 1/365. (You ask one person for his birthday, and you have 1 chance in 365 that it will match that of the other person.) The reasoning above, however, gives N D _ C365+2-1.2 — <?365,2 _ C365+2-1.2 1 183 In the physical experiment, the persons are distinguishable entities. If you line them up and demand birthdays, from left to right, you will extract a 10-sequence from the set { 1 , 2 , 3 , . . . , 365}, not a 10-combination. Order does count. A different 10-sequence, even if it represents the same 10-combination, corresponds to a different assignment of birthdays. So it is the 10-sequences r that are equally likely. Reordering the elements in a 10-combination does not always produce the same number of 10sequences. For example, if the 10-combination has all distinct elements, then the reordering will produce 10! 10-sequences. If, on the other hand, it has a single element, repeated 10 times, then reordering produces a single 10-sequence. A 10combination of distinct elements occurs 10! times as often as the 10-combination containing a single repeated entry. Returning to the field of equally likely possibilities, there are 365 10 possible 10-sequences, of which Ρίβί,ιο have no duplicate elements. The following calculation then gives the correct probability. 365 10 - (365)(364)... (356) _ 365" 365 364 363 356 365 ' 365 ' 365 " 365 _ _ 0 · n m 1 1 7 If the room contains only two persons, the appropriately modified calculation gives the expected result: 365 2 - (365)(364) _ 36? ~ _ _365_ _364_ _ _ 1 _ 365 ' 365 ~ 365 n The use offc-sequencesorfc-combinationsarises in a class of problems called occupancy problems. The common abstraction is that k objects must Section 1.1: Combinatorics 29 be distributed across n containers. The distribution is with replacement, in the sense that a given container can acquire multiple objects. The containers have distinguishable labels; the objects, however, may or may not be distinguishable. If the objects are distinguishable, then our interest lies with the ksequences, chosen with replacement from the container labels. A given ksequence then tells us which containers contain which objects. In particular, the first item in the sequence is the (label of) the container where the first object resides. The second item denotes the location of the second object, and so forth. If the physical reality of the situation is such that all such ^-sequences are equally likely, the consequent probabilities are said to follow MaxwellBoltzmann statistics. In the examples above, we feel that the distribution of 10 coin tosses into 2 bins (heads or tails) follows Maxwell-Boltzmann statistics. Consequently, a sequence, such as HHTTTHTTHT, is just as likely as any other sequence. However, two counts, such as (H = 4,T = 6) and (H = 2,T = 8) are not equally likely. The same remarks apply to the distribution of 10 birthdays into 365 containers (the calendar days). We must work with the 10-sequences from the calendar to derive probabilities that comport with experience. There are cases when it is appropriate to use fe-combinations as a basis for probability calculations. One such instance is when the sampling protocol is without replacement, as discussed earlier. When sampling with replacement, however, we must analyze the situation to ensure that the kcombinations properly count the sets under consideration. If we are distributing indistinguishable objects across n bins, a ^-combination tells us the number of objects in the first bin, the second bin, and so forth. It provides this information in the following manner. The fc-combination contains zero or more repetitions of each of the bin labels. So, as noted earlier, each such kcombination corresponds to exactly one set of repetition counts, k\, fc2,... , kn that total k. The number of objects in the first bin is fci. The second bin contains &2 objects, and so forth. If physical reality is such that these repetition-count sets are equally likely, then probabilities should be ratios of favored counts to all possible counts. If probabilities follow this law, they are said to follow Bose-Einstein statistics. Physicists study certain distributions of particles to labeled energy containers, and they have found, perhaps contrary to intuition, that some particle distributions do obey Bose-Einstein statistics. These examples are beyond the scope of this text. We do, however, use fc-combinations to analyze sampling with replacement when the repetition counts themselves participate directly in the probability computations. The following examples illustrates the point. 1.26 Example: Suppose that you have available, in unlimited amounts, food packets of 6 types. The types are labeled 1,2,3,4,5, and 6. You must make up a shipment of 20 packets, and there is no limit on the number of repetitions of any 30 C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y function optimumShipment () k = (0, 0, 0, 0, 0, 20); / / initial assignment: 20 of type 6, 0 of types 1-5 choice = k; value = 0.0; searching = true; while (searching) x = 0.0; for (j = 1 to 6) x = x + k[j] * weight (j); y — expert (k) / x; if (y > value) value = y; choice = k; if (k = (20, 0, 0, 0, 0, 0)) searching = false; else k = next(k); return choice; F i g u r e 1.5. Algorithm to determine optimal food pack (see Example 1.26) one type. The weight of a packet depends only on its type. The nutritional value, however, varies when mixed with other packets. That is, some packets are more nutritional when served in combination with other packets. An expert is available to determine the nutritional value of any 20 packets chosen from the 6 types. The expert charges 1 dollar for each consultation when the shipment includes no packets of type 1. Because packets of type 1 require more expensive testing, the expert charges 2 dollars to evaluate a shipment that includes packets of type 1. Your task is to ship 20 packets to the scene of a catastrophe, and you want to maximize the nutrition delivered per unit weight. What is the maximum that you should have to pay the expert to attain your goal? Consider the algorithm of Figure 1.5, where the subroutine expert (k) returns the nutritional value of a shipment containing k[l] packets of type 1, k[2] packets of type 2, and so forth, through k[6] packets of type 6. We also assume a routine weight (j) that specifies the weight of a packet of type j . The routine next(k) acquires a new assignment of 20 items in some indeterminate manner. Of interest here is simply the fact that all assignments are eventually processed and that the final assignment is the one that terminates the while-loop: (20, 0, 0, 0, 0, 0). We need to determine how many times the algorithm invokes the routine expert and what fraction involve packets of type 1. Since the code systematically steps through all possible 6-sequences of nonnegative integers that sum to 20, we can appeal to Theorem 1.23 for the number of such sequences. It is Ce+20-1,20 = 53130. If the first element of the sequence is zero, then the remaining elements are precisely those 5-sequences that sum to 20, and again by Theorem 1.23, they number Cs+20-1,20 = 10626. So our expert costs 1 dollar for each of 10626 calls plus 2 dollars for each of 53130 - 10626 = 42504 calls, for a total of 95634 dollars. This is the most we will have to pay the expert because this approach investigates every possible assignment. Q 1.27 E x a m p l e : A hash table is a storage structure that calculates the disk storage location of a record as a function of the record key value. The disk storage areas Section 1.1: Combinatorics 31 are usually called buckets, and each bucket can hold some maximum number of records, say b. Suppose that there are N buckets. The calculation that operates on the record key to produce the bucket number is called the hash function. Λ good hash function spreads the records uniformly across the buckets. This is important because when a bucket fills, any further records assigned to that bucket overflow onto new buckets, beyond the original N. Each bucket in the original collection is potentially the beginning of an overflow chain of additional buckets needed to accommodate such inappropriate assignments. The hash function, for example, could be f(k) — k mod 100, where k is the key value. The key might be a 7-digit part number, for instance. This hash function then returns the last two digits of the part number, which represent a number in the range 0 to 99. Records are stored in the corresponding buckets. If the bucket capacity is 6 = 5, this hash function may or may not generate overflow chains, depending on the specific part numbers that participate. Even though the system has room for 500 records in the 100 buckets, it could still generate overflows on the first six storage operations if the part numbers of these six records all have the same last two digits. The appeal of a hash table lies in the subsequent retrieval of the stored records. When presented with a key, the hash function can calculate the bucket that contains the record, which allows us to retrieve just that bucket, instead of sequentially searching through all buckets in the structure. Of course, we are occasionally disappointed to find an overflow chain attached to the home bucket, and that unhappy circumstance necessitates further retrievals. Nevertheless, if the overflows are infrequent, most retrievals will be efficient. For this example, assume that the hash function does spread the records uniformly across the N buckets in the sense that each bucket is equally likely every time a storage assignment is made. A random hash function would be of small worth, because we must be able to locate the storage bucket deterministically when a later retrieval request presents the key. So we assume that just the storage assignment is made in a uniform random fashion. Thereafter, the function remembers the buckets that have been assigned to particular keys, so it can perform retrievals. This is a bit much to expect of a real hash function, but an analysis of this case allows us to infer the approximate behavior of a real hash function that does manage to spread its storage assignments uniformly across the available buckets. Under this assumption, we want the probability D(p) that a bucket receives exactly p records, including those that might produce overflows, for p = 0 , 1 , 2 , . . . , n. The relevant parameters are: TV, the number of buckets; n, the number of records assigned; 6, the number of records a bucket can hold without overflowing; and Q = n/Nb, the load factor, which is the fraction of the system capacity (without overflows) occupied by records when all have been assigned. For the case N = 10000, b = 10, the results appear as Table 1.6, and we now proceed to justify these values. After receiving an assignment, a bucket remains in the pool of choices for subsequent assignments. So the process draws n samples, with replacement, from the buckets { 1 , 2 , . . . , N}. Theorem 1.19 asserts that there are N" possible n-sequences, and our assumption about the uniform nature of the hash function means that each of these sequences is equally likely to occur. If i is an arbitrarily selected bucket, how many of these sequences assign exactly p records to bucket x? That is, how many sequences contain exactly p elements labeled i ? Applying the multiplicative 32 Chapter 1: Combinatorics and Probability P 0 1 2 3 4 5 6 7 8 9 10 >10 10000 0.1 0.3679 0.3679 0.1839 0.0613 0.0153 0.0031 0.0005 0.0001 0.0000 0.0000 0.0000 0.0000 30000 0.3 0.0498 0.1494 0.2240 0.2240 0.1680 0.1008 0.0504 0.0216 0.0081 0.0027 0.0008 0.0003 Number o! records Load factor 70000 50000 0.7 0.5 0.0067 0.0009 0.0064 0.0337 0.0223 0.0842 0.0521 0.1404 0.1755 0.0912 0.1755 0.1277 0.1462 0.1490 0.1490 0.1044 0.0653 0.1304 0.1014 0.0363 0.0181 0.0710 0.0137 0.0985 90000 0.9 0.0001 0.0011 0.0050 0.0150 0.0337 0.0607 0.0911 0.1171 0.1318 0.1318 0.1186 0.2940 99000 0.99 0.0001 0.0005 0.0025 0.0081 0.0201 0.0398 0.0656 0.0928 0.1148 0.1263 0.1250 0.4045 T A B L E 1.6. D(p), the probability of a hash table bucket receiving p records (see Example 1.27) counting principle, we reason that for each of the C„,p ways of choosing locations for the elements labeled x, there are (N — l ) n _ p ways of completing the sequence by assigning the remaining n — p locations to any one of the remaining (N — 1) buckets that are not labeled x. In other words, the remaining n — p positions form a (n — p)-sequence chosen with replacement from the buckets other than x, and there are (N — 1) such buckets. The counting principle then asserts that the number of sequences that contain exactly p elements labeled i is Cn,p{N — l ) n - p . The required probability is then D(p) = C„,P(N — l ) n - p / N n . This result is independent of x and is therefore the same for all buckets. Each bucket receives exactly p records with probability D(p). We can therefore interpret D(p) as the fraction of the N buckets that receive exactly p records. Note that these fractions sum properly to 1. y . Cn,p{N - 1 ) " - " _ L·, N" p=0 E ^ o g - . p l P ( A f - I ) - " _ [l + ( J V - l ) j " N" N" For p > b, this expression gives the fraction of buckets that overflow. Consider again Table 1.6, which illustrates the situation with 10,000 buckets, each of capacity 10. Each column corresponds to a different number of records stored in the hash structure. These values run from n = 10000 through n = 99000. The corresponding load factors appear below the number of records. The column then gives the probability that a bucket will receive p = 0 , 1 , 2 , . . . ,10 record assignments. The last value, shown in boldface type gives the probability that a bucket will receive more than 10 assignments and will therefore overflow. You can see that that probability of overflow is rather small until the load factor exceeds 0.7. The chance of overflow does rise rapidly after this threshold; there is a 40% chance of overflow when the load factor reaches 0.99. Figure 1.6 depicts the same information in a graphical format, using interpolation to produce smooth curves for each of the load factors a = 0.1 through a = 0.99. Each curve, therefore, corresponds to one column of data from Table 1.6. Overflow occurs when significant probability exists for p > 10. As you Section 1.1: Combinatorics 0.451 1 1 33 1 1 1 1 1 1 r Population, p Figure 1.6. Probability that a hash table bucket receives exactly p records (see Example 1.27) can see, the curves for low load factors, up through about a = 0.7, have reasonably small D(p) values, for p > 10. Hash table manipulations occur often in database applications, and bucket fetching is usually the most time-intensive operation because it involves disk accesses. If a hash table contains no overflow chains, each retrieval requires just 1 bucket fetch. Performance deteriorates when overflow chains appear. We can use the D(p) values just obtained to calculate the average number of bucket fetches per retrieval, assuming that the hash function spreads the records in a uniform manner. Π Exercises 1.17 You receive 5 cards from a standard card deck, but you return each card immediately after noting its value and suit. What is the probability that you receive two or more aces? Should this probability be larger or smaller than the probability of receiving two or more aces when you keep the cards? 1.18 Three dice are thrown simultaneously. What is the probability that the sum of the spots on the upturned faces is less than six? *1.19 You shuffle a deck of cards and then deal all 52 cards in a row. What is the probability that the display contains no adjacent spades? 1.20 In 10 tosses of a fair coin, what is the probability that the number of heads will be greater than twice the number of tails? Chapter 1: Combinatorics and Probability 34 1.21 You receive three cards from a well-shuffled standard deck. You note the values and suits and then return the cards. You then receive two further cards. What is the probability that you see two or more aces during the course of the experiment? 1.22 FVom an unlimited supply of pennies, nickels, dimes, and quarters, you are to select 20 coins. How many different combinations are possible? What fraction consists of monetary amounts that are divisible by 5? 1.23 Consider the equation Σ»=ι xi — **0, where all the Xi are nonnegative integers. What fraction of the solutions have both xi = 0 and x-j = 0? 1.24 Consider the equation £Z i=1 Xi = 60, where each Xi satisfies Xi > i. What fraction of the solutions also satisfy Xi > 2i? *1.25 Suppose that k > n and consider the equation ΣΓ=ι χ> = ^' where the Xi are positive integers. How many solutions exist? *1.26 Suppose that Σ7=ι Xi — ^' where the Xi are nonnegative integers. How many solutions exist? 1.27 How many people must assemble to have a probability greater than one-half that two or more persons share the same birthday? How many people are necessary to raise the probability to 0.8? 1.28 A restaurant occasionally announces that the owner will pick up the tab for any patron whose birthday is tonight. If the owner wants a 50% chance of escaping without giving away any free dinners, what is the largest crowd to which he can make the announcement? *1.29 &?=oßk)n = ΣΤ^ο^β11 for \β\ < 1. Find the coefficients {ak}. *1.30 Example 1.27 analyzes a hash table with N buckets of capacity b to which n records are allocated in a uniform manner. It computes D(p), the fraction of buckets that receive exactly p records as D(p) = Cn<p{Nl)n~p/Nn. For large N and n and small p, show that D(p) is approximately (ab)pe~ah/p\, where a = n/(Nb) is the table load factor. 1.2 Summations Consider the code segment to the right, in for (i = 1 to n) which we want to count the number of calls for (j = i + 1 to n + 1) to procedure inner. When i = 1, j runs from 2 inner(i, j); to n + 1, which gives n procedure calls. When i = 2, j runs from 3 to n + 1, producing n - 1 calls. The number of calls decreases by 1 for each successive value of i, until the final loop, when i — n, produces just 1 call. We can write the total by summing up from 1 to n or by summing down from n to 1. The following manipulation produces the simple Section 1.2: Summations 35 expression S = n(n + l ) / 2 for the sum. S = 1 + 2 + 3 + ... + n-1 + n S = n + n-1 + n-2 + ... + 2 + 1 2S = (n + 1) + (n + 1) + (n + 1) + . . . + (n + 1) + (n + 1) In summation notation, we have shown that $Z"=1 i = n(n + l ) / 2 . In this expression, the right side is called a closed form because it involves no summation symbols. An expression with a summation symbol cannot be expanded without using an ellipsis. The left side, for example, expands as 1 + 2 + 3 + · · · + n. The ellipsis (the three dots) is difficult to manipulate in further algebraic simplification. The closed-form on the right, by contrast, easily participates in further algebra. Although this device quickly provides a closed-form expression for Y^i, it is not very helpful for Σ i 2 or Σ i3, which also occur in counting operations. What we need is a general expression for Y^ik, for arbitrary k. These sums become increasingly complicated as k grows larger, but there is a systematic approach to the problem. Before launching into a calculation of £ " = 1 i2, we illustrate a useful transformation called index shifting. This technique is helpful in manipulating indices that are slightly offset from those we would find convenient. For example, consider the following two sums, which we can show to be equal. n n+1 i=l i=2 £(t + i)i = Σι(ί-ΐ) The left side generates (2)(1), (3)(2), (4)(3),... , (n + l)(?i), as i runs from 1 to n. But these are exactly the terms generated by the right side as i runs from 2 to n + 1. The same number of terms result when we shift the sum's lower and upper limits up by 1, and we can ensure that the value of each term remains unchanged by adjusting each occurrence of i to i — 1. Observing an index shift of this sort, we quickly verify the conditions needed to maintain equality: (1) the lower and upper indices are shifted by the same amount in the same direction, and (2) the first term generated in the transformed expression is the same as the first term in the old expression. The latter condition holds when the summation variable is adjusted by the same amount as the lower and upper limits, but in the opposite direction. In the example above, the lower and upper limits advance by 1, and each occurrence of the summation index decreases by 1. We need not check terms other than the first. If the initial terms of the two expressions are the same, all succeeding terms will follow along in parallel as the summation index increases uniformly on both sides. The following formula, in which the shift s may be positive or negative, generalizes this rule. t=j i=j+s Keeping this transformation in mind, consider the following manipulation, which manages to recover the starting expression plus some other terms. The 36 Chapter 1: Combinatorics and Probability other terms must, of course, be equal to zero. Note that not all transformations are index shifts; some are obtained by breaking off or adding on a term at the sum's extreme. n+1 n n+1 n+1 ^ i ( i - l ) = £ ( i - i)(i - 2) = £*(*-!)-2 £ ( i - l ) i=l t=2 t=2 i=2 We adjust the first summation by breaking out the first and last terms; we adjust the second summation by shifting the index. Σ i(i - 1) = -(1)(0) + Σ i(i - 1) + (n + l)(n) - 2 ^ t=l i=l i i=l n n = 53»(i-l) + (n+l)(n)-2 5 ] i i=l i=l Canceling the left side with its counterpart on the right and transposing the remaining sum gives 5Z"=1 i = (n + \)n/2. Of course, we already know this result from the previous calculation. But it is comforting to get the same answer, and besides, this technique generalizes to calculate Σ"=ι **'■ 1.28 Theorem: £ ? = 1 P,,fc = P n + 1 , f e + i/(fc+ 1). P R O O F : To see the general pattern, we repeat the previous discussion with the first three factors of i\. n+1 5 > . , 3 = £ i ( i - l)(t - 2) = £ ( i - l ) ( t - 2 ) ( t - 3 ) i=l i=l n+1 i=2 n+1 = 53 i(< - l)(i _ 2) - 3 53(i - l)(i - 2) i=2 i=2 n = -(1)(0)(-1) + 5 3 i(i - l)(i - 2) + (n + l)n(n - 1) i=l n -353t(i-i) i=l The left side again cancels with a right term, giving i=l i=l So we have established the theorem by direct calculation for k = 1 and k = 2, and we now need to handle the general case. n n 53 p,,fc+1 = 53 i(i -1)(< - 2) · ■ · (i - k + m - k) t=l t=l S e c t i o n 1.2: S u m m a t i o n s n Σ 37 n+1 Pi.k+i = £ ( ' - !)(* - 2) · ■ · (i - *)(» - k - 1) i=l i=2 n+1 n+1 fc p = Σ t=2 1 i-iA* - ( +1)] = Σ ^*- ·* " (fc+1) i=2 n+1 ρ Σ <->.* i=2 We adjust the first sum to run from 1 to n by subtracting the summand for i = 1 and breaking off the last term: n+1 n J2iPi-i,k = -(l)(0)(-l)---(l-k) 1=2 + Y>iPi.1,k + (n + l)Pn,k i=l n = 2_^Pi,k + l + i=l Pn+l,k+l- Consequently, n n n P P Σ P iit+1 = Σ '.*+l + n+l,fc+l - (* + 1) Σ i=l i=l Pi i=l ·*· Canceling the left side with its right counterpart and dividing by (fc+ 1) gives the desired result. | Using the initial cases already established, we can now compute £3" =1 ik, for k = 1,2,3, — For example, the computation for ΣΓ=ι i2 is as follows, and the next theorem collects the first several such results. Pn+l, 1+1.3 3 (71 + l)n(n 3 —1 __ V ^ p "i,2 i=l = Σ * - 1 > = Σ<*2-<> = Σ < 2 - Σ i=l Σ 1=1 1=1 .2 _ (n + l ) ? t ( n - 1) 1 ~ 3 i=l + i=l n(n + l) _ n ( n + l ) ( 2 n + l ) 2 6 n [ n ( n + l)/2, p=l 1.29 T h e o r e m : Σ * Ρ = 1 n(n + l)(2n + l)/6, p = 2 p = 3. i=i [n2(n+l)2/4, 2 P R O O F : The proofs for £™=i i and 5Z"=1 i appear in the discussion above. For Σ " = 1 i 3 we again proceed from Theorem 1.28. 4 (n+l)n(n-l)(n-2) ii = = ll = ^.(. _ 1=1 1)(. _ g) = ^(., _ ^ ^ 1=1 _ · Α .3 _ 3 n ( n + l)(2u + l) ~ 2^1 6 i=l + + 2n(n + 1) 2 C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y 38 Solving for £ ] i 3 now gives the desired expression: Σ™ = 1 *3 = " 2 ( n + l ) 2 / 4 · I 1.30 E x a m p l e : The procedure to the right below accepts a list (i.e., a vector) i4[l..n] and sorts it into ciscending order. Find f(n), the number of executions of the comparison in line 4. function slowsort(A, n) The algorithm loops give the following nested sums, which after some manipulation, provide an opfor (i = 1 to n - 1) portunity to involve the first summation formula of for (j = i + 1 to n) Theorem 1.29. Toward the end, the calculation uses if (Aß] < A[iJ) the fact that 2Γ=ι * = ΣΓ=ι ( n — *)■ This is true x = A[i]; because the second sum generates the same terms as A[i] = Aij]; the first, but in reverse order. Aß] = x; _y,._ i=l j=i+l t=l i=\ (n-l)n G i= \ Besides sums of integers and powers of integers, we also encounter geometric sums and their variations. A plain geometric sum is one in which each term is a constant multiple of its predecessor. We shall see that we can easily convert such sums into closed-form expressions. We shall also find closed-form expressions for certain systematic variations of the plain geometric sum, where the multiple that generates each term from its predecessor is not constant. First, consider the plain geometric sum ΣΓ=ο a r ' · The n r s t t e r m ' s a> and each remaining term is r times its predecessor. We can expand the sum and employ a trick similar that used to calculate Σ " = 1 *· This ls o n e 0 I " t n e few instances where the ellipsis notation allows further manipulation. Letting 5 represent the sum, the following array immediately implies the subsequent closed form for the geometric sum. S = a + ar + ar2 + ... + arn~l rS = ar + ar2 + ... + arn~l ( l - r ) S = a + 0 + 0 + ... + Ö + arn + arn + + 0 - arn+1 arn+l Equation 1.2 is valid only when r φ 1, but the sum is particularly easy to evaluate directly when r = 1. Now consider the following arrangement, where the sum of the first n integers is complicated with a geometric factor. We immediately add zero at the beginning of the sum in the form Or0. Of course, this leaves the sum unchanged. n Ύ^ ir> = r + 2r2 + 3r 3 + · · ■ + nrn - Or0 + r + 2r 2 + 3r 3 + · · · + nrn :=1 =r = Lir i=0 r =Γ ^ΓΣΓ t=0 n+1 fnr = { L"· -(n = r~dr- {~T=T- ) N t=0 + l)r (ΓΓ7? n + l\ j= n+l r[nr ' - (n + l ) r n + 1] (Γ^Ρ „ oX (L3) Section 1.2: Summations 39 The expression for an infinite geometric sum finds frequent use, and we can obtain it by considering the limiting form of Equation 1.2. «21 1 - rn+l " 1 T V = n->oo lim V r l = n-»oo lim -i— i—i *-*· 1 —Γ t=0 i=0 1 = ~1 —- Γ, (1.4) provided that \r\ < 1. The first two sections of Appendix A, which discusses sets, functions, and limits, should be reviewed at this time. The following examples provide further illustrations of limiting summations but assume the background familiarity with limit processes provided by the appendix. 1.31 Example: Evaluate E ^ o * / 2 ' For any finite n, Equation 1.3 shows that r[nrn ■A . i +1 - (n + l ) r n + 1] Σ"· = (irr? · Substituting r = 1/2 and adding the inconsequential term for i — 0 gives n+ 2 - ? Σ" _L 2' 2" i=0 Σ~= i=0 limè-f- lim ( 2 -^±l) = 2. V i=0 ' This last reduction used l'Hôpital's rule to evaluate the indeterminate form oo/oo. The rule stipulates that a ratio !{n)/g{n), in which both / ( n ) and g(n) —> oo, approaches limn-»oo f'(n)/g'(n), provided that the latter is well defined. In the case at hand, n +2 2n 1 2" In 2 o.D 1.32 Example: Show that the geometric variant Y2°l.i ir' converges for 0 < r < 1. If r = 0, all partial sums are also zero, and the convergence is trivial. Consequently, assume that 0 < r < 1. From Equation 1.3 we have ^ . ; r[nr"+l - (n + l)rn + 1} r . „ . , . „ „ Since r < 1, linin-nx, r " = 0, so assuming for the moment that nr" also goes to zero, we have that oo Σ π ΙΓ' = lim > ir' = — r·,-. To show that nr" actually does go to zero, note that nrn = n / ( l / r ) " , which approaches the indeterminate form oo/oo. Consequently, n lim nrn = lim , ■ ,„ = lim n-»oo n-too (l/r) . η-,αο (1/Τ) 1 , / Τ ln(l/r) = lim r" . . . = 0. G n-»oo I l l ( l / r ) 40 Chapter 1: Combinatorics and Probability 1.33 Example: To the right below is a procedure that searches a list /l[l..n] for a target X. The unsophisticated algorithm simply compares each list element with the target, starting at the first element, until it finds a match or falls off the end of the list. If it finds a match, it returns the index of the matching element; otherwise, it returns a 0. Suppose that this procedure runs in an applicaj n { slowsearch(A, n, X) tion environment where the target appears in the first for i\ = 1 to n) position half the time, in the second position one-fourth jf //Wji _ _ χ ) of the time, and so forth. Over a large number of exereturn i; cutions, what is the average number of comparisons of return 0' the target against a list value? If we run the algorithm a large number of times, say N, we will find the target in the first position N/2 times, in the second position N/4 times, and so forth, through finding the target in nth position JV/2n times. The remaining times, which will be N/2", the target is not in the list. Note that these fractions properly add to TV. ( N + N + N + N\ + N + ( ^ - Χ Ί Γ · · · -τ-) -2^ = Ν KT Λ f 1 + {ΤΤ Λ h, Σ-Τ)=Ν The last equality follows because Equation 1.2 gives n-l 2 n-r^2* i=l n 2 " r , ^ 2 r r r i=0 , 2 , n n-l 2 ^— V 2 y >=0 N ' 2 n "*" 2 I 1 - (1/2) ) \ ,\ I I / We are concerned with the number of comparisons in line 3, where the forloop compares list element A[i] with target X. In particular, we want to know the average number of comparisons over a large number of procedure runs N. So we add the comparisons generated in each run and divide the total by N. (We will develop a more exact definition of average, as a probability concept, later in the text. For now, let us be content with this intuitive meaning of average.) Because N/2 runs find the target in the first position, they each do 1 comparison. N/4 runs find the target in the second position, and they therefore do two comparisons. The N/2n runs that find the target in the last position must do n comparisons. The ΛΓ/2" runs that do not find the target in the list also do n comparisons each. So we can calculate our required average f(n) as follows, where the second line invokes Equation 1.3 with r = 1/2. /(n) 1 f N N N N \ = ΊΓ ( Ί Γ ( 1 ) + T" (2) + ■ ■ ' + -£" (n) + "F- (n) J - " , γ - , Υ ΐ Υ - " 2 n " ' " 4 - \2J * 2n + , ( 1 / 2 ) [ η ( 1 / 2 Γ + 1 - ( η + 1 ) ( 1 / 2 ) " + 1) [1 - (1/2)]* -7-'"[4-œ"-(4)'-(4·)" *·]-«(-» As n becomes large, f(n) approaches 2, which reflects the bias in the application. Recall that most of the time the target is among the early list elements, and it takes, on the average, only about 2 probes to find it. D Section 1.2: Summations 41 The technique used to derive Y^=0 ir' is worth remembering because it generalizes to ]Γ]Γ=ο ikr* for other values of k. The method involves differentiating a closed-form expression of 5Z"_0r*. The exercises pursue this line of thought. Here, however, we employ the same general approach for expressions involving binomial coefficients. Summing the coefficients themselves is easy. n n /t=0 k=0 £ Cn,k = £ Cn,klkln-k = (1 + l) n = 2" (1.5) How might we obtain a closed-form expression for £2£_0 k · Cntk? Consider the following differentiate function of x. f{x) = Or + l) n = Σ Cn>kxk\n-k = 52 Cn,iXkik fc=0 fc=0 f'(x) = n ^ + l)"- 1 =J2kC»<kxk~1 =-jY,kC"<kxk fc=0 fc=0 We let x = 1 to obtain n 5 > C „ , * =η·2"- 1 . (1.6) fc=0 Example 1.18 used this closed-form expression. The technique again generalizes to provide closed-form expressions for Σ*=ο kpCn,k for higher values of PThe binomial coefficients exhibit a large number of interrelationships. The following two prove useful in upcoming derivations, and the exercises provide an opportunity to investigate further relationships. 1.34 Theorem: (Pascal's identity) For 0 < k < n,Cnik = Cn-i,k+Cn-itk-iP R O O F : Let S = { s i , s 2 , · · · ,sn}. Then Cn,k counts the number of fccombinations available without replacement from S. We divide these combi. nations into two groups: those that contain sn and those that do not contain s„. Each ^-combination in the first group contains s„ plus k—l further choices from s i , S 2 , . . . , s n - i - Since there are Cn-i,k-i such choices, the first group must contain Cn-i,k-i elements. The second group contains ^-combinations that do not include sn and so must therefore include all k choices from the remaining si, S2,. · · , «n-i· There are Cn-\ik such choices, so there are Cn-\ik elements in the second group. Adding the two group sizes completes the theorem. By expressing the right side in factorial terms, we can construct an algebraic proof. r . r _ W-l,fc + O n _ i t _ i - ("-!)' , -ΓΤ7Γ 7ΤΓ + fcl(n-fc-l)! ("-I)' (fc-l)i(n-fc)! (n -k) (n - 1)! (n - k) fc!(n - k - 1)! k_ _ j n -_!)!_ k (k - l)!(n - k)\ Chapter 1: Combinatorics and Probability 42 k — ► n 0 1 2 3 4 5 6 7 8 9 10 0 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 3 6 10 15 21 28 36 45 1 4 10 20 35 56 84 120 1 5 15 35 70 126 210 1 6 21 56 126 256 1 7 28 84 210 1 8 36 120 1 9 45 1 10 1 T A B L E 1.7. Pascal's triangle (see Theorem 1.34) Rearranging to force the factor (n — l)!/[fc!(n —fc)!]into both terms, we have W - i , * + W_i,fc_i — (n - K ) ητΓ- ΓΤΓ + ^ fc!(n - fc)! = (?i - fc + fc) fc!(n - fc)! (n-1)! fc!(7i-fc)! ~ n< = C n ,fc· fc!(n-fc)! The first proof is more satisfying for the insight into how to decompose the original Cn,k fc-combinations into two groups. I Pascal's triangle arranges the binomial coefficients to emphasize the relationship of Theorem 1.34. As shown in Table 1.7, each row corresponds to a particular n and contains the entries <?„_*, for fc = 0 , 1 , . . . ,n. Each row starts and ends with a 1, and Pascal's identity computes each remaining entry as the sum of its north and northwest neighbors. k 1.35 T h e o r e m : Suppose that m < n. Then > J C m ) j C n . m,k—j (~Ή k- 3=0 PROOF: Recall the convention that Ca,b = 0 if 6 > a. Hence the sum produces nonzero terms only while j <m and fc — j <n — m [i.e., j >fc— (n — m)]. So although j formally runs from 0 to fc, some of the early terms can be zero if fc > (n - m), and some of the final terms can be zero if fc > m. Consider a set containing m elements of one type and n — m elements of a second type. That is, consider 5 = {x\,X2, ■ ■ ■ ,xm,yi,y2, ■ ■ ■ ,2/n-m}· Now imagine the set elements aligned over n slots. The situation for n = 9,??i = 4 appears as follows. xi X2 X3 14 i/2 V3 V4 ys The diagram shows one of Cg,6 patterns. In this one, 2 elements come from the prefix (the first four slots), and 4 come from the suffix (the last five slots). There are C ^ C s ^ patterns of this subtype. Other subtypes contain 1 element Section 1.2: Summations 43 from the prefix and 5 from the suffix, which contributes Ci^C5^ patterns, 3 from the prefix and 3 from the suffix, which contributes C^^C^,^ patterns, and finally, 4 from the prefix and 2 from the suffix, which contributes Cj^Cs^ patterns. These correspond to 1 < j < 4 in the summation formula of the theorem. Notice that the terms for j = 0,5, and 6 produce zero patterns. When j'· — 0, the suffix does not have enough slots to provide 6 - 0 = 6 choices, and when j > 4, the prefix does not have enough slots to provide 5 or 6 choices. In the general case, let X = {x\,.. ■ ,xm} and Y = {yi,... ,yn-m}· X then refers to the leftmost m slots, while Y refers to the remaining n — m slots. We decompose the Cn,k patterns into k + 1 groups: those with no chosen slots from X and k chosen slots from Y, those with 1 slot from X and λ' - 1 slots from V, and so forth, through the final group, which consists of patterns with k slots from X and none from Y. These groups are disjoint, and by the second counting principle, their sizes are as follows, assuming that k — (n — m) < j < m. If j is outside this range, the group size is zero. Group 0 1 U) Slots Ik from X Slots from Y k 0 k-1 1 Size of group C/m.lC'n — m,k— 1 0 k (-•m.JtOn — m,0 The sum of the last column must then be Cn,k, which completes the theorem's proof. An algebraic proof of this theorem is not as simple as was the case with Pascal's identity. | Exercises 1.31 Obtain a closed-form expression for J27=i *4· 1.32 Obtain a closed-form expression for $2" =1 î 2 /2'· 1.33 Obtain a closed-form expression for ^ " _ 6 ? 3 / 3 * , for n > 6. *1.34 Obtain a closed-form expression for ΣΖι ■ o» · 1.35 Obtain a closed-form expression for X)fc =0 ( - l)' : Cn,fc· 1.36 Obtain a closed-form expression for £ £ _ 0 fc2C„,fc. 1.37 Prove Newton's identity: CnikCk,i = Cn,iCn-i,k-i, for 0 < I < k < n. *1.38 Show that £ £ L 0 Cn+k,k = Cn+m+\,m*1.39 Show that Σ2=Ο°ΙΛ = C 2n,n. 1.40 Obtain a closed-form expression for ]Γ^ ] + λ + +k _ n n\/[k\\k2l. · · · kp\]. 44 Chapter 1: Combinatorics and Probability 1.41 Obtain closed-form expressions for the following: n! • Efc1+ic2+...+k„=n*;i ■-jrT^TTTfcT n\ 1.3 Probability spaces and random variables Although nondeterministic, the outcomes of a random process nevertheless obey certain restrictions. In dice experiments, the outcomes are numbers corresponding to the upturned faces. For two dice, the outcomes are the integers 2 through 12. Other integer outcomes are not possible, nor are outcomes such as heads, tails, blue, or green. The term sample space denotes the set of possible outcomes. For experiments with two dice, the sample space is { 2 , 3 , . . . ,12}. For coin-tossing situations, it is {heads,tails}. For card deck processes, it is the 52 individual cards. Clearly, the sample space need not contain numbers. It is convenient, nevertheless, to associate numbers with the outcomes in order to involve them in further computations. A functional association of a number with each possible outcome is a random variable, which we will formally define shortly. At this point, however, we note that the domains of such functions are sample spaces. Of course, the sample space outcomes have varying probabilities of occurrence, and when we wish to emphasize this additional feature, we use the term probability space. The next definition formalizes the concept for the discrete case. 1.36 Definition: A discrete probability space is a countable set Ω (the sample space), together with a function Pr : Ω -¥ [0,1], such that Σ ω 6 Ω Pr(w) = 1. The elements of Ω are called outcomes. For ω € Ω, the number Pr(w) is the probability of u>. The term probability distribution refers to the overall assignment of varying probabilities to the outcomes. Subsets of Ω are called events. The probability of an event E is the sum of its constituent probabilities: Pr(i?) = Σ ω β Ε ^ 1 ^ ) · Note that P r ( ) can refer to an outcome or to an event. The second form represents an extension of the original function to the collection of all subsets of Ω. In any given application, the distinction will always be clear from context. | As the definition implies, probability measures the "size" of sets, where we consider a point to be a singleton set. Consequently, we will find frequent occasion to manipulate sets through unions, intersections, and complements. A quick review of the set theory presented in Appendix A would therefore be appropriate at this point. In any case, we emphasize the notation A U B = A + B as alternative expressions for set union. Similarly, A Π B = AB both represent set intersection. An overbar indicates complement, as in A. Later, when we consider uncountable sample spaces, we will update the probability space definition to include a privileged collections of events, called a sigma-algebra, whose members admit probability definitions. For Section 1.3: Probability spaces and random variables 45 the countable case under discussion here, we can always take this privileged collection to be the set of all subsets of the sample space. That is, all possible events admit probability definitions through the simple expedient of summing the probabilities of their constituent outcomes. We write ( Ω , Ρ Γ ( · ) ) to denote a discrete probability space. If several spaces are under analysis at the same time, we use subscripts on the probability function to clarify its domain. For example, (X, Prx(·)) and (Y, Pry (·)) might occur in the same discussion. Although Definition 1.36 describes an abstract mathematical object, we need to connect it with real-world situations. For example, suppose that C i , C 2 , . . . are client requests arriving at a database server. In this context, Ω can be the number of requests in the waiting queue when client Cn arrives. Individual outcomes are then natural numbers: ω € { 0 , 1 , 2 , . . . } , each with an associated probability. Of course, if n — 1, the first client always finds an empty waiting queue, which is hardly a random outcome. For large n, however, the waiting queue can contain from zero to n — 1 previous clients, depending on the time needed to service their requests. Some requests require long service times, while others need only brief attention. Moreover, the arrival rate of clients depends on a number of external circumstances and appears random to an observer. From these variations, we expect the waiting queue size, as seen by client n, to exhibit random lengths. To be specific, we fix n = 1000 to give time for the influence of the initially empty queue to dissipate. Under a frequency-of-occurrence interpretation, what then is the intuitive meaning of Pr(2)? We imagine that the server opens for business with an empty queue at the beginning of each day. At some point in each day, client 1000 arrives and finds some number of requests in the queue. On some days, this number is zero. Sometimes it is 1; sometimes it is 2, and so forth. Pr(2) is the fraction of days that client n finds two requests in the waiting queue. In general, the frequency-of-occurrence interpretation requires a repeating environment that, on each cycle, provides opportunities for the outcomes to occur. Pr(w) is then the fraction of cycles in which outcome ω occurs. In the client-server context, the subset {ω\ω < 5}, or more simply (ω < 5), is an event. Its probability is J2i=0 Pr(i), which measures the fraction of cycles in which the client 1000 encounters fewer than five waiting requests. It is reasonable to assume that these probabilities are nearly identical for clients 1001,1002,.... In keeping with the interpretation that outcomes occur in a repeating environment, we extend the frequency-of-occurrence interpretation to events. If event A contains outcome ω, we say that A occurs whenever ω occurs. For the client-server situation, the events (ω < 5), (ω > 1), and (ω = 0 mod 2) all occur simultaneously with the outcome ω = 4. Also, A C D means that if A occurs, then B occurs. If AB = φ, the empty set, then A and B cannot occur simultaneously. In this case, A and B are said to be mutually exclusive. If A + B occurs, then at least one of A and B must octur, because the actual Chapter 1: Combinatorics and Probability 46 outcome ω must be in one or the other. Since AA = φ, if A occurs, then A cannot. Definition 1.36 generalizes the intuitive probability of the first section. There we investigated certain counting mechanisms that enabled us to assign probabilities to events. Specifically, the assigned probability for event E was the number of outcomes favorable to E divided by the total number of possible outcomes. In these cases, the probability assigned to each individual outcome was 1/iV, where N was the total number of possible outcomes. The new definition merely allows arbitrary assignments to the individual outcomes, provided that each assignment is nonnegative and all assignments sum to 1. The earlier probabilities then become special cases of the current definition. For example, the set of all 5-card hands becomes a probability space with Pr(u>) = 1/C52.5, for each hand ω. It is important to note that all outcomes have nonnegative probabilities. Moreover, each event E, being a collection of outcomes, has 0 < Pr(£J) < 1. Because an event's probability is the sum of the probabilities associated with its constituent outcomes, we have Pr(E\ + E2 + . ■ ■ ) = Σ ί Pr(-^») when the Ei are pairwise disjoint. This last observation leads to the following simple properties for probability assignments. 1.37 Theorem: Let ( Ω , Ρ Γ ( · ) ) be a probability space, in which A,B are events and φ is the empty event. Then ΡΓ(Ω) = 1 Ρτ(φ) = 0 Ρτ(Α) = 1 - Ρ Γ ( Λ ) Ρτ{Α + B) = Pr(A) + Pr(2?) - Pv(AB) Pr(A + B) < Pr(;4) + Pr(J0). PROOF: We have ΡΓ(Ω) = Σ ω 6 Ω P l " M = ! directly from the definition. The disjoint union Ω = Ω + φ immediately gives ΡΓ(Ω) = ΡΓ(Ω) + Pr(<£), from which Ρτ(φ) = 0 follows. Similarly, Ω = A + A is a disjoint union. Therefore, 1 = Pr(A) + Pr(A), which implies that Ρτ{Α) = 1 - Pr(^4). For the remaining assertions, we note that A + B = AB + AB + AB is the traditional disjoint union of A and B. Moreover, A and B admit the following disjoint unions. A = AB + AB B = AB + ÄB Therefore, Pr(A + B) = Pi{AB) + P r ( ^ S ) + Pr(AB) Pi{A) = Pv{AB) + Pr(AB) Pr(B) = Pr(AB) + Pr(AB). Adding the last two equations and identifying the expression for Pr(^4 + B) on the right side, we have Pr(i4) + Pr(fl) = Ρΐ{ΑΒ) + PT{AB) + Pr(ÄB) + Pr(Aß) = Ρτ{ΑΒ) + Ρτ(Α + B). Section 1.3: Probability spaces and random variables 47 It then follows that Pr(A + B) = Ρτ{Α) + PT(B) - Ρτ{ΑΒ). Moreover, since 0 < Pr(AB) < 1, we also have PT(A + B) < Ρτ{Α) + P r ( ß ) . | A discrete probability space is, of course, a model of some repetitive process which produces outcomes ω with relative frequencies Pr(u>). The process could be, for example, ten tosses of a fair coin. The space Ω then contains all possible 10-sequences from the set {H, T}. There are 2 10 such outcomes, and they all have the same probability: Pr(u>) = 1/210 = 1/1024, for all ω e Ω. The collection of outcomes with 4 heads constitutes an event, say the event Ht. The probability of this event is the sum of the probabilities of its constituent outcomes: Pr(/f 4 ) = Cio,4/1024 = 0.2051. 1.38 Definition: Given a discrete probability space, (Ω,Pr(·)), a discrete random variable is a function that maps Ω into 72, the real numbers. | Because we will deal only with discrete random variables for the next several chapters, the unqualified term random variable will mean discrete random variable unless otherwise indicated. Tradition demands capital letters for random variables, but we emphasize that a random variable is, in fact, a function. To further overload the notation, we also use the random variable's name to denote its range. Suppose that A" is a random variable. Then Λ' denotes a function in the expressions Χ{ω), or X~1(S) where S C 72, or X : Ω -* 72. However, when used as a set, A' refers to the range of X. For example, suppose that we have another function that maps range(X) into 72. Using X to denote range(AT), we write g : X -¥ 72, and we also speak of the image of X under g as g(X). The intended meaning should be clear from context. The next example suggests why such a twofold interpretation of X is appropriate. It shows how probability transfers from events in the probability space to numbers in the random variable's range. Indeed, once this transfer is well understood, it is convenient in many cases to deemphasize the original probability space and think of the random variable's range as the relevant outcomes. 1.39 Example: Let Ω be the set of 10-sequences from {H,T}, and let Prn(w) = 1/2 for each ω 6 Ω. This probability space models ten tosses of a fair coin. Let the random variable H : Ω —¥ TZ assign Η(ω) the number of heads in the sequence ω. The range of H is {0,1,2,3,4,5,6,7,8,9,10}. For i in this range, H~l{i) = {ω € Q | Η(ω) = i} is an event in Ω. We can think of the probability of this event as transferring to the number i in the range. Because there are Cio.i 10-sequences that exhibit i heads, we see that this probability is C?io,i/210. If i = 4, for instance, the probability of H~i(4) = Cio,4/2 10 = 0.2051. We can often proceed from this point as if the probability space were H = { 0 , 1 , 2 , . . . , 10} with probabilities as follows. h 0 1 2 3 4 5 Pr«(/i) = C 10 ,/./2 10 0.000977 0.00977 0.04395 0.1172 0.2051 0.2461 h 6 7 8 9 10 PrH(h) = C 1 0 ,h/2 1 0 0.2051 0.1172 0.04395 0.00977 0.000977 The table shows that the elements of H are not equally likely, even though the ele- C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y 48 ments of the underlying Ω all have the same probability. Note that the probabilities of the range elements sum to 1. i > 0 , / 2 1 0 = -acritc">^10-' i=0 i=0 = lis-O +1)10 = i In the new probability space, we speak of Pr//(i) or Ρτ(Η = i) to describe the probability of the event in Ω that maps to i. Similarly, we refer to Ρτ(Η < i) or Pr(i < H < j) when we mean the probability of the event in Ω that maps to numbers less than i or to numbers between i and j . It frequently happens that the only events of interest are those with descriptions of this sort. If this is the case, subsequent analysis need not refer to the original probability space. Rather, the analysis works with the new space (H, Pr#(·)). D In Example 1.39, H becomes a new discrete probability space because the sum of the elements' probabilities is 1. The next theorem assures us that this will always happen. 1.40 Theorem: Let X be a random variable over the discrete probability space ( Ω , Ρ Γ Ω ( · ) ) . Then (Χ,Ρϊχ(-)) is a discrete probability space, with Ρτχ(χ) = ΡνΩ({ωΕΓί \ Λ » = *}). PROOF: Let {ω\,ω2,...} be an enumeration of Ω. Then {Χ(ω\), Χ(ω2), ■ ■ ■} sequentially elaborates all of λ', although it may contain some repetitions. (X may map several Ω-elements to the same value.) If we proceed through the sequence, eliminating those elements that duplicate elements already passed, then we obtain an enumeration of X. This shows that X is a countable set. Let X = {x\,X2, ■ ■ ■} be that reduced enumeration. The X{ are now all distinct. Now we claim that each ω € Ω belongs to precisely one of the sets À' - 1 (xj). Indeed, for i Φ j , Theorem A.6 asserts X~1(xi) Π X~l{x:) = X~1({xi} Π {XJ}) = Χ~ι(φ) = φ. So no ω can belong to more than one X~1(xi). Also, if ω € Ω, then X{u>) = x t , for some i, which places ω € X~l(xi). It then follows that ^ P r x O r . ) =Y/PrQ(X-l(xi)) i t = £ ΡνΩ{ω) = 1 ω6Ω because X~l(xi),X~l(x2),. ■ ■ are disjoint subsets whose union is Ω. | In the case where the underlying Ω consists of numerical outcomes, we can define a random variable with the function Χ{ω) = ω. The random variable then retains the same probability distribution as the underlying space. Another consequence of the theorem is that a function of a random variable is another random variable. Y = g(X), for example, maps the range of X to another countable set of real numbers, and the probability transfers from the elements of X to the new range. That is, P r y ( F = y) = Prx(g _ 1 ({y})). If g(t) = t2 — 3, for instance, we speak interchangeably of the transformed random variable as g{X) or as X2 - 3. A functional expression in a random variable is simply a new random variable. In view of Theorem 1.40, most analyses start with a statement that a random variable has a particular probability distribution and continue with- Section 1.3: Probability spaces and random variables 49 out further reference to the underlying process that generates the actual outcomes. Further consequences depend only on the probabilities attached to the elements in random variable's range. In Example 1.39, for instance, we say that the random variable H has a binomial distribution as given by the table in that example. The next chapter treats the binomial and other discrete distributions in detail. In circumstances where the underlying outcome space is unknown, we simply work with the transferred probabilities: Prx : Λ' —► [0,1]. In this case, it is sometimes convenient to extend the probability function to all of 71. We do so by defining Prx(y) = 0 for all y outside range(X). We then have Prx : H -¥ [0,1], but the probability function is zero except on a countable subset of H. 1.41 Example: Let Ω be the collection of possible 5-card hands from a standard card deck. Assume that all hands are equally likely. That is, assume that Pro(u;) = 1/C52.5, for each ω £ Ω. Let X be the random variable defined by Χ(ω) equals 2 times the number of aces plus the number of kings in the hand u>. If, for instance, u> contains 1 ace and 1 king, then X(u>) = 3. What is the probability distribution of X? The largest value that X can assume is 9, which occurs when u> contains 4 aces and a king. The smallest value is 0, which occurs when ω contains no aces or kings. We must first tabulate the events in Ω that correspond to the inverse images of the numbers 0 through 9. The techniques of Section 1.1.1 are available to calculate the probabilities of all these inverse images. The results appear in Table 1.8. As a sample calculation, consider the case where X = 2. This can arise from two disjoint kinds of hands: those with no aces and two kings and those with 1 ace and no kings. We calculate these values separately. For the case with no aces and two kings, we use the multiplicative counting principle to break the total hands into three successive categories: those that differ in the choice of aces, those that differ in the choice of kings, and those that differ in the final choice from the remaining 44 cards. So the number of hands with no aces and two kings is C^^CA,2(^44,3. This gives 79464 hands with no aces and two kings. Similarly, we calculate that 543004 hands have 1 ace and no kings. So λ' = 2 occurs in 79464 -I- 543004 = 622468 hands out of C52,5 = 2598960 possible hands for a probability of 622468/2598960 = 0.2395066. The final column of Table 1.8 thus gives the probability distribution of the random variable λ'. Π 1.42 Definition: Let f : 1Z -> H. Let X be a random variable with range xi,X2. — The expected value of / ( · ) , denoted E[f(·)] or simply E[f], is the sum £[/(·)] = 52j /(:Ti)Prx(:r;), provided that the sum converges absolutely. | A series Σί^ converges absolutely if £ ; lflil converges, and we insist that the defining sum in the expected value computation possess this property. Sections A.2 and A.3 in Appendix A review convergence and absolute convergence of series. In view of Theorem A.45, this condition allows the summation to proceed over the range of X in any order. Also note that the definition depends on the random variable in question. If there is any ambiguity, we write Εχ [/(·)] to clarify that it is the range of A" and the corresponding probabilities that appear in the sum. The following theorem gives two useful properties of the expected value operator. 50 Chapter 1: Combinatorics and Probability l (x) components aces kings X X 0 1 2 3 4 5 6 7 8 9 0 0 0 1 0 1 0 1 2 1 2 1 2 3 2 3 3 4 4 0 1 2 0 3 1 4 2 0 3 1 4 2 0 3 1 2 0 1 Size of components Prn(*-1(*)) = P r ( X = x) C 4 ,oC4,oC 4 4,5 C4,oC 4 ,iC44,4 C 4 ,oC4,2C 4 4,3 C4,iC 4 ,oC44,4 C4,oC*4,3(^44,2 C4,iC 4 ,iC44,3 C4,oC4,4C44,l CV1C4,2(^44,2 = = = = = = 1086008 543004 79464 543004 3784 211904 = 44 = 22704 0.4178625 0.2089313 C4,2C4,oC44,3 C 4 l iC 4 ,3C 4 4 ,i C 4 , 2 C 4 ,iC 4 4 ,2 C 4 ,iC 4 , 4 C 4 4 ,o C 4 l 2C 4 ,2C 4 4 ,i C4,3CI4,oC44,2 C 4 ,2C' 4 ,3C 44l o C 4 ,3C 4 ,iC 4 4 ,i C4,3C,4,2C,4410 C 4l4 C 4 ,o£? 44 ,l C4,4C4,lC44,0 = 79464 = 704 =22704 = 4 = 1584 = 3784 = 24 = 704 = 24 = 44 = 4 0.0393280 0.2395066 0.0829901 0.0090067 0.0020670 0.0002801 0.0000262 0.0000015 T A B L E 1.8. Random variable X is 2 times aces plus kings in a 5-card hand (see Example 1.41) 1.43 Theorem: E[·] is a linear operator. That is, if E[f] and E[g] exist, we have, for constants a, b, E[af + bg] = aE[f] + bE[g}. Also, if h(t) = K, a constant, then E[h) = K. PROOF: Let X be the random variable in question, and let χι,Χί,... enumerate its range. Since E[f] and E[g] exist, their defining sums converge absolutely. Let 00 00 i=l i=l A = Σ l/(^)lPr(*i) and B = £ ^(ιΟΙΡφί). By the triangle inequality, \af(x,) + όρ(^)| < \af{Xi)\ + \bg(Xi)\ = \a\ ■ \f(Xi)\ + \b\ ■ \g(Xi)\. Consequently, n ΣΜΜ+Μχ^ΡτΜ t=l n n < |α| 5 3 |/(χί)|ΡΓ(χ0 + |b| ^ ) |s(*i)|Pr(ari) i=l t=l which shows that the defining sum for E[af + bg] converges absolutely. Con- Section 1.3: Probability spaces and random variables 51 sequently, E[af + bg] exists. Therefore, E[af + bg) = ^ ( ο / ( ϋ ) + bg{Xi)) ■ P r ( n ) i = a Σ fixiWxi) + b Σ g(xi)Pr{Xi) = a £ [ / ] + bE[g\. t i Also, if h{t) = K, then Ε[Λ(Χ)] = Σί κ ' P r ( ^ ) = ^ Σ , Ρ Φ Ο = Λ '· ■ For simple polynomial functions, such as /(<·) = t,g(t) = t2, we frequently write the corresponding expression of the random variable as the argument of the expectation operator. That is, we write E[X] for E[f] and £[A 2 ] for E[g]. Slightly more complicated expressions also appear in this abbreviated notation, such as E[{X - o)2] for E[h] where h(t) = (i - a) 2 . When you know the probability distribution of a random variable, you theoretically know everything about it. Practically, however, the distribution is not as easy to manipulate as certain summary values. Two of these values are the mean and variance, which measure, respectively, the central value, around which the probabilities balance, and the dispersion about that balance point. 1.44 Definition: Let X be a random variable with probability distribution Pr(). The mean of X, denoted μ or μχ, and the variance of A', denoted σ2 or σ2χ, are defined as follows, provided that the defining expected values exist. μ = £?[X] = Ç i i - P r ( i i ) t σ = Ε[(Χ - μ)2} = Σ(*< - /*)2 · ΡΦΟ, 2 t where £1,2:2, · · ■ is the range of A'. The square root of the variance, denoted σ or σχ, is the standard deviation of the random variable. I Using Theorem 1.43, we can derive a more useful formula for the variance. σ2 = E[(X - μ)2\ = E[X2} - 2μΕ[Χ] + μ2 = E[X2} - 2μ2 + μ2 = E[X2} -ß2=(j2xl· Pr(ïi) J - μ2 (1.7) 1.45 Example: What are the mean and variance of the random variable A' in Example 1.41? Table 1.9 excerpts the probability data from the last column of Table 1.8 and tabulates the terms needed for the mean and variance calculations. From the summations in the table's last line, we calculate μ = 1.154 and a2 = 2.858 — (1.154)2 = 1.526. The standard deviation is v/l.526 = 1.235. F igure 1.7 shows how the mean represents a balance point for the probabilities, in the sense that the x-axis balances at the mean when loaded at the points xi, X2,... with weights proportional to the corresponding probabilities. The figure also shows two boundary lines, at the Chapter 1: Combinatorics and Probability 52 Xi 0 1 2 3 4 5 6 7 8 9 Σ Pr(X = x;) i. ■ Pr(X = Xi) x'f-Pr(X 0.0 0.4178625 0.2089313 0.2395066 0.0829901 0.0393280 0.0090067 0.0020670 0.0002801 0.0000262 0.0000015 1.0000000 0.2089313 0.4790132 0.2489703 0.1573120 0.0450335 0.0124020 0.0019607 0.0002096 0.0000135 1.1538461 = Xi) 0.0 0.2089313 0.9580264 0.7469109 0.6292480 0.2251675 0.0744120 0.0137249 0.0016768 0.0001215 2.8582193 T A B L E 1.9. Sample calculation of the mean and variance (see Example 1.45) Figure 1.7. The mean as the balance point of probability weights (see Example 1.45) mean plus or minus two times the standard deviation. will become clear after the next two theorems. Q 1.46 Example: Over a wide range of conditions, the operator of an airline reservation terminal has determined that the response time T, rounded to the nearest 1/2 second, follows the distribution given in the first two columns to the right. The operator obtained these data by appealing to the frequency-of-occurrence meaning of probability. The data reflect the fraction of responses that entail a wait of 0.0,0.5,... ,4.0 seconds. What are the mean and variance of T? >; The significance of these lines t Pr(i) 0.0 0.15 0.5 0.05 1.0 0.20 1.5 0.30 2.0 0.05 2.5 0.05 3.0 0.10 3.5 0.05 4.0 0.05 18.0 1.00 t ■ Pr(()il 0.0000 0.0250 0.2000 0.4500 0.1000 0.1250 0.3000 0.1750 0.2000 1.5750 ■ Pr(t) 0.0000 0.0125 0.2000 0.6750 0.2000 0.3125 0.9000 0.6125 0.8000 3.7125 Section 1.3: Probability spaces and random variables 53 The summary line of the third column gives the mean, μ = Σ <;Pr(T = ti) = 1.5750. The corresponding entry at the bottom of the last column gives E[T2]. The variance is σ2 = E[T2} - μ2 = 3.7125 - (1.5750)2 = 1.232. D If Y = f(X), we can compute the mean and variance of Y in the probability space range(V) or in range(X). Let 2/1,1/2,-·· be an enumeration of range(F), and let xlyX2,..enumerate range(X). R a n g e ^ ) splits into disjoint sets: those elements that map to j/i, those that map to 2/2 > and so forth. These sets are, of course, f~1{yi),f~ï(y2),— ßY=EY[Y] = Y;yiprY{yi) = sryt?rx{ri{yi)) » = Σ Σ = sryi £ * /(^i)Prx(^) = Σ » /(^)Prx(^) = Prx(Xj) j|/(*j)=ïi Ex[f] The last line follows because the separate sums over the partition components / _ 1 (3/ι),/ _ 1 (ΐ/2)ι· · · total precisely the sum over range(A'), A similar computation gives σγ = Ey[(Y - μγγ] = = EY[Y2} - μ\ = Ex[f*} - (E[f]f Εχ[{ί{-)-μγγ). 1.47 Example: Suppose that X has the probability distribution below, and let Y = f(X) — iX1 + 6. Find the distribution of Y and its mean and variance. I - 2 - 1 0 1 2 Pr(X = x) 0.25 0.20 0.15 0.35 0.05 Under the specified mapping, Y assumes the values 14,8, and 6. Below we tabulate the inverse images and associated probabilities, which give rise to the mean and variance that follow. y-»(14) = {-2, 2} -> Pr(Y = 14) = Pr(X = -2) + Pr(X = 2) = 0.30 Y~\8) = {-1,1} -> Pr(y = 8) = Pr(X = -1) + Pr(X = 1) = 0.55 y - ' ( 6 ) = {0} -> Pr(V = 6) = Pr(.Y = 0) = 0.15 μγ = 14(0.30) + 8(0.55) + 6(0.15) = 9.50 σγ = 14 2 (0.30)+8 2 (0.55)+ 62(0.15) - 9.502 =9.15 Equivalently, we can compute the mean and variance of Y = f(X) using the X distribution. ßY = /(-2)(0.25) + /(-l)(0.20) + /(0)(0.15)+/(l)(0.35) + /(2)(0.05) = 14(0.25) + 8(0.20) + 6(0.15) + 8(0.35) + 14(0.05) = 9.50 °Y = [/(-2)-9.5] 2 (0.25) + [/(-l)-9.5] 2 (0.20) + [/(0)-9.5] 2 (0.15) + [/(l) - 9.5]2(0.35) + [/(2) - 9.5]2(0.05) = (14 - 9.5)2(0.25) + (8 - 9.5)2(0.20) + (6 - 9.5)2(0.15) + (8 - 9.5)2(0.35) + (14-9.5) 2 (0.05) = 9.15D Chapter 1: Combinatorics and Probability 54 Suppose that a random variable X assumes values χχ,χτ,... with corresponding probabilities px = Pr(xi),p2 = Pr(#2)> If we repeat the underlying random process a large number of times, say N, and let iVi,JV2, ■ ■ ■ be the number of times that X generates, respectively, the values xi,X2,..., then we expect Νχ/Ν « Ρι,Ν?/Ν « p2, and so forth. So the average X value over the course of the N trials is jj-Y^Xk-Nk = ^2xk~-Kj2xkPk k k = Ε[Χ] = μ. k In this sense, the mean is frequently called the average value of the random variable. The frequency-of-occurrence interpretation provides a strong connection between the mean, which is a mathematical concept, and the average over a long sequence of observations, which is a real-world construction. This connection enables many applications of probability to real-world problems, and we consequently want no ambiguity about the mean. In particular, we want the same mean regardless of how the defining sum is executed, and this invariance comes with the insistence on absolute convergence. The next example illustrates the difficulties that can arise when absolute convergence does not hold. It uses certain properties of convergent series, which may call for a review of Section A.3 in Appendix A. 1.48 Example: Note that £ £ . , + t£s. + ±^-'+±(Th--T) i=l «=2 n-1 n i=2 j K 1+ 1 V i=2 ' ' <2. n These partial sums form a nondecreasing sequence with upper bound 2. Hence the sum converges. Let A = Σ°Ζγ 1/i' · A is actually π /6, but that detail is not necessary for this example. Now consider the random variable λ' with distribution Pr(X = i) = l/(2Ai2), for integer i φ 0. Understanding that i = 0 is to be omitted, we sum these weights as follows. °° N 1 1 1 / \i=l ·= -Λί -<x> M 1 N i= l 1 \ / = 1 Ζ ^ + Λ) = ι Thus we have a proper discrete probability distribution. Consider now the expression for the mean, again with the understanding that i = 0 is omitted from the sum. μχ - O Λ I / ^—' 1Ai ÏAl 1 ( — l i m / M,JV->OO Λ1.ΛΓ-»οο ^—' t—' i=-M 1 1 , 1Ai , 1 1 \ 2A This sum has no definitive limit. (Appendix A investigates this particular sum in the argument accompanying Figure A.l.) By carefully choosing interleaving spans of Section 1.3: Probability spaces a n d r a n d o m variables 55 positive terms with compensating spans of negative terms, we can drive the partial sum back and forth across any prespecified interval. That is, we can force the partial sums to oscillate continually with infinitely many high points greater than 1 and infinitely many low points less than — 1. The limit of such an oscillating sequence of partial sums does not exist. Since the terms 1/i are approaching zero in magnitude, each positive or negative excursion uses an increasingly large span of terms, but there are always infinitely many remaining to continue the process. We can even force the partial sums to converge to an arbitrary number. For example, we sum a span of positive terms to drive the partial sum to a value in (42,43). We follow with a span of negative terms to force it back into the interval (41,42). Alternating positive spans with negative ones, we successively force the partial sum into the ranges (42,42+ 1), ( 4 2 - 1, 42), (42,42 + 1/2), (42 - 1/2, 42), (42,42 + 1/4), ( 4 2 - 1 / 4 , 42) · · ·. These partial sums converge to 42. It is hardly helpful to have a mean when the value depends on the summation order. We intend to interpret the mean in real-world applications as the average value over many observations. It would be rather artificial to suggest that we sum a run of positive observations, balance it with a run of negative observations, and continue in this manner to achieve some prespecified fraction. Rather, we must take the observations as they occur. We are trying to describe and explain the nondeterministic process; we are not trying to control it. The ambiguous behavior of the expected value operation in this example arises because the sum is not absolutely convergent. 1 2Ah\ 1 " 2A M,iv-*oo ί—ι 1 \i\ 1 M 2A AÎ,JV->OO ^ 1 ι N 1 *—> ι In cases like this, we say that the mean does not exist. It is clear that the variance does not exist either, because the defining sum for S'fA'2] is not absolutely convergent. Q T h e variance measures dispersion about the mean, with a larger variance corresponding to a larger dispersion. For example, we can construct the probability distribution of a random variable X by placing probability 1/2 at x = 1 and 1/2 at x = — 1. T h e mean and variance are, respectively, ( l ) ( l / 2 ) + (—l)(l/2) = 0 and (1 - 0 ) 2 ( l / 2 ) + ( - 1 - 0 ) 2 ( l / 2 ) = 1. Now construct a competing random variable Y by placing probability 1/2 at y = 10 and 1/2 at y = - 1 0 . T h e mean is again ( 1 0 ) ( l / 2 ) + (—10)(l/2) = 0, b u t the variance is now much larger: (10 - 0 ) 2 ( l / 2 ) + ( - 1 0 - 0 ) 2 ( l / 2 ) = 100. To continue the mechanical analogy, if the mean corresponds to the center of gravity, the balance point, then the variance corresponds to the moment of inertia. T h e following theorems show how the variance places limits on how far from the mean we can find significant probability masses. 1.49 T h e o r e m : (Markov's inequality) Let X be a nonnegative random variable with mean μ. Then, for x > 0, P r ( X > x) < μ/χ. PROOF: Let x\,X2,... Xi > 0 for all i. μ = E[X} = ^Xi?t(xi)= be an enumeration of range(A'). Σ XiPr{Xi)+ ]Γ By assumption, χ,Ρφή Chapter 1: Combinatorics and Probability 56 A smaller value results if we omit the first summation. μ > χ Ρτ χ Σ ^x\ ί (ύ {i\Xi>x} Σ Pr{xi) j = x ■ Pr(X > x) \{i\*i>x} Dividing by x establishes the theorem. | 1.50 Theorem: (Chebyshev's inequality) Let X be a random variable with mean μ and standard deviation σ > 0. Then, for any r > 0, ΡΓ(μ - τσ < X < μ + τσ) > 1 - 1 / r 2 . PROOF: The result is, of course, trivial for r < 1. In this case 1 - 1/r2 < 0, and the probability of any event must be nonnegative. Hence, we expect interesting applications of this theorem to involve r > 1. In any case, let γ= (χ-μχ)ησ\. μγ = Εγ[Υ} = Εχ[(Χ-μχ)2/σ2χ] -Εχ[(Χ-μχ)2} =1 Because Y > 0, we can apply Markov's inequality to obtain P r y ( F > r 2 ) < 1/r 2 . T h e n - p - > P r y ( y > r 2 ) = 1 - PvY(Y < r 2) = 1 - P r x f ( X ~^x)2 < A l-Prx[(X-ßx)2<(rax)2} = = 1 - Ρ Γ χ ( μ χ - rax < X < μχ + rax). Transposing terms gives the expression asserted by the theorem. | 1.51 Example: Continuing Examples 1.41 and 1.45, we verify Markov's and Chebyshev's inequalities. Recall that we compute the random variable X as twice the number of aces plus the number of kings in a five-card hand. We have computed μ = 1.154 and a — 1.235. Reading the required probabilities from Table 1.8, we construct the following probability values and their Markov bounds. X 1 2 3 4 5 Pr(X > x) 0.5821375 0.3732062 0.1336996 0.0507095 0.0113815 bound (μ/χ) 1.154 0.577 0.385 0.289 0.231 X 6 7 8 9 10 Pr(X > x) 0.0023748 0.0003078 0.0000277 0.0000015 0 bound(μχ/χ) 0.192 0.165 0.144 0.128 0.115 As you can see, the actual PT(X > x) is always less than the corresponding Markov bound, μ/χ. Indeed, it is usually very much less, which suggests that the Markov bound is not very precise. The Chebyshev inequality states that Ρτ(μ — 2σ < X < μ + 2σ) = Pr(-1.316 < X < 3.624) > 1 - (1/2) 2 = 0.75. A glance at Figure 1.7 shows that this span includes X = 0,1, 2, 3 for a total probability of 0.949, which again beats the bound by a large margin. Despite this example, the Chebyshev bound cannot be improved in the general case. One of the exercises pursues this idea. 0 Section 1.3: Probability spaces a n d r a n d o m variables 1.0 0.8 F(x) rJ 0.6 0.4 0.2 0.0 A r 57 4 8 12 16 x Figure 1.8. A cumulative distribution function We introduce one final aspect of a random variable in this section, the cumulative distribution function. This new description of the random variable turns out to be equivalent to its probability distribution. However, it is more useful in certain circumstances, such as the computer simulations of Chapter 3. 1.52 Definition: Let X be a random variable with range X\ < X2 < x* < · ■ · · The cumulative distribution function of Λ', denoted F(x) or Fx(x), is defined for all real x by F(x) = Pr(X <x) = Σ Pr(x t ). I If, for instance, X has the probability distribution in the third row below, we can calculate F(xi) by summing the values at or to the left of ι,, as shown in the fourth row. i xx Pr(X = a) F(xi) 1 2 0.1 0.1 2 3 0.15 0.25 3 5 0.2 0.45 4 8 0.05 0.50 5 10 0.1 0.60 6 12 0.08 0.68 7 13 0.12 0.80 8 15 0.2 1.00 For other values of x, F(x) remains constant at the value of the largest .τ; that does not exceed x. That is, F(x) = F(x,), for x, < x < xl+i. Also, F(x) = 0 for x < xi, and if there is a largest value in the range of X, say xn, then F(x) = 1.0 for x > xn. Figure 1.8 shows this staircase behavior of F(x) for the current example. Note that F(x) is continuous from the right, even at the jump points. That is, limy_+I+ F(y) = F(x). Recall that the notation y -¥ x+ means that y approaches x through values greater than x. Except at points in range(X), F(x) is also continuous from the left. 1.53 E x a m p l e : Suppose that a disk server provides file access services for a collection of clients. The server takes a fixed 10 milliseconds to read or to write a disk sector, which is a fixed-length character string. The server maintains a buffer, which can hold at most two requests. Therefore, if client requests arrive during the 10-millisecond interval while the server is busy with a prior request, the new requests, up to a maximum of 2, will not be lost. Empirical studies for this particular client-server arrangement indicate that the probability of k client requests arriving during any 10-millisecond interval is given by (0.9)*e -0 ' 9 /fc!. What are the mean Chapter 1: Combinatorics and Probability 58 and variance of this distribution? What is the probability that the server will lose a client request if it starts its 10-millisecond service interval with an empty buffer? What size buffer would reduce this probability below 0.05? Let N be the random variable that counts the number of arrivals during a 10millisecond interval. We then have Pr(N = k) = (0.9) f c e - a 9 /W· Since Σ Π = ο χ * Α ! is the expansion for ez, we see that f; Pr(N = k) = e"0·9 Σ,-Qjf- = e-° V 9 = 1- k=0 k=o Therefore, N is indeed a discrete random variable with range 0 , 1 , 2 , . . . . Computation of the mean and variance uses the summation techniques of the preceding section. fc=0 fc = l °· 9 Σ ( 0 ΐΓ' = °-9e"°9 Σ-τ-= 0 - 9e " 0 - 9e °' 9 = °·9 σ2 = E[N2} - μ2 = Π Γ fc2Pr(JV = k) ) - (0.9) fc (°·9)* = -(0.9) 2 + e - 0 · 9 V Ü + k=o 2 = -(0.9) + (0.9)e-°· 9 ζ A:(0.9) fc k\ 9 + (0.9)e-°- X: W-D*!1 k\ (0.9)fc k\ *=o = -(0.9) + (0.9)£[JV) + (0.9)e -0 - 9 e° 9 = -(0.9) 2 + (0.9) 2 + (0.9) = 0.9 2 From the formula Pr(W = k) = (0.9)*e-°-9/fc!, we can calculate the cumulative distribution function, F(x), at the first several jump points. To four decimal places, the probability is zero for k > 6. k = k) F(k) PT(N 0 1 2 3 4 5 6 0.4066 0.4066 0.3659 0.7725 0.1647 0.9372 0.0494 0.9866 0.0111 0.9977 0.0020 0.9997 0.0003 1.0000 Since the server starts its service interval with an empty buffer, it will not lose a client request as long as no more than 2 such requests arrive in the 10-millisecond interval. The probability of 2 or fewer arrivals is F(2) = 0.9372. Hence, the probability is 0.9372 that the server will not lose a request. Equivalently, the probability is 1.0 — 0.9372 = 0.0628 that it will lose a request. In the sense that the mean of a random variable reflects its average value, we expect an average of 0.9 requests to arrive while the server is busy. Because fewer than 1 request arrives, on the average, during the time necessary to service a prior request, we see that the server has a chance of staying ahead of its clients. Moreover, there are enough buffers to cover the average number of incoming requests during the service interval. Nevertheless, there is still a 6.3% chance of losing a request through buffer overflow. To decrease this value below 0.05, we must be able to buffer sufficient requests, say n, such that Section 1.3: Probability spaces and random variables 59 F(n) > 0.95. Consulting our tabulation above, we see that we must have n > 3. That is, we need 1 additional buffer. Of course, if the server starts its 10-millisecond cycle with a partially filled buffer, it will incur a higher probability of an overflow. Analysis of that situation will have to wait until we have developed more tools. The probability distribution of this example is called a Poisson distribution, and it occurs frequently in client-server performance calculations. The next chapter takes up the Poisson distribution in detail. It also develops less tedious methods for determining the mean and variance. Q Exercises 1.42 Let ( Ω , Ρ Γ ( · ) ) be a discrete probability space, in which A, B,C are events. Prove the following expression of the probability of the union, and deduce the proper expression for Pr(U"=1.<4j). Ρτ(Α + B + C) = Pt(A) + Pr(B) + Pr(C) -Pi(AB) - PT(AC) - Pi(BC) + Pr(ABC). * 1.43 Suppose that Ai C A2 C A3 C · · · is an event sequence. Prove that Pr(U~ =1 i4„) = lim,,-*«, Pt(An). *1.44 Suppose that B\ D B2 D B3 D · · · is an event sequence. Prove that Pr(n~ = 1 B n ) = l i m n - ^ P r i B n ) . 1.45 Two dice are rolled. Let X be the sum of the spots on the upturned faces. What is the range of X? What are the corresponding probabilities? What are μ and σ 2 ? 1.46 You receive 5 cards from a well-shuffled deck. Let X be twice the number of queens plus the number of jacks in the hand. What is probability distribution of X? What are μ and σ 2 ? What is the relationship between this problem and that discussed in Examples 1.41 and 1.45? 1.47 As in the preceding problem, you receive 5 cards from a well-shuffled deck. The hand contains nq queens and n, jacks. Let Y = (nq - n,) 2 . What is the probability distribution of Y? What are μγ and σ\Ί 1.48 In 5 tosses of a fair coin, let X be the number of heads minus the number of tails. What is the probability distribution of XI What is the cumulative probability distribution? What are μ and σ 2 ? 1.49 In a game you receive three cards from a well-shuffled deck. You then receive $10 if the hand contains an ace and a face card. The face cards are the kings, queens, and jacks. All cards are then returned to the deck before the next game begins. How much would you be willing to pay, per game, to play a large number of hands?

1/--pages