Log in using OpenID

This page intentionally left blank
Probability and Statistics
for Computer Science
This page intentionally left blank
Probability and Statistics
for Computer Science
Western Washington University
Copyright © 2003, 2008 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hobokcn, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without cither the prior
written permission of the Publisher, or authorization through payment of the appropriate pcr-copy fee to
the Copyright Clearance Center, Inc., 222 Rosewood Drive. Danvcrs, MA 01923, (978) 750-8400, fax
(978) 750-4470, or on the web at Requests to the Publisher for permission should
be addressed to the Permissions Department, John Wiley & Sons. Inc., 111 River Street, Hobokcn, NJ
07030, (201) 748-6011, fax (201) 748-6008, or online at
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategics contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States at
(317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may
not be available in electronic format. For information about Wiley products, visit our web site at
Library of Congress Cutaloging-in-Publication Data is available.
ISBN 978-0-470-38342-1
10 9 8 7 6 5 4
3 2 I
To my muses
Shelley, Jessica, and Jimmy
This page intentionally left blank
Combinatorics and Probability
1.1 Combinatorics
1.1.1 Sampling without replacement
1.1.2 Sampling with replacement
1.2 Summations
1.3 Probability spaces and random variables
1.4 Conditional probability
1.5 Joint distributions
1.6 Summary
Discrete Distributions
2.1 The Bernoulli and binomial distributions
2.2 Power series
2.3 Geometric and negative binomial forms
2.4 The Poisson distribution
2.5 The hypergeometric distribution
2.6 Summary
3.1 Random number generation
3.2 Inverse transforms and rejection filters
3.3 Client-server systems
3.4 Markov chains
3.4.1 Irreducible aperiodic Markov chains
3.4.2 Convergence properties
3.5 Summary
Discrete Decision Theory
4.1 Decision methods without samples
4.2 Statistics and their properties
4.3 Sufficient statistics
4.4 Hypothesis testing
4.4.1 Simple hypothesis versus simple alternative
4.4.2 Composite hypotheses
Real Line-Probability
5.1 One-dimensional real distributions
5.2 Joint random variables
5.3 Differentiable distributions
5.4 Summary
Continuous Distributions
6.1 The normal distribution
6.1.1 The univariate and bivariate normal distributions . . . . 437
6.1.2 The multivariate normal distribution
6.2 Limit theorems
6.2.1 Convergence concepts
6.2.2 An inversion formula
6.3 Gamma and beta distributions
6.4 The χ 2 and related distributions
6.5 Computer simulations
6.6 Summary
Parameter Estimation
7.1 Bias, consistency, and efficiency
7.2 Normal inference
7.3 Sums of squares
7.4 Analysis of variance
7.5 Linear regression
7.6 Summary
A Analytical Tools
A.l Sets and functions
A.2 Limits
A.3 Structure of the real numbers
A.4 Riemann-Stieltjes integrals
A.5 Permutations and determinants
B Statistical Tables
This text develops introductory topics in probability and statistics with particular emphasis on concepts that arise in computer science. It starts with the
basic definitions of probability distributions and random variables and elaborates their properties and applications. Statistics is the major application
domain for probability theory and consequently merits mention in the title.
Unsurprisingly, then, the text treats the most common discrete and continuous distributions and shows how they find use in decision and estimation
problems. It also constructs computer algorithms for generating observations
from the various distributions.
However, the text has a major subtheme. It develops in a thorough and
rigorous fashion all the necessary supporting mathematics. This approach
contrasts with that adopted by most probability and statistics texts, which for
economy of space or for fear of mixing presentations of different mathematical
sophistication, simply cite supporting results that cannot be proved in the
context of the moment. With careful organization, however, it is possible to
develop all the needed mathematics beyond differential and integral calculus
and introductory matrix algebra, and this text purports to do just that.
Of course, as the book lengthens to accommodate the supporting mathematics, some material from the typical introduction to probability theory must
be omitted. I feel the omissions are minor and that all major introductory
topics receive adequate attention. Moreover, engagement with the underlying
mathematics provides an opportunity to understand probability and statistics at a much deeper level than that afforded by mechanical application of
unproved theorems.
Although the presentation is as rigorous as a pure mathematics text,
computer science students comprise the book's primary audience. Certain
aspects of most computer science curriculums involve probabilistic reasoning,
such as algorithm analysis and performance modeling, and frequently students
are not sufficiently prepared for these courses. While it is true that most
computer science curriculums do require a course in probability and statistics,
these courses often fail to provide the necessary depth. This text certainly
does not fail in presenting a thorough grounding in elementary probability
and statistics. Moreover, it seizes the opportunity to extend the student's
command of mathematical analysis. This approach is different than that taken
by other probability and statistics texts currently aimed at computer science
curriculums. The more rigorous approach does require more work, both from
the student and from the instructor, but the rewards are commensurate.
The engineering sciences, like computer science, also tend to use texts
that place more emphasis on mechanical application of results than on the
mathematical derivation of such results. Consequently, engineering science
students will also benefit from the deeper presentation afforded by this text.
Nevertheless, the primary audience remains computer science students because many of the illustrative examples are computer science applications.
Therefore, from this point forward, I assume that I am addressing a computer
science student or instructor.
Computer science students typically follow a traditional curriculum that
includes one or two terms of probability and statistics, which follow prerequisite courses in differential and integral calculus and linear algebra. Although
these prerequisite courses do introduce limit processes and matrix transformations, they typically emphasize formulas that isolate applications from the
underlying theory. For example, if we drain a swimming pool with a sinusoidal
cross-section, we can calculate how fast the water level falls without invoking
limit operations. We simply set up a standard differential ratio and equate it
to the drain flow rate. Why this works is buried in the theory and receives
less and less emphasis once a satisfactory collection of calculation templates is
available. This text provides an opportunity to reconnect with the theoretical
concepts of these prerequisite courses. As it probes deeper into the properties
of probability distributions, the text puts these concepts to fruitful use in
constructing rigorous proofs.
The book's ambient prose deals with the principal themes and applications of probability, and a sequence of mathematical support modules interrupts this prose at strategic junctures. With some exceptions, these modules
appear as needed by the probability concepts under discussion. A reader can
omit the modules and still obtain a good grounding in elementary probability and statistics, including philosophical interpretations of probability and
ample exercise in the associated numerical techniques. Reading the support
modules will, however, strengthen this understanding and will also arouse an
appreciation for the mathematics itself.
The encapsulation is as follows. An appendix gathers selected topics
from set theory, limit processes, the structure of the real numbers, RiemannStieltjes integrals, matrix transformations, and determinants. The treatment
first reviews the material at an introductory level. The prepared reader will
be familiar with these concepts from previous courses, but the results are nevertheless proved in detail. The less prepared reader will certainly find frequent
recourse to the appendix, and the text provides pointers to the appropriate
sections. However, even the prepared reader will benefit from the introductory
presentations, which serve both as a review of proof technique and as an introduction to the argument style pursued in the main text. Upon completing
an introductory review, the appendix then extends the topics as necessary to
support the arguments that appear in the main body of the text. Therefore,
all chapters depend on the appendix for completeness. Even a reader well
grounded in the aforementioned prerequisites can expect to spend some time
mastering the specialized tools developed in the appendix.
The appendix, with its eclectic collection of review topics and specialized
extensions, provides general mathematical background. There is need, however, for more specific supporting mathematics in connection with particular
probabilistic and statistical concepts. Until perhaps halfway through the text,
this supporting mathematics appears in mathematical interludes, which occur
in each chapter. These interludes introduce particular results that are needed
for the first time in that chapter. The first interlude deals with summation
techniques, which are useful tools for the combinatoric problems associated
with probability over equally likely outcomes. Others treat convergence issues
in power series, stability features of Markov matrices, and sufficient statistics.
Before taking up continuous distributions, however, it is appropriate to devote a full chapter to the mathematical issues that arise when one attempts to
generalize discrete probability to uncountable sets and to the real line in particular. This chapter is actually a brief introduction to measure theory, and
its logical place is just prior to the discussion of the common distributions on
the real line. Two further interludes follow in subsequent chapters. They deal
with limit theorems for continuous random variables and with decompositions
of the sample variance. In short, the text exploits opportunities to introduce
the mathematical analysis necessary to establish the basic results of probability theory. Moreover, the presentation clearly considers the mathematical
analysis and the probability theory to be of equal importance.
The following sketch shows the dependencies among the chapters, with
the understanding that portions of the appendix are prerequisite for any given
path. The dashed boxes note the mathematical interludes within the chapters.
Chapter 5
Probability on the Real Line
Discrete Decision Theory
Chapter 6 ■
Continuous Distributions
Chapter 7
Parameter Estimation
- * » Limit theorems
>| Sums of squares |
The reader can study the appendix in detail to ensure familiarity with
all the background mathematics needed in the text, or can start immediately
with the probability discussions of Chapter 1 and refer to the appendix as
needed. Because of its breadth, the appendix is more difficult to master in
its entirety than the mathematical interludes of the introductory chapters.
A reader who prefers that the material increase monotonically in difficulty
should start with the introductory chapters and digress into the appropriate
appendix sections as needed. When using the text to support a course, an
instructor should follow a similar path.
As noted earlier, the intended audience is computer science students.
Once past colored balls in numbered urns, which constitute the traditional
examples in combinatoric problems, the text uses examples that reflect this
readership. Client-server performance evaluation, for instance, offers many
opportunities for probabilistic analysis. These examples should provide no
difficulty for other readers, such as students from the engineering sciences,
because the examples make no profound references to advanced concepts, but
rather use generally accessible quantities, such as terminal response time,
server queue length, error count per 1000 programming statements, or operation count in an algorithm. These examples are no more difficult than those
in a more general probability text that ventures beyond the traditional urns,
colored balls, and dice.
The requirements of the Computer Science Accreditation Board and
the Accreditation Board of Engineering Technology (CSAB/ABET) include
a one-semester course in probability and statistics. This text satisfies that
requirement. In truth, it is sufficient for a full-year course because it not
only develops the traditional introductory probability concepts but also includes considerable material on mathematical reasoning. For a one-semester
course, the following selection is appropriate. Note that the topics lie along
an acceptable dependency chain in the earlier diagram.
Appendix. Sections as referenced in the items below
Chapter 1. Combinatorics and Probability
Chapter 2. Discrete Distributions
Chapter 3-4. Simulation, Sections 3.1 to 3.3, or Discrete Decision Theory, Sections 4.1 and 4.2
Chapter 6. Continuous distributions, Sections 6.1, 6.3, and 6.4
Chapter 7. Parameter Estimation, Sections 7.1, 7.2, and 7.4
The one-semester abbreviation is possible because Chapters 3 and 4
present major applications of discrete probability and, in the interest of time,
need only be sampled. Chapter 5 is advanced material that elaborates the
difficulties in extending discrete probability to uncountable sample spaces. It
is present for logical completeness and to answer the nagging question that occurs to many students: Is the introduction of a sigma-algebra really necessary
in the general definition of a probability space? Consequently, the proposed
one-semester course omits Chapter 5 with minimal impact on subsequent material. Finally, Chapter 7 undertakes major applications of continuous probability and also admits partial coverage.
At the time of this writing, many computer science curriculums include
only a first probability course. However, there is a recognized need for further
study, at least in the form of an elective second course, if not in a required
sequel to the introductory course. Anticipating that this increased attention
will also expose the need for a more complete mathematical treatment of the
material, I have provided unusally detailed excursions into supporting topics,
such as estimation arguments with limits, properties of power series, and
Markov processes.
Buttressed by these mathematical excursions, the text provides a thorough introduction to probability and statistics—concepts, techniques, and
applications. Consequently, it offers a continuing discussion of the real-world
meaning of probabilities, particularly when the frequency-of-occurrence interpretation becomes somewhat strained. Any science that uses probability must
face the interpretation challenge. How can you apply a result that holds only
in a probabilistic sense to a particular data set? The text also discusses competing interpretations, such as the credibility-of-belief interpretation, which
might appear more appropriate to history or psychology. The goal is, of
course, to remain continually in touch with the real-world meaning of the
Probability as frequency of occurrence over many trials provides the
most compelling interpretation of the phenomenon. It is intuitively plausible, for example, that a symmetric coin should have equal chances of landing
heads or tails. The text attempts to carry this interpretation as far as possible. Indeed, the first chapter treats the combinatorics arising from symmetric
situations, and this treatment serves as a prelude to the formal definitions of
discrete probability. As the theory accumulates layer upon layer of reasoning,
however, this viewpoint becomes difficult to sustain in certain cases. When
testing a hypothesis, for example, we attempt to infer the prevailing state of
nature from sampled data. What does it mean to assign a priori probabilities to the possible states? This practice allows statisticians to incorporate
expert judgment into the decision rules, but the assigned probabilities do not
admit a frequency-of-occurrence interpretation. Rather, they reflect relative
strength-of-belief statements about the possible states. As necessary, the text
interrupts the technical development to comment on the precise real-world interpretation of the model. Although beautiful as abstract theory, probability
and statistics are also rightly praised for their ability to deliver meaningful
statements about the real world. Interpreting the precise intent of these statements should be a primary goal of any text.
A trend in modern textbooks, particularly those not addressed specifically to a mathematics curriculum, is to avoid the theorem-proof presentation
style. This style can be sterile and detached, in the sense that it provides
sparse context for the motivation or application of the theorems. Without
the theorem-proof style, on the other hand, arguments lose some precision,
and there is a blurring of the line between the general result and its specific
applications. I have adopted what I consider a middle ground. I maintain a
running prose commentary on the material, but I punctuate the dialog with
frequent theorems. Often, the theorem's proof is a simple statement: "See
discussion above." This serves to set off the general results, and it also provides reference points for later developments. The ambient prose remains
connected with applications and with the questions that motivate the search
for new general results.
Plentiful examples, displayed in a contrasting typographical style, play
a major role in compensating for the perceived coldness of the theorems.
Incidentally, I should say that I do not find the theorems cold, even in isolation.
But I am responding to the spirit of the age, which suggests that a theorem
wrapped in an example is more digestible than a naked theorem.
The theorems also further a second ambition, noted above, which is
to involve the reader more extensively in precise mathematical argument. An
aspect of proofs that attracts major criticism is the tendency to display, out of
thin air, an expression that magically satisfies all the required constraints and
invites the algebraic manipulations necessary to complete the proof. I have
tried to avoid this practice by including some explanation of the mysterious
expression's origin. I must admit, however, that I am not always successful in
this ploy. Sometimes an explanation adds nothing to a careful contemplation
of the expression. In such cases, I am tempted to suggest that the reader reflect.
on the beauty of the expression, note how one part attaches to the known
information while another extends toward the desired result, and view the
expression as a unifying link, growing naturally from a study of the context of
the problem in question. Instead, however, I fall back on the age-old practice:
"Consider the following expression — " The reader should take these words
as an invitation to pause and ponder the situation.
In summary, the text develops a main theme of probability and statistics,
together with the mathematical techniques needed to support it. Since it
is not practical to start with the Peano axioms, there are, however, some
prerequisites. Specifically, the text's mathematical level assumes that the
reader has mastered differential and integral calculus and has some exposure to
matrix algebra. Nevertheless, acknowledging the mechanical fashion in which
these subjects are taught these days, the text provides considerable detail in
all arguments. It does assume, nevertheless, that readers have some familiarity
with limiting operations, even if they do not have significant experience with
the concept. For example, readers should be comfortable with l'Hôpital's rule
for evaluating limits that initially appear to produce indeterminate results. By
contrast, the text does develop the theory of absolutely convergent series to the
point of justifying the interchange of summation order in double summations.
As another example, the readers should be generally conversant with
power series, although the text develops this topic sufficiently to justify the
term-by-term differentiation that is needed to recover parameters from a
random variable's moment generating function. The background appendix
and the mathematical interludes should bridge the gap between prerequisite
knowledge and that needed to establish all probability and statistical concepts
encountered. They also serve to present a self-contained book, which is my
preference when learning new material. A reader can always skip sections
that are peripheral to the main point, but cannot as easily fill in omissions.
Some expositions consist of step-by-step procedures for solving a probabilistic or statistical problem. That is, they involve algorithms. For example,
algorithms appear in sections concerned with computer simulations of probabilistic situations. In any case, algorithms in this text appear as mutilated C
code, in the sense that I vary the standard syntax as necessary to describe the
algorithm most clearly to a human reader. For instance, I use only two iterators, the while-loop and the for-loop, and in each case, the indentation serves
to delineate the body of statements under repetition. The left fragment below, intentionally meaningless to focus attention on the code structure, must
be reformulated as shown on the right to actually compile properly,
while (X > Y)
while (X > Y) {
for (j = 1; j <=; j++)
for (j = 1; j <=; j++) {
Y = Y - t[i];
Y = Y - t[j];
t[j] = t[j] + 1.0;
t[j] = t[j] + 1.0;
if (Y < 0.0)
Y = 0.0;
if (Y < 0.0)
Y = 0.0;
This practice amounts to omitting the opening and closing braces, and
it is surprisingly efficient in reducing the code length. The price, of course,
is that the code will not execute as written. I also omit variable declarations
and simply use any variables that I need in an untyped manner. So a variable,
or a function return, can be a scalar, a vector, a matrix, or a list. A particular
usage is always clear from context, and these omissions, like those of scope
enclosures, produce more compact code. In summary, the algorithms in this
text are intended for human consumption. The emphasis is on the concept,
not the detailed syntax of a programming language. Nevertheless, I have
constructed all the algorithms in functional code and executed them to ensure
that they perform as promised.
I emphasize that this text requires no C programming background. A
C-based style finds frequent use in books and journals as a concise and precise algorithm presentation method for audiences that are not C literate. This
widespread usage confirms the opinion that loose C structures are easily accessible to readers who do not have a background knowledge of C. Consequently,
a reader with no previous C experience should not feel anxious about the algorithm descriptions in this text. Although readers must want to understand
the algorithm and must expend the time to attend carefully the iterative
processes in the C expression, they will find the algorithm comprehensible.
Moreover, explanatory prose accompanies all C-structure descriptions.
Another feature that will be appreciated by today's students is the text's
early emphasis on discrete distributions. Most introductory probability books
quickly proceed to the normal distribution and the central limit theorem because these tools enable certain computational approximations. I certainly
do not wish to imply that the normal distribution or the central limit theorem is not important. These topics are developed in detail in the latter
half of the book. However, much reasoning with discrete distributions can
take place without replacing the discrete distribution with a more tractable
normal distribution. This is especially true today when computers complete
the mechanical, but sometimes tedious, computations. Working directly with
discrete distributions develops a feeling for the subtleties of probability theory
that can be overlooked in a rush to obtain approximate results. The many
examples involving discrete distributions will sharpen a reader's capabilities
for combinatorial reasoning and will exercise his or her skills in discrete mathematical argument. These skills are especially important in today's scientific
world, which grows ever more digital, and therefore ever more discrete.
In keeping with this approach, the text develops simulation techniques
and certain aspects of statistical inference directly after introducing a repetoire
of discrete distributions but before developing continuous distributions. This
means that certain proofs, which admit more elegant expositions with continuous tools, contain detailed arguments with limits. These arguments are
conceptually similar to those associated with calculus (e.g., passing to an
integral limit from a summation), but the text nevertheless presents more
intermediate steps than one normally finds in a textbook
All definitions, theorems, and examples share a common numbering sequence within each chapter. To find Theorem 3.14, for example, you can use
any definition, theorem, or example to direct your search forward or backward.
I find it a considerable annoyance when these items are numbered separately.
Of course, this means that there may be a Theorem 3.14 even though there is
no Theorem 3.13. I try to make this dissonance more acceptable by placing
the number before the theorem, definition, or example. Having located Theorem 3.14, for example, you find that it starts "3.14 Theorem
" In this
reading, Theorem 3.14 is an item, which happens to be a theorem. There is
indeed an item 3.13, which may or may not be a theorem. In the traditional
manner, tables and figures exhibit separate numbering sequences within each
Chapter 1
Combinatorics and
The computer and engineering sciences investigate many nondeterministic
phenomena. In computer science, examples include an algorithm's running
time, the queue length encountered by a request arriving at a database server,
and the delay in transferring a data block from disk to main memory. Currents in semiconductor devices, an important concern in electrical engineering,
depend on the energy distributions of charge carriers. Biological evolution appears to involve random mutations in cells' offspring. In any applied science,
measurement error adds an uncontrollable variation to the simplest datagathering activity. Therefore, it is common laboratory practice to average a
sequence of measurements to reduce random effects.
We also use probabilistic terms to describe certain ordinary events of
everyday life. The chance of rain today is 40%. The probability of heads on
a coin toss is 1/2. The odds that team X will win a sports event are 3 : 1 .
Once in every thirty-six throws, approximately, a pair of dice will exhibit
the minimum total of two spots on the upturned faces. In all these cases,
we understand that some repetitive process generates outcomes that are not
deterministic but nevertheless exhibit a rational pattern. For example, each
rotation of the planet brings a new day, on which rain may or may not occur.
We interpret a 40% chance of rain as meaning that over an extended run of
days, 40% of them will be rainy.
In this initial informal discussion, we use the term probability in the ordinary sense implied by the examples above. That is, the probability of an
event is its relative frequency of occurrence over many observations. Probability is simply a synonym for chance. The probability of rain is 40%. The
chance of rain is 40%. Both mean that similar meteorological conditions in
the past have produced rain in 40% of the observations. More exact mathematical definitions appear later in the chapter, and they are consistent with
the intuitive meanings discussed here.
When faced with a repeating phenomenon that generates varying out1
Chapter 1: Combinatorics and Probability
comes, we can adopt one of two viewpoints. On the one hand, we can argue
that the phenomenon does not actually repeat. It only appears to do so
because our knowledge is not sufficiently detailed to distinguish between iterations. On the other hand, we can accept the variations and plan accordingly.
With the latter approach, we are indifferent as to whether the variations arise
because we are not sufficiently informed or because the phenomenon is fundamentally nondeterministic. Consider again the weather forecast that predicts
a 40% chance of rain under certain meteorological conditions. The first viewpoint suggests that the meteorological conditions are merely approximations
of the true state of nature. It implies that we should direct our efforts toward attaining complete knowledge of nature, which would allow a prediction
with 100% certainty. The second attitude suggests that we should live with
the uncertainty. We should lay plans that make provisions for alternative
outcomes. Indeed, an adherent of the first viewpoint must frequently follow
the second course because complete knowledge of nature is inaccessible. Most
persons find it practical to take a decision about carrying an umbrella without
waiting for meteorologists to acquire complete knowledge of the weather.
Probability theory provides a rigorous mathematical framework for investigating phenomena that exhibit repeatable patterns, even though individual outcomes may appear random. As Hamlet put it, "Though this be
madness, yet there is method in't." Probability theory provides a lens that
enables us to see the patterns underlying an apparently chaotic surface, a lens
that tames the madness to reveal the method.
Knowledge of repeatable patterns can be very useful, even if there are
occasional disappointments. If I know that team X will play an extended
sequence of games and that the chances of winning remain 3 to 1 throughout
the series, I might find a betting office that will accept the following proposition. I will pay the office $10 for each game. When team X wins a game, the
office will pay me $15; when team X loses, I get nothing back. Suppose that
the team plays 12 games and in keeping with the odds, wins 3 for each loss.
That is, the teams wins 9 games and loses 3. I pay $120, and I receive $135,
for a net gain of $15. Of course, the betting office can run this calculation as
well as I can, and therefore it would not accept my proposition. In truth, the
betting office can run the same calculation only if it also knows that the 3:1
odds are an accurate description of the situation. So informed bets are made
only between parties who have different perceptions of the underlying odds.
Long study and experience can produce credible estimates of the odds
for a sports event, or for the chance of rain on a given day, but the results cannot be completely accurate. Other probabilistic descriptions, however, can be
precise, at least in an idealized situation, because they describe certain favorable combinations as a fraction of all possible outcomes. Only two outcomes,
for example, are possible in a fair coin toss, heads and tails. Therefore, the
probability of heads in a fair toss is 1/2.
Similarly, rolling a pair of dice will produce one of the thirty-six patterns
in the grouping that follows, where the ordered pair (x,y) means that the first
Chapter 1: Combinatorics and Probability
die shows x spots on its upturned face and the second die shows y. We can
infer the probabilities of certain simple events from direct observation of the
For example, noticing that the
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
outcomes summing to seven are the di(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)
agonal entries, running from the lower
(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)
left to the upper right, we conclude that
i4-1) (4>2) (4>3) (4-4) (4*5) (4-6)
a seven total has probability 6/36. The
j 5 · 1 ) j 5 ' 2 j [ 5 ' 3 j [5-4j j 5 'fj [5-6j
calculations for other results are equally
* ' ' *■ ' ' * ' ' ^ ' ' ^ ' ° ' * ' '
straightforward. The probability of a three total is 2/36. The probability
that the absolute difference between the two dice is greater than two is 12/36
because the favorable outcomes are the upper-right and lower-left triangles,
each containing six entries.
Probabilistic quantities appear frequently in the computer and engineering sciences. To the examples noted earlier we can add the time required for a
computer to respond to a command from an interactive terminal, the number
of errors in a program module, the number of polls to find a receptive device,
the time to access a transmission bus, the number of memory bits flipped by
cosmic radiation, or the number of compare operations executed by a sort
algorithm. In many cases, it is difficult to calculate meaningful probability
weights. For example, suppose that you measure the system response time T
to a terminal command by directly observing a number of such operations.
You then have an empirical estimate of the fraction of cases for which T — 2
seconds. But is it possible to compute this fraction in advance from your
knowledge of the system parameters? Such a computation is not likely; it
depends on circumstances that are neither repeatable nor easily controlled,
such as parallel activity from other system users. By contrast, the number of
compare operations in a sort algorithm is subject to analysis. The possible
inputs of size N are the permutations of the numbers 1,2,... ,N, and it is
reasonable to assume that all such inputs are equally likely. For a specific
input, the algorithm determines exactly the number of compare operations,
and we can therefore proceed by direct count to enumerate the fraction of
cases for which the count is a particular n. This chapter opens our study
of probability by considering just such situations, where we can compute the
relevant probabilities by exhaustive counting.
Each observation of a nondeterministic phenomenon is called a trial.
In symmetric situations, such as coin tosses or dice rolls, we can reasonably
assume that each elementary outcome appears with equal frequency in a long
sequence of trials. However, as we associate trials in various ways, the equal
frequencies quickly disappear. For example, a tossed coin displays either a
head or a tail, and we expect a sequence of tosses to exhibit both results with
approximately equal frequencies. But, if we change the definition of a trial to
cover all tosses necessary to obtain a head, each observation is a number of
tails followed by a head. It is not likely that all such observations are equally
likely. Do you feel that 6 tails followed by a head will occur as often as 2
Chapter 1: Combinatorics and Probability
tails followed by a head? This chapter's early sections develop the tools to
answer such questions. In the process, we obtain an intuitive appreciation for
the probability of an event, such as 6 tails followed by a tail, as its relative
frequency of occurrence in an extended sequence of observations.
In a particular application, the usual difficulty is the systematic counting
of the total possible outcomes and of the outcomes favorable to a certain
result. Unlike the simple examples cited above, most applications provide
a very large field of possible outcomes, and it is not feasible to list them
explicitly. Before launching into the details, we take the time to elaborate
three counting principles. All are immediately obvious, so you may feel that
there is no need to state them explicitly. In complicated situations, however,
when you cannot seem to find a starting point for a solution, you might
well return to these simple principles. The first principle suggests breaking
the mass to be counted into disjoint pieces and then summing the subcounts
from these pieces. That is, if you want to count rooms within a building, you
first perform subcounts on each floor and then add the subcounts. In applying
this principle, you need to ascertain that there is no overlap among the pieces,
which could lead to overcounting some elements. The second principle, known
as the multiplicative principle, is actually a special case of the first principle.
It again breaks the mass into disjoint pieces, but now such that each piece
contains the same number of elements. The total count is then the number
of pieces times the count of any one piece. If, for example, you want the
number of houses on a rectangular site, you count the houses in one row and
multiply by the number of rows. The third principle, known as the pigeonhole
principle, states that the allocation of more than n pigeons to n pigeonholes
results in multiple occupancy for at least one pigeonhole. If we randomly
allocate four balls among three containers, what is the probability that all
containers receive at most one ball? By the pigeonhole principle, it is zero.
Some container must receive at least two balls. Consider a related question:
What is the probability that the leftmost container receives at most one ball?
Now, that is a more complicated story, which we now begin.
Many situations call for counting the number of ways to select, under specified
constraints, objects from a given set. When we say that a set contains n
objects, we understand, in keeping with the usual definition of a set, that
the objects are distinct. We may wish, nevertheless, to disregard distinctions
among certain elements. For example, in a set of 7 balls, we can have 4 reds
and 3 blacks. In this case, we can adopt a notation that both captures the
elements' distinct identities and emphasizes the two internal classes of red and
black. The designation {ΓΙ,Γ2,Γ 3 ,Γ4,61,62.^3} serves this purpose, but other
schemes are equally valid. The best notation varies with the context of the
problem at hand.
In some manner or other, a problem context always specifies three gen-
Section 1.1: Combinatorics
Comb. Sequences
abc abc, acb, bac, bca, cab, cba
abd abd, adb, bad, bda, dab, dba
acd acd, adc, cad, cda, dac, dca
bed bed, bde, cbd, cdb, dbc, deb
(a) Sampling without replacement
Comb. Sequences
abc, acb, bac, bca, cab, cba
66c 66c, 6c6, c66
abd, adb, bad, bda, dab, dba
6cc 6cc, c6c, cc6
bbd bbd,bdb,dbb
acd, adc, cad, cda, dac, dca
bed, bde, cbd, cdb, dbc, deb bdd bdd, dbd, ddb
ccd ccd, ede, dec
edd edd, ded, ddc
abb, bab, bba
aaa aaa
aac, aca, caa
ace, cac, cca
666 666
ccc ccc
ddd ddd
add, dad, dda
(b) Sampling with replacement
T A B L E 1.1. 3-combinations and their related 3-sequences from a
field of 4 symbols
eral constraints and perhaps further particular constraints. The general constraints answer the questions: How many objects are selected? Does the
selection order make a difference? After drawing an object and recording the
result, do we return the object to the set before drawing the next item? The
responses to some questions affect the possible answers to others. For example, if you return objects to the common pool after each selection, then the
total number of objects drawn can exceed the pool size.
1.1 Definition: An ordered selection of k objects is a k-sequence. An unordered selection of k objects is a k-combination. In the process of sampling
with replacement, we note the identity of each object as it is drawn, but we
return it to the common pool before the next draw. In the alternative process, sampling without replacement, we do not return the drawn object to the
pool. |
From the set {a,b,c, d), we generate four 3-combinations and twentyfour 3-sequences when sampling without replacement. The upper tabulation
of Table 1.1 suggests a 6-to-l relationship between the sequences and the
combinations. Sampling with replacement, on the other hand, yields twenty 3combinations and sixty-four 3-sequences, as illustrated in the lower tabulation.
If there is a relationship between a combination and the sequences involving
the same choices, it is less apparent. Duplications appear in both sequences
and combinations when choosing with replacement, and this complicates the
As noted earlier, we use the term probability for relative frequency of
occurrence. Suppose, for example, that we are generating sequences with-
Chapter 1: Combinatorics and Probability
out replacement in the context of Table 1.1. Note that 4 of the 24 possible
sequences contain the ordered pair "ab." Consequently, we say that the probability of "ab" occurring as a subsequence is 4/24 = 1/6. This policy agrees
with our intuition when all sequences are equally likely. When, in due course,
we come to a formal definition of probability, we will find that it conforms
with this frequency-of-occurrence notion.
Sampling without replacement
Under sampling without replacement, neither a sequence nor a combination
exhibits a duplicate entry. Sequences, however, have more structure because
the insertion order is important. Our goal is to find systematic methods for
counting sequences and combinations under sampling without replacement,
but also under a variety of further constraints.
We start with sequences. Suppose that we have seven tiles, labeled with
the letters: a b c d e f g. Suppose further that we consider a word to be
any 3-sequence. How many such words can we compose with the seven tiles?
Possible words are abc, cag, def, fed, and the like. We imagine a repetitive
process. On each trial, the process selects three tiles from the initial seven and
assembles them in the order selected to form a word. There is no replacement
as the tiles are selected. However, all seven tiles are reinstated at the beginning
of each trial. With this clarification, the desired count is the number of 3sequences from a set of seven symbols, where the selection process is without
replacement. We apply the multiplicative counting principle that breaks the
possibilities into groups of the same size. In particular, a systematic count
arises from partitioning the words into seven groups—those beginning with a,
those beginning with b, and so forth. Within the first group, where the first
letter is a, we distinguish six subgroups: those with second letter b, those with
second letter c, and so forth through those with second letter g. In the second
group, where the first letter is b, we again distinguish six subgroups: those
with second letter a, those with second letter c, and so forth through those
with second letter g. An important observation is that each group divides
into the same number of subgroups. Table 1.2 organizes these subdivisions.
Within each subgroup, where the first two letters are now specified,
there remains a choice of one of the five remaining letters for the last tile.
Hence each of the 42 subgroups divides naturally into five further pieces, each
distinguished by one of the remaining five letters. So, if we choose three tiles
randomly from the group of seven and lay them out in the order chosen, we
will form one of exactly 7 · 6 · 5 = 210 three-letter words.
In terms of Definition 1.1, this demonstration shows that there are exactly 210 possible ways of choosing, without replacement, a sequence of length
3 from a group of size 7. Now suppose that we want the probability that a
three-letter sequence chosen without replacement from a field of seven letters
has its first two letters in alphabetical order. In other words, we want the fraction of the 210 sequences that have the first two letters in order. Referring
Section 1.1: Combinatorics
First letter
Second letter
Possible third letters
cd e fg
b d e fg
b cefg
b cd fg
bcd eg
bcd ef
cd e fg
a d e fg
a ce fg
a cd fg
a cdeg
a c d ef
bd e fg
a d efg
a b e fg
a b d fg
a bdeg
a bd ef
T A B L E 1.2. Organizing the three-letter words from seven distinct tiles
to the organization above, we see that the first group, containing 6 · 5 = 30
sequences, meets the criterion. Since the first letter is a, the first two letters
must be in order, regardless of the remaining letters. Sequences in the second
group, however, begin with b, so the first subgroup, where the second letter is
a, must be excluded. The second group then contributes 5 ■ 5 = 25 sequences
that meet the criterion. The third group starts with c, so we must exclude its
first two subgroups, where the second letter is a or b. Hence, we obtain only
4 · 5 = 20 sequences here. The pattern is clear: Each group must exclude one
more subgroup than its predecessor. The total number of sequences with the
first two letters in order is then 6 · 5 + 5 ■ 5 + 4 · 5 + 3 · 5 + 2 · 5 + 1 · 5 + 0 · 5 = 105,
which gives a probability of 105/210 = 1/2.
Is this result surprising? The number of sequences with the first two letters in order is exactly one-half of the total number of three-letter sequences.
A so-called "sanity check" is always advisable. This means that we should
try to confirm by some other method that the result is correct or at least
reasonable. An obvious check in this case is that the result is a proper fraction, between zero and 1, because it represents the ratio of some number
of selected possibilities to the total number of possibilities. So the result is
reasonable in this sense. However, we can argue further that it should be
1/2 exactly. Suppose that we reorganize the three-letter sequences into two
different groups: Gx contains those sequences with the first two letters in
order and G2 contains those with the first two letters out of order. We can
define a function / : G\ —> G2 by f(xyz) — yxz, where xyz is a three-letter
Chapter 1: Combinatorics and Probability
sequence from G\. This means that x < y alphabetically, so yxz is indeed
a sequence in G2. This function establishes a one-to-one correspondence between the two sets, so they must have the same number of elements. Using
the notation \X\ to indicate the number of elements in a set, we have that
the probability of a three-letter sequence having its first two letters in order
i s | G 1 | / | G 1 U G 2 | = |G 1 |/(2|G 1 |) = l / 2 .
With some additional notation, we can extend the example to a general
result. Recall that n! means the product of the integers n · (n - 1) · (n — 2) · · · 1
and that we define 0! = 1. The notation Pn>k will mean the first fc factors in
the expansion for n!.
1.2 Definition: For integers n > 0, fc > 0, Ρ η ,* denotes the fcth falling
factorial of n. It is the product n · (n — 1) · (n - 2) · · · (n - fc + 1). For the
moment, we think of the " P " as an abbreviation for "product." As a special
boundary case, we define Pn,o = 1-1
For k > n the expansion contains a zero factor, and therefore Pn<k = 0.
For 0 < k < n, we have
Pn,k = n · (n - 1) · (n - 2) · · · (n - fc + 1)
_ n ■ (n - 1) · ■ · (n - fc + 1) · (n - fc) ■ ■ · 2 · 1
(n - fc)!
Pn,n = n · (n - 1) · (n - 2) · · · 2 · 1 = n!.
1.3 T h e o r e m : Let n > 0 and fc > 0. Drawing without replacement from a set
S containing n symbols, we can construct exactly P n ^ different fc-sequences.
PROOF: If fc = 0, then P„,fc = 1, and there is indeed just one 0-sequence—the
empty sequence. If fc > n, then Pn<k = 0, but there are no possible fcsequences that we can select without replacement from the pool of n symbols.
The theorem is therefore correct for fc = 0 and for fc > n. For fc in the
intermediate range, we proceed by induction.
For fc = 1, we count exactly n 1-sequences, each being one of the symbols
from S. Since P nj i = n, the theorem is correct in this base case. If fc >
1, divide the possible fc-sequences into n groups, each distinguished by its
starting symbol. The sequences within a group start with a specified first
symbol and continue with some sequence of length fc - 1 from the remaining
n - 1 symbols. Hence the number of sequences in the group is the number
of continuation sequences, which is Pn-i,k-i from the induction hypothesis.
The total number of fc-sequences is then
n(n - 1)!
_ n
1.4 Definition: A rearrangement of the ascending sequence 1,2,3,... ,n is
called a permutation of these numbers. |
An immediate application of Theorem 1.3 shows that there are exactly
n! permutations of the numbers 1,2,3,... ,n. The next example provides a
more extensive application and also suggests a more useful formula—one that
deals with combinations rather than sequences.
Section 1.1: C o m b i n a t o r i c s
T A B L E 1.3. In-order three-card hands from thirteen spades (see Example 1.5)
1.5 E x a m p l e : A dealer removes all cards from a standard deck, except the spades,
shuffles the remaining cards, and deals you three cards. What is the probability
that you receive your cards in a sequence of increasing value?
As you receive your cards, you keep them. Therefore, you receive a 3-sequence,
chosen without replacement from a set of 13. Theorem 1.3 then says that you receive
one of the Pi3,3 — 1716 possible sequences. To count the in-order sequences, we
apply the multiplicative counting principle to elaborate several disjoint categories.
If the deal starts (2,3), then any of the remaining 11 cards completes an in-order
hand. If the deal starts (2,4), then only 10 of the remaining cards complete an
in-order hand because we must exclude the 3. Similarly, only 9 cards successfully
complete a hand that starts with (2, 5). Thus, the number of in-order hands starting
with 2 is 11 + 10 + 9 + . .. + 1 + 0 = 66. Continuing with the hands that start with 3,
we organize the in-order hands as shown in Table 1.3, which produces a grand total
of 286 in-order hands. The required probability is then 286/1716 = 1/6. So you
should receive your three cards in increasing order about once in every six deals.
Can you think of a sanity check for why this should be so?
Suppose that we ignore card order for the moment. We divide all possible 3sequences into groups that contain the same cards. Suppose that there are N such
Chapter 1: Combinatorics and Probability
groups. Each group contains 3-sequences that are rearrangements of the same three
cards, so according to Theorem 1.3, each group contains ^3,3 = 3! = 6 sequences,
and only one of them can be in increasing order. So there must be N sequences
in increasing order and a total of 6N sequences all together. The probability of
receiving an in-order hand is then N/(6N) = 1/6. Π
We have learned that a set of n distinct symbols admits Pn,k ^-sequences
when they are chosen without replacement. These sequences are, of course,
ordered displays of k symbols. A combination, however, is an unordered display, which we can visualize as a subset of the original set. Suppose that
we want to count instead the number of fc-combinations. That is, how many
fc-combinations can we construct if we choose without replacement from n distinct symbols? We follow the suggestion at the end of Example 1.5 to prove
the following theorem. First, however, we introduce a notational convenience.
1.6 Definition: For integers n > 0, k > 0, Cn,k, pronounced "n choose k," is
defined as Pn,k/k\. For the moment, think of the "C" as denoting "choose." |
If k > n, then Pn>k = 0 and consequently Cn,k = 0. If 0 < k < n, then
C„,fc = n\/[k)(n — k)}]. From this formula, it follows that Cn,k = C„tTl-k for
0 < k < n.
1.7 Theorem: Let n > 0 and 0 < k < n. Choosing without replacement, we
can construct exactly Cn,k different fc-combinations from a set of n symbols.
PROOF: By Theorem 1.3 the set admits Pn,k fc-sequences. If we collect together all sequences that are merely different orderings of the same symbols,
we partition the Pn,k sequences into N collections. Each collection represents
a different fc-combination, and, moreover, each fc-combination must appear as
one of the collections. Therefore, N counts the number of fc-combinations.
How large are these collections? Suppose that C is one of the collections.
Every sequence in C contains the same k symbols. The sequences differ only
in the ordering of these symbols. We apply Theorem 1.3 again to conclude
that C contains exactly Pk,k = k\ sequences. This same reasoning applies to
the other collections, so each of them is of size k\. Then Pn k = N ■ k\, or
N = Pntk/k\ = n\/[(n-k)lk\}
= Cn,k.i
The theorem applies when k = 0 because then C„,o = n!/[0!(n - 0)!] = 1
and there is indeed just one subset of size zero, namely the empty set. It is
not possible to select a subset with more than n objects, so Cn,k = 0 is correct
for k > n.
A useful interpretation envisions Cn<k as the number
1 1 ^^»^
of patterns obtained by darkening k slots from a field of n.
' ^^__^
For example, C ^ = 6 produces the patterns to the right.
* M \ *m
In choosing 2 slots from a field of 4, you may, for example,
} \ } "* '
choose slot 1 and then slot 4, or you may choose slot
4 followed by slot 1. In either case, you get the fourth pattern. The pattern
is sensitive only to the particular 2-combination, {1,4}, not to the order in
which the slots were chosen. In this sense, exactly 6 patterns are possible. The
next example shows how to use this interpretation to count the possibilities
from a large field of values.
Section 1.1: Combinatorics
1.8 Example: You receive 13 cards from a well-shuffled deck. What is the probability that your hand contains 2 or more aces? Because you keep your cards as
you receive them, the process unfolds without replacement. Moreover, you are
concerned only with the particular 13-combination that you receive. The order in
which you receive the cards is not important. Therefore, the total possible outcomes number Cs2,i3, the number of 13-combinations from the deck of 52. This is
the number of patterns over 52 slots in which 13 are darkened. To count the number
of 13-combinations that contain exactly two aces, we imagine the 52 cards arranged
linearly over 52 slots, but with some further structure. Specifically, we put the four
aces over the first four slots (spades, hearts, diamonds, and then clubs). The remaining 48 cards occupy slots 5 through 52, in an arbitrary order. The following
diagram illustrates some sample choices that yield a 13-card hand with exactly 2
■ aces
i non-aces -
M I M M I M 1 1 1 M I M I M Γ Ε M M l y M l l i y II
y M MIM y u n y l i n y N Éiy y
In the top pattern, aces fill slots 2 and 3, while the remaining 11 cards come from
slots 6, 9, 14, 19, 22, 25, 28, 30, 35, 37, and 42. The next pattern takes aces from
slots 1 and 4 with the other 11 cards from slots 13, 19, 25, 29, 32, 34, 40, 44, 48,
49, and 51. The last pattern shows aces from slots 2 and 4 and another selection for
the remaining cards. In general, a valid pattern arises from any of d,2 subpatterns
in the first four slots associated with any of C48.11 subpatterns from the remaining
slots. Hence, the number of 13-card hands that contain exactly 2 aces is C4.2C48.11·
Similarly, the number of hands containing exactly 3 aces is C4.3C48.10· The number
for 4 aces is C4,4C48,9· The probability of obtaining a hand with 2 or more aces is
then [C4,2C48,n + C4.3C48.10 + C4.4C48.9l/C52,13 = 0.2573.
In terms of the counting principles, we first decided to count separately the
hands with 2,3, and 4 aces, which breaks the field into three disjoint components of
varying sizes. We then invoke the multiplicative principle to count the component
with k aces as a regular arrangement of C4.1t possible ace combinations by Cig.u-jt
possible combinations for the other 11 cards. D
1.9 Example: A poker hand contains five cards. The hand "two-pair" contains
cards of three distinct values. Two cards share one value, another two cards share
a second value, and the final card assumes the third value. For example, the 4 of
spades and diamonds, the king of clubs and diamonds, and the 7 of clubs constitute
a two-pair hand. What is the probability of receiving a two-pair hand in a five-card
deal from a well-shuffled deck?
The total number of five-card hands is Cs2,s = 2, 598,960. This is the number
of 5-combinations from a set of 52. We now need the number of combinations that
result in two-pairs. As in the preceding example, we envision the 52 cards aligned
linearly over slots, this time in descending value. The four aces occupy the first four
slots, then the four kings, then the four queens, and so forth. The final four slots
contain the four 2s. We view this arrangement as thirteen consecutive groups of
four slots each. Each such group contains cards of a single value. A two-pair hand
must involve two cards from each of two distinct groups and a final card from the
44 possibilities outside the two groups. There are Ci3,2 patterns corresponding to
Chapter 1: Combinatorics and Probability
the choice of the two groups. The upper bars in the diagram below show one such
pattern where the pairs come from the Q-group and the 4-group. The odd card
comes from the 9-group.
K | Q | J | 1 0 , 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 ,
1111111 M 11 n 11
Ι ι ι ι Ι ι ι ι ί w l i n l i i i l y t i n ιΐιι i l i i i l i n l a J i n l i n l
Within each of the two groups chosen, a pair arises from any of d,2 possible combinations. Assuming that the suits are arrayed within groups as spades, hearts,
diamonds, clubs, the diagram shows the Q-group generating the queen of spades
and the queen of diamonds, one of the six possible pairs. The 4-group generates
the four of hearts and the four of clubs, again one of six possibilities. Finally, the
44 remaining cards generate the 9 of hearts, one of Ci4,i possibilities. The diagram
exemplifies only one possible pattern. The total number of two-pair patterns is
Ci3,2C"4,22C44.i = 123,552. The desired probability is then Ci3,2[C4,2]2C44,i/C"52,5,
which evaluates to 0.0475. D
Following the strategy of the last two examples, you should now be able
to verify the probabilities of any poker hand. For a flush, which is five cards
of the same suit, it is C4,16*13,5/052,5 = 0.00198. This interpretation of a
flush includes a royal flush, which is the ace through ten of the same suit. For
a straight, which is five cards of consecutive value, it is C9 i i[C 4il ] 5 /C52,5 =
0.00355. This interpretation includes pure straights as well as straight flushes
(consecutive values from the same suit) and royal flushes. The strategy in all
cases is to construct the favored hand through a series of choices, where the
total number of possibilities divides equally across the choices. The desired
count is then the product of the number of subdivisions at each step. This
process amounts to the repeated application of the multiplicative counting
principle. To choose a straight, for example, you first choose the lowest value,
which must be one of the nine values 2 through 10. Any higher value for
the lowest would not allow four further cards in increasing value. For each
such choice, you then have the option of 4 cards (spades, hearts, diamonds,
or clubs) for the lowest value. Then you must choose one of 4 cards for the
second lowest value. Indeed, for each of the five sequential values, you must
choose one of the 4 available suits. Multiplying the sizes of the subdivisions
at each step gives Cg,i[C4,i]5, which leads to the probability calculated above.
Although the examples to this point have featured card games, the general ideas of Theorem 1.7 are relevant to a wide range of problems, some of
which appear in the following examples. The examples introduce the terms
random sample and random selection. For the moment, we adopt intuitive
meanings for these terms. A random sample, or selection, of size k means a
subset of k objects, selected without replacement by a process that affords
each potential fc-combination an equal chance of being chosen. For small objects, we achieve a random sample by mixing all the objects in a jar and
drawing k of them without looking at the jar. For larger objects, or for objects that have distinguishable shapes or textures, we must imagine a process
that removes these clues.
Section 1.1: Combinatorics
1.10 Example: Suppose that a bin of 100 components contains 4 defective ones.
In a random sample of size 6 from this bin, what is the probability that none will
be defective? What is the probability that exactly 1 will be defective?
Let pk,k = 0,1,2,3, or 4, be the probability of k defects in the sample. We
imagine an alignment with the 4 defective components in the first 4 slots and the
properly functioning components in the subsequent 96 slots. A pattern of 6 slots
that exhibits no entry from the first four must necessarily choose all 6 from the
remaining 96. Consequently, we calculate po as shown below. In a similar manner,
a pattern with one defective entry must have 1 slot from the first four and 5 slots
from the remaining 96, which explains the pi computation. At this point, a sanity
check should be stirring in the back of your mind. Between no defects and one
defect, we have used up 0.778 + 0.205 = 0.983 of the total probability. There are
three remaining possibilities, and they must share the final 0.017 probability. The
remaining calculations follow as shown, and we note that the values add up properly,
within round-off error.
po =
C 4 ,oC96,6
= 0.778
p2 =
Ci l Cgfi 5
pi =
C 4 ,2^96,4
pi =
n m
= 0.0167
= 3.22 x 10~ 7
1.11 Example: Suppose that a 10-person committee contains 5 Republicans and
5 Democrats. Random selection produces a 3-person subcommittee. What is the
probability that the Democrats will dominate the subcommittee?
Democrats will dominate if they number 2 or 3 on the subcommittee. Cio,3 =
120 subcommittees are possible. Cs^Cs.i = 50 of them will contain exactly 2
Democrats, and Cs^Cs.o = 10 of them will contain exactly 3 Democrats. Therefore,
Democrats will dominate 60 of the 120 possible subcommittees. The probability of
a Democratically dominated committee is 60/120 = 1/2.
Is this answer reasonable? A subcommittee with an odd number of members
must be dominated by one group or the other. Since Democrats and Republicans
are equally represented in the candidates, you should suspect from the symmetry
of the situation that each group should have the same chance to dominate the
subcommittee. That is, each group's chance should be 1/2. D
The common theme across the last several examples expresses the desired probability as a ratio, in which the numerator is the number of combinations under some constraint and the denominator is the total number of
combinations. In the preceding example, the denominator was the number of
3-person combinations from the set of ten, while the numerator was the number of 3-person combinations that contained 2 or more Democrats. In the
earlier example, the denominator was the number of 6-component combinations from the field of 100, and the numerator was the number of 6-component
combinations with a specified number of defects. With these kinds of problems, the most difficult part is identifying the relevant subsets. Consider the
following examples. The first arises in computer science and communications
engineering. The second introduces a mathematical tool with widespread application.
Chapter 1: Combinatorics and Probability
10+ + + + +
+ + + + + *-♦
9 +++ +
+ ++++ t +I » +
8 +++++
7 +++ +
. 6+ +
5 + ++ + + + + + +
4 + <>—Ik + + + + + +
2 + -- + + + + + + +
0 1 2 3 4 5 6 7 8
(a) Typical message path
(b) Traffic density versus location
Figure 1.1. Node loading in a switching computer network (see Example 1.12)
1.12 E x a m p l e : A network has switching computers arrayed in a square pattern
as shown in Figure 1.1(a). The computers occupy the grid points (x,y) for integer
values of 0 < x < 10 and 0 < y < 10. Computers communicate only with their
immediate neighbors. All message traffic arrives at node (0,0) and departs from
node (10,10) after passing through some path across the intermediate computers.
Moreover, each computer forwards messages either up or to the right, so that each
link makes progress toward the final destination at (10,10). Figure 1.1(a) shows a
sample valid path. Beyond this restriction, all paths are equally likely. What is the
probability that the computer at location (2,6) participates in a message transfer
from (0,0) to (10,10)?
This question is a particular instance of a more general consideration, which
asks how various nodes participate in the overall traffic flow. Figure 1.1(b) gives a
graphical depiction of the traffic density, and we now proceed to verify that pattern.
A path must proceed from an intermediate node either upward or to the right.
It follows that every path must make 10 steps up and 10 steps to the right as it
winds from (0,0) to (10,10). Every path, therefore, consists of exactly 20 links: 10
horizontal links and 10 vertical links. Paths differ only in the choice of which 10
links are horizontal. So there are C2o,io possible paths from (0,0) to (10,10). Now,
consider a path that passes through node (2,6), such as the one in Figure 1.1(a).
The initial segment, from (0,0) to (2, 6) must contain 8 links: 2 horizontal and 6
vertical. Different paths will choose different locations for the horizontal links. The
path illustrated in the figure proceeds horizontally on links 2 and 6 and vertically on
the remaining links. Hence there must be Cs,2 possible segments from (0,0) to (2,6).
Similarly, there must be Ci2,8 possible continuations from (2,6) to (10,10). Since
any initial segment can pair with any of these continuations to form a path through
(2,6), the number of such complete paths is C8,2Ci2,8· The required probability is
then P(2,6) = CB,2CI2,S/C2O,IO = 0.075.
Node (i,j) participates with probability p ( i j) = Ci+j,iCi0-(i+}),io-i/C20,io
the general case. For a peripheral point, such as (1,9), the probability is 0.00054,
while a central point, such as (5,5), participates with probability 0.344. If paths are
chosen randomly, the central computer at node (5,5) must bear 34.4% of the traffic,
while the peripheral computer at (1,9) sees very little traffic. These results are
important because they reveal where to concentrate the fastest computers and the
largest message buffers. Table 1.4 shows the probabilities associated with selected
Section 1.1: Combinatorics
y- 0
ι = 0
T A B L E 1.4. Traffic density in a switching network (see Example 1.12)
nodes. The source node is at the upper left corner; the destination node (not shown)
is at the lower right. The boldface entries highlight the probabilities greater than
0.3 and show that the traffic load is highest near the source and destination and that
a bottleneck persists through the center of the network. Horizontal positions 6-10,
which do not appear in the table, are reflections of columns 4-0. That is, column 6
is a copy of column 4, installed upside down, and so forth. The diagonal bottleneck
therefore continues to the destination node. Q
1.13 Example: Verify the binomial theorem: (x + y)n = J2l=0C„ikxkyn~k
To multiply two polynomials, we first multiply every term of the first by every
term of the second. For example, (a + 26 — 2c)(6+2c) = a6+2ac+26 2 + 46c — 2bc — Ac2.
Then we combine terms to obtain a more compact expression: ab + 2ac + 2b2 + 2bc —
4c 2 . We can view the initial product, before the final simplification, as the sum of
all possible two-factor terms in which one factor comes from the first polynomial
and the other from the second polynomial. If the two polynomials have ni and ri2
terms, respectively, then the product, before simplification, has n\ri2 terms. If we
now multiply the product by a third polynomial with 713 terms, we obtain a sum
of r i i o n s three-factor terms. In each such term, each of the three polynomials
contributes one factor. Applying this observation to (x + y)n, we get 2 n terms.
Each term, moreover, contains n factors, one from each copy of (x + y) in the
multiplication. A given term might contain n i-factors and no y-factors and so
appear as i n . It might contain n — 1 x-factors and one y and take the form xn~ly.
The sum of the x and y exponents must always be n, so the general form of (x + y)n,
after combining terms, must be 53£_ 0 CkXkyn~k, for some set of constants CkTo determine the Ck, we imagine the n factors of (x + y)n aligned over n slots.
To produce a term x yn~ , we must choose an x-factor from k of the
n slots and a y-factor from each of the remaining slots. The diagram
to the right illustrates the point for the case n = 5,fc= 2. The number
of patterns is, of course, Cs,2 = 10. That is, (x + y)s contains the
term lOx y . We can calculate the coefficient associated with other
values of k in a similar manner. In general, it is C„ *, and therefore
The examples to this point illustrate the usefulness of counting the number of combinations available from a set of n objects. There is, however,
another viewpoint that extends this concept to a different kind of problem.
Chapter 1: Combinatorics and Probability
Consider a set of n objects, some of which are indistinguishable at a higher
level of abstraction. The traditional example consists of n balls, of which r
are red and b = n — r are black. For the moment, let us assume that n = 5
and r = 3, which means, of course, that 6 = 2. Because a set can have no
duplicates, the balls must be distinguishable through some further characteristic, so we label the red balls as r i , r 2 , Γ3 and the blacks as 6i, 62. Now if we
ask how many 5-sequences we can form, without replacement, from the five
balls, Theorem 1.3 immediately answers 5! = 120. However, if we ignore the
subscripts and consider only the colors, there will be fewer sequences because
some of them will contain the same colors in the same places and differ only
in the subscripts. So, how many sequences are possible when only color is
Suppose that we group the 120 permutations into N collections, each
containing the same color pattern. That is, each sequence in a given collection
contains red balls in exactly the same positions, differing only in the subscripts
on the red balls. By default, all sequences in a given collection must then
contain black balls in exactly the same positions, but again differing in the
For example, the sequences in Figure 1.2(a) all lie in one collection. The
reds occupy positions 1,2, and 4, and the three red balls can appear in these
slots in any arrangement of {ri,r2,r3}. Similarly, the blacks appear in slots
3 and 5 in any arrangement of {61,62}. So any of the 3! = 6 3-sequences of
the reds (within the red slots, of course) can associate with any of the 2! = 2
2-sequences of the blacks to form a sequence in the collection. The collection
then contains 3! · 2! = 12 sequences, as exemplified by the display above. The
same analysis shows that each of the JV collections contains 12 sequences.
Therefore, N = 120/12 = 10, and N counts the number of sequences that are
distinguishable only by color.
In the general case, we have n balls, of which r are red and b = n — r are
black. If we partition the n\ sequences into collections such that all members of
a collection exhibit the same color pattern, we find r!6! = r\(n — r)\ sequences
in each collection. This follows because the r reds can be redistributed in the
same slots in any of the r! permutations, and the blacks can be redistributed in
their slots in any of 6! = (n — r)! permutations. The number of distinguishable
color patterns is the number of collections: n\/[r\(n — r)!] = Cn,r.
In the earlier examples, the result C„, r occurred when we arrayed the
objects linearly above a set of slots and looked at the slot patterns formed
by choosing r objects from the field of n. In the current context, however,
we are choosing all the objects, and our interest shifts from which objects
are chosen to where the r "special" objects appear. We can still use the
slot-pattern technique, with one small variation. We now align the balls'
potential positions over the slots. It should be clear that a color sequence
is determined as soon as the positions of the three red balls are known; the
remaining positions must, by default, contain black balls.
For the case n = 5, r = 3, we envision the arrangements of Figure 1.2(b).
Section 1.1: Combinatorics
fclΓ3 <>2
Γ 2 fc
r2 Π fci
Γ3 62
ri Γ 2
Γι T-2 &2 r3
r-3 *>2
r2 61
Γ3 ri b2 r2 61
Γ3 Π 01 r2 <>2
Γ3 r 2 62 ri 61
Γ3 r2 il n 62
(a) Sequences with same color pattern
Γ2 Γ3 <»1 Γι 62
(b) Possible color patterns
Figure 1.2. Color patterns exhibited by 3 red balls in a field of 5
The slots are aligned with the potential positions of the red balls, not with the
balls themselves. The 10 patterns correspond to the possible positions of the
red balls, which in turn correspond to the possible color patterns. Since there
are Cs,3 = 10 patterns, there are 10 distinguishable color sequences. This
analysis verifies the more direct computation above and provides a further
interpretation of the Cn>k formula.
We can capture the new interpretation in the following theorem. The
theorem refers to "classes," which may be colors, as in the example at hand,
or any other attribute in the context of a particular problem. In the next
example, we interpret class as gender.
1.14 T h e o r e m : Suppose that a set of n objects contains two classes with k
objects in the first class and the remaining n — k objects in the second class.
Consider n-sequences, chosen without replacement, in which objects of the
same class are indistinguishable. The number of such sequences is Cn<kPROOF: See discussion above. |
The interpretation accorded Cn^ in this theorem is really the same as
that which visualizes Cn^ as the number of slot patterns in a field of n under
the constraint that k slots must be shaded. Objects of the first class occupy the
shaded slots, and by default, objects of the second class occupy the remaining
slots. Given that objects within a class are indistinguishable, the number of
distinguishable n-sequences is precisely the number of slot patterns.
1.15 Example: Consider a dinner party of 10 people—5 men and 5 women. If they
randomly choose places at a linear table, what is the probability that there will be
a strict alternation of men and women?
The positions of the 5 men determines a gender pattern. Of the Cio.s = 252
possible patterns, only two exhibit a strict alternation of men and women: one
starting with a man and the other starting with a woman. The probability is then
2/252 = 0.00862. D
At this point, we have two interpretations of Cn,k, which we will call the
selection and allocation solutions. It is convenient to add a third interpretation to form the following list.
1. Cn,k is the number of subsets of size k drawn from a pool of n objects.
Stated differently, it is the number of ways to select k distinguishable
Chapter 1: Combinatorics and Probability
objects from a pool of n.
2. Cn<k is the number of slot patterns in which k of the n slots are darkened.
Equivalently, it is the number of ways to allocate k indistinguishable
objects to n containers in such a manner that no container receives more
than one object. The darkened slots indicate the chosen containers.
3. Cn,k is the number of binary rc-vectors whose components sum to k.
A binary vector is a vector whose components are zeros and Is. A binary
n-vector whose components sum to k must contain exactly k Is and n — k
zeros. These vectors correspond to slot patterns in which the darkened slots
represent Is. Hence the number of binary n-vectors whose components sum
to k is the number of slot patterns in a field of n with the requirement that k
are dark. This count is Cn^We conclude this section with a generalization of Theorem 1.14. We
start by extending the example of the colored balls. Suppose now that the
balls come in three colors: red, black, and white. Specifically, suppose that
we have a set of n balls, of which k\ are red, k2 are black, and fc3 are white.
Of course, ki + k2 + k3 = n. How many n-ball color patterns are possible?
Reasoning as before, we have n! ?i-sequences when the balls are individually recognizable. We collect sequences that contain the same color balls
in the same positions. Suppose that N collections result. Then TV is the
required number of color patterns. To determine the number of sequences in
each collection, consider an arbitrary member of a collection C. This sequence
contains ki red balls in specific positions, and the A^ ! permutations of the red
balls among the fixed set of positions result in new sequences with the same
color pattern. So, for each fixed arrangement of the black and white balls,
there are ki\ sequences that exhibit red balls in identical locations. Of course,
each of thefci! sequences exhibits the same color pattern. Continuing this line
of reasoning, we see that each of these ki ! sequences gives rise to k2 ! further sequences by permuting the black balls. Consequently, we have k\ \k2 ! sequences
that display the same color pattern. Moreover, each of the ki\-k2\ obtained to
this point gives rise to k3 ! additional sequences by permuting the white balls.
Because the chosen collection was arbitrary, we conclude that each collection
contains k\\ ■ k2\ ■ £3! of the n! sequences. So N = ΐΐ!/[λ:ι! · k2\ ■ Â;3!].
The general case concerns p colors, or classes, which appear k\, k2, ■.. , kp
times in the original set of 7; objects. Of course, ki + k2 + ... + kp = n.
Theorem 1.14 becomes a special case of the following theorem.
1.16 T h e o r e m : Suppose that a set of n objects contains p classes with
ki, 1 < i < p, objects in class i. 11 — 5Zi=i ki. Consider ?i-sequences, chosen
without replacement, in which objects of the same class are indistinguishable.
The number of such sequences is n\/[k\! · k2\ ■ · · kp\].
P R O O F : The discussion above applies equally well to p classes. |
Section 1.1: C o m b i n a t o r i c s
1.17 E x a m p l e : Verify the multinomial expansion:
(I1 + Ia + ... + *„)» =
( ^ ! . ^ ! . · ν χ ' ' ) χ ' 2 , "xkp"'
The notation under the summation symbol implies that the sum extends over all
p-sequences of nonnegative integers, chosen with replacement, that sum to n. We
will learn later, in Theorem 1.22, that there are precisely C p +„_i, n summands in
this expression, but that detail is not important at this point.
Anticipating an induction on p, we denote the left side by L£. The case L" is
a straightforward verification, while the case LJ reduces to the binomial theorem.
Both computations are illustrated below.
n _ « !
n _
Jfcl = n
- 2^
> x*
> z*
The final reduction for LS occurs because the pairs (ki, ta) which sum to n must have
the form (fci,n —fci).Moreover, as kx runs from 0 to n, the expression (fci,n — fcj)
generates all such pairs.
If p > 2, we proceed by induction, assuming that
1 =
for q = 1, 2 , . . . ,p— 1. We then consider the multinomial of p terms to be a binomial
by grouping the first p — 1 terms as a unit. At that point we can apply the binomial
Lp = [(ii + X2 + · ■ ■ + xP-i) + i p j " = y ^
kl+k2 + ... + kl,_l=n-k,,
The double sum effectively enumerates all the p-sequences that sum to n, so we can
continue the computation as follows.
Λ- p
kp\{n - fcp)! * , ! · 1 , ! . . . 1 Η !
iki!-fc 2 !---A„-i!-ik„!
* „ - t *„
"-1 P
This calculation verifies the multinomial expansion, but an alternative approach is
worth consideration, one that draws on the interpretation of n\/[ki\ ■ta!· · ■ta-!]as
the number of color patterns, each of length n, that we can assemble from a group of
n = ta + t a + ·. . +ta·balls of p different colors. From the discussion of Example 1.13,
we know that (xi + X2 + ■ ■ · + i p ) " will generate p" terms, each containing n factors.
Chapter 1: Combinatorics and Probability
Each copy of ( n 4- X2 + · · · + xp) contributes one of the n factors in each term.
Since each copy must draw its contribution from the possibilities 1 1 , 1 2 , . . . , xp, we
see that each term in the expansion must have the form i / a ^ 2 ' ' ' XP f° r some set
of nonnegative integers ta, ta,... ,kp that sum to n. The question is: How many
terms have the same set of exponents and are thus eligible for combination? Stated
differently, we know that the expansion contains terms of the form Cxl1x22 · · · Xp1',
but we do not yet know the value of C associated with a particular exponent set
ta,ta,·· · , ta-·
Let ta, ta, · · · , kp be a fixed set of nonnegative integers that sum to n. Now
identify each of x i , X2,... , xp with a distinct color. Each of the n copies of (xi +
X2 + ■ ■ ■ + Xp) now hovers over one of n slots. Each drops one of its colors into
the slot below it, under the overall constraint that ta copies must drop color 11, ta
copies must drop color 12, and so forth. The number of possible color patterns is
n!/[ta-'-ta! · · ■ kp\]. But this is also the number of terms in the expansion that involve
x 1 ' z 2 2 · · · xpr. That is, every choice that produces an n-factor term with exactly
k\ xi-factors, ta X2-factors, and so forth through kp x p -factors must correspond to
ta slots contributing the value x i , ta slots contributing the value X2, and so forth.
Hence C = n!/[ta! ■ ta' · · · ta·!], for a particular set ta, ta, · · · , ta-, an< ^ ' n e expansion
is verified. D
Example 1.13 shows that the terms Cnyk appear as coefficients in the
expansion of (x + y) n . For this reason, the Cn<k = n\/[k\(n — k)\], 1 < k < n,
are called binomial coefficients. Similarly, Example 1.17 shows that the terms
nl/f/ci'-fo! ' · -kpl] appear as coefficients in the expansion of (x\+x2+- ■ .+xp)n.
The expressions n\/[ki! · fo! · · ■ kp\], for kt > 0 and k\ +fc2+ · · · + kp = n, are
therefore known as multinomial coefficients.
1.18 Example: Suppose that we have a collection of objects, S = {χι,Χ2,... , i n } ,
each with two attributes, / ( x ; ) and g(xi). For example, / ( x ; ) might be the size of
object x,·, while g{xi) might be its weight. In any case, we assume that both / ( x i )
and g{xi) are nonnegative. The goal is to choose a subset, {x;,, Xi 2 ,... , n , , } , such
that the total g-values are as large as possible, subject to the constraint that the
total /-values do not exceed a prescribed limit, F. Suppose further that our chosen
subset must contain at least 2 objects. The code of Figure 1.3 conducts a search for
the best subset.
Since this is the first algorithm description in the book, we comment on the
C structure's readability, even for a reader with no previous experience in C. Note
carefully the convention for marking the scope of while-loop, for-loop, and conditional execution (if) constructions. The body is set off as an indented mass under
the control statement. The while-loop controls all statements at the indentation
level 1 greater than the while-statement itself [i.e., through the statement X = subset(S, X)]. The while-loop is simply a device for repeatedly executing a step sequence
until a specified condition prevails. This example also illustrates the for-loop, which
executes a step sequence a prescribed number of times. Executions use consecutive
values for an index, which is the variable i in this case. Finally, the code contains
if-statements, each with an indented body. The body contains a step sequence that
is executed if and only if the attendant condition is true.
This algorithm assumes the availability of a subroutine, subset(S, Y), which
cyclically returns the subsets of its first argument. The routine constructs, in a
manner that is not of interest here, the next subset beyond that given by the second
Section 1.1: Combinatorics
function binpack (S, F)
X = phi;
Y = phi; y = 0.0
while (X φ S)
if (size(X) > 2)
a = 0.0; b =
for (i = 1 to
/ / inputs are the set S and the / - l i m i t F
// Λ
X <.y<.i»
cycles through
imuugci SUUICLS
subsets ui
of JS,, starting with the empty set
/ / Y stores the best subset to date; y stores its g-value
a = a + f(X[i]); b = b + g(X[i]);
if ((a < F) and (b > y))
y = b; Y = X;
X = subset(S, X);
return Y;
Figure 1.3. Algorithm for optimal bin packing (see Example 1.18)
argument. Successive calls eventually recover all subsets, and the last subset to
be returned is S itself. For starting the process, we assume that the empty set is
available under the name phi. We also assume a routine that returns the number of
elements in a set: size(X). X[i] extracts an element from the subset X.
How many times does the code call the function / if the input set S is of size
10? How many times if the input set is of size n? When X is a subset of size k, the
for-loop makes k calls to f, and this loop executes for all subsets of size 2 or greater.
Let t(n) denote the number of calls when the input set is of size n. Since C\o,k is
the number of subsets of size k, we have the following summation.
((10) = ]TCio,fc ' k = 45(2) + 120(3) + . . . + 1(10) = 5110
For general n, we obtain the following summation.
'■(") =
$ >
C'„,o · 0 — C„,i · 1 = n ■ 2"
The last simplification involves summing kC„,k, which you may or may not know
how to do at this time. Section 1.2 will review these matters, but for the moment,
even if you are not able to perform the reduction here, you can easily verify that
the general formula generates the correct result for n = 10. D
1.1 The binary coded decimal (BCD) code assigns a four-bit binary pattern
to each digit as follows.
8 |9
0000 0001 0010 0011 0100 0101 0110 0111 100011001
This is just one of N possible codes that express the ten digits as distinct
four-bit binary patterns. What is iV? The BCD code leaves six patterns
Chapter 1: Combinatorics and Probability
unused: 1010,1011,1100,1101,1110, and 1111. Other codes involve different sets of unused patterns, although each such set will always contain
six members. Assume that the first step in constructing a code specifies
the set of unused patterns. How many choices are available at this step?
1.2 What is the probability of receiving a full-house poker hand? A fullhouse hand contains 2 cards of one value and 3 cards of a different
1.3 What is the probability of straight-flush poker hand? A straight-flush
hand contains 5 cards of consecutive value, all from the same suit.
1.4 In the game of bridge, point cards are aces, kings, queens, and jacks.
What is the probability of receiving a 13-card bridge hand containing 3
or more point cards?
1.5 A 5-person committee is formed from a group of 6 men and 7 women.
What is the probability that the committee has 3 or more men?
1.6 Seventeen 10-kilohm resistors are inadvertently mixed with a batch of
eighty 100-kilohm resistors. Five resistors are drawn from the combined
group and wired in series. What is the probability that the series resistance will be less than 500-kilohms?
1.7 If you randomly choose 3 days from the calendar, what is the probability
that they all fall in June?
1.8 In a group of 1000 persons, 513 plan to vote Republican and 487 plan
to vote Democrat. What is the probability that a random sample of 25
persons will erroneously indicate a Democratic victory?
*1.9 Suppose that 8 balls are tossed toward a linear arrangement of 10 open
boxes. What is the probability that the leftmost box receives no balls?
*1.10 You arrive at the gym to find a single free locker, number 4 in a row
of 10. You take that locker. When you finish your workout, you find
6 lockers, including yours, still occupied. What is the probability that
the neighboring lockers on either side of yours are now free?
1.11 Calculate £ f c +k2+k3+k4=7 7!/[A:i! · fo! · fo! · A4!]. The notation means
the sum over all sequences of four nonnegative integers, chosen with
replacement, that add to 7.
1.12 A football fan is sitting high in the stadium and can barely make out the
cheerleaders far below on the field. These cheerleaders are arranged in
a line and are waving colored fabrics. If 3 cheerleaders have red fabrics,
2 have green fabrics, and 2 have blue fabrics, how many color patterns
are discernible by the faraway fan?
Section 1.1: Combinatorics
1.13 Bob and Alice are dinner guests at a party of eight, 4 male and 4 female.
The hostess arranges the guests linearly along a table with the men on
one side and the women on the other. What is the probability that Bob
and Alice will be facing each other or be within one position of facing
each other?
1.14 Six devices d\,d,2,... ,d$ connect with a central processor, which polls
them in order to identify one that is ready to communicate. Suppose
that three are ready to communicate when the polling cycle starts. What
is the probability that the central processor issues exactly two polling
requests before communicating.
1.15 If you rearrange the letters of "Mississippi," how many distinguishable
patterns are possible?
*1.16 In the context of Example 1.12, what fraction of paths from (0,0) to
(10,10) never rise above the diagonal? That is, what fraction of paths
pass only through nodes (x,y) with x > y?
Sampling with replacement
Let us now consider the situation where repetitions are permitted in choosing
from a set of n objects. That is, each chosen object, after being recorded
in the result, rejoins the set before another object is drawn. This allows an
object to appear multiple times in the fc-sequence or /^-combination under
construction. It is an easy task to determine the number of ^-sequences from
a set of size n, but it is somewhat less obvious how to calculate the number
of /c-combinations. We start with the easy theorem.
1.19 T h e o r e m : Let n > 0 and k > 0. Under sampling with replacement, a
set of n objects admits nk fc-sequences.
PROOF: Let 5 be the set in question. \S\ = n. Let s(n,k) denote the
number of fc-sequences that arise from S. There is just one empty sequence,
and 1 = n°. Also, there are n = n1 1-sequences, each consisting of a single
object from S. Therefore, s(n,0) = 1 and s(n, 1) = n. We now proceed
by induction on k. Assume that s{n,j) = n J , for j = 1,2,... ,k - 1. Now
consider all fc-sequences arising from S. Divide them into groups according to
the first symbol. There are n such groups. Within each group, the sequence
following the initial symbol is a (k — l)-sequence from S because S reacquires
the initial symbol before any subsequent choices. By the induction hypothesis,
the number of such continuation sequences is s(n, k - 1) = nk~1. Hence there
are n groups of nk~1 sequences, giving n · nk~1 = nk sequences in total. I
1.20 Example: A sequence of heads and tails results from tossing a fair coin 10
times. What is the probability that heads and tails alternate in the sequence?
The sequence is one of the 2 10 = 1024 10-sequences arising from the set {H, T}
when sampling with replacement. Of these 1024 possibilities, only two provide a
strict alternation of heads and tails—the one starting with a head and alternating
Chapter 1: Combinatorics and Probability
= 04
1-4 ^
Number of tosses
Number of tosses
log (Number of tosses)
S 0.3
log (Number of tosses)
Figure 1. 4. Probability of an equal head-tail split in a sequence of coin
tosses, (see Example 1.21). Logarithms are base 2.
thereafter, and the one starting with a tail and alternating thereafter. So the probability is 2/1024 = 0.001953. The coin may be fair, but strict alternation of the two
possibilities is not likely. U
1.21 Example: A sequence of heads and tails results from tossing a fair coin 10
times. What is the probability that the sequence has exactly five heads in it?
The sequence is again one of 2 10 = 1024 possible 10-sequences that arise
from the set {H,T} when sampling with replacement. To determine the number
of sequences with exactly 5 heads, we note that such a sequence is fixed once we
determine the positions of the 5 heads. Hence, we must choose 5 positions, without
replacement, from the set {1,2,3,4,5,6,7,8,9,10}. Theorem 1.7 states that such
choices number Cio.s = 252. The desired probability is then 252/1024 = 0.246.
Again, the coin may be fair, but the chance that such fairness will be manifest as
an equal split over 10 tosses is rather smaller than 0.5. Indeed, if we were looking
for an even split in 100 tosses, the probability would be Cioo.5o/2100 = 0.0796. This
suggests, correctly it turns out, that the chances of an even split become less as the
number of tosses increases. Figure 1.4 provides some credibility for this suggestion.
Since the probability of an even head-tail split is zero when the number of coin
tosses is odd, all the graphs in Figure 1.4 provide probability points only for even
numbers of coin tosses. The upper-left graph shows how the probability of an even
split drops rapidly as the number of tosses grows larger, but it appears to level out.
Switching to a logarithmic scale on the probability axis, we can discern a bit more
detail about the transition, as shown in the graph in the upper right. However, the
flattening of the decline rate persists in this representation. Using logarithmic scales
on both axes provides the clearest indication of the overall trend, as shown in the
lower-right graph. Π
1.22 T h e o r e m : Let n > 0 and k > 0. Under sampling with replacement,
Cn+k-\,k fc-combinations arise from a set with n objects.
P R O O F : Let the elements of the set be x\,...
, xn. In a given ^-combination,
Section 1.1: Combinatorics
Pattern Repetitions
*l*l*l (1,1,1,0)
*l*ll* (1,1,0,1)
1 * 1 * 1 * (0,1,1,1)
* * l * l l (2,1,0,0)
* l * * l l (1,2,0,0)
* * l l * l (2,0,1,0)
* l l * * l (1,0,2,0)
» » I I I * (2,0,0,1)
• I I I * * (1,0,0,2)
Pattern Repetitions
1 * * 1 * 1 (0,2,1,0)
1 * 1 * * 1 (0,1,2,0)
1**11* (0,2,0,1)
1 * 1 1 * * (0,1,0,2)
1 1 * * 1 * (0,0,2,1)
1 1 * 1 * * (0,0,1,2)
* * * l l l (3,0,0,0)
1 * * * 1 1 (0,3,0,0)
1 1 * * * 1 (0,0,3,0)
I I I * * * (0,0,0,3)
TABLE 1.5. Star-bar patterns and 3-combinations from a field of 4
symbols (see Theorem 1.22)
each Xi is present some integer number of times, say fc\ > 0. Furthermore,
these repetition numbers, provided that their sum is fc, completely specify the
particular fc-combination. So the set {{ki,k2,...
, kn)\ki > 0, Σ ki = A;} is in
one-to-one correspondence with the fc-combinations. Therefore, an equivalent
question is: How many such ordered sets of n nonnegative integers add to fc?
These are n-sequences chosen from { 0 , 1 , 2 , 3 , . . . ,fc} with replacement. Although we know from Theorem 1.19 that the total number of such n-sequences
is (fc + l ) n , not all of these n-sequences sum to fc.
Instead, consider the following indirect approach to counting just those
n-sequences that add to fc. Imaginefcasterisks and n—1 vertical bars, arranged
linearly. Each such display contains n + k — 1 symbols. The possibilities for
n — 4,fc= 3 appear as Table 1.5. (The table depicts 3-combinations from the
set {abed} rather than from {£1,2:2,13,£4} in order to facilitate comparison
with Table 1.1. That earlier table breaks out all the 3-combinations from a
set of size 4 but does not offer a method for counting them systematically.)
The number of asterisks to the left of the first bar specifies the repetition
count for the first symbol from the set. The number of asterisks between the
first and second bar specifies the repetition count for the second symbol, and
so forth. Because exactly fc asterisks appear in each pattern, the sum of
the repetition counts as separated by the bars is always fc. Each distinct
sequence of repetition counts corresponds to a fc-combination, and vice versa.
The total number of fc-combinations is then the number of ways of choosing
n — 1 positions for the bars out of the total of n + fc — 1 possible positions.
Theorem 1.7 gives that value as Cn+k-i,n-i
= Cn+k-i,k, which proves the
theorem. |
1.23 T h e o r e m : The number of n-sequences (x\,X2,X3,.. ■ ,xn) with nonnegative integer components such that 53"_ 0 £i = fc is Cn+k-i,kP R O O F : The proof of Theorem 1.22 establishes a one-to-one correspondence
between these n-sequences and the fc-combinations from a set of size n. The
theorem also asserts that the latter count is Cn+k-i,k- I
The preceding section noted three interpretations of Cn,k, and there are
Chapter 1: Combinatorics and Probability
three parallel interpretations for Cn+k-i,k- They again involve a selection
process, an allocation process, and a count of constrained vectors.
1. Cn+k-i,k counts the number of ways to select k distinguishable objects
from a pool of n, given that each object may be chosen zero, one, or
more times.
2. C„+fc_i,fc counts the number of ways to allocate k indistinguishable objects to n containers, given that a container may acquire zero, one, or
more objects.
3. Cn+k-\,k counts the number of n-vectors whose components are nonnegative integers summing to k.
Theorem 1.23 notes the equivalence of the selection solution and the number
of n-vectors with nonnegative integer components summing to k. The equivalence of the allocation solution follows because each such n-vector specifies
an allocation pattern. The vector (k\,k2,.·· ,kn) specifies fcj objects in the
first container, &2 in the second, and so forth.
In discussing the pigeonhole principle, the chapter's introduction noted
the impossibility of randomly allocating four balls among three containers such
that all containers receive at most 1 ball. It then asked the probability of an
allocation in which the leftmost container receives at most 1 ball. We can
now answer that question. Total allocations number £3+4-1,4 = 15, while
those favoring the leftmost container give it zero or 1 ball. If the leftmost
container receives zero balls, the remaining two containers must accommodate
four balls, which they can do in C 2 +4-i,4 = 5 ways. If the leftmost container
receives 1 ball, the remaining two must receive 3 balls, which can happen in
^2+3-1,3 = 4 ways. The probability is then (5 + 4)/15 = 0.6 that the leftmost
container receives at most 1 ball.
In this small example, we can tabulate * (0,0,4) * (0,3,1) * (0,2,2)
the possible allocations. In the scheme to the * (0,4,0) * (1,3,0) (2,2,0)
right, {ki,k2,ks) means that the left container
(4,0,0) (3,0,1) * (1,1,2)
receives ki balls, the center container receives * (0,1,3) (3,1,0) * (1,2,1)
&2, and the right container fc3. Of course, the * (1,0,3) (2,0,2) (2,1,1)
ki must sum to four. The asterisks mark the allocations in which the left
container receives at most 1 ball.
When choosing fc-sequences or fc-combinations with replacement, it is
possible for k to be much larger than n, the size of the pool. The following
example illustrates this possibility and also investigates the difference between
fc-sequences and fc-combinations in probability calculations.
1.24 Example: Ten tosses of a fair coin produces an ordered pair (h,t), where h
is the number of heads over the 10 tosses and t is the number of tails. Of course,
h + t = 10. How many ordered pairs can result from the experiment? What is the
probability of getting (3, 7)?
The elements h and t can assume any of the values from { 0 , 1 , . . . , 10}, provided that their sum is 10. According to Theorem 1.23, the number of such 2sequences that sum to 10 is C2+io-i,io = Cn,io = 11. Such a high-powered formula
Section 1.1: Combinatorics
is not necessary in this simple case. We see that all possible (ft, t) pairs must be
of the form (ft, 10 — ft), with 0 < ft < 10. Hence, there are 11 such pairs. Also, a
given (ft, t) pair corresponds to a 10-combination from {H,T}, so Theorem 1.22 also
provides the correct answer: Cio+2-1,10 = 11·
Now, since there are 11 possible 10-combinations and just 1 with 3 heads, it
might appear that the probability of getting (3,7) is 1/11. This is not correct. When
calculating a probability as a fraction of some number of equally likely possibilities,
we must check carefully that the denominator equitably represents the field of possibilities. Of the 11 possible 10-combinations, are all equally likely in a physical
experiment with ten coins? Does (1,9), in which just 1 head occurs, appear as often
as (5,5), in which 5 heads occur? No, these outcomes are not equally likely.
What actually occurs are 10-sequences, such as HTTHHHTTHT.
By Theorem 1.19, there are 2 10 = 1024 such sequences., and they are equally likely.
Also, since a sequence with 3 heads is determined once the positions of the heads
are known, we must have as many such sequences with 3 heads as there are 3combinations, without replacement, from the positions { 1 , 2 , . . . , 10}. Using Theorem 1.7, we then compute
P 37
'( - > = W C ' ° ' 3 = W = 0117
ΡΓ 5 5 =
< · > w °- = w = 0·246
P ' 0 . » ) = ^ » . . = - ^ - = 0.0098.0
When calculating a probability as a fraction of cases favorable to some
constraint out of all possible cases, we must be more careful when the process is sampling with replacement. In the simpler situation, sampling without
replacement, we can normally use either sequences or combinations, as long
as we are consistent in the numerator and denominator. As suggested by Table 1.1, each fc-combination, without replacement, gives rise tofc!fc-sequences.
Therefore, a solution using fc-sequences in numerator and denominator is the
same solution using fc-combinations, after multiplying numerator and denominator by fc!. The value of the fraction does not change. If you want, for example, the probability of receiving 4 hearts when dealt 4 cards from a well-shuffled
deck, you can form the appropriate 4-sequence ratio, ^13,4/^52,4 = 0.00264,
or the corresponding 4-combination ratio, Ciz^/C$2,4 = 0.00264. The numerator and denominator of the first fraction are just 4! = 24 times those
of the second. Because combinations are simpler and extend more easily to
complicated situations, they are more frequently used in situations where the
underlying process is sampling without replacement.
In sampling with replacement, each fc-combination does not. give rise to
fc!fc-sequences;sometimes the number of sequences is much less, as illustrated
in Table 1.1. Therefore, the ratio of fc-combinations is not the same as the ratio
of the corresponding fc-sequences. In sampling with replacement, it is normally
the ratio of the appropriate fc-sequences that corresponds to the reality of the
experiment. The following example provides another situation where a simple
Chapter 1: Combinatorics and Probability
sanity check shows that a proposed solution involvingfc-combinationscannot
be correct.
1.25 Example: Suppose that 10 people are gathered in a room. What is the
probability that two or more persons have the same birthday? By having the same
birthday, we mean born on the same day of the year, not necessarily being the same
We imagine each person announcing a birthday from the 365 possible choices.
The birthdays constitute a 10-combination from the set of days: {1, 2, 3 , . . . , 365}.
The 10-combination arises from sampling with replacement because an announcement can duplicate a birthday already used. The number of such 10-combinations
is D = C365+io-i,io· The combinations corresponding to no two persons having the
same birthday are just those that happen to have no duplicate elements. That is,
they correspond to the 10-combinations that you could draw without replacement.
Hence they number C365,io· The rest of the combinations, N = £374,10 — C*365,i0i
then count the sequences for which at least two persons share a birthday. It might
appear that the probability of two or more persons having the same birthday is
N/D. This is not correct.
Consider the following sanity check. Suppose that the room contains only two
persons. You then expect the probability of a shared birthday to be 1/365. (You
ask one person for his birthday, and you have 1 chance in 365 that it will match
that of the other person.) The reasoning above, however, gives
C365+2-1.2 — <?365,2 _
In the physical experiment, the persons are distinguishable entities. If you line
them up and demand birthdays, from left to right, you will extract a 10-sequence
from the set { 1 , 2 , 3 , . . . , 365}, not a 10-combination. Order does count. A different
10-sequence, even if it represents the same 10-combination, corresponds to a different
assignment of birthdays. So it is the 10-sequences r that are equally likely. Reordering
the elements in a 10-combination does not always produce the same number of 10sequences. For example, if the 10-combination has all distinct elements, then the
reordering will produce 10! 10-sequences. If, on the other hand, it has a single
element, repeated 10 times, then reordering produces a single 10-sequence. A 10combination of distinct elements occurs 10! times as often as the 10-combination
containing a single repeated entry.
Returning to the field of equally likely possibilities, there are 365 10 possible
10-sequences, of which Ρίβί,ιο have no duplicate elements. The following calculation
then gives the correct probability.
365 10 - (365)(364)... (356) _
365 364 363
365 ' 365 ' 365 " 365
_ 0
n m
1 1 7
If the room contains only two persons, the appropriately modified calculation gives
the expected result:
365 2 - (365)(364) _
_ _365_ _364_ _ _ 1 _
365 ' 365 ~ 365
The use offc-sequencesorfc-combinationsarises in a class of problems
called occupancy problems. The common abstraction is that k objects must
Section 1.1: Combinatorics
be distributed across n containers. The distribution is with replacement, in
the sense that a given container can acquire multiple objects. The containers
have distinguishable labels; the objects, however, may or may not be distinguishable.
If the objects are distinguishable, then our interest lies with the ksequences, chosen with replacement from the container labels. A given ksequence then tells us which containers contain which objects. In particular,
the first item in the sequence is the (label of) the container where the first
object resides. The second item denotes the location of the second object,
and so forth.
If the physical reality of the situation is such that all such ^-sequences
are equally likely, the consequent probabilities are said to follow MaxwellBoltzmann statistics. In the examples above, we feel that the distribution of
10 coin tosses into 2 bins (heads or tails) follows Maxwell-Boltzmann statistics.
Consequently, a sequence, such as HHTTTHTTHT,
is just as likely as any
other sequence. However, two counts, such as (H = 4,T = 6) and (H =
2,T = 8) are not equally likely. The same remarks apply to the distribution
of 10 birthdays into 365 containers (the calendar days). We must work with
the 10-sequences from the calendar to derive probabilities that comport with
There are cases when it is appropriate to use fe-combinations as a basis for probability calculations. One such instance is when the sampling
protocol is without replacement, as discussed earlier. When sampling with
replacement, however, we must analyze the situation to ensure that the kcombinations properly count the sets under consideration. If we are distributing indistinguishable objects across n bins, a ^-combination tells us the
number of objects in the first bin, the second bin, and so forth. It provides
this information in the following manner. The fc-combination contains zero or
more repetitions of each of the bin labels. So, as noted earlier, each such kcombination corresponds to exactly one set of repetition counts, k\, fc2,... , kn
that total k. The number of objects in the first bin is fci. The second bin
contains &2 objects, and so forth.
If physical reality is such that these repetition-count sets are equally
likely, then probabilities should be ratios of favored counts to all possible
counts. If probabilities follow this law, they are said to follow Bose-Einstein
statistics. Physicists study certain distributions of particles to labeled energy
containers, and they have found, perhaps contrary to intuition, that some
particle distributions do obey Bose-Einstein statistics. These examples are
beyond the scope of this text. We do, however, use fc-combinations to analyze
sampling with replacement when the repetition counts themselves participate
directly in the probability computations. The following examples illustrates
the point.
1.26 Example: Suppose that you have available, in unlimited amounts, food packets of 6 types. The types are labeled 1,2,3,4,5, and 6. You must make up a
shipment of 20 packets, and there is no limit on the number of repetitions of any
C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y
function optimumShipment ()
k = (0, 0, 0, 0, 0, 20);
/ / initial assignment: 20 of type 6, 0 of types 1-5
choice = k; value = 0.0; searching = true;
while (searching)
x = 0.0;
for (j = 1 to 6)
x = x + k[j] * weight (j);
y — expert (k) / x;
if (y > value)
value = y; choice = k;
if (k = (20, 0, 0, 0, 0, 0))
searching = false;
k = next(k);
return choice;
F i g u r e 1.5. Algorithm to determine optimal food pack (see Example 1.26)
one type. The weight of a packet depends only on its type. The nutritional value,
however, varies when mixed with other packets. That is, some packets are more
nutritional when served in combination with other packets. An expert is available
to determine the nutritional value of any 20 packets chosen from the 6 types. The
expert charges 1 dollar for each consultation when the shipment includes no packets
of type 1. Because packets of type 1 require more expensive testing, the expert
charges 2 dollars to evaluate a shipment that includes packets of type 1. Your task
is to ship 20 packets to the scene of a catastrophe, and you want to maximize the
nutrition delivered per unit weight. What is the maximum that you should have to
pay the expert to attain your goal?
Consider the algorithm of Figure 1.5, where the subroutine expert (k) returns
the nutritional value of a shipment containing k[l] packets of type 1, k[2] packets of
type 2, and so forth, through k[6] packets of type 6. We also assume a routine weight
(j) that specifies the weight of a packet of type j . The routine next(k) acquires a new
assignment of 20 items in some indeterminate manner. Of interest here is simply
the fact that all assignments are eventually processed and that the final assignment
is the one that terminates the while-loop: (20, 0, 0, 0, 0, 0).
We need to determine how many times the algorithm invokes the routine
expert and what fraction involve packets of type 1. Since the code systematically
steps through all possible 6-sequences of nonnegative integers that sum to 20, we
can appeal to Theorem 1.23 for the number of such sequences. It is Ce+20-1,20 =
53130. If the first element of the sequence is zero, then the remaining elements
are precisely those 5-sequences that sum to 20, and again by Theorem 1.23, they
number Cs+20-1,20 = 10626. So our expert costs 1 dollar for each of 10626 calls plus
2 dollars for each of 53130 - 10626 = 42504 calls, for a total of 95634 dollars. This
is the most we will have to pay the expert because this approach investigates every
possible assignment. Q
1.27 E x a m p l e : A hash table is a storage structure that calculates the disk storage
location of a record as a function of the record key value. The disk storage areas
Section 1.1: Combinatorics
are usually called buckets, and each bucket can hold some maximum number of
records, say b. Suppose that there are N buckets. The calculation that operates on
the record key to produce the bucket number is called the hash function. Λ good
hash function spreads the records uniformly across the buckets. This is important
because when a bucket fills, any further records assigned to that bucket overflow
onto new buckets, beyond the original N. Each bucket in the original collection
is potentially the beginning of an overflow chain of additional buckets needed to
accommodate such inappropriate assignments.
The hash function, for example, could be f(k) — k mod 100, where k is the
key value. The key might be a 7-digit part number, for instance. This hash function
then returns the last two digits of the part number, which represent a number in
the range 0 to 99. Records are stored in the corresponding buckets. If the bucket
capacity is 6 = 5, this hash function may or may not generate overflow chains,
depending on the specific part numbers that participate. Even though the system
has room for 500 records in the 100 buckets, it could still generate overflows on the
first six storage operations if the part numbers of these six records all have the same
last two digits.
The appeal of a hash table lies in the subsequent retrieval of the stored records.
When presented with a key, the hash function can calculate the bucket that contains the record, which allows us to retrieve just that bucket, instead of sequentially
searching through all buckets in the structure. Of course, we are occasionally disappointed to find an overflow chain attached to the home bucket, and that unhappy
circumstance necessitates further retrievals. Nevertheless, if the overflows are infrequent, most retrievals will be efficient.
For this example, assume that the hash function does spread the records uniformly across the N buckets in the sense that each bucket is equally likely every time
a storage assignment is made. A random hash function would be of small worth,
because we must be able to locate the storage bucket deterministically when a later
retrieval request presents the key. So we assume that just the storage assignment is
made in a uniform random fashion. Thereafter, the function remembers the buckets
that have been assigned to particular keys, so it can perform retrievals. This is a
bit much to expect of a real hash function, but an analysis of this case allows us to
infer the approximate behavior of a real hash function that does manage to spread
its storage assignments uniformly across the available buckets.
Under this assumption, we want the probability D(p) that a bucket receives
exactly p records, including those that might produce overflows, for p = 0 , 1 , 2 , . . . , n.
The relevant parameters are: TV, the number of buckets; n, the number of records
assigned; 6, the number of records a bucket can hold without overflowing; and
Q = n/Nb, the load factor, which is the fraction of the system capacity (without
overflows) occupied by records when all have been assigned. For the case N =
10000, b = 10, the results appear as Table 1.6, and we now proceed to justify these
After receiving an assignment, a bucket remains in the pool of choices for
subsequent assignments. So the process draws n samples, with replacement, from the
buckets { 1 , 2 , . . . , N}. Theorem 1.19 asserts that there are N" possible n-sequences,
and our assumption about the uniform nature of the hash function means that each
of these sequences is equally likely to occur. If i is an arbitrarily selected bucket,
how many of these sequences assign exactly p records to bucket x? That is, how
many sequences contain exactly p elements labeled i ? Applying the multiplicative
Chapter 1: Combinatorics and Probability
Number o! records
Load factor
T A B L E 1.6. D(p), the probability of a hash table bucket receiving p
records (see Example 1.27)
counting principle, we reason that for each of the C„,p ways of choosing locations
for the elements labeled x, there are (N — l ) n _ p ways of completing the sequence
by assigning the remaining n — p locations to any one of the remaining (N — 1)
buckets that are not labeled x. In other words, the remaining n — p positions form a
(n — p)-sequence chosen with replacement from the buckets other than x, and there
are (N — 1) such buckets. The counting principle then asserts that the number of
sequences that contain exactly p elements labeled i is Cn,p{N — l ) n - p . The required
probability is then D(p) = C„,P(N — l ) n - p / N n . This result is independent of x and
is therefore the same for all buckets. Each bucket receives exactly p records with
probability D(p). We can therefore interpret D(p) as the fraction of the N buckets
that receive exactly p records. Note that these fractions sum properly to 1.
y . Cn,p{N - 1 ) " - " _
E ^ o g - . p l P ( A f - I ) - " _ [l + ( J V - l ) j "
For p > b, this expression gives the fraction of buckets that overflow. Consider again
Table 1.6, which illustrates the situation with 10,000 buckets, each of capacity 10.
Each column corresponds to a different number of records stored in the hash structure. These values run from n = 10000 through n = 99000. The corresponding load
factors appear below the number of records. The column then gives the probability
that a bucket will receive p = 0 , 1 , 2 , . . . ,10 record assignments. The last value,
shown in boldface type gives the probability that a bucket will receive more than 10
assignments and will therefore overflow. You can see that that probability of overflow is rather small until the load factor exceeds 0.7. The chance of overflow does
rise rapidly after this threshold; there is a 40% chance of overflow when the load
factor reaches 0.99. Figure 1.6 depicts the same information in a graphical format,
using interpolation to produce smooth curves for each of the load factors a = 0.1
through a = 0.99. Each curve, therefore, corresponds to one column of data from
Table 1.6. Overflow occurs when significant probability exists for p > 10. As you
Section 1.1: Combinatorics
Population, p
Figure 1.6. Probability that a hash table bucket receives exactly p
records (see Example 1.27)
can see, the curves for low load factors, up through about a = 0.7, have reasonably
small D(p) values, for p > 10.
Hash table manipulations occur often in database applications, and bucket
fetching is usually the most time-intensive operation because it involves disk accesses. If a hash table contains no overflow chains, each retrieval requires just 1
bucket fetch. Performance deteriorates when overflow chains appear. We can use
the D(p) values just obtained to calculate the average number of bucket fetches
per retrieval, assuming that the hash function spreads the records in a uniform
manner. Π
1.17 You receive 5 cards from a standard card deck, but you return each card
immediately after noting its value and suit. What is the probability
that you receive two or more aces? Should this probability be larger
or smaller than the probability of receiving two or more aces when you
keep the cards?
1.18 Three dice are thrown simultaneously. What is the probability that the
sum of the spots on the upturned faces is less than six?
*1.19 You shuffle a deck of cards and then deal all 52 cards in a row. What is
the probability that the display contains no adjacent spades?
1.20 In 10 tosses of a fair coin, what is the probability that the number of
heads will be greater than twice the number of tails?
Chapter 1: Combinatorics and Probability
1.21 You receive three cards from a well-shuffled standard deck. You note
the values and suits and then return the cards. You then receive two
further cards. What is the probability that you see two or more aces
during the course of the experiment?
1.22 FVom an unlimited supply of pennies, nickels, dimes, and quarters, you
are to select 20 coins. How many different combinations are possible?
What fraction consists of monetary amounts that are divisible by 5?
1.23 Consider the equation Σ»=ι xi — **0, where all the Xi are nonnegative
integers. What fraction of the solutions have both xi = 0 and x-j = 0?
1.24 Consider the equation £Z i=1 Xi = 60, where each Xi satisfies Xi > i.
What fraction of the solutions also satisfy Xi > 2i?
*1.25 Suppose that k > n and consider the equation ΣΓ=ι χ> = ^' where the
Xi are positive integers. How many solutions exist?
*1.26 Suppose that Σ7=ι Xi — ^' where the Xi are nonnegative integers. How
many solutions exist?
1.27 How many people must assemble to have a probability greater than
one-half that two or more persons share the same birthday? How many
people are necessary to raise the probability to 0.8?
1.28 A restaurant occasionally announces that the owner will pick up the
tab for any patron whose birthday is tonight. If the owner wants a 50%
chance of escaping without giving away any free dinners, what is the
largest crowd to which he can make the announcement?
*1.29 &?=oßk)n
= ΣΤ^ο^β11 for \β\ < 1. Find the coefficients {ak}.
*1.30 Example 1.27 analyzes a hash table with N buckets of capacity b to
which n records are allocated in a uniform manner. It computes D(p),
the fraction of buckets that receive exactly p records as D(p) = Cn<p{Nl)n~p/Nn.
For large N and n and small p, show that D(p) is approximately (ab)pe~ah/p\, where a = n/(Nb) is the table load factor.
Consider the code segment to the right, in
for (i = 1 to n)
which we want to count the number of calls
for (j = i + 1 to n + 1)
to procedure inner. When i = 1, j runs from 2
inner(i, j);
to n + 1, which gives n procedure calls. When
i = 2, j runs from 3 to n + 1, producing n - 1 calls. The number of calls
decreases by 1 for each successive value of i, until the final loop, when i — n,
produces just 1 call. We can write the total by summing up from 1 to n or by
summing down from n to 1. The following manipulation produces the simple
Section 1.2: Summations
expression S = n(n + l ) / 2 for the sum.
S =
+ ... + n-1
S =
+ n-1
+ n-2
+ ... +
2S = (n + 1) + (n + 1) + (n + 1) + . . . + (n + 1) + (n + 1)
In summation notation, we have shown that $Z"=1 i = n(n + l ) / 2 . In
this expression, the right side is called a closed form because it involves no
summation symbols. An expression with a summation symbol cannot be
expanded without using an ellipsis. The left side, for example, expands as
1 + 2 + 3 + · · · + n. The ellipsis (the three dots) is difficult to manipulate
in further algebraic simplification. The closed-form on the right, by contrast,
easily participates in further algebra.
Although this device quickly provides a closed-form expression for Y^i,
it is not very helpful for Σ i 2 or Σ i3, which also occur in counting operations.
What we need is a general expression for Y^ik, for arbitrary k. These sums
become increasingly complicated as k grows larger, but there is a systematic
approach to the problem. Before launching into a calculation of £ " = 1 i2,
we illustrate a useful transformation called index shifting. This technique is
helpful in manipulating indices that are slightly offset from those we would
find convenient. For example, consider the following two sums, which we can
show to be equal.
£(t + i)i =
The left side generates (2)(1), (3)(2), (4)(3),... , (n + l)(?i), as i runs from 1
to n. But these are exactly the terms generated by the right side as i runs
from 2 to n + 1. The same number of terms result when we shift the sum's
lower and upper limits up by 1, and we can ensure that the value of each term
remains unchanged by adjusting each occurrence of i to i — 1. Observing an
index shift of this sort, we quickly verify the conditions needed to maintain
equality: (1) the lower and upper indices are shifted by the same amount in the
same direction, and (2) the first term generated in the transformed expression
is the same as the first term in the old expression. The latter condition holds
when the summation variable is adjusted by the same amount as the lower
and upper limits, but in the opposite direction. In the example above, the
lower and upper limits advance by 1, and each occurrence of the summation
index decreases by 1. We need not check terms other than the first. If the
initial terms of the two expressions are the same, all succeeding terms will
follow along in parallel as the summation index increases uniformly on both
sides. The following formula, in which the shift s may be positive or negative,
generalizes this rule.
Keeping this transformation in mind, consider the following manipulation,
which manages to recover the starting expression plus some other terms. The
Chapter 1: Combinatorics and Probability
other terms must, of course, be equal to zero. Note that not all transformations are index shifts; some are obtained by breaking off or adding on a term
at the sum's extreme.
^ i ( i - l ) = £ ( i - i)(i - 2) = £*(*-!)-2 £ ( i - l )
We adjust the first summation by breaking out the first and last terms; we
adjust the second summation by shifting the index.
Σ i(i - 1) = -(1)(0) + Σ i(i - 1) + (n + l)(n) - 2 ^
= 53»(i-l) + (n+l)(n)-2 5 ] i
Canceling the left side with its counterpart on the right and transposing the
remaining sum gives 5Z"=1 i = (n + \)n/2. Of course, we already know this
result from the previous calculation. But it is comforting to get the same
answer, and besides, this technique generalizes to calculate Σ"=ι **'■
1.28 Theorem: £ ? = 1 P,,fc = P n + 1 , f e + i/(fc+ 1).
P R O O F : To see the general pattern, we repeat the previous discussion with
the first three factors of i\.
5 > . , 3 = £ i ( i - l)(t - 2) = £ ( i - l ) ( t - 2 ) ( t - 3 )
= 53 i(< - l)(i _ 2) - 3 53(i - l)(i - 2)
= -(1)(0)(-1) + 5 3 i(i - l)(i - 2) + (n + l)n(n - 1)
The left side again cancels with a right term, giving
So we have established the theorem by direct calculation for k = 1 and k = 2,
and we now need to handle the general case.
53 p,,fc+1 = 53 i(i -1)(< - 2) · ■ · (i - k + m - k)
S e c t i o n 1.2: S u m m a t i o n s
Pi.k+i = £ ( ' - !)(* - 2) · ■ · (i - *)(» - k - 1)
= Σ
i-iA* - ( +1)] = Σ ^*- ·* "
Σ <->.*
We adjust the first sum to run from 1 to n by subtracting the summand for i
= 1 and breaking off the last term:
= -(l)(0)(-l)---(l-k)
+ Y>iPi.1,k + (n + l)Pn,k
2_^Pi,k + l +
Σ P iit+1 = Σ '.*+l + n+l,fc+l - (* + 1) Σ
Canceling the left side with its right counterpart and dividing by (fc+ 1) gives
the desired result. |
Using the initial cases already established, we can now compute £3" =1 ik,
for k = 1,2,3, — For example, the computation for ΣΓ=ι i2 is as follows,
and the next theorem collects the first several such results.
(71 + l)n(n
__ V ^ p
= Σ * - 1 > = Σ<*2-<> = Σ < 2 - Σ
.2 _ (n + l ) ? t ( n - 1)
n(n + l) _ n ( n + l ) ( 2 n + l )
[ n ( n + l)/2,
1.29 T h e o r e m : Σ * Ρ = 1 n(n + l)(2n + l)/6, p = 2
p = 3.
P R O O F : The proofs for £™=i i and 5Z"=1 i appear in the discussion above.
For Σ " = 1 i 3 we again proceed from Theorem 1.28.
ii =
= ll
^.(. _
g) =
^(., _ ^
_ · Α .3 _ 3 n ( n + l)(2u + l)
~ 2^1
2n(n + 1)
C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y
Solving for £ ] i 3 now gives the desired expression: Σ™ = 1 *3 = " 2 ( n + l ) 2 / 4 · I
1.30 E x a m p l e : The procedure to the right below accepts a list (i.e., a vector)
i4[l..n] and sorts it into ciscending order. Find f(n), the number of executions of
the comparison in line 4.
function slowsort(A, n)
The algorithm loops give the following nested
sums, which after some manipulation, provide an opfor (i = 1 to n - 1)
portunity to involve the first summation formula of
for (j = i + 1 to n)
Theorem 1.29. Toward the end, the calculation uses
if (Aß] < A[iJ)
the fact that 2Γ=ι * = ΣΓ=ι ( n — *)■ This is true
x = A[i];
because the second sum generates the same terms as
A[i] = Aij];
the first, but in reverse order.
Aß] = x;
i=l j=i+l
i= \
Besides sums of integers and powers of integers, we also encounter geometric sums and their variations. A plain geometric sum is one in which each
term is a constant multiple of its predecessor. We shall see that we can easily
convert such sums into closed-form expressions. We shall also find closed-form
expressions for certain systematic variations of the plain geometric sum, where
the multiple that generates each term from its predecessor is not constant.
First, consider the plain geometric sum ΣΓ=ο a r ' · The n r s t t e r m ' s a>
and each remaining term is r times its predecessor. We can expand the sum
and employ a trick similar that used to calculate Σ " = 1 *· This ls o n e 0 I " t n e
few instances where the ellipsis notation allows further manipulation. Letting
5 represent the sum, the following array immediately implies the subsequent
closed form for the geometric sum.
S = a + ar + ar2 + ... + arn~l
rS =
ar + ar2 + ... + arn~l
( l - r ) S = a + 0 + 0 + ... +
+ arn
+ arn +
+ 0 -
Equation 1.2 is valid only when r φ 1, but the sum is particularly easy to
evaluate directly when r = 1. Now consider the following arrangement, where
the sum of the first n integers is complicated with a geometric factor. We
immediately add zero at the beginning of the sum in the form Or0. Of course,
this leaves the sum unchanged.
Ύ^ ir> = r + 2r2 + 3r 3 + · · ■ + nrn - Or0 + r + 2r 2 + 3r 3 + · · · + nrn
= Lir
= {
= r~dr- {~T=T- )
+ l)r
+ l\
- (n + l ) r n + 1]
„ oX
Section 1.2: Summations
The expression for an infinite geometric sum finds frequent use, and we can
obtain it by considering the limiting form of Equation 1.2.
1 - rn+l
V = n->oo
lim V
r l = n-»oo
lim -i—
1 —Γ
= ~1 —- Γ,
provided that \r\ < 1. The first two sections of Appendix A, which discusses
sets, functions, and limits, should be reviewed at this time. The following
examples provide further illustrations of limiting summations but assume the
background familiarity with limit processes provided by the appendix.
1.31 Example: Evaluate E ^ o * / 2 '
For any finite n, Equation 1.3 shows that
■A . i
- (n + l ) r n + 1]
Σ"· =
Substituting r = 1/2 and adding the inconsequential term for i — 0 gives
n+ 2
- ?
Σ" _L
limè-f- lim ( 2 -^±l) = 2.
This last reduction used l'Hôpital's rule to evaluate the indeterminate form oo/oo.
The rule stipulates that a ratio !{n)/g{n), in which both / ( n ) and g(n) —> oo,
approaches limn-»oo f'(n)/g'(n),
provided that the latter is well defined. In the case
at hand,
n +2
2" In 2
1.32 Example: Show that the geometric variant Y2°l.i ir' converges for 0 < r < 1.
If r = 0, all partial sums are also zero, and the convergence is trivial. Consequently, assume that 0 < r < 1. From Equation 1.3 we have
^ .
- (n + l)rn + 1}
. „ . , . „
Since r < 1, linin-nx, r " = 0, so assuming for the moment that nr" also goes to zero,
we have that
= lim > ir' = —
To show that nr" actually does go to zero, note that nrn = n / ( l / r ) " , which approaches the indeterminate form oo/oo. Consequently,
lim nrn = lim , ■ ,„ = lim
n-too (l/r)
η-,αο (1/Τ)
, / Τ
= lim
. . . = 0. G
n-»oo I l l ( l / r )
Chapter 1: Combinatorics and Probability
1.33 Example: To the right below is a procedure that searches a list /l[l..n] for
a target X. The unsophisticated algorithm simply compares each list element with
the target, starting at the first element, until it finds a match or falls off the end of
the list. If it finds a match, it returns the index of the matching element; otherwise,
it returns a 0.
Suppose that this procedure runs in an applicaj n { slowsearch(A, n, X)
tion environment where the target appears in the first
for i\ = 1 to n)
position half the time, in the second position one-fourth
jf //Wji _ _ χ )
of the time, and so forth. Over a large number of exereturn i;
cutions, what is the average number of comparisons of
return 0'
the target against a list value?
If we run the algorithm a large number of times, say
N, we will find the target in the first position N/2 times, in the second position N/4
times, and so forth, through finding the target in nth position JV/2n times. The
remaining times, which will be N/2", the target is not in the list. Note that these
fractions properly add to TV.
( N
( ^ - Χ Ί Γ · · · -τ-) -2^
= Ν
f 1
The last equality follows because Equation 1.2 gives
2 "
^ 2
r r r
2 ^— V 2 y
2 n "*" 2 I 1 - (1/2) )
I /
We are concerned with the number of comparisons in line 3, where the forloop compares list element A[i] with target X. In particular, we want to know the
average number of comparisons over a large number of procedure runs N. So we
add the comparisons generated in each run and divide the total by N. (We will
develop a more exact definition of average, as a probability concept, later in the
text. For now, let us be content with this intuitive meaning of average.) Because
N/2 runs find the target in the first position, they each do 1 comparison. N/4 runs
find the target in the second position, and they therefore do two comparisons. The
N/2n runs that find the target in the last position must do n comparisons. The
ΛΓ/2" runs that do not find the target in the list also do n comparisons each. So
we can calculate our required average f(n) as follows, where the second line invokes
Equation 1.3 with r = 1/2.
1 f N
= ΊΓ ( Ί Γ ( 1 ) + T" (2) + ■ ■ ' + -£" (n) + "F- (n) J
" , γ - , Υ ΐ Υ - "
2 n " ' " 4 - \2J
* 2n
, ( 1 / 2 ) [ η ( 1 / 2 Γ + 1 - ( η + 1 ) ( 1 / 2 ) " + 1)
[1 - (1/2)]*
-7-'"[4-œ"-(4)'-(4·)" *·]-«(-»
As n becomes large, f(n) approaches 2, which reflects the bias in the application.
Recall that most of the time the target is among the early list elements, and it takes,
on the average, only about 2 probes to find it. D
Section 1.2: Summations
The technique used to derive Y^=0 ir' is worth remembering because it
generalizes to ]Γ]Γ=ο ikr* for other values of k. The method involves differentiating a closed-form expression of 5Z"_0r*. The exercises pursue this line of
thought. Here, however, we employ the same general approach for expressions
involving binomial coefficients. Summing the coefficients themselves is easy.
£ Cn,k = £ Cn,klkln-k = (1 + l) n = 2"
How might we obtain a closed-form expression for £2£_0 k · Cntk? Consider
the following differentiate function of x.
f{x) = Or + l) n = Σ Cn>kxk\n-k = 52 Cn,iXkik
fc=0 fc=0
f'(x) = n ^ + l)"- 1 =J2kC»<kxk~1
fc=0 fc=0
We let x = 1 to obtain
5 > C „ , * =η·2"- 1 .
Example 1.18 used this closed-form expression. The technique again generalizes to provide closed-form expressions for Σ*=ο kpCn,k for higher values of
PThe binomial coefficients exhibit a large number of interrelationships.
The following two prove useful in upcoming derivations, and the exercises
provide an opportunity to investigate further relationships.
1.34 Theorem: (Pascal's identity) For 0 < k < n,Cnik =
Cn-i,k+Cn-itk-iP R O O F : Let S = { s i , s 2 , · · · ,sn}.
Then Cn,k counts the number of fccombinations available without replacement from S. We divide these combi. nations into two groups: those that contain sn and those that do not contain
s„. Each ^-combination in the first group contains s„ plus k—l further choices
from s i , S 2 , . . . , s n - i - Since there are Cn-i,k-i such choices, the first group
must contain Cn-i,k-i elements. The second group contains ^-combinations
that do not include sn and so must therefore include all k choices from the remaining si, S2,. · · , «n-i· There are Cn-\ik such choices, so there are Cn-\ik
elements in the second group. Adding the two group sizes completes the
By expressing the right side in factorial terms, we can construct an
algebraic proof.
. r
W-l,fc + O n _ i t _ i -
-ΓΤ7Γ 7ΤΓ +
(n -k)
(n - 1)!
(n - k) fc!(n - k - 1)!
k_ _ j n -_!)!_
k (k - l)!(n - k)\
Chapter 1: Combinatorics and Probability
k — ►
T A B L E 1.7. Pascal's triangle (see Theorem 1.34)
Rearranging to force the factor (n — l)!/[fc!(n —fc)!]into both terms, we have
W - i , * + W_i,fc_i — (n -
ΓΤΓ + ^
fc!(n - fc)!
= (?i - fc + fc)
fc!(n - fc)!
fc!(7i-fc)! ~
= C n ,fc·
The first proof is more satisfying for the insight into how to decompose the
original Cn,k fc-combinations into two groups. I
Pascal's triangle arranges the binomial coefficients to emphasize the relationship of Theorem 1.34. As shown in Table 1.7, each row corresponds to
a particular n and contains the entries <?„_*, for fc = 0 , 1 , . . . ,n. Each row
starts and ends with a 1, and Pascal's identity computes each remaining entry
as the sum of its north and northwest neighbors.
1.35 T h e o r e m : Suppose that m < n. Then > J C m ) j C n .
PROOF: Recall the convention that Ca,b = 0 if 6 > a. Hence the sum produces
nonzero terms only while j <m and fc — j <n — m [i.e., j >fc— (n — m)]. So
although j formally runs from 0 to fc, some of the early terms can be zero if
fc > (n - m), and some of the final terms can be zero if fc > m.
Consider a set containing m elements of one type and n — m elements of
a second type. That is, consider 5 = {x\,X2, ■ ■ ■ ,xm,yi,y2, ■ ■ ■ ,2/n-m}· Now
imagine the set elements aligned over n slots. The situation for n = 9,??i = 4
appears as follows.
The diagram shows one of Cg,6 patterns. In this one, 2 elements come from
the prefix (the first four slots), and 4 come from the suffix (the last five slots).
There are C ^ C s ^ patterns of this subtype. Other subtypes contain 1 element
Section 1.2: Summations
from the prefix and 5 from the suffix, which contributes Ci^C5^ patterns,
3 from the prefix and 3 from the suffix, which contributes C^^C^,^ patterns,
and finally, 4 from the prefix and 2 from the suffix, which contributes Cj^Cs^
patterns. These correspond to 1 < j < 4 in the summation formula of the
theorem. Notice that the terms for j = 0,5, and 6 produce zero patterns.
When j'· — 0, the suffix does not have enough slots to provide 6 - 0 = 6
choices, and when j > 4, the prefix does not have enough slots to provide 5
or 6 choices.
In the general case, let X = {x\,.. ■ ,xm} and Y = {yi,... ,yn-m}· X
then refers to the leftmost m slots, while Y refers to the remaining n — m
slots. We decompose the Cn,k patterns into k + 1 groups: those with no
chosen slots from X and k chosen slots from Y, those with 1 slot from X and
λ' - 1 slots from V, and so forth, through the final group, which consists of
patterns with k slots from X and none from Y. These groups are disjoint,
and by the second counting principle, their sizes are as follows, assuming that
k — (n — m) < j < m. If j is outside this range, the group size is zero.
U) Slots
from X Slots from Y
Size of group
C/m.lC'n — m,k— 1
(-•m.JtOn — m,0
The sum of the last column must then be Cn,k, which completes the theorem's
proof. An algebraic proof of this theorem is not as simple as was the case with
Pascal's identity. |
1.31 Obtain a closed-form expression for J27=i *4·
1.32 Obtain a closed-form expression for $2" =1 î 2 /2'·
1.33 Obtain a closed-form expression for ^ " _ 6 ? 3 / 3 * , for n > 6.
*1.34 Obtain a closed-form expression for ΣΖι
■ o» ·
1.35 Obtain a closed-form expression for X)fc =0 ( - l)' : Cn,fc·
1.36 Obtain a closed-form expression for £ £ _ 0 fc2C„,fc.
1.37 Prove Newton's identity: CnikCk,i = Cn,iCn-i,k-i, for 0 < I < k < n.
*1.38 Show that £ £ L 0 Cn+k,k = Cn+m+\,m*1.39 Show that
1.40 Obtain a closed-form expression for ]Γ^ ] + λ
_ n n\/[k\\k2l. · · · kp\].
Chapter 1: Combinatorics and Probability
1.41 Obtain closed-form expressions for the following:
• Efc1+ic2+...+k„=n*;i ■-jrT^TTTfcT
Probability spaces and random variables
Although nondeterministic, the outcomes of a random process nevertheless
obey certain restrictions. In dice experiments, the outcomes are numbers corresponding to the upturned faces. For two dice, the outcomes are the integers
2 through 12. Other integer outcomes are not possible, nor are outcomes
such as heads, tails, blue, or green. The term sample space denotes the set
of possible outcomes. For experiments with two dice, the sample space is
{ 2 , 3 , . . . ,12}. For coin-tossing situations, it is {heads,tails}. For card deck
processes, it is the 52 individual cards. Clearly, the sample space need not
contain numbers. It is convenient, nevertheless, to associate numbers with
the outcomes in order to involve them in further computations. A functional
association of a number with each possible outcome is a random variable,
which we will formally define shortly. At this point, however, we note that
the domains of such functions are sample spaces. Of course, the sample space
outcomes have varying probabilities of occurrence, and when we wish to emphasize this additional feature, we use the term probability space. The next
definition formalizes the concept for the discrete case.
1.36 Definition: A discrete probability space is a countable set Ω (the sample
space), together with a function Pr : Ω -¥ [0,1], such that Σ ω 6 Ω Pr(w) = 1.
The elements of Ω are called outcomes. For ω € Ω, the number Pr(w) is
the probability of u>. The term probability distribution refers to the overall
assignment of varying probabilities to the outcomes. Subsets of Ω are called
events. The probability of an event E is the sum of its constituent probabilities: Pr(i?) = Σ ω β Ε ^ 1 ^ ) · Note that P r ( ) can refer to an outcome or to
an event. The second form represents an extension of the original function to
the collection of all subsets of Ω. In any given application, the distinction will
always be clear from context. |
As the definition implies, probability measures the "size" of sets, where
we consider a point to be a singleton set. Consequently, we will find frequent
occasion to manipulate sets through unions, intersections, and complements.
A quick review of the set theory presented in Appendix A would therefore be
appropriate at this point. In any case, we emphasize the notation A U B =
A + B as alternative expressions for set union. Similarly, A Π B = AB both
represent set intersection. An overbar indicates complement, as in A.
Later, when we consider uncountable sample spaces, we will update
the probability space definition to include a privileged collections of events,
called a sigma-algebra, whose members admit probability definitions. For
Section 1.3: Probability spaces and random variables
the countable case under discussion here, we can always take this privileged
collection to be the set of all subsets of the sample space. That is, all possible
events admit probability definitions through the simple expedient of summing
the probabilities of their constituent outcomes.
We write ( Ω , Ρ Γ ( · ) ) to denote a discrete probability space. If several
spaces are under analysis at the same time, we use subscripts on the probability function to clarify its domain. For example, (X, Prx(·)) and (Y, Pry (·))
might occur in the same discussion.
Although Definition 1.36 describes an abstract mathematical object, we
need to connect it with real-world situations. For example, suppose that
C i , C 2 , . . . are client requests arriving at a database server. In this context,
Ω can be the number of requests in the waiting queue when client Cn arrives.
Individual outcomes are then natural numbers: ω € { 0 , 1 , 2 , . . . } , each with
an associated probability. Of course, if n — 1, the first client always finds
an empty waiting queue, which is hardly a random outcome. For large n,
however, the waiting queue can contain from zero to n — 1 previous clients,
depending on the time needed to service their requests. Some requests require
long service times, while others need only brief attention. Moreover, the
arrival rate of clients depends on a number of external circumstances and
appears random to an observer. From these variations, we expect the waiting
queue size, as seen by client n, to exhibit random lengths.
To be specific, we fix n = 1000 to give time for the influence of the
initially empty queue to dissipate. Under a frequency-of-occurrence interpretation, what then is the intuitive meaning of Pr(2)? We imagine that the
server opens for business with an empty queue at the beginning of each day.
At some point in each day, client 1000 arrives and finds some number of requests in the queue. On some days, this number is zero. Sometimes it is 1;
sometimes it is 2, and so forth. Pr(2) is the fraction of days that client n finds
two requests in the waiting queue.
In general, the frequency-of-occurrence interpretation requires a repeating environment that, on each cycle, provides opportunities for the outcomes
to occur. Pr(w) is then the fraction of cycles in which outcome ω occurs. In
the client-server context, the subset {ω\ω < 5}, or more simply (ω < 5), is
an event. Its probability is J2i=0 Pr(i), which measures the fraction of cycles
in which the client 1000 encounters fewer than five waiting requests. It is
reasonable to assume that these probabilities are nearly identical for clients
In keeping with the interpretation that outcomes occur in a repeating
environment, we extend the frequency-of-occurrence interpretation to events.
If event A contains outcome ω, we say that A occurs whenever ω occurs. For
the client-server situation, the events (ω < 5), (ω > 1), and (ω = 0 mod 2) all
occur simultaneously with the outcome ω = 4. Also, A C D means that if A
occurs, then B occurs. If AB = φ, the empty set, then A and B cannot occur
simultaneously. In this case, A and B are said to be mutually exclusive. If
A + B occurs, then at least one of A and B must octur, because the actual
Chapter 1: Combinatorics and Probability
outcome ω must be in one or the other. Since AA = φ, if A occurs, then A
Definition 1.36 generalizes the intuitive probability of the first section.
There we investigated certain counting mechanisms that enabled us to assign
probabilities to events. Specifically, the assigned probability for event E was
the number of outcomes favorable to E divided by the total number of possible
outcomes. In these cases, the probability assigned to each individual outcome
was 1/iV, where N was the total number of possible outcomes. The new
definition merely allows arbitrary assignments to the individual outcomes,
provided that each assignment is nonnegative and all assignments sum to 1.
The earlier probabilities then become special cases of the current definition.
For example, the set of all 5-card hands becomes a probability space with
Pr(u>) = 1/C52.5, for each hand ω. It is important to note that all outcomes
have nonnegative probabilities. Moreover, each event E, being a collection of
outcomes, has 0 < Pr(£J) < 1. Because an event's probability is the sum of
the probabilities associated with its constituent outcomes, we have Pr(E\ +
E2 + . ■ ■ ) = Σ ί Pr(-^») when the Ei are pairwise disjoint. This last observation
leads to the following simple properties for probability assignments.
1.37 Theorem: Let ( Ω , Ρ Γ ( · ) ) be a probability space, in which A,B are
events and φ is the empty event. Then
ΡΓ(Ω) = 1
Ρτ(φ) = 0
Ρτ(Α) = 1 - Ρ Γ ( Λ )
Ρτ{Α + B) = Pr(A) + Pr(2?) - Pv(AB)
Pr(A + B) < Pr(;4) + Pr(J0).
PROOF: We have ΡΓ(Ω) = Σ ω 6 Ω P l " M = ! directly from the definition. The
disjoint union Ω = Ω + φ immediately gives ΡΓ(Ω) = ΡΓ(Ω) + Pr(<£), from
which Ρτ(φ) = 0 follows. Similarly, Ω = A + A is a disjoint union. Therefore,
1 = Pr(A) + Pr(A), which implies that Ρτ{Α) = 1 - Pr(^4). For the remaining
assertions, we note that A + B = AB + AB + AB is the traditional disjoint
union of A and B. Moreover, A and B admit the following disjoint unions.
A = AB + AB
B = AB + ÄB
Pr(A + B) = Pi{AB) + P r ( ^ S ) + Pr(AB)
Pi{A) = Pv{AB) + Pr(AB)
Pr(B) = Pr(AB) + Pr(AB).
Adding the last two equations and identifying the expression for Pr(^4 + B)
on the right side, we have
Pr(i4) + Pr(fl) = Ρΐ{ΑΒ) +
+ Pr(ÄB) + Pr(Aß)
= Ρτ{ΑΒ) + Ρτ(Α + B).
Section 1.3: Probability spaces and random variables
It then follows that Pr(A + B) = Ρτ{Α) + PT(B) - Ρτ{ΑΒ). Moreover, since
0 < Pr(AB) < 1, we also have PT(A + B) < Ρτ{Α) + P r ( ß ) . |
A discrete probability space is, of course, a model of some repetitive
process which produces outcomes ω with relative frequencies Pr(u>). The
process could be, for example, ten tosses of a fair coin. The space Ω then
contains all possible 10-sequences from the set {H, T}. There are 2 10 such
outcomes, and they all have the same probability: Pr(u>) = 1/210 = 1/1024,
for all ω e Ω. The collection of outcomes with 4 heads constitutes an event,
say the event Ht. The probability of this event is the sum of the probabilities
of its constituent outcomes: Pr(/f 4 ) = Cio,4/1024 = 0.2051.
1.38 Definition: Given a discrete probability space, (Ω,Pr(·)), a discrete
random variable is a function that maps Ω into 72, the real numbers. |
Because we will deal only with discrete random variables for the next
several chapters, the unqualified term random variable will mean discrete
random variable unless otherwise indicated. Tradition demands capital letters
for random variables, but we emphasize that a random variable is, in fact, a
function. To further overload the notation, we also use the random variable's
name to denote its range. Suppose that A" is a random variable. Then Λ'
denotes a function in the expressions Χ{ω), or X~1(S) where S C 72, or
X : Ω -* 72. However, when used as a set, A' refers to the range of X. For
example, suppose that we have another function that maps range(X) into 72.
Using X to denote range(AT), we write g : X -¥ 72, and we also speak of the
image of X under g as g(X). The intended meaning should be clear from
context. The next example suggests why such a twofold interpretation of X is
appropriate. It shows how probability transfers from events in the probability
space to numbers in the random variable's range. Indeed, once this transfer
is well understood, it is convenient in many cases to deemphasize the original
probability space and think of the random variable's range as the relevant
1.39 Example: Let Ω be the set of 10-sequences from {H,T}, and let Prn(w) =
1/2 for each ω 6 Ω. This probability space models ten tosses of a fair coin. Let
the random variable H : Ω —¥ TZ assign Η(ω) the number of heads in the sequence
ω. The range of H is {0,1,2,3,4,5,6,7,8,9,10}. For i in this range, H~l{i) = {ω €
Q | Η(ω) = i} is an event in Ω. We can think of the probability of this event as
transferring to the number i in the range. Because there are Cio.i 10-sequences that
exhibit i heads, we see that this probability is C?io,i/210. If i = 4, for instance, the
probability of H~i(4) = Cio,4/2 10 = 0.2051. We can often proceed from this point
as if the probability space were H = { 0 , 1 , 2 , . . . , 10} with probabilities as follows.
Pr«(/i) = C 10 ,/./2 10
PrH(h) = C 1 0 ,h/2 1 0
The table shows that the elements of H are not equally likely, even though the ele-
C h a p t e r 1: C o m b i n a t o r i c s a n d P r o b a b i l i t y
ments of the underlying Ω all have the same probability. Note that the probabilities
of the range elements sum to 1.
i > 0 , / 2 1 0 = -acritc">^10-'
= lis-O +1)10 = i
In the new probability space, we speak of Pr//(i) or Ρτ(Η = i) to describe the
probability of the event in Ω that maps to i. Similarly, we refer to Ρτ(Η < i)
or Pr(i < H < j) when we mean the probability of the event in Ω that maps to
numbers less than i or to numbers between i and j . It frequently happens that the
only events of interest are those with descriptions of this sort. If this is the case,
subsequent analysis need not refer to the original probability space. Rather, the
analysis works with the new space (H, Pr#(·)). D
In Example 1.39, H becomes a new discrete probability space because
the sum of the elements' probabilities is 1. The next theorem assures us that
this will always happen.
1.40 Theorem: Let X be a random variable over the discrete probability space ( Ω , Ρ Γ Ω ( · ) ) . Then (Χ,Ρϊχ(-)) is a discrete probability space, with
Ρτχ(χ) = ΡνΩ({ωΕΓί \ Λ » = *}).
PROOF: Let {ω\,ω2,...} be an enumeration of Ω. Then {Χ(ω\), Χ(ω2), ■ ■ ■}
sequentially elaborates all of λ', although it may contain some repetitions. (X
may map several Ω-elements to the same value.) If we proceed through the
sequence, eliminating those elements that duplicate elements already passed,
then we obtain an enumeration of X. This shows that X is a countable
set. Let X = {x\,X2, ■ ■ ■} be that reduced enumeration. The X{ are now all
Now we claim that each ω € Ω belongs to precisely one of the sets
À' - 1 (xj). Indeed, for i Φ j , Theorem A.6 asserts X~1(xi) Π X~l{x:)
X~1({xi} Π {XJ}) = Χ~ι(φ) = φ. So no ω can belong to more than one
Also, if ω € Ω, then X{u>) = x t , for some i, which places ω €
It then follows that
^ P r x O r . ) =Y/PrQ(X-l(xi))
= £
ΡνΩ{ω) = 1
because X~l(xi),X~l(x2),.
■ ■ are disjoint subsets whose union is Ω. |
In the case where the underlying Ω consists of numerical outcomes, we
can define a random variable with the function Χ{ω) = ω. The random
variable then retains the same probability distribution as the underlying space.
Another consequence of the theorem is that a function of a random variable
is another random variable. Y = g(X), for example, maps the range of X
to another countable set of real numbers, and the probability transfers from
the elements of X to the new range. That is, P r y ( F = y) = Prx(g _ 1 ({y})).
If g(t) = t2 — 3, for instance, we speak interchangeably of the transformed
random variable as g{X) or as X2 - 3. A functional expression in a random
variable is simply a new random variable.
In view of Theorem 1.40, most analyses start with a statement that a
random variable has a particular probability distribution and continue with-
Section 1.3: Probability spaces and random variables
out further reference to the underlying process that generates the actual outcomes. Further consequences depend only on the probabilities attached to the
elements in random variable's range. In Example 1.39, for instance, we say
that the random variable H has a binomial distribution as given by the table
in that example. The next chapter treats the binomial and other discrete distributions in detail. In circumstances where the underlying outcome space is
unknown, we simply work with the transferred probabilities: Prx : Λ' —► [0,1].
In this case, it is sometimes convenient to extend the probability function to
all of 71. We do so by defining Prx(y) = 0 for all y outside range(X). We
then have Prx : H -¥ [0,1], but the probability function is zero except on a
countable subset of H.
1.41 Example: Let Ω be the collection of possible 5-card hands from a standard
card deck. Assume that all hands are equally likely. That is, assume that Pro(u;) =
1/C52.5, for each ω £ Ω. Let X be the random variable defined by Χ(ω) equals 2
times the number of aces plus the number of kings in the hand u>. If, for instance,
u> contains 1 ace and 1 king, then X(u>) = 3. What is the probability distribution
of X?
The largest value that X can assume is 9, which occurs when u> contains 4 aces
and a king. The smallest value is 0, which occurs when ω contains no aces or kings.
We must first tabulate the events in Ω that correspond to the inverse images of the
numbers 0 through 9. The techniques of Section 1.1.1 are available to calculate the
probabilities of all these inverse images. The results appear in Table 1.8.
As a sample calculation, consider the case where X = 2. This can arise from
two disjoint kinds of hands: those with no aces and two kings and those with 1 ace
and no kings. We calculate these values separately. For the case with no aces and two
kings, we use the multiplicative counting principle to break the total hands into three
successive categories: those that differ in the choice of aces, those that differ in the
choice of kings, and those that differ in the final choice from the remaining 44 cards.
So the number of hands with no aces and two kings is C^^CA,2(^44,3. This gives
79464 hands with no aces and two kings. Similarly, we calculate that 543004 hands
have 1 ace and no kings. So λ' = 2 occurs in 79464 -I- 543004 = 622468 hands out of
C52,5 = 2598960 possible hands for a probability of 622468/2598960 = 0.2395066.
The final column of Table 1.8 thus gives the probability distribution of the random
variable λ'. Π
1.42 Definition: Let f : 1Z -> H. Let X be a random variable with range
xi,X2. — The expected value of / ( · ) , denoted E[f(·)] or simply E[f], is the
sum £[/(·)] = 52j /(:Ti)Prx(:r;), provided that the sum converges absolutely. |
A series Σί^ converges absolutely if £ ; lflil converges, and we insist
that the defining sum in the expected value computation possess this property. Sections A.2 and A.3 in Appendix A review convergence and absolute
convergence of series. In view of Theorem A.45, this condition allows the
summation to proceed over the range of X in any order. Also note that the
definition depends on the random variable in question. If there is any ambiguity, we write Εχ [/(·)] to clarify that it is the range of A" and the corresponding
probabilities that appear in the sum. The following theorem gives two useful
properties of the expected value operator.
Chapter 1: Combinatorics and Probability
(x) components
Size of components
= P r ( X = x)
C 4 ,oC4,oC 4 4,5
C4,oC 4 ,iC44,4
C 4 ,oC4,2C 4 4,3
C4,iC 4 ,oC44,4
C4,iC 4 ,iC44,3
= 22704
C 4 l iC 4 ,3C 4 4 ,i
C 4 , 2 C 4 ,iC 4 4 ,2
C 4 ,iC 4 , 4 C 4 4 ,o
C 4 l 2C 4 ,2C 4 4 ,i
C 4 ,2C' 4 ,3C 44l o
C 4 ,3C 4 ,iC 4 4 ,i
C 4l4 C 4 ,o£? 44 ,l
= 79464
= 704
= 4
= 1584
= 3784
= 24
= 704
= 24
= 44
T A B L E 1.8. Random variable X is 2 times aces plus kings in a 5-card
hand (see Example 1.41)
1.43 Theorem: E[·] is a linear operator. That is, if E[f] and E[g] exist, we
have, for constants a, b,
E[af + bg] = aE[f] + bE[g}.
Also, if h(t) = K, a constant, then E[h) = K.
PROOF: Let X be the random variable in question, and let χι,Χί,...
enumerate its range. Since E[f] and E[g] exist, their defining sums converge
absolutely. Let
A = Σ l/(^)lPr(*i) and B = £ ^(ιΟΙΡφί).
By the triangle inequality,
\af(x,) + όρ(^)| < \af{Xi)\ + \bg(Xi)\ = \a\ ■ \f(Xi)\ + \b\ ■ \g(Xi)\.
< |α| 5 3 |/(χί)|ΡΓ(χ0 + |b| ^ ) |s(*i)|Pr(ari)
which shows that the defining sum for E[af + bg] converges absolutely. Con-
Section 1.3: Probability spaces and random variables
sequently, E[af + bg] exists. Therefore,
E[af + bg) = ^ ( ο / ( ϋ ) + bg{Xi)) ■ P r ( n )
= a Σ fixiWxi)
+ b Σ g(xi)Pr{Xi) = a £ [ / ] + bE[g\.
Also, if h{t) = K, then Ε[Λ(Χ)] = Σί κ ' P r ( ^ ) = ^ Σ , Ρ Φ Ο = Λ '· ■
For simple polynomial functions, such as /(<·) = t,g(t) = t2, we frequently write the corresponding expression of the random variable as the
argument of the expectation operator. That is, we write E[X] for E[f] and
£[A 2 ] for E[g]. Slightly more complicated expressions also appear in this
abbreviated notation, such as E[{X - o)2] for E[h] where h(t) = (i - a) 2 .
When you know the probability distribution of a random variable, you
theoretically know everything about it. Practically, however, the distribution
is not as easy to manipulate as certain summary values. Two of these values
are the mean and variance, which measure, respectively, the central value,
around which the probabilities balance, and the dispersion about that balance
1.44 Definition: Let X be a random variable with probability distribution
Pr(). The mean of X, denoted μ or μχ, and the variance of A', denoted σ2 or
σ2χ, are defined as follows, provided that the defining expected values exist.
μ = £?[X] = Ç i i - P r ( i i )
σ = Ε[(Χ - μ)2} = Σ(*< - /*)2 · ΡΦΟ,
where £1,2:2, · · ■ is the range of A'. The square root of the variance, denoted
σ or σχ, is the standard deviation of the random variable. I
Using Theorem 1.43, we can derive a more useful formula for the variance.
σ2 = E[(X - μ)2\ = E[X2} - 2μΕ[Χ] + μ2 = E[X2} - 2μ2 + μ2
= E[X2} -ß2=(j2xl·
Pr(ïi) J - μ2
1.45 Example: What are the mean and variance of the random variable A' in
Example 1.41?
Table 1.9 excerpts the probability data from the last column of Table 1.8
and tabulates the terms needed for the mean and variance calculations. From the
summations in the table's last line, we calculate μ = 1.154 and a2 = 2.858 —
(1.154)2 = 1.526. The standard deviation is v/l.526 = 1.235. F igure 1.7 shows how
the mean represents a balance point for the probabilities, in the sense that the x-axis
balances at the mean when loaded at the points xi, X2,... with weights proportional
to the corresponding probabilities. The figure also shows two boundary lines, at the
Chapter 1: Combinatorics and Probability
Pr(X = x;) i. ■ Pr(X = Xi) x'f-Pr(X
= Xi)
T A B L E 1.9. Sample calculation of the mean and variance (see Example 1.45)
Figure 1.7. The mean as the balance point of probability weights (see
Example 1.45)
mean plus or minus two times the standard deviation.
will become clear after the next two theorems. Q
1.46 Example: Over a wide range of conditions, the operator of an airline reservation terminal has determined that the response time T,
rounded to the nearest 1/2 second, follows the
distribution given in the first two columns to the
right. The operator obtained these data by appealing to the frequency-of-occurrence meaning
of probability. The data reflect the fraction of
responses that entail a wait of 0.0,0.5,... ,4.0
seconds. What are the mean and variance of T? >;
The significance of these lines
t Pr(i)
0.0 0.15
0.5 0.05
1.0 0.20
1.5 0.30
2.0 0.05
2.5 0.05
3.0 0.10
3.5 0.05
4.0 0.05
18.0 1.00
t ■ Pr(()il
■ Pr(t)
Section 1.3: Probability spaces and random variables
The summary line of the third column gives the mean, μ = Σ <;Pr(T = ti) =
1.5750. The corresponding entry at the bottom of the last column gives E[T2]. The
variance is σ2 = E[T2} - μ2 = 3.7125 - (1.5750)2 = 1.232. D
If Y = f(X), we can compute the mean and variance of Y in the probability space range(V) or in range(X). Let 2/1,1/2,-·· be an enumeration of
range(F), and let xlyX2,..enumerate range(X). R a n g e ^ ) splits into disjoint sets: those elements that map to j/i, those that map to 2/2 > and so forth.
These sets are, of course,
ßY=EY[Y] =
Y;yiprY{yi) = sryt?rx{ri{yi))
= Σ
sryi £
/(^i)Prx(^) = Σ
/(^)Prx(^) =
The last line follows because the separate sums over the partition components / _ 1 (3/ι),/ _ 1 (ΐ/2)ι· · · total precisely the sum over range(A'), A similar
computation gives
σγ = Ey[(Y - μγγ]
= EY[Y2} - μ\ = Ex[f*} -
1.47 Example: Suppose that X has the probability distribution below, and let
Y = f(X) — iX1 + 6. Find the distribution of Y and its mean and variance.
- 2 - 1
Pr(X = x) 0.25 0.20 0.15 0.35 0.05
Under the specified mapping, Y assumes the values 14,8, and 6. Below we tabulate
the inverse images and associated probabilities, which give rise to the mean and
variance that follow.
y-»(14) = {-2, 2} -> Pr(Y = 14) = Pr(X = -2) + Pr(X = 2) = 0.30
Y~\8) = {-1,1} -> Pr(y = 8) = Pr(X = -1) + Pr(X = 1) = 0.55
y - ' ( 6 ) = {0}
-> Pr(V = 6) = Pr(.Y = 0) = 0.15
μγ = 14(0.30) + 8(0.55) + 6(0.15) = 9.50
σγ = 14 2 (0.30)+8 2 (0.55)+ 62(0.15) - 9.502 =9.15
Equivalently, we can compute the mean and variance of Y = f(X) using the X
ßY = /(-2)(0.25) + /(-l)(0.20) + /(0)(0.15)+/(l)(0.35) + /(2)(0.05)
= 14(0.25) + 8(0.20) + 6(0.15) + 8(0.35) + 14(0.05) = 9.50
°Y = [/(-2)-9.5] 2 (0.25) + [/(-l)-9.5] 2 (0.20) + [/(0)-9.5] 2 (0.15)
+ [/(l) - 9.5]2(0.35) + [/(2) - 9.5]2(0.05)
= (14 - 9.5)2(0.25) + (8 - 9.5)2(0.20) + (6 - 9.5)2(0.15) + (8 - 9.5)2(0.35)
+ (14-9.5) 2 (0.05) = 9.15D
Chapter 1: Combinatorics and Probability
Suppose that a random variable X assumes values χχ,χτ,...
with corresponding probabilities px = Pr(xi),p2 = Pr(#2)>
If we repeat the underlying random process a large number of times, say N, and let iVi,JV2, ■ ■ ■
be the number of times that X generates, respectively, the values
then we expect Νχ/Ν « Ρι,Ν?/Ν « p2, and so forth. So the average X value
over the course of the N trials is
= ^2xk~-Kj2xkPk
= Ε[Χ] = μ.
In this sense, the mean is frequently called the average value of the random
variable. The frequency-of-occurrence interpretation provides a strong connection between the mean, which is a mathematical concept, and the average
over a long sequence of observations, which is a real-world construction. This
connection enables many applications of probability to real-world problems,
and we consequently want no ambiguity about the mean. In particular, we
want the same mean regardless of how the defining sum is executed, and this
invariance comes with the insistence on absolute convergence. The next example illustrates the difficulties that can arise when absolute convergence does
not hold. It uses certain properties of convergent series, which may call for a
review of Section A.3 in Appendix A.
1.48 Example: Note that
£ £ . , + t£s. + ±^-'+±(Th--T)
1+ 1
These partial sums form a nondecreasing sequence with upper bound 2. Hence
the sum converges. Let A = Σ°Ζγ 1/i' · A is actually π /6, but that detail is not
necessary for this example. Now consider the random variable λ' with distribution
Pr(X = i) = l/(2Ai2), for integer i φ 0. Understanding that i = 0 is to be omitted,
we sum these weights as follows.
·= -Λί
i= l
1 \
= 1 Ζ ^ + Λ) = ι
Thus we have a proper discrete probability distribution. Consider now the expression
for the mean, again with the understanding that i = 0 is omitted from the sum.
^—' 1Ai
l i m
Λ1.ΛΓ-»οο ^—'
This sum has no definitive limit. (Appendix A investigates this particular sum in the
argument accompanying Figure A.l.) By carefully choosing interleaving spans of
Section 1.3: Probability spaces a n d r a n d o m variables
positive terms with compensating spans of negative terms, we can drive the partial
sum back and forth across any prespecified interval. That is, we can force the partial
sums to oscillate continually with infinitely many high points greater than 1 and
infinitely many low points less than — 1. The limit of such an oscillating sequence of
partial sums does not exist. Since the terms 1/i are approaching zero in magnitude,
each positive or negative excursion uses an increasingly large span of terms, but there
are always infinitely many remaining to continue the process. We can even force the
partial sums to converge to an arbitrary number. For example, we sum a span of
positive terms to drive the partial sum to a value in (42,43). We follow with a span
of negative terms to force it back into the interval (41,42). Alternating positive
spans with negative ones, we successively force the partial sum into the ranges
(42,42+ 1), ( 4 2 - 1, 42), (42,42 + 1/2), (42 - 1/2, 42), (42,42 + 1/4), ( 4 2 - 1 / 4 , 42) · · ·.
These partial sums converge to 42.
It is hardly helpful to have a mean when the value depends on the summation
order. We intend to interpret the mean in real-world applications as the average
value over many observations. It would be rather artificial to suggest that we sum
a run of positive observations, balance it with a run of negative observations, and
continue in this manner to achieve some prespecified fraction. Rather, we must
take the observations as they occur. We are trying to describe and explain the
nondeterministic process; we are not trying to control it. The ambiguous behavior of
the expected value operation in this example arises because the sum is not absolutely
2A M,iv-*oo ί—ι
2A AÎ,JV->OO ^
*—> ι
In cases like this, we say that the mean does not exist. It is clear that the variance does not exist either, because the defining sum for S'fA'2] is not absolutely
convergent. Q
T h e variance measures dispersion about the mean, with a larger variance corresponding to a larger dispersion. For example, we can construct
the probability distribution of a random variable X by placing probability
1/2 at x = 1 and 1/2 at x = — 1. T h e mean and variance are, respectively,
( l ) ( l / 2 ) + (—l)(l/2) = 0 and (1 - 0 ) 2 ( l / 2 ) + ( - 1 - 0 ) 2 ( l / 2 ) = 1. Now construct a competing random variable Y by placing probability 1/2 at y = 10
and 1/2 at y = - 1 0 . T h e mean is again ( 1 0 ) ( l / 2 ) + (—10)(l/2) = 0, b u t
the variance is now much larger: (10 - 0 ) 2 ( l / 2 ) + ( - 1 0 - 0 ) 2 ( l / 2 ) = 100.
To continue the mechanical analogy, if the mean corresponds to the center of
gravity, the balance point, then the variance corresponds to the moment of
inertia. T h e following theorems show how the variance places limits on how
far from the mean we can find significant probability masses.
1.49 T h e o r e m : (Markov's inequality) Let X be a nonnegative random variable with mean μ. Then, for x > 0, P r ( X > x) < μ/χ.
PROOF: Let x\,X2,...
Xi > 0 for all i.
μ = E[X} = ^Xi?t(xi)=
be an enumeration of range(A').
By assumption,
Chapter 1: Combinatorics and Probability
A smaller value results if we omit the first summation.
μ >
χ Ρτ χ
ί (ύ
Pr{xi) j = x ■ Pr(X > x)
Dividing by x establishes the theorem. |
1.50 Theorem: (Chebyshev's inequality) Let X be a random variable with
mean μ and standard deviation σ > 0. Then, for any r > 0, ΡΓ(μ - τσ < X <
μ + τσ) > 1 - 1 / r 2 .
PROOF: The result is, of course, trivial for r < 1. In this case 1 - 1/r2 < 0,
and the probability of any event must be nonnegative. Hence, we expect
interesting applications of this theorem to involve r > 1. In any case, let
μγ = Εγ[Υ} =
Because Y > 0, we can apply Markov's inequality to obtain P r y ( F > r 2 ) <
1/r 2 . T h e n
- p - > P r y ( y > r 2 ) = 1 - PvY(Y
< r 2) = 1 - P r x f
( X
= 1 - Ρ Γ χ ( μ χ - rax
< X < μχ +
Transposing terms gives the expression asserted by the theorem. |
1.51 Example: Continuing Examples 1.41 and 1.45, we verify Markov's and Chebyshev's inequalities. Recall that we compute the random variable X as twice the
number of aces plus the number of kings in a five-card hand. We have computed
μ = 1.154 and a — 1.235. Reading the required probabilities from Table 1.8, we
construct the following probability values and their Markov bounds.
Pr(X > x)
bound (μ/χ)
Pr(X > x)
As you can see, the actual PT(X > x) is always less than the corresponding
Markov bound, μ/χ. Indeed, it is usually very much less, which suggests that the
Markov bound is not very precise. The Chebyshev inequality states that Ρτ(μ — 2σ <
X < μ + 2σ) = Pr(-1.316 < X < 3.624) > 1 - (1/2) 2 = 0.75. A glance at Figure 1.7
shows that this span includes X = 0,1, 2, 3 for a total probability of 0.949, which
again beats the bound by a large margin. Despite this example, the Chebyshev
bound cannot be improved in the general case. One of the exercises pursues this
idea. 0
Section 1.3: Probability spaces a n d r a n d o m variables
Figure 1.8. A cumulative distribution function
We introduce one final aspect of a random variable in this section, the
cumulative distribution function. This new description of the random variable
turns out to be equivalent to its probability distribution. However, it is more
useful in certain circumstances, such as the computer simulations of Chapter 3.
1.52 Definition: Let X be a random variable with range X\ < X2 < x* < · ■ · ·
The cumulative distribution function of Λ', denoted F(x) or Fx(x), is defined
for all real x by
F(x) = Pr(X <x) =
Pr(x t ). I
If, for instance, X has the probability distribution in the third row below,
we can calculate F(xi) by summing the values at or to the left of ι,, as shown
in the fourth row.
Pr(X = a)
For other values of x, F(x) remains constant at the value of the largest .τ;
that does not exceed x. That is, F(x) = F(x,), for x, < x < xl+i. Also,
F(x) = 0 for x < xi, and if there is a largest value in the range of X, say xn,
then F(x) = 1.0 for x > xn. Figure 1.8 shows this staircase behavior of F(x)
for the current example. Note that F(x) is continuous from the right, even
at the jump points. That is, limy_+I+ F(y) = F(x). Recall that the notation
y -¥ x+ means that y approaches x through values greater than x. Except at
points in range(X), F(x) is also continuous from the left.
1.53 E x a m p l e : Suppose that a disk server provides file access services for a collection of clients. The server takes a fixed 10 milliseconds to read or to write a
disk sector, which is a fixed-length character string. The server maintains a buffer,
which can hold at most two requests. Therefore, if client requests arrive during the
10-millisecond interval while the server is busy with a prior request, the new requests, up to a maximum of 2, will not be lost. Empirical studies for this particular
client-server arrangement indicate that the probability of k client requests arriving
during any 10-millisecond interval is given by (0.9)*e -0 ' 9 /fc!. What are the mean
Chapter 1: Combinatorics and Probability
and variance of this distribution? What is the probability that the server will lose
a client request if it starts its 10-millisecond service interval with an empty buffer?
What size buffer would reduce this probability below 0.05?
Let N be the random variable that counts the number of arrivals during a 10millisecond interval. We then have Pr(N = k) = (0.9) f c e - a 9 /W· Since Σ Π = ο χ * Α !
is the expansion for ez, we see that
f; Pr(N = k) = e"0·9 Σ,-Qjf- = e-° V
= 1-
Therefore, N is indeed a discrete random variable with range 0 , 1 , 2 , . . . . Computation of the mean and variance uses the summation techniques of the preceding
fc = l
°· 9 Σ ( 0 ΐΓ' = °-9e"°9 Σ-τ-= 0 - 9e " 0 - 9e °' 9 = °·9
σ2 = E[N2} - μ2 = Π Γ fc2Pr(JV = k) ) - (0.9)
= -(0.9) 2 + e - 0 · 9 V Ü +
= -(0.9) + (0.9)e-°·
A:(0.9) fc
+ (0.9)e-°- X:
= -(0.9) + (0.9)£[JV) + (0.9)e -0 - 9 e° 9 = -(0.9) 2 + (0.9) 2 + (0.9) = 0.9
From the formula Pr(W = k) = (0.9)*e-°-9/fc!, we can calculate the cumulative
distribution function, F(x), at the first several jump points. To four decimal places,
the probability is zero for k > 6.
= k)
Since the server starts its service interval with an empty buffer, it will not lose a client
request as long as no more than 2 such requests arrive in the 10-millisecond interval.
The probability of 2 or fewer arrivals is F(2) = 0.9372. Hence, the probability
is 0.9372 that the server will not lose a request. Equivalently, the probability is
1.0 — 0.9372 = 0.0628 that it will lose a request. In the sense that the mean of a
random variable reflects its average value, we expect an average of 0.9 requests to
arrive while the server is busy. Because fewer than 1 request arrives, on the average,
during the time necessary to service a prior request, we see that the server has a
chance of staying ahead of its clients. Moreover, there are enough buffers to cover
the average number of incoming requests during the service interval. Nevertheless,
there is still a 6.3% chance of losing a request through buffer overflow. To decrease
this value below 0.05, we must be able to buffer sufficient requests, say n, such that
Section 1.3: Probability spaces and random variables
F(n) > 0.95. Consulting our tabulation above, we see that we must have n > 3.
That is, we need 1 additional buffer. Of course, if the server starts its 10-millisecond
cycle with a partially filled buffer, it will incur a higher probability of an overflow.
Analysis of that situation will have to wait until we have developed more tools.
The probability distribution of this example is called a Poisson distribution,
and it occurs frequently in client-server performance calculations. The next chapter
takes up the Poisson distribution in detail. It also develops less tedious methods for
determining the mean and variance. Q
1.42 Let ( Ω , Ρ Γ ( · ) ) be a discrete probability space, in which A, B,C are
events. Prove the following expression of the probability of the union,
and deduce the proper expression for Pr(U"=1.<4j).
Ρτ(Α + B + C) = Pt(A) + Pr(B) + Pr(C)
-Pi(AB) - PT(AC)
- Pi(BC) +
* 1.43 Suppose that Ai C A2 C A3 C · · · is an event sequence. Prove that
Pr(U~ =1 i4„) = lim,,-*«, Pt(An).
*1.44 Suppose that B\ D B2 D B3 D · · · is an event sequence. Prove that
Pr(n~ = 1 B n ) = l i m n - ^ P r i B n ) .
1.45 Two dice are rolled. Let X be the sum of the spots on the upturned faces.
What is the range of X? What are the corresponding probabilities?
What are μ and σ 2 ?
1.46 You receive 5 cards from a well-shuffled deck. Let X be twice the number
of queens plus the number of jacks in the hand. What is probability
distribution of X? What are μ and σ 2 ? What is the relationship between
this problem and that discussed in Examples 1.41 and 1.45?
1.47 As in the preceding problem, you receive 5 cards from a well-shuffled
deck. The hand contains nq queens and n, jacks. Let Y = (nq - n,) 2 .
What is the probability distribution of Y? What are μγ and σ\Ί
1.48 In 5 tosses of a fair coin, let X be the number of heads minus the
number of tails. What is the probability distribution of XI What is the
cumulative probability distribution? What are μ and σ 2 ?
1.49 In a game you receive three cards from a well-shuffled deck. You then
receive $10 if the hand contains an ace and a face card. The face cards
are the kings, queens, and jacks. All cards are then returned to the deck
before the next game begins. How much would you be willing to pay,
per game, to play a large number of hands?
Report inappropriate content