Next: About this document ...
Notes on runtime and elements of sorting
One can argue that the single key notion in all of computer science is
that of massive repetition. That is, computers (when properly
programmed) are great at doing things over and over, and that is where
their enormous utility comes from.
This means that lines of code can get executed over and over, instead
of just once. (Imagine what it would be like if we had to program a
separate line for each execution of an instruction -- to do 1000 steps
would take 1000 lines of code!)
On the other hand, if execution of (such a massively repetitious)
program required unacceptably large amounts of time, or space, etc,
then the utility would not be so enormous after all. Thus we are
usually interested in the "efficiency" of a program, as well as that
it solve the problem we are interested in. There are several concerns
hidden in the notion of efficiency, among them being space (how much
memory is needed), time (how long does it take to run to completion),
and coding-time (how hard is it to design/write/debug the program).
Our main concern here will be that of time, usually called
RUNTIME. However, this does not usually mean the literal clock time
measured as the program runs; this is for many reasons. For one thing,
that depends on lots of factors in addition to the code itself, eg:
the compiler, the operating system, the hardware, not to mention the
data. The latter is an especially interesting issue; we normally
expect the runtime to increase if the size of the dataset increases,
although this is not necessarily the case. However, this suggests
that we think of runtime as a function T_p (n), where n is a measure
of the amount of data and p is the program. This notation also
suggests that we are abstracting away from any physically exact time
measure since the other factors above (compiler, OS, hardware) are not
specified. But then we need to decide on what "time" measure we are
using.
We will aim, as a first approximation, at something like the total
number of line executions that occur until completion; this depends on
the data and on the program, and nothing else. For many purposes this
is too complicated, however, and we can often find a useful (second)
approximation (to this first approximation). For instance, for very
many sorting algorithms -- of which we will see two below, and more
later -- counting the number of executions of data comparisons
[Boolean statements such as (x < y) ] gives an excellent measure of
the overall time performance. Here "excellent" means this: even
though the actual count we come up with this way may not tell us about
the actual clock time that will pass, and even though it may not even
tell us the total number of line executions, we do learn about how
time performance depends on data size.
There is an implicit understanding here that, although different
instructions may differ in how long they take to execute, these
differences should not be very large. Thus a simple assignment
(x <-- y ) statement may take longer than one that has evaluations
embedded in it ( x <-- (w + r)/z ); but as long as we are confident
that no instructions differ in clock time by more than a constant
factor (say, 10) then our second approximations are still useful.
Thus if we find that, for one program, p, that solves problem P,
T_p (n) = n^2, whereas for program q, T_q (n) = 30n. In this case, we'd
conclude that q is the "better" program since it responds to more
data with a proportionate increase in runtime, whereas p magnifies a data
increase quadratically, eg:
T_p (10) = 100 and T_q(10) = 300
but
T_p (100) = 10000 and T_q(100) = 3000
and worse:
T_p (10^10) = 10^20 and T_q(10^10) = 30*10^10
Thus the runtime for q grows reasonably with data size -- it is
reasonable that if we have enough memory to accommodate ten times as
much data, then we can pay for ten times as much speed (or are willing
to wait ten times as long). Similarly for 10^10. But no one can
afford to wait 10^20 --even if the units are nanoseconds -- since that
is 10^11 seconds, about 3000 years! Note that 10^10 nanoseconds is,
by contrast, only 10 seconds.
The lesson here is, in part, that a multiplicative factor, even a
seemingly large one such as 300, is relatively unimportant compared
to a (seemingly small) power (such as 2). In other words, a linear
runtime function T(n) is generally much preferred over a quadratic
one, especially if we envision using larger and larger data sets.
This kind of issue has to do with the ``order of growth'' of
functions, and we will study it soon.
Now back to massive repetitions:
There are at least two standard ways to program massive repetition:
(i) iterations, ie looping, as in For-loops, While-loops, etc
(ii) recursion
Each of these programming methodologies leads to a characteristic
mathematical formulation for runtime. For iterations, the formulation
involves summations; and for recursion it involves recurrence
formulas.
Example 1: Insertion Sort
Given: Array A[1:n] of numbers
1 for i <- 2 to n do
2 newval <- A[i] **store new value for comparisons**
3 j <- i-1
4 while [j > 0] and [newval < A[j]] do
5 A[j+1] <- A[j] **shift A[j] right to make space**
6 j <- j-1
7 A[j+1] <- newval **stick newval into open spot**
Each line L takes a certain amount of time t_L to execute, and each line is
executed a certain number of times e_L that may depend on the data in
array A. So the total runtime for Insertion Sort is
Runtime(InsSrt) = Sum_L=1_to_7 (t_L * e_L)
Let us suppose that the differences in the t_L values is not great,
and that they are all approximately equal to 1 basic time unit. Then
we get
Runtime(InsSrt) = Sum_L=1_to_7 (e_L)
Can we figure out the values of the e_L, ie, how many times each line
is executed?
Line 1 is executed exactly n times, since i is assigned values 2
through n+1, but the last time it is out of bounds. Lines 2, 3, and 7
are each executed exactly n-1 times. Lines 4, 5, and 6 are the tricky
ones, since the number of time they are executed depend on whether the
conditions in line 4 are true, and that depends on the data. That is,
how much ``work'' is done (how many line executions are done) depends
crucially on how the line 4 test comparison [newval < A[j]] comes out,
since that test determines whether yet more comparisons are done with
newval (if the test comes out true) or not.
There are three interesting cases:
1. The ``best'' case, ie when the e_L's are as small as possible.
This happens when the test always comes out false, so line 4 is
executed only once for each newval, and lines 5 and 6 not at
all. But this case is the one in which the data are already sorted
to start with! And in this case, e_4 = 1+1+...+1 = n-1, ie once for each
newval = 2,3,...,n. Thus in the best case, the runtime is
*linear*, proportional to the amount n of data.
2. The ``worst'' case, ie when the e_L's are as large as possible.
This happens when the test always comes out true, so line 4 is
executed i times, and lines 5 and 6 i-1 times each. Note that j is
initially equal to i-1, and j will take on all values from i-1 down to
0 in line 4, if the newval test is always true. This case is the
one where the data start out *reverse* sorted. And in this case,
e_4 = 2+3+...+n = n(n+1)/2 - 1. (Why is this last formula
correct?) In the worst case, then, the runtime is quadratic in the
amount of data.
3. The ``average'' case, ie with the most typical e_L values is now
rather interesting, since we don't yet know whether it will turn
out that, typically, the average runtime is more like the best case
(good), or more like the worst case (not so good). We shall see,
after the following.
We will make a simplifying assumption: that all we need to really
count is the number of times the newval test [newval < A[j]] is
executed. For this, after all -- THE NUMBER OF DATA COMPARISONS -- is
what makes the big difference in overall work in insertion sort, as we
have seen. We now use the expression T(n) to mean the NUMBER OF DATA
COMPARISONS made on n data items (for whatever sorting algorithm we are
talking about -- in this case it is of course insertion sort). But we
will still often sloppily refer to T(n) as the runtime.
Let us now rephrase the above observations, but just for data
comparisons:
The best-case runtime (counting only data comparisons) is now clearly
n-1, since each new card will be compared only to the one before
it. There are n-1 cards that are "new" after the first one, so we get
a sum of 1, n-1 times. This is good, in that it is a linear function,
and so only grows proportionally to the size n of the data set. This
occurs, however, when the data is already sorted, so there is no point
in running a sorting algorithm!
The worst-case runtime occurs when the data is reverse-sorted, so each
new card has to be compared to all the cards before it. That is, the
second card is compared to the first (1 comparison), then the 3rd to
the first two, etc: 1+2+3+...+(n-1) = (n-1)n/2. This is quadratic in
n and thus grows with the square of the amount of data. If we increase
the data by a factor of 100, the runtime increases by a factor of
10,000.
But perhaps this poor quadratic performance is rare (eg only when the
data is reverse sorted); perhaps most of the time the performance is
better than quadratic. For this we need an "average-case" runtime
analysis.
An intuitive but imprecise argument is this: When we are about to
place card i in the correct position among the previous i-1 cards, it
has equal chance of belonging in any of the i possible positions from
1 to i, so on average it ought to go in the middle position i/2
(assuming i is even). This then gives (roughly) (i-1)/2 comparisons,
since we need to compare it to (roughly) half the first i-1 cards.
Summing for all i starting with 2, we get 1/2 + 2/2 + 3/2 + ... + (n-1)/2,
ie, 1/2 [1 + 2 + ... + (n-1)] = (n-1)n/4. If this is more or less
correct reasoning, then the average case runtime is about 1/2 that of
the worst-case, ie still quadratic and thus not very good.
We will do it again more precisely later, using probabilistic machinery.
Example 2: MergeSort
Given: Array A[left,right]
1 if left < right then
2 mid <- (left + right)/2
3 MergeSort(A,left,mid)
4 MergeSort(A,mid+1,right)
5 Merge(A,left,mid,right)
In the above, MergeSort is a recursive procedure, calling itself in
lines 3 and 4, and Merge is a subroutine whose code must be given
elsewhere; what Merge does is to take two already-sorted sublists
A[left,mid] and A{mid+1,right] (that have been returned/sorted by the
two calls in lines 3 and 4) and ``merges'' them into one large sorted list
A[left,right]. We assume for simplicity that n is an exact power of
2, so that mid is a whole number.
There are many ways to design such a Merge, but a particularly easy
one is this: imagine each sorted sublist written verically, with the
large elements at the top. Compare the two tops and pop off the
larger into the rightmost position of an initially empty new list;
repeat this process until only one element is left so no more
comparisons are possible. This takes exactly n-1 comparisons.
Here the best-case, worst-case, and average-case are all the same,
since MergeSort performs exactly the same number of comparisons no
matter what the data items are, ie it depends only on the size n of
the data set. And we get a precise ``recurrence equation'' -- at
least in the case where n is an exact power of 2 -- for the runtime
T(n) of MergeSort:
T(n) = 2T(n/2) + n - 1
This is because T(n/2) is the runtime for the left half-array, and
also for the right half-array! (And n-1 is the runtime for Merge.)
We will need to find a method for solving recurrence formulas. For now
we leave ourselves in a state of suspense: Is MergeSort quadratic
(like Insertion Sort in the average case) or perhaps linear (like
Insertion Sort's best case), or perhaps something different from both
of these?
Some key LESSONS of this material are:
a. iterative code leads to a summation expression for runtime
b. recursive code leads to a recurrence expression for runtime
c. T(n) is a runtime measure based on key lines (such as data
comparisons in our two examples) that control overall
performance.
d. probabilities may be needed to determine average case runtimes
e. order of growth helps us compare runtimes
*** THE FOLLOWING WILL BE PRESENTED *AFTER* WE STUDY PROBABILITIES ***
Here is a more precise treatment of the average case for insertion sort.
Let C be the random variable giving the number of comparisons when
Insertion Sort is run on an input string of n particular items. We
assume that the input string of n items is provided by some "fair"
process (an experiment) that gives equal probability to all n!
permutations of the string. This means that when item i is being
processed to see where it belongs relative to the previous i-1 items,
there is an equal chance that its proper place is in each of the i
possible positions (in position 1, position 2, ... , position i. So
the probability that it belongs in any position j (where 1 <= j <= i)
is 1/i. Note that this is simply another way of saying that the
probability distribution is uniform for positions, for a given card i.
The outcomes (the sample space elements, or elementary events) are the
n! possible repositionings into an ordered list (smallest to largest
-- and as we noted above, we assume this to have a uniform probability
distribution). For each such outcome, C gives the number of
comparisons that were made to get that outcome (by Insertion Sort).
What we want to calculate is the expected value E(C) -- this will be
the average number of comparisons. But it is hard to calculate this
directly. Instead we use a trick.
Note that C = C1 + ... + Cn, where each Ci is the comparison count for
the ith item as it is being processed to see which position j it
belongs in. Recall that a new card (card i) might end up all the way
in front (ie in position 1, and this will take i-1 comparisons), if it
is the smallest value so far, all the way to the back (position i, and
this will take only 1 comparison), or anywhere in between. And E(C) =
E(C1) + ... + E(Cn). So we need to find E(Ci) = Sum xPr(Ci=x) for
each i, and this will be easier to do, that working with E(C) directly.
We know that x can take on any integer value between 1 and i-1, so
E(Ci) = Sum_k=1..(i-1) kPr(Ci=k). The case of k=i-1 is special, since
it corresponds to item j belonging in either position 1 or position 2
(each of these positions requires comparison of card i with all
i-1 previous cards). The other values of k each correspond to a unique
position. So we write
E(Ci) = (i-1) Pr(C1=i-1) + Sum_k=1..(i-2) kPr(Ci=k).
But Pr(Ci=i-1) is just 2/i (1/i for each of the two positions). And
Pr(Ci=k) = 1/i for all the other values of k. We end up then with
E(Ci) = (i-1)2/i + [1/i + 2/i + 3/i + ... + (i-2)/i], which is just
(i-1)2/i + [1 + 2 + 3 + ... + (i-2)]/i = (i-1)/i + [1 + 2 + ... + (i-1)]/i
= (i-1)/i + (1-i)i/(2i) = (i-1)/i + (i-1)/2.
Finally, E(C) = Sum E(Ci) for i=1...n :
E(C) = Sum_i=1..n [(i-1)/i + (i-1)/2] = [Sum_i=1..n (i-1)/i] + (n-1)n/4.
What does this extra [Sum (i-1)/i] contribute, that we did not see in
our initial rough argument above? This can be rewritten as Sum_i=1..n
[1 - 1/n] = n - Sum_i=1..n 1/n; the latter expression is a famous sum,
called the harmonic series: 1/1 + 1/2 + 1/3 + 1/4 + ... + 1/n . It
can be approximated by integrals, and is found to be between lg(n) and
lg(n) + 1. This is a very low-order term compared to n^2. Thus our
final result is that the average-case runtime for Insertion Sort is
indeed quadratic in its highest-order term: (n^2)/4 + 3n/4 - O(lg n)
where the big-O means something that behaves like lg n -- we'll study
this concept shortly.
We have two debts to pay here: we need to prove the above
approximation fact about the harmonic series, and we need to get a
clearer understanding of the "behavior" of functions, eg their
higher-order terms. Both of these will be covered soon.
Don Perlis
2003-09-16