Binary Search (done right)

Binary search is one of the most famous algorithms in computer science. It solves a simple but important task: given a sorted list of values, it finds the position of a target value (or determines that the value is not present). Because the data is sorted, binary search is fast: by repeatedly discarding the half that cannot contain the answer, it halves the remaining search space at each step, requiring only about \(log_2(n)\) comparisons in the worst case.

Although the idea can be explained in a few sentences, correct implementations are surprisingly error-prone. The most common failures are:

Boundary semantics that are not precisely defined (for example, whether the upper index is included or excluded), which leads to off-by-one mistakes.
Updates that do not guarantee progress, causing infinite loops when the search interval stops shrinking.
Midpoint computations that overflow in fixed-width integer arithmetic (for large indices), producing invalid midpoints and out-of-bounds access.

For that reason, a reliable binary search is best presented not as a “pattern to memorize,” but as a small program derived from a clear specification, with explicit invariants and a termination argument.

Problem and specification

Let \(A\) be an array of length \(n\), indexed \(0,1,\dots,n-1\), sorted in nondecreasing order:

\[ A[0] \le A[1] \le \dots \le A[n-1]. \]

Given a search key \(x\), the task is to return an index \(i\) such that \(A[i]=x\), if such an index exists; otherwise report “not found.” Some APIs refine “not found” into an insertion position (the index where \(x\) can be inserted while preserving sortedness). Both behaviors are straightforward once the search boundaries are defined rigorously.

Representing the search range

A binary search maintains a set of indices where \(x\) could still appear. Making this set explicit is the simplest way to avoid ambiguous stop conditions. A particularly clean convention is the half-open interval \([L,R)\):

\(L\) is included, \(R\) is excluded; candidates are indices \(L, L+1, \dots, R-1\).
The interval is empty exactly when \(L=R\).
Bounds naturally satisfy \(0 \le L \le R \le n\).

This convention leads to a simple loop condition and updates that can be proven to shrink the search range on every unsuccessful iteration.

Calculating the midpoint

Each iteration selects a midpoint \(m\) inside the current candidate interval \([L,R)\). Mathematically, the most natural choice is

\[ m := \left\lfloor \frac{L+R}{2} \right\rfloor. \]

In exact integer arithmetic, this choice ensures \(L \le m < R\) whenever \(L < R\), and splits the interval into two parts. In programming languages with fixed-width integers, however, the expression (L + R) may overflow when indices become large. Overflow can produce a negative midpoint or other invalid values, leading to out-of-bounds access or language-specific undefined behavior.

The standard remedy is to compute the same value without forming the potentially overflowing sum:

\[ m := L + \left\lfloor \frac{R-L}{2} \right\rfloor. \]

This is algebraically equivalent over the integers: \[ L + \left\lfloor \frac{R-L}{2} \right\rfloor = \left\lfloor \frac{2L + (R-L)}{2} \right\rfloor = \left\lfloor \frac{L+R}{2} \right\rfloor, \] but it avoids adding two potentially large indices. Since \(0 \le R-L \le n\), the subtraction is safe whenever the indices themselves are valid.

Some languages offer additional midpoint idioms based on binary shifting. For nonnegative integer indices \(L, R\), the expression (L + R) >>> 1 (Java or JavaScript) computes the unsigned average. The operator >>> is a logical right shift: it shifts in zeros at the top and therefore corresponds to unsigned division by 2. Even if L + R overflows the signed int range, Java’s two’s-complement wraparound yields the sum modulo \(2^{32}\), and the logical shift divides that unsigned 32-bit value by 2. For \(0 \le L, R \le 2^{31}-1\), the true sum satisfies \(0 \le L+R \le 2^{32}-2\), so the unsigned computation produces exactly \(\left\lfloor\frac{L+R}{2}\right\rfloor\) as a value in \([0,2^{31}-1]\).

In C and C++, a closely related pattern is to cast to an unsigned type before adding and shifting, for example: ((uint32_t)L + (uint32_t)R) >> 1. This makes the overflow behavior well-defined (modulo \(2^{32}\)) and the right shift an unsigned division by 2. That said, the subtraction-based midpoint L + (R - L)/2 is generally the most portable and the easiest to justify across languages.

Correctness by invariants

The algorithm can be derived directly from statements that remain true throughout the loop. These statements also document what the variables mean, which is essential for avoiding “plausible” but wrong boundary updates.

Loop invariants

Maintain the following invariants throughout the search:

Bounds: \(0 \le L \le R \le n\).
Exclusion to the left: for all \(i\) with \(0 \le i < L\), \(A[i] < x\).
Exclusion to the right: for all \(i\) with \(R \le i < n\), \(A[i] > x\).

Together these invariants state: if \(x\) exists in the array, it must appear at some index \(i\) in \([L,R)\).

Initialization

Set \(L := 0\), \(R := n\). The bounds invariant holds immediately. The exclusion invariants are vacuously true: there are no indices less than \(0\), and no indices greater than or equal to \(n\).

Midpoint placement

While \(L < R\), define \[ m := L + \left\lfloor\frac{R-L}{2}\right\rfloor. \] This guarantees \(L \le m < R\), so \(m\) always lies within the candidate interval.

Update rules

Compare \(A[m]\) with \(x\):

If \(A[m] < x\), then for every \(i \le m\), sortedness implies \(A[i] \le A[m] < x\). No index up to and including \(m\) can hold \(x\), so update \(L := m+1\).
If \(A[m] > x\), then for every \(i \ge m\), sortedness implies \(A[i] \ge A[m] > x\). No index from \(m\) onward can hold \(x\), so update \(R := m\).
If \(A[m] = x\), return \(m\).

Termination and total correctness

Termination (progress argument)

Let \(d := R-L\) be the length of the candidate interval. As long as the loop continues, \(d\) is a positive integer. If the loop does not return, the interval strictly shrinks:

If \(A[m] < x\), the new interval is \([m+1,R)\) and \[ d' = R-(m+1) < R-L = d, \] because \(m \ge L\).
If \(A[m] > x\), the new interval is \([L,m)\) and \[ d' = m-L < R-L = d, \] because \(m < R\).

Since \(d\) is a nonnegative integer that decreases strictly on each unsuccessful iteration, the loop must terminate. This rules out the classic infinite-loop failure mode caused by updates that fail to eliminate the midpoint.

Partial correctness (returned index is correct)

The algorithm returns only when it observes \(A[m]=x\). Therefore any returned index is correct by inspection.

Total correctness (not found is correct)

If the loop terminates without returning, then \(L=R\) and the candidate interval is empty. The exclusion invariants imply:

all indices \(i < L\) satisfy \(A[i] < x\), and
all indices \(i \ge L\) satisfy \(A[i] > x\) (since \(R=L\)).

Every index \(i \in \{0,\dots,n-1\}\) falls into one of these two sets, so \(A[i]\neq x\) for all \(i\). Hence reporting “not found” is correct.

Reference implementations

Find any occurrence

# Precondition: A is sorted in nondecreasing order.
# Returns: an index i with A[i] == x, or -1 if not found.
def binary_search_any(A, x):
  L = 0
  R = length(A)   # candidate interval [L, R)

  while L < R:
      m = L + (R - L) // 2   # integer division with //; overflow-safe midpoint
      if A[m] < x:
          L = m + 1
      else if A[m] > x:
          R = m
      else:
          return m

  return -1

Returning an insertion position when not found

When \(x\) is absent, the final value \(L\) is the unique position where inserting \(x\) preserves the order relative to all elements proven less than \(x\). Many APIs encode this insertion point as a negative return value to distinguish it from valid indices. A common encoding is -(L + 1).

# Returns: index if found; otherwise returns -(insertionPoint + 1).
def binary_search_with_insertion(A, x):
  L = 0
  R = length(A)

  while L < R:
      m = L + (R - L) // 2
      if A[m] < x:
          L = m + 1
      else if A[m] > x:
          R = m
      else:
          return m

  # not found: insertion point is L
  return -(L + 1)

Notes on midpoint variants in practice

Portable choice: m = L + (R - L) / 2.
Java idiom: m = (L + R) >>> 1 (logical shift, unsigned average for nonnegative indices).
C/C++ idiom (when using fixed-width integers): m = ((uint32_t)L + (uint32_t)R) >> 1. This relies on unsigned arithmetic and produces the floor of the unsigned average; the subtraction-based midpoint remains the most generally defensible.

Boundary searches: lower_bound and upper_bound

Arrays may contain duplicates. In that setting, returning “any” matching index is sometimes inadequate; the most robust interface is to compute boundaries. Two standard functions are:

lower_bound(x): the smallest index \(i\) such that \(A[i] \ge x\), or \(n\) if no such index exists.
upper_bound(x): the smallest index \(i\) such that \(A[i] > x\), or \(n\) if no such index exists.

Both can be expressed as binary searches on a monotone predicate over indices. The same half-open interval mechanics apply, and the same midpoint computation should be used.

lower_bound

# Returns the first index i with A[i] >= x; returns n if all A[i] < x.
def lower_bound(A, x):
  L = 0
  R = length(A)    # [L, R)

  while L < R:
      m = L + (R - L) // 2
      if A[m] < x:
          L = m + 1
      else:
          R = m

  return L

upper_bound

# Returns the first index i with A[i] > x; returns n if all A[i] <= x.
def upper_bound(A, x):
  L = 0
  R = length(A)    # [L, R)

  while L < R:
      m = L + (R - L) // 2
      if A[m] <= x:
          L = m + 1
      else:
          R = m

  return L

These boundary functions provide precise building blocks:

Existence: \(x\) exists iff \(i=\text{lower\_bound}(x)\) satisfies \(i < n\) and \(A[i]=x\).
First occurrence: \(i=\text{lower\_bound}(x)\) is the first index with value \(x\) (when \(x\) exists).
Last occurrence: \(j=\text{upper\_bound}(x)-1\) is the last index with value \(x\) (when \(x\) exists).
Count: \(\text{upper\_bound}(x)-\text{lower\_bound}(x)\).

Runtime analysis

Let \(d_k = R_k - L_k\) be the candidate interval length after \(k\) iterations. Initially \(d_0 = n\). When the loop does not return, the midpoint choice implies the new interval length is at most half of the old one:

\[ d_{k+1} \le \left\lfloor \frac{d_k}{2} \right\rfloor. \]

By induction, \[ d_k \le \left\lfloor \frac{n}{2^k} \right\rfloor. \] Termination occurs when \(d_k = 0\), which is guaranteed once \(2^k > n\), i.e. \[ k > \log_2 n. \] Therefore the number of iterations in the worst case is bounded by \(\lceil \log_2 n \rceil + 1\) for \(n \ge 1\) (and \(0\) iterations for \(n=0\)).

Each iteration performs a constant amount of work (one midpoint computation and a constant number of comparisons), so the running time is \(O(\log n)\). The algorithm uses \(O(1)\) additional space.

References

Bentley1986J. Bentley (1986/2000) Programming Pearls. Addison-Wesley.
Bloch2006J. Bloch (2006) “Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken.”
Knuth1998D. E. Knuth (1998) The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.
Cormen2009T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein (2009) Introduction to Algorithms (3rd ed.). MIT Press.