If you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. – Donald Knuth

## Introduction

This section will contain some very basic computer science background. Since the target audience of this lecture series has little to no training in this area, I will attempt to fill in some gaps (admittedly very poorly).

### Numerical analysis

It should come as no surprise that computers store data as sequences of 1’s and 0’s. For example, non-negative integers are represented by their base-2 expansions (e.g. 7 is represented by 111 and 8 by 1000). Since we can only allocate a finite amount of space for each integer (e.g., 16, 32, or 64 bits) we can only hold a finite range of integers (e.g., $-2^{63}\leq i \leq 2^{63}-1$) (see two’s complement). This means that one has to avoid applying operations that will take us out of this range. Arithmetic operations that generate values outside of their allowed range are said to *overflow*.

Problems with the finite representations of integers are dwarfed by the problems arising from the finite representation of real numbers. The (IEEE) standard way to finitely represent a real number is as a *floating point number*. The idea is to approximate a real number $x$ by a non-negative integer $\alpha$, an integer $\beta$, and a sign bit $\epsilon$ so that $x\approx (-1)^\epsilon \cdot \alpha \cdot 2^{\beta}$. Since, using a fixed number of bits, we can only represent finitely many integers, there are only finitely many such floating point numbers.

The finiteness of this representation leads to some obvious consequences as well as some non-obvious ones.

- Since the exponent $\beta$ is bounded above, there is a bound for the range of real numbers that we can encode. This leads to the same type of overflow problems as for integers.
- Since the exponent $\beta$ is bounded below, there is a smallest positive real number we can encode in this representation. This leads to a new set of problems called
*underflow*. For example $e^{-n}=0$ for $n\gg 0$. As a consequence, $e^{-n}\cdot e^{n}\neq e^0=1$ for large $n$. If we instead expanded this expression from the middle then we obtain: $$e\cdot (e\cdot (\cdots(e\cdot e^{-1})\cdot e^{-1})\cdots e^{-1})=1.$$*So multiplication of floating point numbers fails to be associative*. - Since $\alpha$ is a bounded positive integer, we can only encode a finite amount of precision. This implies that for certain (large) values of $m$ and $n$, $$\sum_{i=1}^{2^n} 2^{-n} + 2^m = 2^m+1,$$ if we apply the operations from left to right, while the sum is $2^m$ if we apply them from right to left.
*So addition of floating point numbers fails to be associative.* - We can not encode fractions precisely so $1/n \cdot n\neq 1$ in general. On older calculators one can witness this first hand and see that $1/3\cdot 3“=”0.9999999999$.

#### Remark

I sometimes find it helpful to visualize all of the floating point numbers placed on the real line as a sequence of dots. These dots cluster heavily near 0 and gradually spread farther and farther apart as we move away from the origin until we meet the size bounds. Beyond these points there are no longer any dots.

Since both encodings of real numbers into floating point and floating point operations contribute errors^{1}, care needs to be taken in the design of floating point algorithms to ensure that these errors do not grow out of control, otherwise things may go seriously wrong).

#### Remark

In probability theory and, especially, statistical inference, it is very common to take iterated products of probabilities. Since many of these probabilities are very small, there is a high risk of numerical underflow as we calculate the product.

One way to avoid this underflow is to use the simple formula $$ \prod_{i=1}^n x_i = \exp(\sum_{i=1}^n \log x_i). $$ This will help avoid the numerical underflow for the product. To ensure that the sum of quantities of varying magnitude is as accurate as possible, we can first sort the values to be summed and then add them.

### Algorithm Analysis

Computer scientists and engineers are very concerned with space and computational efficiency. Making fair comparisons between two implementations of algorithms can be tricky. Implementation details of essentially identical algorithms can result in significant changes in runtime performance and memory costs (e.g., consider a program written in python 2.6 versus one written python 3.0).

An easier comparison can be made about how the memory and runtime requirements of an algorithm change as we increase the size of the input. For example, the standard grade school algorithm for multiplying two $n\times n$ matrices requires $n^3$ multiplications and $n^2*(n-1)$ addition operations^{2}. What is important to realize about this algorithm is not the exact count of the operations or how much faster addition is compared to multiplication, it is, for example, that if we replace $n$ with $100n$ then we will require roughly $1 000 000$-times more operations!

At this scale the addition operations and their relative speed in comparison to multiplication do not matter as much as this observation. The dominant feature is that the time to multiply two $n\times n$ matrices grows cubically in $n$.

One way to encode the essential features of the growth of functions is with Big O-notation.

#### Definition

We say^{3} $f(n)= O(g(n))$ if there exists an $M>0$ such that for all sufficiently large $n$, $$|f(n)|\leq M|g(n)|.$$

For most big data tasks, any algorithm whose runtime is not $O(n \cdot \log n)$ requires some attention (especially attention to how one can parallelize the computation across a cluster). More importantly, once the data is too large to fit into a single computers memory, this entire approach to comparing algorithms breaks down (as do the standard implementations of the algorithms). In this case, one has to think very carefully about how to break up the task into separate pieces. This is all complicated by the fact that the time move data between computers is much slower than it is to move the data internally in computer memory.

- One way to reduce the errors in floating point operations is to use more bits for the representations, do the operation, and then truncate. ↩
- It turns out this is not asymptotically optimal. For example see here. ↩
- I would like to, on the record, state my disapproval of the standard usage of the equal-sign in the expression $f(n)=O(g(n))$. I can see no reasonable interpretation of the equality symbol here.)https://en.wikiquote.org/wiki/Donald_Knuth ↩