Lecture 1: Introduction and Lambda Calculus

About This Course

Welcome

This is an incomplete set of notes prepared by Zhengfeng Ji for the course. Please do not distribute.

Have you ever wondered—What is computation? Why are some problems impossible for computers to solve? Why is P vs NP considered the most important problem in computer science? What is the mysterious connection between complexity theory and cryptography?

If these questions spark your curiosity, then this is the course for you!

Introduction to Theoretical Computer Science is a journey into the essence of computation. Using rigorous mathematical language, the course unveils the deep logic underlying computer science. Following the theme of “Logic and Computation,” it guides you through the intellectual landscape of Turing machines, λ-calculus, time and space complexity, zero-knowledge proofs, and the interplay between logic and computation.

History

Second time.

PB: Theoretical Computer Science

Theoretical computer science studies computation and, more generally, information processing from a mathematical perspective.

What is computation?

How efficient can we solve a problem, and why are some problems much harder to solve than others?

The importance of mathematical rigorous, logical, and mathematical thinking.
Theory A and B: Algorithms and complexity vs. formal methods and logic

We will cover topics from both sides.

Why Should You Care About TCS?

Galileo's tenet: The book of nature is written in the language of mathematics.

Logic and mathematics are the languages for computer science.

Foundation of computer science: the science part of our department

CS without theory is like engineering without physics.
CS, mathematics, and physics

As stated in the ITCS (Introduction to the Theory of Computation) book: computation and information play a role in the 21st century similar to the role that energy and matter played in the 20th century.

A concrete example: It from Qubit initiative supported by the Simons Foundation, which aims to understand quantum field theory and gravity from an information theory perspective.

CS concepts such as $\NP$-completeness are gifts to mathematics, physics, and economics, etc.

Computer science primarily deals with finite structures, whereas mathematics is more inclined towards the study of infiniteness. However, computer scientists often explore exponentially large finite scenarios, making the discussion non-trivial.

Even if you are not going to choose theory as your future research direction:

TCS offers interesting perspectives on the nature of computation and the inherent complexity of problems.

Computation is not solely about writing programs to solve specific problems. Developing the ability to think about computation theoretically can increase your potential to contribute creatively to the field.

By including TCS in your computer science education, you become more well-rounded and educated in the field.

Learning about TCS in computer science is comparable to studying classic poems in a Chinese literature class.

It is crucial to balance between being capable in practical applications and being knowledgeable about the underlying principles and theories.

Lecturer

Zhengfeng Ji
East Main Building 8-307
My research

Quantum computing, theory of computing.

Q & A

Online: Wechat group, learn.tsinghua.edu.cn, email.
Office hour: TBA
Your TAs: Boyang Chen (by-chen24@mails.tsinghua.edu.cn) and Huiping Lin (lhp22@mails.tsinghua.edu.cn)

Assessment

Homework. (20%)

Emphasize homework: Meant to reinforce the understanding of the materials; Prepares you for the exams!

Rules.
Midterm. (20%)
Final Exam. (60%)

Textbook

Introduction to Theoretical Computer Science (Boaz Barak)
Introduction to the Theory of Computation (Michael Sipser)

Helpful videos online:

Great Ideas in Theoretical Computer Science by Ryan O'Donnell, CMU
Theory of Computation by Michael Sipser, MIT

Lecture notes will be available at http://itcs.finite-dimensional.space after each lecture.

Tips for the Students

The subject could be challenging for many of you.
1. It has a different style and different set of requirements.
2. It covers a large number of different topics, ranging from NP-completeness reductions to $\lambda$-calculus and analyzing probabilistic algorithms and protocols.
Pause and think about the problem or theorem by yourself before you see the proof.

Barak: Even a five-minute attempt to solve the problem by yourself will significantly improve your understanding of the topic.
Get involved and give us feedback!

Topics

Computational Models
1. Lambda Calculus
2. Recursion Theorems
3. Godel's Incompleteness Theorem
Complexity Theory
1. Time Complexity
2. P vs NP
3. Space Complexity
4. Hierarchy Theorems
5. Non-Uniform Computation
Probabilistic Computation
1. Probability Theory for Computer Science
2. Randomized Algorithms
3. The Class BPP
Cryptography
1. Encryption and Pseudorandomness
2. Zero-Knowledge
Logic and Computation
1. Natural Deduction
2. Propositions as Types

Lambda Calculus

Is there a functional theory where functions and their arguments are at the same level?

And if so, what use is it?

See more history in Dana Scott's talk.

Computability and the foundation of mathematics.

Motivation

We have studied Turing machines in FLA where we also mentioned there are many equivalent models of computation.

The $\lambda$ calculus is another way to define computable functions.

$\lambda$ calculus to CS is like algebra to mathematics.
While Turing machines are not used for practical computation, the $\lambda$ calculus has inspired functional programming languages such as LISP, ML and Haskell, and indirectly the development of many other programming languages as well.
The $\lambda$ operator.

At the core of the $\lambda$ calculus is a way to define anonymous functions.

Function $f(x) = x \times x$ is represented as $\lambda x. x \times x$.

For example, in Python we can define the squaring function using
```
lambda x: x*x
```
The study of the $\lambda$-calculus mainly about the set of terms and equations between the terms.

Both concepts are defined inductively.

Structural induction.

Proof Theory Notation

During the course, we define many sets (also relations) using inductive definitions.

We will usually use the following notation from proof theory:
\[\begin{prooftree} \AXC{$P_1\in \mathcal{S}$} \AXC{$\cdots$} \AXC{$P_n\in \mathcal{S}$} \LeftLabel{(name)} \RightLabel{ (side condition)} \TIC{$C\in \mathcal{S}$} \end{prooftree} \]
It means that if the premises $P_1, \ldots, P_n \in \mathcal{S}$ hold satisfying the side conditions, then the conclusion is also in $\mathcal{S}$.
When there are no premises, the rule is sometimes called an axiom.
The side conditions are arbitrary. The premises only talk about whether $P_i$ is in $\mathcal{S}$.
When we define the set $\mathcal{S}$ in this way, we mean that $\mathcal{S}$ is the least set for which all the rules hold.

Terms

The objects of study in the $\lambda$-calculus are the terms of the language, and ideas of convertibility or equality between pairs of these terms.
The terms are remarkably simple yet unexpectedly powerful objects.
Assume the existence of a countable set of variables.

A term of the untyped $\lambda$-calculus is a finite string made up of the symbols $\lambda$, $($, $)$, $.$, and variable identifiers.

The set of terms $\Lambda$ is defined by the following Backus-Naur Form (BNF) grammar:
\[\begin{equation*} s ::= x \mid (\lambda x. s) \mid (s s) \end{equation*} \]

Or we can define $\Lambda$ inductively as
\[\begin{align*} \text{(variable)}\; & \frac{}{x\in \Lambda}\; (x \in \mathcal{V})\\ \text{(abstraction)}\; & \frac{s\in \Lambda}{(\lambda x. s) \in \Lambda}\; (x \in \mathcal{V})\\ \text{(application)}\; & \frac{s \in \Lambda\quad t \in \Lambda}{(s t) \in \Lambda} \end{align*} \]
Variable, abstraction, application.
Note: We use the notation $(st)$, rather than $s(t)$, to denote the application of a function $s$ to an argument $t$.

This may take some time to get familiar with.

Subterms

Can be defined inductively.
$\lambda x. (\lambda y. (xz))$

$(xz)$ is an application subterm.

$(\lambda y. (xz))$ is an abstraction subterm.

Abbreviations

The parenthesis and $.$ are used to disambiguate the structure of expressions.
In practice, we minimize their use of parenthesis to avoid notational clutter:
1. Outermost parentheses can be omitted.
2. A sequence of applications associates to the left. So $x y z$ means $((xy)z)$.
3. The body of a lambda abstraction (the part after the dot) extends as far to the right as possible. So $\lambda x. st$ means $\lambda x.(st)$ not $(\lambda x.s) t$.
4. Nested abstractions associate to the right and can be collected under a single $\lambda$ sign. So $\lambda xy.xz$ means $(\lambda x.(\lambda y.(xz)))$.
5. When an abstraction occurs on the right of an application, the parentheses around it may be omitted. $z \lambda xy.xz$ means $(z(\lambda x.(\lambda y.(xz))))$.

Free and Bound Variables

Free variables

Free variables $\text{FV}(s)$
\[\begin{align*} \text{FV}(x) &= \{x\}\\ \text{FV}(st) &= \text{FV}(s) \cup \text{FV}(t)\\ \text{FV}(\lambda x.s) &= \text{FV}(s) \setminus \{x\}. \end{align*} \]
A variable $x$ is free in a term $s$ if $x \in \text{FV}(s)$.
A variable $x$ is bound if it is in a subterm $\lambda x. u$ of $s$.
As $x$ can occur multiple times, it could be the case that some occurrences of $x$ are free while other occurrences are bound.
A term $s$ is called closed if it has no free variables, $\text{FV}(s) = \emptyset$.

Closed terms are also sometimes called combinators.

Some standard combinators named using boldface letters.
\[\begin{align*} \mathbf{i} & \equiv \lambda x.x\\ \mathbf{k} & \equiv \lambda xy.x\\ \mathbf{s} & \equiv \lambda xyz.xz(yz)\\ \mathbf{t} & \equiv \lambda xy.x\\ \mathbf{f} & \equiv \lambda xy.y\\ \mathbf{\Omega} & \equiv (\lambda x.xx)(\lambda x.xx)\\ \end{align*} \]
Since the names of bound variables are not important, we can rename them without changing the meaning of a term. But we should not rename a bound variable to be the same as a free variable!

Alpha-Conversion

$\lambda$ is a binding Operator

A bound variable is similar to the variable $x$ in the summation $\sum_x a_x$. Here, $x$ is a dummy variable that can be replaced without changing the meaning of the summation.

Two terms $s$ and $t$ are said to be $\alpha$-convertible, written $s \equiv_\alpha t$, if one can derive the same term from both purely by renaming bound variables to fresh variables.

We sometimes omit $\alpha$ and write $\equiv$ only.
We consider terms that are $\alpha$-convertible to each other to be identical at the syntactic level.

Beta-Conversion

Application in $\lambda$ calculus means substitution.
We now give another type of conversion to define what $\lambda$ does with the argument.
The term $\lambda x.s$ behaves like a function as if $s$ were parameterized by $x$.
$\beta$-convertibility is a binary relation which we write infix as $\beta$, is defined by the scheme
\[\begin{equation*} (\lambda x.s) t\;\beta\;s[t/x] \end{equation*} \]
for any terms $s,t$ where $s[t/x]$ means the result of substituting $t$ for every free occurrence of $x$ in $s$.
$s[t/x]$ is not a term, it denotes the term that is obtained by the substitution.
Substitution may be defined inductively.
\[\begin{align*} x[t/x] & \equiv t,\\ y[t/x] & \equiv y,\\ (su)[t/x] & \equiv (s[t/x])(u[t/x]),\\ (\lambda y.s) [t/x] & \equiv \lambda y. (s[t/x]) \text{ assuming } y \not\equiv x \text{ and } y\not\in \text{FV}(t) \end{align*} \]
Two complications:
1. Substitution should only apply to free occurrences of $x$ in $s$.
2. Avoid unintended capture of free variables!
  
  For example, $s \equiv \lambda x.yx$ and $t \equiv \lambda z.xz$.
  
  What is the result of substituting $t$ for $y$ in $s$?
  
  $\lambda x.(\lambda z.xz)x$ is not correct.
  
  Rename $x$ to $x'$ first in $s$.
  
  $\lambda x'.(\lambda z.xz)x'$ is the correct substitution.
$(\lambda xy.yz)[yy/z] \equiv \lambda xu.u(yy)$ (note that the parenthesis is needed here).

Equational Theory

In logic, a theory is a set of formulae closed under some notion of provability.
In the formal theories of the $\lambda$-calculus, the only formulae are equations between terms.

'$s = t$' for terms $s$ and $t$.
We write $\mathcal{T} \vdash s = t$ to mean $s=t$ is a formula in the theory $\mathcal{T}$.

The Standard Theory in $\lambda$-Calculus

$\lambda\beta$ of $\lambda$-calculus is defined by the rules:

\[\begin{align*} \text{(reflexivity)}\quad & \frac{}{s=s} \\ \text{(symmetry)}\quad & \frac{s=t}{t=s}\\ \text{(transitivity)}\quad & \frac{s=t\quad t=u}{s=u}\\ \text{(application)}\quad & \frac{s=s'\quad t=t'}{st = s't'}\\ \text{(abstraction)}\quad & \frac{s=t}{\lambda x.s = \lambda x.t}\\ (\beta) \quad & \frac{\phantom{\lambda}}{(\lambda x.s)t=s[t/x]} \end{align*} \]
A formal proof of an equation in $\lambda\beta$ is a proof tree constructed by the rules.

For example,

\[\begin{prooftree} \AXC{} \LeftLabel{($\beta$)} \UIC{$(\lambda xy.x)p = \lambda y.p$} \AXC{} \LeftLabel{(refl)} \UIC{$q=q$} \LeftLabel{(app)} \BIC{$(\lambda xy.x)pq = (\lambda y.p)q$} \AXC{} \LeftLabel{($\beta$)} \UIC{$(\lambda y.p)q = p$} \LeftLabel{(trans)} \BIC{$(\lambda xy.x)pq = p$} \LeftLabel{(sym)} \UIC{$p = (\lambda xy.x)pq$} \end{prooftree} \]

is the proof that $\lambda\beta \vdash p = (\lambda xy.x) pq$.
The proof tree looks tedious and we may write a linearized proof of the same fact: $\lambda \beta \vdash (\lambda xy.x) pq = (\lambda y.p)q = p$.

Eta Rule

Another important theory is the $\lambda\beta\eta$ theory with one additional rule
\[\begin{align*} (\eta) \quad \frac{}{\lambda x.sx = s} \; (x\not\in \text{FV}(s)) \end{align*} \]
This rule does not follow from the $\beta$ rule (hopefully we will see a proof it) but it is justified if we apply them to another term $t$.
The principle of extensionality: if $M$ and $M'$ define the same function, then $M$ and $M'$ are equal.
If we wish to take an extensional viewpoint then the $\eta$-convertible terms should be equated.
A $\lambda$-theory $\mathcal{T}$ is one that closed under all rules for $\lambda\beta$.
Is $\mathcal{T}$ consistent? (Does not equate all terms. Yes for $\lambda\beta$!)

A Fixed-Point Theorem

fixed-points of functions $f(x)=x^2$, $f(x) = x$, $f(x) = x+1$.

For terms $f$ and $u$, $u$ is said to be a fixed-point of $f$ (in $\lambda\beta$) if $\lambda\beta \vdash fu = u$.

(First Recursion Theorem). Let $f$ be a term, Then there is a term $u$ which is the fixed-point of $f$.

In fact, the fixed-point can be computed in $\lambda$ calculus itself by using fixed-point combinators. A fixed-point combinator is a closed term $s$ such that for all terms $f$, $f(sf) = sf$. In other words, $sf$ is a fixed-point of $f$ for all terms $f$.

The power of (untyped) $\lambda$-calculus: A function can even be its own argument!

There are many fixed-point combinators, two important examples are:

Curry's 'paradoxical' Y combinator:
\[\begin{equation*} \mathbf{y} \equiv \lambda f.(\lambda x. f(xx))(\lambda x. f(xx)). \end{equation*} \]
Turing's fixed-point combinator:
\[\begin{equation*} \mathbf{\Theta} \equiv (\lambda xy.y(xxy))(\lambda xy.y(xxy)). \end{equation*} \]
Proof that $\mathbf{y}$ is a fixed-point combinator!

Easy!
Do you know the company named Y Combinator?

Extra: Deriving the Y Combinator

Say $f$ is a given term and we want to solve $X = fX$ to get a fixed-point of $f$.

This is not allowed as $X$ appears in the right hand side.
We replace the $X$ on the right with a variable $x$ and write $X' x = f x$.

However, this does not help us to solve find the fixed-point.
One weird trick is all we need: use-mention trick, mockingbird, or self-reference.

Consider $X'' x = f (x x)$; now we can take $X = X'' X''$ and have $X = X'' X'' = f(X'' X'') = f X$!

So $X'' = \lambda x. f(x x)$ and the fixed-point is $X = X'' X'' = (\lambda x. f(xx))(\lambda x. f(xx))$.
Finally, we bind $f$ to get the Y combinator!
Another way to derive the Y combinator is to use the diagonal argument.

Lecture 2: Programming in Lambda Calculus

Lambda Calculus II

Beta-Reduction

Motivations
Substitution implements the computation

Substitution is asymmetric, and it is natural to consider the reduction part, $(\lambda x.s) t \betared s[t/x]$.

The computational drive of $\lambda$-calculus.
For $\beta$-conversion, the reduction part is called $\beta$-reduction.

It is also a binary relation but is not an equivalence relation anymore.

Compatible Closure

For any binary relation $R$, one-step $R$ reduction is the compatible closure of $R$, an induced relation denoted $\red_R$ defined as follows.
\[\begin{align*} (R)\; & \frac{}{s \red_R t}\quad (\tuple{s,t} \in R)\\ (\text{Reduction on the left})\; & \frac{s \red_R t}{su \red_R tu}\\ (\text{Reduction on the right})\; & \frac{s \red_R t}{us \red_R ut}\\ (\text{Reduction under abstraction})\; & \frac{s \red_R t}{\lambda x.s \red_R \lambda x.t}\\ \end{align*} \]
$t$ in $s \red_R t$ is called the reduct of $s$.
The relation $\reds_R$ is the reflexive, transitive closure of $\red_R$.
The relation $=_R$ is the reflexive, symmetric, transitive closure of $\red_R$.

When $s =_R t$, we say that $s$ is $R$-convertible to $t$.
We are mostly interested in the $\beta$-reduction $\betared$ defined by relation $\beta$:
\[\begin{equation*} (\lambda x.s) t\; \beta\; s[t/x]. \end{equation*} \]
One step of $\beta$-reduction is denoted as $\betared$, and the reflexive and transitive closure is denoted $\betareds$.

Examples

Example 1
\[\begin{equation*} \begin{split} \mathbf{s} \mathbf{k} \mathbf{k} & \equiv (\lambda x y z. (xz)(yz)) \mathbf{k} \mathbf{k}\\ & \betared \lambda y z.(\mathbf{k} z)(yz) \mathbf{k}\\ & \betared \lambda z.(\mathbf{k} z)(\mathbf{k} z)\\ & \equiv \lambda z.((\lambda xy.x)z)(\mathbf{k} z)\\ & \betared \lambda z.(\lambda y.z)(\mathbf{k} z)\\ & \betared \lambda z.z\\ & \equiv \mathbf{i}. \end{split} \end{equation*} \]
Example 2: $\mathbf{\Omega} \betared \mathbf{\Omega}$
Note: $\mathbf{\Theta} f \betareds f(\mathbf{\Theta} f)$ and $\mathbf{y} f \betaeq f(\mathbf{y} f)$.

Normal Form

A term $s$ is in normal form if there is no term $t$ such that $s \betared t$.

A term $s$ has a normal form if there is a term $t$ in normal form with $s \betareds t$. Equivalently, we say that $s$ is normalizable.

A term $s$ is called strongly normalizable if there does not exist an infinite sequence $t_1, t_2, \ldots$ such that $s \betared t_1 \betared t_2 \betared \cdots$.

Examples

$\lambda x. x$ is in normal form.

$(\lambda x. x) (\lambda x. x)$ has a normal form and is strongly normalizable.

$\mathbf{\Omega}$ is not even normalizable as it reduces to itself.

$\mathbf{f}\, \mathbf{\Omega}\, \mathbf{i}$ is normalizable but not strongly normalizable.

Reduction Strategy

A fundamental theorem about $\beta$-reductions is the following theorem.

(Reduction Theorem). Leftmost reduction always finds the normal form if it exists.

Leftmost reduction: the redex $(\lambda x.u)v$ is the leftmost among all redexes in $s$ (measuring the beginning position).
The proof is complicated.
It is related to the concept of call-by-value/call-by-name distinction in programming languages.

Church-Rosser Property

Would different reduction strategies lead to different normal forms?

Let $\red$ be a binary relation and $\reds$ its reflexive transitive closure.
1. The relation $\red$ satisfies the diamond property if $s \red u$ and $s \red v$ implies $u \red t$ and $v \red t$ for some $t$.
2. Relation $\red$ is Church-Rosser if $\reds$ satisfies the diamond property.
3. The relation $\red$ has the unique normal form property if $s \reds t_1$ and $s \reds t_2$ for normal forms $t_1$ and $t_2$ implies that $t_1 \equiv t_2$.

Simple Observations

If $\red$ is Church-Rosser, then it has the unique normal form property.

Easy.

If $\red$ satisfies the diamond property, then $\reds$ satisfies the diamond property.

Proof by picture.

Suppose $\red$ is a relation with the Church-Rosser property and $=$ is its reflexive, symmetric, transitive closure. Then $s = t$ if and only if there is some $u$ such that $s \reds u$ and $t \reds u$.

Proof by picture.

What does $s = t$ mean in the picture?

A corollary is that if $\red$ has Church-Rosser property and $s$ and $t$ are normal forms for $\red$ satisfying $s = t$, then $s \equiv t$.
$\eta$ rule does not follow from $\beta$

Terms $x$ and $\lambda y.xy$ are not $\beta$-convertible as they are normal forms that are not $\alpha$-equivalent.

Church-Rosser for $\beta$-Reduction

A fundamental result of $\lambda$-calculus is that $\beta$-reduction has the Church-Rosser property.

(Church and Rosser, 1936). $\beta$-reduction has the Church-Rosser property.
Note: $\betared$ does not have the diamond property.

You may duplicate a term in one step and then need two steps to cancel it.
\[\begin{align*} s & \equiv (\lambda x.xx)((\lambda x.y) z),\\ u & \equiv ((\lambda x.y) z)((\lambda x.y) z),\\ v & \equiv (\lambda x.xx) y. \end{align*} \]

Proof Strategy

How do we prove Church-Rosser (CR)?!

One possible idea is to prove something between diamond and CR in Figure (b).

Unfortunately, this does not work as (b) does not imply (c) in general!

Infinite brick tiling picture
The proof we show here is given by Martin-Löf and Tait.

We define a parallel version of $\beta$-reduction.

The deduction rules:

The parallel $\beta$-reduction $\parared$ is defined by the rules
\[\begin{align*} \text{(refl)}\; & \frac{}{s \parared s}\\ \text{(abs)}\; & \frac{s \parared s'}{\lambda x. s \parared \lambda x. s'}\\ \text{($\|$-app)}\; & \frac{s \parared s'\quad t \parared t'}{ st \parared s't'}\\ \text{($\|$-$\beta$)}\; & \frac{s \parared s'\quad t \parared t'}{ (\lambda x. s)t \parared s'[t'/x]} \end{align*} \]

Proof

We prove two results about $\parared$:

It has the diamond property!
$\betareds$ is the transitive closure of $\parared$.

Part 2 is easy to show and can be proved by observing that $\betared \;\subseteq\; \parared \;\subseteq\; \betareds \;\subseteq\, \Lambda^2$.

For Part 1, we want to prove that if $s \parared u$ and $s \parared v$, then there is $t$ such that $u \parared t$ and $v \parared t$.

We prove the claim using an induction on $s$ and a case analysis on the rules showing $s \parared u$ and $s \parared v$.

Case 1. $s \parared u$ is by (refl) and $s \parared v$ is arbitrary. In this case, $s \equiv u$ and we can take $t \equiv v$. We can similarly handle the case when $s \parared v$ uses (refl).
Case 2. $s \parared u$ is by ($\|$-app) and $s \parared v$ is also by ($\|$-app). So $s \equiv s_1s_2$, $u \equiv u_1u_2$ and $v \equiv v_1v_2$ with $s_i \parared u_i$ and $s_i \parared v_i$ for $i=1, 2$. By the induction hypothesis, we have $t_1, t_2$ such that $u_i \parared t_i$ and $v_i \parared t_i$ for $i=1, 2$. We can take $t \equiv t_1t_2$, and the claim follows by rule ($\|$-app).
Case 3. $s \parared u$ is by ($\|$-app) and $s \parared v$ is by ($\|$-$\beta$). In this case, $s$ can be reduced using both the parallel application and parallel beta rule. This means that it has form $s \equiv (\lambda x.s_1)s_2$. Similarly $u = (\lambda x.u_1)u_2$ (this is not obvious), $v = v_1[v_2/x]$ and $s_i \parared u_i$, $s_i \parared v_i$ for $i=1,2$. Apply the induction hypothesis, and we have $t_1, t_2$ such that $u_i \parared t_i$ and $v_i \parared t_i$ for $i=1,2$. We take $t = t_1[t_2/x]$. $u \parared t$ by the ($\|$-$\beta$) rule. We claim that $v \equiv v_1[v_2/x] \parared t$, which we prove in the Substitution Lemma below.

The case where the first reduction uses the parallel beta rule and the second uses the parallel application is similar.
Case 4. Both use ($\|$-$\beta$). Easy.
Case 5. If $s \parared u$ is by (abs), then $s \parared v$ is by (abs) too. Easy.

(Substitution Lemma). If $s \parared s'$ and $t \parared t'$, then $s[t/x] \parared s'[t'/x]$.

Omit the proof of the lemma in class. The proof is again an induction on the structure of $s$.

Lambda Calculus as a Programming Language

Lambda calculus is the foundation for many functional programming languages.

Lambda calculus can encode both data and programs that operate on the data without adding any additional syntax or axioms.

Program vs data and von Neumann architecture.

Booleans

We use the following combinators to encode true and false,
\[\begin{align*} \mathbf{t} &= \lambda xy.x\\ \mathbf{f} &= \lambda xy.y, \end{align*} \]
instead of introducing new keywords as in other languages.

Define $\mathbf{and} = \lambda ab.ab\mathbf{f}$.
It is easy to verify
\[\begin{align*} \mathbf{and}\, \mathbf{t} \mathbf{t} & \betareds \mathbf{t},\\ \mathbf{and}\, \mathbf{t} \mathbf{f} & \betareds \mathbf{f},\\ \mathbf{and}\, \mathbf{f} \mathbf{t} & \betareds \mathbf{f},\\ \mathbf{and}\, \mathbf{f} \mathbf{f} & \betareds \mathbf{f}. \end{align*} \]
Note that $\mathbf{t}$ and $\mathbf{f}$ are normal forms, so we can say that a term such as $\mathbf{and}\, \mathbf{t} \mathbf{t}$ evaluates to $\mathbf{t}$.
We say that $\mathbf{and}$ represents the boolean function and. It is understood that this encoding is with respect to the particular encoding of true and false.
There are other ways to encode and other than $\lambda ab.ab\mathbf{f}$.

Another possibility is $\lambda ab.bab$.
We can define $\mathbf{ite} = \lambda x.x$. Show that
\[\begin{align*} \mathbf{ite}\, \mathbf{t} s t & \betareds s,\\ \mathbf{ite}\, \mathbf{f} s t & \betareds t. \end{align*} \]
Note that you can always omit $\mathbf{ite}$.

Natural Numbers

If $f$ and $x$ are terms, and $n\ge 0$ is a natural number, we write $f^n x$ for the term $f(f(\cdots(f x)\cdots))$ where $f$ occurs $n$ times.
For each natural number $n$, define a term $\church{n}$ as
\[\begin{equation*} \church{n} \equiv \lambda fx.f^n x. \end{equation*} \]
It is the $n$-th Church numeral encoding natural number $n$ in lambda calculus.

The first few Church numerals are
\[\begin{align*} \church{0} &\equiv \lambda fx.x,\\ \church{1} &\equiv \lambda fx.fx,\\ \church{2} &\equiv \lambda fx.f(fx),\\ \church{3} &\equiv \lambda fx.f(f(fx)),\\ & \cdots \end{align*} \]
This particular way of encoding the natural numbers is due to Alonzo Church.
This may be more natural if you recall the Peano axioms for natural numbers.
1. Zero is a natural number.
2. Every natural number has a successor in the natural numbers.
3. Zero is not the successor of any natural number.
4. If the successor of two natural numbers is the same, then the two original numbers are the same.
5. If a set contains zero and the successor of every number is in the set, then the set contains the natural numbers.
$f$ and $x$ play the role of the successor and Zero respectively.
Note that $\church{0}$ is the same as $\mathbf{f}$, which is a neat coincidence.
We should know beforehand whether to interpret the result as a boolean value or a numeral in untyped lambda calculus.

Successor, Addition, and Multiplication

The successor function can be defined as $\mathbf{succ} = \lambda n f x.f(n f x)$.

To see this, we check that
\[\begin{align*} \mathbf{succ}\, \church{n} & \equiv (\lambda n f x.f(n f x)) \church{n}\\ & \betared \lambda f x.f(\church{n} f x)\\ & \betareds \lambda f x.f(f^n x)\\ & \equiv \church{n+1}. \end{align*} \]
Thus, we have proved that the term $\mathbf{succ}$ represents the successor function when applied to a numeral.
We can also define addition and multiplication as
\[\begin{align*} \mathbf{add} &\equiv \lambda n m f x . n f(m f x),\\ \mathbf{mult} &\equiv \lambda n m f. n (m f). \end{align*} \]

Currying

In the definition of $\mathbf{mult}$, we used the idea of Currying.
For traditional functions of two or more arguments, say $f(x, y)$, we must have all the arguments ready to evaluate $f(x, y)$.
In $\lambda$-calculus, the expression $\lambda x y. \mathbf{add}\, x\, y$ can take a single argument $\church{a}$ and $\beta$-reduces to
\[\begin{equation*} \lambda y. \mathbf{add}\, \church{a}\, y, \end{equation*} \]
which is a function.
Similarly, for church numeral $\church{m} = \lambda fx.f^m x$, $\church{m} f$ reduces to the term $\lambda x.f^m x$ representing the function taking $x$ to $f^m x$.
This method is known as Currying.
Uncurrying is the dual process that takes a function $f$ returning another function $g$ to a function that takes parameters as arguments for both $f$ and $g$.

General Functions

More generally, suppose $f: \natural^k \to \natural$ is a $k$-ary function on the natural numbers. We say that a lambda term $s$ represents $f$ if for all $n_1, n_2, \ldots, n_k \in \natural$,
\[\begin{equation*} s\, \church{n_1} \cdots \church{n_k} \betareds \church{f(n_1, \ldots, n_k)}. \end{equation*} \]
This definition easily generalizes to boolean functions or functions of other data types.

Examples

Prove that $\mathbf{iszero} = \lambda n x y. n (\lambda z.y) x$ represents the boolean function that evaluates to true if and only if the input is $0$.
\[\begin{align*} \mathbf{iszero} \, \church{0} & \equiv (\lambda n x y. n (\lambda z.y) x) (\lambda f x.x)\\ & \betared \lambda x y.(\lambda f x.x)(\lambda z.y) x\\ & \betareds \lambda x y.x\\ & \equiv \mathbf{t} \end{align*} \]
For $n \ge 1$, we have
\[\begin{align*} \mathbf{iszero} \, \church{n} & \equiv (\lambda n x y. n (\lambda z.y) x) \church{n}\\ & \betared \lambda x y.\church{n}(\lambda z.y) x\\ & \betareds \lambda x y.y\\ & \equiv \mathbf{f} \end{align*} \]

You will work out the $\exp$ function and $\mathrm{pred}$ function in your homework
\[\begin{equation*} \mathrm{pred}(n) = \begin{cases} 0 & \text{ if } n = 0,\\ n-1 & \text{ otherwise.} \end{cases} \end{equation*} \]
In fact, Church believed for a while that it was impossible until his student Kleene found a solution. (In fact, Kleene said he found the solution while having his wisdom teeth pulled, so his trick for defining the predecessor function is sometimes referred to as the wisdom teeth trick.)

Important for showing equivalence between recursive functions and $\lambda$-calculus.

Lecture 3: Church-Turing Thesis

Fix-Point Combinators in Use

How do we define the factorial function?
Using recursion, we can define
\[\begin{align*} \mathbf{fact}\, n & := \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (\mathbf{fact}\, (\mathbf{pred}\, n))). \end{align*} \]
However, $\lambda$-calculus does not allow us to write recursive function definitions directly.
So we rewrite the equation as
\[\begin{align*} \mathbf{fact} & := \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (\mathbf{fact}\, (\mathbf{pred}\, n))),\\ \mathbf{fact} & := (\lambda f. \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (f\, (\mathbf{pred}\, n))))\, \mathbf{fact}.\\ \end{align*} \]
Choose $\mathbf{fact}$ to be the fixed-point of the term $\mathbf{fac} \equiv \lambda f. \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\, (\mathbf{mult}\, n\, (f\, (\mathbf{pred}\, n)))$.

That is, we can solve this fixed-point equation (up to $\beta$-equivalence) by letting $\mathbf{fact} \equiv \mathbf{y}\, \mathbf{fac}$ where $\mathbf{y}$ is the Y combinator!
Example: $\mathbf{fact}\, \church{2}$
\[\begin{equation*} \begin{split} \mathbf{fact}\, \church{2} & \equiv \mathbf{y}\,\mathbf{fac}\, \church{2}\\ & = \mathbf{fac} (\mathbf{y}\, \mathbf{fac})\, \church{2}\\ & \betareds \mathbf{ite}\, (\mathbf{iszero}\, \church{2})\, \church{1}\, (\mathbf{mult}\, \church{2}\, (\mathbf{y}\, \mathbf{fac}\, (\mathbf{pred}\, \church{2})))\\ & \betareds \mathbf{mult}\, \church{2}\, (\mathbf{y}\, \mathbf{fac}\, \church{1})\\ & \equiv \mathbf{mult}\, \church{2}\, (\mathbf{fact}\, \church{1}). \end{split} \end{equation*} \]
Using the two main theorems about $\lambda$-calculus, we know that $\mathbf{fact}\, \church{2}$ evaluates to the correct answer in the normal form.

Data Structures in $\lambda$-Calculus

Data Structures: Pair and Tuples

If $s,t$ are $\lambda$ terms, we define $\pair{s}{t}$ to be the term $\lambda x.x s t$.

Alternatively, we can write $\mathbf{pair} \equiv \lambda ab.\lambda x. ((x a) b)$, which encloses the pair of $a$ and $b$ in function $x$.

You can think of $\pair{s}{t}$ as a syntactic sugar for $\mathbf{pair}\, st$ or $\lambda x. x s t$.
Why is it a pair?
Define terms $\pi_1 \equiv \mathbf{fst} \equiv \lambda p. p\, \mathbf{t}$ and $\pi_2 \equiv \mathbf{snd} \equiv \lambda p. p\, \mathbf{f}$.
\[\begin{align*} \mathbf{fst}\, \pair{s}{t} & \equiv (\lambda p. p\, \mathbf{t}) \lambda x. x s t\\ & \betared (\lambda x. x s t)\, \mathbf{t}\\ & \betared \mathbf{t}\, s t\\ & \betareds s. \end{align*} \]
Similarly $\mathbf{snd}\, \pair{s}{t} \betareds t$.
For terms $s_1, s_2, \ldots, s_k$, we can define the tuple $\langle s_1, s_2, \ldots, s_k \rangle$ of them as the term $\lambda x.x s_1 \cdots s_k$. The $i$-th projection $\pi_i^k \equiv \lambda p. p(\lambda x_1 \cdots x_k . x_i)$.

Data Structures: Lists

We write $\mathbf{nil}$ to denote the empty list.

We write $h :: t$ for the list whose first element is $h$ and $t$ is the tail (the remaining list).

For example, $1 :: (2 :: (3 :: \mathbf{nil}))$ is a list of three elements.
Assuming the convention that $::$ associates to the right, we can omit the parenthesis.
Corresponding terms
\[\begin{align*} \mathbf{nil} & \equiv \lambda x. \mathbf{t}\\ h :: t & \equiv \mathbf{cons}\, h\, t \text{ where } \mathbf{cons} \equiv \mathbf{pair}. \end{align*} \]
We consider $1 :: 2 :: 3 :: \mathbf{nil}$ as the syntactic sugar for $\mathbf{cons}\, 1\, (\mathbf{cons}\, 2\, (\mathbf{cons}\, 3\, \mathbf{nil}))$.

Note: The definition is not unique. For example, you may also take $\mathbf{nil} \equiv \mathbf{f}$, but the form of $\mathbf{empty}$ would be different.
Head, tail and empty
\[\begin{align*} \mathbf{head} & \equiv \mathbf{fst},\\ \mathbf{tail} & \equiv \mathbf{snd},\\ \mathbf{empty} & \equiv \lambda l. l\, \lambda h t. \mathbf{f}. \end{align*} \]
You may check that $\mathbf{empty} \, \mathbf{nil} \betareds \mathbf{t}$ and $\mathbf{empty} \, h :: t \betareds \mathbf{f}$.

List Operations

$\mathbf{map}, \mathbf{reduce}, \mathbf{filter}$.

Given a list $l = x_0 :: x_1 :: \cdots :: x_{n-1} :: \mathbf{nil}$ and a function $f$, $\mathbf{map}\, l\, f$ applies $f$ to all entries in the list $l$ to obtain a new list $l' = f(x_0) :: f(x_1) :: \cdots :: f(x_{n-1}) :: \mathbf{nil}$.
For a Boolean function $f$, $\mathbf{filter}\, l\, f$ returns the list $l'$ containing only elements $x_i$ such that $f x_i \betareds \mathbf{t}$.
The term $\mathbf{reduce}$ takes a list $l$, an operation $f$ taking two arguments, and a lambda term $z$ which we think of as the neutral element of the operation $f$.

Draw pictures
We can build $\mathbf{map}$ and $\mathbf{filter}$ from $\mathbf{reduce}$ and build $\mathbf{reduce}$ using fixed-point combinators.

$\mathbf{map} \equiv \lambda lf . \mathbf{reduce}\, l\, (\lambda ab . \mathbf{cons}\, (f\,a)\, b)\, \mathbf{nil}$.
Define $\mathbf{reduce}$ recursively.

First, we want $\mathbf{reduce}\, \mathbf{nil}\, f\, z = z$. Then, we want to set $\mathbf{reduce}\, (h :: t)\, f\, z$ to $f\, h\, (\mathbf{reduce}\, t\, f\, z)$. So, we may define $\mathbf{reduce}$ as the fixed-point of
\[\begin{equation*} \mathbf{red} \equiv \lambda s l f z. \mathbf{ite}\, (\mathbf{empty}\, l)\, z\, (f\, (\mathbf{head}\, l)\, (s\, (\mathbf{tail}\, l)\, f\, z)). \end{equation*} \]
Finally, $\mathbf{reduce} \equiv \mathbf{y}\, \mathbf{red}$ completes the construction.
These terms motivated Google's MapReduce framework for processing big data sets.

Functional Programming

Advantages:
1. Immutability helps eliminate side effects
2. First-class and higher-order functions
3. Declarative style
4. Concurrency made easier
It's possible to realize all the $\lambda$ terms mentioned above in Python.
Yet, it is recommended that you learn a fully functional programming language, such as Scheme, Haskell, or Scala.

Church-Turing Thesis

What is Computation

Hilbert, Gödel, Church, Turing

Hilbert's 10th problem (1900)

Devise a mechanical process determining in a finite number of operations whether a multivariate polynomial with integer coefficients has an integer solution.

It is answered in the negative by Matiyasevich, Robinson, Davis, Putnam in 1970.

Does it have a real root? It's decidable by a result of Tarski proved in 1951.
Hilbert's Entscheidungsproblem (Decision Problem, 1928)

Given a sentence in first-order logic, give an effectively calculable procedure for determining if it's provable.

“We must know, we will now.”
We have defined functions to be computable by
1. Recursive functions (Kurt Gödel)
2. $\lambda$-calculus (Alonzo Church)
3. Turing machines (Alan Turing)
4. Cellular automaton, Conway's game of life (Simple rules in a distributed system can create complex behavior)
5. RAM
6. Python, C
Veritasium video Maths Fundamental Flaw is recommended. It talks about Cellular automaton, Turing completeness and Gödel's incompleteness theorem.
Turing machine as the definition for computatoin is quickly taken by many pioneers, so game over!

Why is the Turing machine special among all equivalent models?

Turing machine captures the notion of computation intuitively!

“Computing is normally done by writing certain symbols on paper. We may suppose that this paper is divided into squares like a child's arithmetic book. The behavior of the [human] computer at any moment is determined by the symbols which he is observing, and of his 'state of mind' at that moment… We may suppose that in a simple operation not more than one symbol is altered. We compare a man in the process of computing … to a machine which is only capable of a finite number of configurations … The machine is supplied with a ‘tape’ (the analogue of paper) … divided into sections (called ‘squares’) each capable of bearing a ‘symbol’”, Alan Turing, 1936.
The Church-Turing thesis claims this is the only sensible definition of computable functions!

“[In 1934], Church had been speculating, and finally definitely proposed, that the $\lambda$-definable functions are all the effectively calculable functions … When Church proposed this thesis, I sat down to disprove it … but, quickly realizing that [my approach failed], I became overnight a supporter of the thesis.”, Stephen Kleene, 1979.
No candidate computing device, including quantum computers, is a serious challenge to the Church-Turing thesis.

Extended Church-Turing thesis.

Turing Machines Revisited

A Turing machine consists of
1. A finite control unit
2. A 1-D tape, divided into cells; each cell can hold a finite number of symbols.
3. A head positioned at one of the cells
Initial configuration:

Input is a finite-length string over the input alphabet; it is placed on the tape. All other cells, extending indefinitely to the left and right, hold the blank symbol. The head is positioned at the leftmost cell of the input.
How a TM computes

The transition is a function of the state of the machine, and the tape symbol scanned. In one move, it will
1. Change the state.
2. Write a symbol in the cell scanned.
3. Move the head one cell left or right.
More formally, the computation is governed by the transition function $\delta: Q \times \Sigma \to Q \times \Sigma \times \{\text{L},\text{R}\}$.

It will perform the above step-by-step computation until it reaches the accept or reject state.

Accepting/rejection configurations are halting configurations.
Most of the time, we use high-level descriptions of a TM for simplicity.

Important Facts About Turing Machines

We can encode Turing machines as binary strings (or integers).

They are the codes for Turing machines.

Description of a Turing machine $\desc{M}$ is an effective encoding of $M$ as a string that is easy to parse.
A configuration (or instantaneous description) of a TM $M$ at certain time step is the collection of information measuring
1. tape content,
2. current state, and,
3. head position.
It is convenient to encode it as a string $\alpha$ in $\Gamma^*$ where $\Gamma = (Q \times \Sigma) \cup \Sigma$.
There is a universal Turing machine $U$ that, given the description of another machine $\desc{M}$ and string $w$ as input, simulates the computation of $M$ on input $w$ (for $t$ steps).
Variants of TMs are equivalent in terms of computability.

Multiple-tape TMs
Nondeterministic TMs

A nondeterministic TM accepts if there exists an accepting branch.

Let $M_N$ be a nondeterministic TM. Then there is a deterministic TM $M_D$ such that $L(M_N) = L(M_D)$.

Proof idea: Use a two-tape machine; one of its tapes maintains a queue of configurations of $M_N$. Breadth-first search on the nondeterministic computation tree.

Undecidability Review

A language or problem is Turing-recognizable if it is the collection of strings $L(M)$ accepted by a Turing machine $M$.
A language or problem is decidable if there is a machine recognizing it and the machine always halt.
Given two TM descriptions $\desc{M_1}$ and $\desc{M_2}$, do they act the same on all inputs?

Does it print Hello World!?

GoldbachCounterexample()
print("Hello World!")

Just run it?
Halting problem and diagonal argument

$\HALT = \{ \tuple{M, w} \mid M \text{ halts on input } w\}$.
Suppose there is a machine $H$ that decides the Halting problem.
Define machine $D$ = “On input $\desc{M}$, the encoding of a Turing machine:
1. Simulate $H$ on input $\tuple{M, \desc{M}}$;
2. Loop if $H$ accepts;
3. Halt if $H$ rejects.”
So $D(\desc{M})$ loops if $M(\desc{M})$ halts and $D(\desc{M})$ halts if $M(\desc{M})$ loops.

Now the question is whether $D$ halts on input $\desc{D}$?!
It is called Cantor's diagonal argument.

Reduction

The idea of reduction can be used to prove undecidability of more problems.

Problem $A$ reduces to $B$ means that it's possible to decide $A$ using an algorithm for $B$ as a subroutine.

We write $A \le_T B$ to mean $A$ is Turing reducible to $B$.

If $B$ is decidable, then $A$ is also decidable.

To prove a negative result about computation (that certain problem is uncomputable), we usually design an algorithm (the reduction).

$\problem{Post\ Correspondence}$ problem
$\problem{Mortal\ Matrix}$ problem

Given two $15 \times 15$ matrices $A$ and $B$ of integers (or six $3 \times 3$ matrices). Is it possible to multiply themselves together in some order and get a $0$ matrix.

Shown to be undecidable (in 2014 for $15 \times 15$).
Wang tiles

Given a set of Wang tiles, decide if it can be used to tile the plane.

Robert Berger, a student of Wang, proved in 1966 that there is a transformation taking any Turing machine into a set of Wang tiles that tiles the plane if and only if the Turing machine does not halt.

Lambda Calculus is Turing Complete

A computational model is said to be Turing-complete or computationally universal if it can simulate any Turing machine.
Suppose $M$ is a Turing machine, we will define a $\lambda$-term $\mathbf{simulate}_M$ simulating the machine $M$ in the sense that for all list $\alpha$ representing a possible initial configuration of the machine, $\mathbf{simulate}_M\, \alpha \betareds \alpha_{\mathrm{final}}$ if the machine terminates with final configuration $\alpha_{\mathrm{final}}$.

What happens if the machine does not halt?

The leftmost $\beta$-reduction does not lead to a normal form!
The key idea for the simulation is to keep track of the configurations of the machine $M$ on input $w$.
We use the list data structure in $\lambda$-calculus to represent a configuration in $\Gamma^*$.

Suppose the current configuration is
\[\begin{equation*} \alpha = \gamma_0 :: \gamma_1 :: \cdots :: \gamma_{k-1} :: \mathbf{nil}, \end{equation*} \]
and the next configuration is
\[\begin{equation*} \alpha' = \gamma'_0 :: \gamma'_1 :: \cdots :: \gamma'_{k'-1} :: \mathbf{nil}. \end{equation*} \]
Note that $k'$ differ from $k$ by at most $1$.

As computation is local, every entry $\gamma'_j$ in the next configuration $\alpha'$ is a function of $\gamma_{j-1}$, $\gamma_j$, and $\gamma_{j+1}$ (corner cases need special care).

That is, there is a constant size function computing $\gamma'_j$ given $\gamma_{j-1}, \gamma_j, \gamma_{j+1}$ as input. We claim there is a term $\mathbf{local}_M$ that represents the computation of the function using a constant number of $\mathbf{and}$ and $\mathbf{not}$.
\[\begin{equation*} \mathbf{local}_M\, \tuple{\gamma_{j-1}, \gamma_j, \gamma_{j+1}} \betareds \gamma'_j. \end{equation*} \]
You can use, for example, Shannon expansion formula.

It is also easy to come up with a term $\mathbf{pack}$ that shifts and packs $\alpha$ to get the list $\cdots :: \tuple{\gamma_{j-1}, \gamma_j, \gamma_{j+1}} :: \cdots$. Similarly, there is a term $\mathbf{terminate}$ that checks whether a configuration is terminating or not.

Then we define $\mathbf{next}_M \equiv \lambda l. \mathbf{map}\, (\mathbf{pack}\, l)\, \mathbf{local}_M$.

Note that $\mathbf{next}_M$ depends on $M$ only.

Finally, consider terms
\[\begin{equation*} \mathbf{sim}_M \equiv \lambda sl. (\mathbf{terminate}\, l)\, l\, (s\, (\mathbf{next}_M\, l)), \end{equation*} \]
and $\mathbf{simulate}_M \equiv \mathbf{y}\, \mathbf{sim}_M$.

Check that if $M$ does not halt, the term $\mathbf{simulate}_M$ does not have a normal form.
Note: It is also possible to simulate the computation of $\lambda$-calculus on a Turing machine by performing the leftmost $\beta$-reductions.

Lecture 4: Advanced Computability

Kleene's Recursion Theorem

“No one shall expel us from the paradise that Cantor has created.” — David Hilbert

“A Swiss pocket knife for computability.” — Neil D. Jones

Motivation

The recursion theorem is an advanced mathematical result in the theory of computability.
The proof of the recursion theorem is easy to follow but not to come up with.
How did Kleene even formulate the theorem?

Kleene was a student of Alonzo Church, whose pet invention was the $\lambda$-calculus. It is probably motivated by the fixed-point combinators in $\lambda$-calculus.

Indeed, we will see a fixed-point theorem called Roger's fixed-point theorem which is equivalent to Kleene's recursion theorem and can be proven by directly mimicking the working of the Y combinator!
It connects to mathematical logic, the theory of self-reproducing systems, and even computer viruses.

Self-Reproduction

Factory of factory

A car factory must be more complex than the cars it produces.

This claim must be true because the factory itself has the car's design within it, in addition to the design of all the manufacturing robots.
Can machines construct replicas of themselves?

Self-Printing Machines

Can we design a Turing machine that ignores its input and prints a copy of its own description?
We start with an easy lemma.

There is a computable function $q: \Sigma^* \to \Sigma^*$, where if $w$ is any string, $q(w)$ is the description of a Turing machine $P_w$ that prints out $w$ and then halts.

This is easy once one understands the requirements.
Turing machine $Q$ = “On input $w$:
1. Constructs Turing machine $P_w$ = ‘On any input:
  1. Erase input;
  2. Write $w$ on the tape;
  3. Halt.’
2. Output $\desc{P_w}$.”

The Construction Outline of SELF

We now construct a TM $\SELF$ that prints its code on the tape following Sipser's textbook.
- Turing machine $\SELF$ contains two parts $A$ and $B$, which we can think of as two separate procedures that go into $\SELF$.
- $A$ runs first and passes the control to $B$ once completed.
- $A$ uses the machine $P_{\desc{B}}$, described by $q(\desc{B})$.
  
  So machine $A$ depends on machine $B$, and we cannot complete its description without specifying the details of machine $B$.
- Construction for $B$
  1. Note that we cannot define it to be $q(\desc{A})$ as it will be circular.
  2. Instead, we define $B$ using a different strategy.
    
    $B$ computes $A$ from the output that $A$ produces.
  3. If $B$ has access to $\desc{B}$, then it can call $q$ and obtain $\desc{A}$.
    
    Where does this $\desc{B}$ come from? It's printed on the tape when $A$ finishes.
Full Construction

In summary, we take machine $A = P_{\desc{B}}$, and machine $B$ = “On input $\desc{M}$, where $M$ is a portion of a TM:
1. Compute $q(\desc{M})$;
2. Combine the result with $\desc{M}$ to make a complete TM;
3. Print the description of this TM and halt.”
If we now run $\SELF$, we observe the following behavior:
1. First $A$ runs, and it prints $\desc{B}$ on the tape;
2. Then $B$ starts. It looks at the tape and finds its input, $\desc{B}$.
3. $B$ calculates $q(\desc{B}) = \desc{A}$ and combines that with $\desc{B}$ into a TM description, $\desc{\SELF}$.
4. $B$ prints this description and halts.
- We can implement this construction in any (Turing-complete) programming language to obtain a program that outputs a copy of itself.
```
s='s=%r;print(s%%s)';print(s%s)
```
- We can even do so in natural language!
  
  Print out two copies of the following, the second one in quotes: “Print out two copies of the following, the second one in quotes:”

Kleene's Recursion Theorem

Let $T$ be a TM that computes a function $t: \Sigma^* \times \Sigma^* \to \Sigma^*$. There exists a TM $R$ computing function $r: \Sigma^* \to \Sigma^*$ such that for every $w$, $r(w) = t(\desc{R}, w)$.

Note that we can explicitly construct $R$ from $T$!
Proof of the recursion theorem is similar to the construction of $\SELF$ with several minor modifications.
We construct a TM $R$ in three parts $A$, $B$, $T$ where $T$ is given in the statement of the theorem.
1. Here $A$ is the TM $P_{\desc{BT}}$ described by $q(\desc{BT})$. To preserve the input $w$, we redesign $q$ so that $P_{\desc{BT}}$ writes its output after the input $w$ on the tape. So after $A$ finishes, the tape contains $w\desc{BT}$.
2. $B$ is the machine that finds the $\desc{BT}$ part on the tape and runs $q$ to obtain $\desc{A} = q(\desc{BT})$. Then $B$ combines $A$, $B$, and $T$ into a single machine and obtains its description $\desc{ABT} = \desc{R}$. $B$ then transforms the content of the tape to $\tuple{R,w}$ and passes control to $T$.

Applications

The recursion theorem states that Turing machines can obtain their own description and then go on to compute with it.

If you are designing a machine $M$, you can include the phrase obtain own description $\desc{M}$ in the informal description of $M$'s algorithm.
At first glance, this capability may seem useful only for tasks such as making a machine that prints a copy of itself.
But, as we will demonstrate, it is a handy technique for solving many problems in the theory of computability.

SELF Redesigned

Machine $\SELF$ redesigned as follows.
Machine $\SELF$ = “On any input:
1. Obtain, via the recursion theorem, own description $\desc{\SELF}$.
2. Print $\desc{\SELF}$.”
Undecidability

$\ACCTM = \{\desc{M, w} \mid M \text{ accepts } w \}$ is undecidable.

It is Turing-recognizable by universal TMs.
We assume to the contrary that there is a TM $H$ decides $\ACCTM$.

Consider machine $B$: “On input $w$:
1. Obtain, via the recursion theorem, own description $\desc{B}$.
2. Run $H$ on input $\desc{B, w}$.
3. Do the opposite of what $H$ says. That is, accept if $H$ rejects and reject if $H$ accepts.”
$H$ does not perform as required on $\desc{B,w}$, so it cannot decide $\ACCTM$.
Minimal Machine
- The description length of machine $M$ is the number of symbols in the description $\desc{M}$.
- $M$ is minimal if no Turing machine equivalent to $M$ has shorter description.
  \[\begin{equation*} \MINTM = \{ \desc{M} \mid M \text{ is a minimal TM } \}. \end{equation*} \]
- $\MINTM$ is not Turing-recognizable.
  Assume to the contrary that some TM $E$ enumerates $\MINTM$ and we will obtain a contradiction.
  We construct the following TM $C$ = “On input $w$:
  1. Obtain, via the recursion theorem, own description $\desc{C}$.
  2. Run the enumerator $E$ until machine $D$ appears with a longer description than that of $C$.
  3. Simulate $D$ on input $w$.”
  Because $\MINTM$ is infinite, $E$'s list must contain a TM with a longer description than C's description.
  
  Therefore, Step 2 of machine $C$ eventually terminates with some machine $D$ that is longer than $C$.
  
  Machine $C$ simulates $D$, and so is equivalent to it.
  
  Because $C$ is shorter than $D$ and is equivalent to it, $D$ cannot be minimal.
  
  But $D$ appears on the list that $E$ produces, a contradiction.
  Let $A$ be infinite subsets of $\MINTM$. Is $A$ Turing-recognizable?
Roger's Fixed-Point Theorem.

Let $t:\Sigma^* \to \Sigma^*$ be a computable function. Then there is a Turing machine $F$ for which $t(\desc{F})$ describes a machine equivalent to $F$.

Here, we assume strings of invalid descriptions correspond to machines that reject immediately.
Consider machine $F$ = “On input $w$:
1. Obtain, via the recursion theorem, own description $\desc{F}$.
2. Compute $t(\desc{F})$ to obtain the description of a TM $G$.
3. Simulate $G$ on $w$.”
The rest of the proof is now straightforward.
Note: There is proof of Roger's fixed-point theorem that implements the Y combinator!

Gödel's Incompleteness Theorems

Proving the famous Gödel's Incompleteness Theorem is easy if you use computer science.

There are inherent limitations in mathematical reasoning.

‘This statement cannot be proved.’

Euclidean geometry and fifth axiom: Through a given point $P$ not on a line $L$, there is one and only one line in the plane of $P$ and $L$ which does not meet $L$.

Is it a consequence of the other four axioms?

Logic Review

Logic is a branch of mathematics that studies mathematical reasoning.
You should have seen propositional logic and first-order logic.
first-order logic
1. Logical connectives $\land$, $\lor$, $\neg$, and $\rightarrow$,
2. Quantifiers $\forall, \exists$,
3. Variables $x, y, \ldots$,
4. Equality symbol $=$.
5. Constants $c_1, c_2, \ldots$,
6. Relations $R_1, R_2, \ldots R_k$,
7. Functions $f_1, f_2, \ldots, f_r$,
Well-formed formula
1. $\forall q \exists p \forall x,y \bigl[ p > q \land (x,y > 1 \rightarrow xy \ne p) \bigr]$.
2. $\forall a,b,c,n\bigl[ (a,b,c>0 \land n>2) \rightarrow a^n+b^n \ne c^n \bigr]$. (Fermat's Last Theorem and Andrew Wiles)
3. $\forall q \exists p \forall x,y \bigl[ p > q \land (x,y > 1 \rightarrow (xy \ne p \land xy \ne p+2)) \bigr]$. (Twin Prime Conjecture and Yitang Zhang)
A formula with no free variables is called a sentence.

Truth and Interpretations (Models)

Truth is defined by the semantics.

An interpretation $\mathcal{U}$ consists of
1. A non-empty set $U$ to represent the universe,
2. For every constant $c$, assign an element of $c^{\mathcal{U}}$ in $U$,
3. For every $m$-ary predicate $R$, assign an $m$-ary relation $R^{\mathcal{U}} \subseteq U^m$,
4. For every $n$-ary $f$, assign an $n$-ary function $f^{\mathcal{U}} : U^n \to U$.
If $\varphi$ is a sentence, it is either true or false in a given model $\mathcal{U}$.
We call $\varphi$ a tautology if it is true in all models.

$\exists y \forall x \bigl[ x = R(y) \bigr] \rightarrow \forall x,y \bigl[ x = y\bigr]$.

Proofs and Deduction Calculus

Proofs are syntactic deductions.

There are many proof systems for first-order logic, including Hilbert-style deductive systems, natural deduction, the sequent calculus, etc.

Axioms
1. $A \lor \neg A$ is a tautology for all sentences $A$.
  
  Or we write
  \[\begin{prooftree} \AXC{} \UIC{$A \lor \neg A$} \end{prooftree} \]
2. $\cdots$
It is not a finite number of axioms. It has finitely many axiom schema.
Deduction rules

From $A$ and $A \rightarrow B$ we can deduce $B$ (modus ponens).
\[\begin{prooftree} \AXC{$A$} \AXC{$A \to B$} \BIC{$B$} \end{prooftree} \]
From $P(c)$ we can deduce $\forall x\, P(x)$ (universal generalization).
\[\begin{prooftree} \AXC{$P(c)$} \UIC{$\forall x\, P(x)$} \end{prooftree} \]
Example: Hilbert System
\[\begin{align*} \mathbf{Ax}_1 \quad & A \to (B \to A),\\ \mathbf{Ax}_2 \quad & (A \to (B \to C)) \to ((A \to B) \to (A \to C)),\\ \mathbf{Ax}_3 \quad & (\neg A \to \neg B) \to (B \to A). \end{align*} \]
Proof and computation

A proof system is an algorithm $V$ that satisfies
1. Effectiveness. For every statement $x$ and proof string $w$, $V(x, w)$ halts with output either $0$ or $1$.
2. Soundness. If $x$ is not true, then for every $w$, $V(x, w) = 0$.
Completeness. True statements can be proved. One of $S$ and $\neg S$ can be proved.

Gödel's Completeness Theorem

Is every tautology provable?

Yes!

Gödel's PhD thesis
A fundamental theorem (which you may have seen previously)

Typical Use of First Order Logic

In mathematics, we care more about the truth of a fixed model.
- Start with some axioms that are true in the model/interpretation.
- Derive true statements from the axioms.
Example: Arithmetic of $\natural$ (Peano axioms, first-order version)

Constants: $0$

Functions: $\mathrm{S}(x)$, $\mathrm{Plus}(x,y)$, $\mathrm{Times}(x,y)$.

Axioms:
1. $\forall x\, (\mathrm{S} x \ne 0)$,
2. $\forall x, y\, (\mathrm{S} x = \mathrm{S} y \to x = y)$,
3. $\cdots$
Set theory (Zermelo-Fraenkel, ZFC)
1. Axiom of Extensionality: If $X$ and $Y$ have the same elements, then $X=Y$
  
  $\forall u\, [u \in X \leftrightarrow u \in Y ] \to X = Y$.
2. $\cdots$
C is for the infamous axiom of choice.

AC is independent of ZF (P. Cohen).

ZFC: Standard set theory can be used to model and prove almost anything in mathematics.
Principia Mathematica

Bertrand Russell and Alfred Whitehead

《数学原理》

Doable but tedious! It took them hundreds of pages to prove $1 + 1 = 2$.

Can computers help?!

Gödel's Incompleteness Theorems

Any mathematical proof system which is sufficiently expressive and has computable axioms, cannot be both complete and consistent.

Consistent: Cannot prove both $S$ and $\neg S$.

A Proof-Searching Approach to Attack the Halting Problem

Consider the Halting problem $\HALT_{\varepsilon} = \{ \tuple{M} \mid M \text{ halts on the empty input}\}$.
Idea: Search for proofs that the machine $M$ on $w$ halts or loops.

If $M$ halts on input $w$, there is mathematical (ZFC) proof for it.
1. Configuration at time step 1 is $q_0 \alpha_0 \alpha_1 \cdots$,
2. Configuration at time step $t$ is correct according to the previous configuration $\alpha^{(t-1)}$ and descriptions of $M$,
3. Configuration at final step $T$ is in a halting state.
Idea: Combine the proof-search approach with diagonal argument
Define Turing machine $D$

$D$ = “On any input:
1. Obtain own description $\desc{D}$ by recursion theorem.
2. For all $k = 1, 2, \ldots$ and all strings $P$ of length $k$:
  1. (Halt Proof Check). Check if $P$ is a valid ZFC proof for the statement ‘$D$ halts’. If yes, then $D$ enters a loop-forever state.
  2. (Loop Proof Check). Check if $P$ is a valid ZFC proof for the statement ‘$D$ loops’. If yes, halt.”
If ZFC proves ‘$D$ loops’, then the machine $D$ will find this proof and $D$ actually halts. By the above lemma, ZFC also proves ‘$D$ halts’, which is a contradiction to consistency.

Note: We used the assumption that ZFC is consistent twice. In addition to the appearance in concluding the contradiction, it is also used to argue that $D$ actually halts. If ZFC is not consistent, there may be a proof for ‘$D$ halts’ and it appears earlier than the proof for ‘$D$ loops’ and $D$ will enters the loop-forever state.

If ZFC proves ‘$D$ halts’, then the machine $D$ will finally enters a loop-forever state after $T$ steps! In that case, the statement ‘$D$ loops’ can by proved in ZFC by considering the configuration history for $T+1$ steps. Again a contradiction to consistency.
ZFC cannot be both complete and consistent.

Either ‘$D$ loops’ or ‘$D$ halts’ is true. None are provable assuming ZFC is consistent!

We are done proving Gödel's First Incompleteness Theorem.
Remark: Adding this true statement as a new axiom does not solve the problem!
Remark: PA for natural numbers is also powerful enough (to encode and reason about TMs)!

Gödel numbering: assigns a unique natural number to each symbol and well-formed formula of some formal language.

‘Statement $g$ has no proof’ has Gödel number $g$!

Gödel's Second Incompleteness Theorem

Which statement is true, ‘$D$ halts’ or ‘$D$ loops’?

‘$D$ halts’ cannot be true. As if it were true, there is a ZFC proof of it by the lemma and then by the definition of $D$, it loops. Contradiction.
Wait! Can we write down the above argument that ‘$D$ loops’?

Is it a proof for ‘$D$ loops’?
No! What we can prove based on the above argument is that ‘ZFC consistent’ implies ‘$D$ loops’ as the argument and, in particular, the analysis of the behavior of $D$ relied on the assumption that ZFC is consistent.

ZFC proves ‘ZFC consistent $\to$ $D$ loops’.

But ZFC cannot prove ‘$D$ loops’ as we have already proved.
So ZFC cannot prove its own consistency!

Assume ZFC is consistent. Then ‘ZFC consistent’ is a true statement that it cannot prove.

Like the Halting problem for undecidability, it is a concrete/interesting statement for the incompleteness theorem!

Lecture 5: Time Complexity

Time Complexity

We will embark on the study of computational complexity.

Computational models (deterministic, nondeterministic, randomized), resources (time, space), and problems.

Time since the Big Bang: $13.8$ billion years or $4.36 \times 10^{17}$ seconds.

What is efficient computation?

Palindrome Example

Consider the palindrome problem $\PAL = \{w \mid w^R = w\}$.
Examples:

Dammit I'm mad!

Dr. Awkward & Olson in Oslo, a palindromic novel numbering 31,594 words (or approximately 104,000 letters) by Lawrence Levine

《西江月·咏梅》

〔宋〕苏轼

马趁香微路远，沙笼月淡烟斜。渡波清彻映妍华。倒绿枝寒凤挂。

挂凤寒枝绿倒，华妍映彻清波。渡斜烟淡月笼沙。远路微香趁马。
You have seen this problem/language in the FLA course.

It cannot be solved by finite automaton, but can be solved by the full notion of an algorithm or a programming language (Python/C).
What is the time complexity of the palindrome problem?

Measuring the Time Complexity Using TMs

Consider the $\PAL$ Turing machine $P$ = “On input $x$:

Read the first letter, remember what it is, replace it with blank;
Move to the end of the input and check if the last letter matches the first; Reject if it does not match; Replace it with blank otherwise;
If the string is empty and has not found a mismatch, accept;
Otherwise, move to the beginning of the string, and repeat from step 1.”

It is natural to measure the time complexity of a TM by the number of steps it takes on an input.
It depends on the length of input $x$, and we usually reserve $n$ for this.

The running time of a TM is a function of the input length.
If $x$ is really a palindrome, we need (basically)

$(n+1) + n + \cdots + 1 = \frac{1}{2}n^2 + \frac{3}{2}n + 1$.
If $x$ is not a palindrome, it depends on the input and may stop much earlier.

The running time of an algorithm $\mathcal{A}$ is a function $T_{\mathcal{A}} : \natural \to \natural$ defined as
\[\begin{equation*} T_{\mathcal{A}}(n) = \max_{\substack{\text{ instances } x \text{ of}\\ \text{length } n}} \text{steps}(\mathcal{A}, x). \end{equation*} \]
So it is a worst-case definition.
We may omit subscript $\mathcal{A}$ when the algorithm under investigation is clear from the context and write $T(n) = \frac{1}{2}n^2 + \frac{3}{2}n + 1$ as the running of the PAL Turing machine $P$.
Why worst-case?
1. Other variants such as average-case complexity and smoothed analysis are seriously considered.
  
  Smoothed analysis of algorithms by Spielman and Teng, 2002.
2. But the worst-case complexity is the most important, and probably the first thing you need to understand when analyzing an algorithm.
  
  It gives a guarantee for all possible instances.
  
  It is sometimes hard to define what the distribution of random inputs should be.

Big-O Notations

We'd like to focus on the big picture of the running time.

How does it scale as a function of $n$.

In other types of computation models, it is not that clear what a step means.
So we need the right level of abstraction and hide unimportant details.

For the $T(n) = \frac{1}{2}n^2 + \frac{3}{2}n + 1$ case, the main message of the running time is that it is a quadratic function, proportional to $n^2$.
$O(\cdot)$, $\Omega(\cdot)$, and $\Theta(\cdot)$:
1. $T(n)$ is $O(n^2)$ if $T(n)$ is at most proportional to $n^2$ as $n$ grows;
2. $T(n)$ is $\Omega(n^2)$ if $T(n)$ is at least proportional to $n^2$ as $n$ grows;
3. $T(n)$ is $\Theta(n^2)$ if $T(n)$ is proportional to $n^2$ as $n$ grows.
Examples

$\frac{1}{2}n^2 + \frac{3}{2}n + 1$ is $\Theta(n^2)$, $O(n^2)$, $O(n^3)$, $\Omega(n^2)$, $\Omega(n)$ and is not $O(n)$ or $\Omega(2^n)$.
Formal definition

Let $f$ and $g$ be functions $f, g: \natural \to \natural$. Say that $f(n) = O(g(n))$ if there are positive integers $c$ and $n_0$ such that for every integer $n \ge n_0$, $f(n) \le c g(n)$.

Let $f$ and $g$ be functions $f, g: \natural \to \natural$. Say that $f(n) = \Omega(g(n))$ if there are positive real numbers $c$ and $n_0$ such that for every integer $n \ge n_0$, $f(n) \ge c g(n)$.

We say that $f(n) = \Theta(g(n))$ if $f(n)$ is $O(g(n))$ and $\Omega(g(n))$.
Example functions, each is the big-O of the next.

$O(1), O(\log^* n), O(\log\log n), O(\log n), O(\sqrt{n}), O \bigl( \frac{n}{\log(n)} \bigr)$,

$O(n), O(n \log n), O(n^{1.5}), O(n^3), O(n^{\log n})$,

$O(2^n), O(n!), O(2^{2^n})$.

Recurrence Relation

$T(n) = a T(n/b) + f(n)$

Examples:

Binary search, $T(n) = T(n/2) + O(1)$, $T(n) = O(\log n)$
Merge sort, $T(n) = 2T(n/2) + O(n)$, $T(n) = O(n \log n)$.

See Master Theorem.

Intrinsic Time Complexity

Given a problem we can ask for its intrinsic time complexity.

The running time of the fastest algorithm up to $\Theta(\cdot)$ and fixing the model of computation.
Claim: Any algorithm solving $\PAL$ uses at least $n-1$ steps.

Suppose algorithm $\mathcal{A}$ solves it in $n-2$ steps. Then for input $x = 0^n$ there are two places $i$ and $j$ which $\mathcal{A}$ never reads. Now we can change $x[i]$ and $x[j]$ so the answer will be flipped.

Time Complexity Class

Define the time complexity class $\TIME(t(n))$ as the collection of decision problems solvable by a TM with running time $O(t(n))$.

Time Complexity for Programs

def TwoFingerPal(x):
    lo = 0
    hi = len(x)-1
    while (lo < hi):
        if (x[lo] != x[hi]):
            return False
        lo = lo + 1
        hi = hi -1
    return True

def OneLinePal(x):
    return (x == x[::-1])

What is the time complexity for this algorithm?

Is it $\Theta(n)$ or …?

Time complexity is model dependent.

A two-tape TM could do it in $O(n)$ steps.

For single-tape TM, it requires $\Omega(n^2)$ steps.
Two subtle questions for considering the time complexity of the python code:
1. How do we measure the cost of accessing x[lo] and x[hi]?
2. How do we measure the cost of $+1$ and $-1$ (when the number could be as large as $n$).

RAM Model
- One limitation of TMs is that we can only access one location of tape where the head is.
- Actual computers often provide Random Access Memory (RAM) which can be thought of as a large array (memory).
  
  The computational model that models access to such a memory is the RAM machine.
- The additional ability of a RAM machine is that the machine can read/write an indexed entry of the memory in one step.
  
  Usually, the memory of a RAM machine is an array of unbounded size where each cell can store a single word.
  
  In addition to the memory array, a RAM machine also contains a constant number of registers $r_0, \ldots, r_{k-1}$, each of which contain a single word.
  1. Data movement: Move data between a cell of memory and a register.
    
    For example, in one step, a RAM machine can load into register $r_j$ from the location of the memory specified by $r_i$.
  2. Computation: Performed on the registers.
- It is closer to real machines.
  
  Pros: Reasonably realistic and relatively simple.
  
  Cons: It's not easy to define precisely.
- We write $\TIME_{\text{RAM}}(t(n))$ for class of problems decidable in time $O(t(n))$ on a RAM machine.
- It is equivalent to TM model with polynomial overhead.

Class $\P$

The class of problems that can be solved by deterministic TMs in polynomial time.

$\P = \bigcup_{c>0} \TIME(n^c)$.
There are efficient computations that are not polynomial, and polynomial computations that are not efficient in practice!

$n^{100}$ vs. $n^{\log n}$.
It captures the notion of efficient computation in most of the cases.

Usually, the exponent can be reduced to a small number.

Extended Church Turing Thesis
$\P$ is stable: All “reasonable” computational models are equivalent if we only care about the distinction between polynomial and exponential.

Polynomial bounds are closed under addition and multiplication.

Adopting polynomial worst-case performance as our criterion of efficiency results in an elegant and useful theory that says something meaningful about practical computation, and would be impossible without this simplification.
Exponential time

$\EXP = \bigcup_{c>0} \TIME(2^{n^c})$.

Time Hierarchy Theorem

If you have more time, you can solve more problems.

Theorem Statement

For any time-constructible function $t: \natural \to \natural$, there is a language $A$ that is decidable in $O(t(n))$ time but not decidable in time $o(t(n) / \log t(n))$.

In other words $\TIME\bigl( o(t(n) / \log t(n))\bigr) \subsetneq \TIME(t(n))$.

Remarks

Remark 1: Why time-constructible (nice)?

A function $t(n) \ge n \log(n)$ is called time-constructible if there is a TM taking $1^n$ as input and computes the binary representation of $t(n)$ in time $O(t(n))$.

Examples: $n \log n$, $n \sqrt{n}$, and $2^n$ are time-constructible. $\sqrt{n}$ or $n$ is not.

Space-constructible functions can be defined similarly.

Examples: $\log n$, $n^2$ are space-constructible. $\log\log n$ is not. Yet, there is a space constructible function in $O(\log\log n)$.
Remark 2: The logarithmic gap!

This comes from the overhead from the $t$-step simulation on the single-tape TMs. If this simulation were improved, we could improve the theorem accordingly.
Remark 3: In Barak, the difference is said to be arbitrary, and this subtle difference comes from the fact that the running time there was defined for a variant of the RAM model.

Proof Outline

The following $O(t(n))$ time algorithm $D$ decides a language $A$ that is not decidable in $o(t(n)/ \log t(n))$ time.

$D$ = “On input $w$:
1. Compute the length $n$ of $w$.
2. Compute $t(n)$ using time-constructibility and store the value $t(n)/ \log t(n)$ in a binary counter.
3. Parse $w$ as $\desc{M}10^*$ for some TM $M$.
4. Simulate $M$ on $w$. Decrement this counter before each step. If the counter ever hits $0$, reject.
5. If $M$ accepts, then reject. If $M$ rejects, then accept.”
Note: Why we use input $\desc{M}10^*$? This is a padding technique to allow us to choose the $n_0$ constant in the asymptotic scaling.
First of all, we claim that $D$ runs in time $O(t(n))$.
1. This is true for Step 1 because $t(n) \ge n \log n$.
2. For Step 2, this follows from the time-constructibility of $t$.
3. For Step 3, Easy.
4. For Step 4, we can simulate the machine using the multi-track technique. We use three tracks: first track for $M$'s tape, second track for keep the state, head symbol, and transition function of $M$ close to the head, and the third track for holding the counter which has length $O(\log(t(n)/ \log t(n))) = O(\log t(n))$. Each step has a $\log t(n)$ cost and the number of steps is $t(n) / \log t(n)$. So the total time is still $O(t(n))$.
Next, we prove that the language defined by the machine $D$ above does not have any algorithm of running time $o(t(n)/ \log t(n))$. Assume to the contrary that TM $M$ decides it in time $g(n) = o(t(n) / \log t(n))$.

Say $D$ simulates $M$ (without counting) in time $d g(n)$ where $d$ is some constant. Choose sufficiently large $n \ge n_0$, $d g(n) \le t(n)/\log(n)$.

Then $D$ on input $\desc{M}10^{n_0}$ will complete the simulation and output differently (compared with $M$).

Corollaries

An easy corollary is that for all constants $c_1 > c_0 \ge 1$, $\TIME(n^{c_0}) \subsetneq \TIME(n^{c_1})$.

And, similarly, $\P \subsetneq \EXP$.

The Gap Theorem

There is an increasing computable function $g: \natural \to \natural$ such that
\[\begin{equation*} \TIME(g(n)) = \TIME(2^{g(n)}). \end{equation*} \]
It means that the time-constructible condition is necessary for the time hierarchy theorem.
Proof Outline

Consider $T_m(n)$, the maximum time machine $m$ takes on input of size $n$. Note that $T_m(n)$ could be undefined if $m$ does not halt on some input of length $n$.

Define $g(0) = 1$. For $n \ge 1$ define $g(n) = j$ where $j$ is the smallest number such that $j > g(n-1)$ and, for all $m < n$, $T_{m}(n)$ is not in the interval $[j, 2^j]$. One can prove that $g$ is computable function, and for all $m$, at least one of the following holds a. $T_m(n) \le g(n)$ for all but finitely many $n$; b. $T_m(n) > 2^{g(n)}$ for infinitely many $n$. This means that no machine has time complexity between $g(n)$ and $2^{g(n)}$.

Lecture 6: Problems in P and NP

Computational Problems

“An algorithm is a finite answer to an infinite number of questions.” — Stephen Kleene

Computational problem: A family of problem instances with increasing size.
Decision problems and search problems

Graph Problems

Graph $G = (V, E)$.

Reserve $n$ to denote the number of vertices $\abs{V}$.

Directed edge $(u,v)$ or undirected edge $\{u,v\}$ which can also be denoted as $u \to v$ or $u \sim v$.
Graphs can be represented either in the adjacency list or adjacency matrix representation.

Input size is measured by $n$ (and sometimes also $m = \abs{E}$).
Graphs are ubiquitous in CS.
- Roads and route planning
- Web
- Social network
- Correlations in data
We focus on undirected graphs for simplicity.

Shortest Path and Reachability

Given $G = (V, E)$ and $s,t \in V$, the $\SHORTPATH$ problem computes the length of the shortest path from $s$ to $t$. The simpler $\REACHABILITY$ problem asks whether such a path exists.
- There can be exponentially many paths from $s$ to $t$.
  
  We have an efficient way to find the shortest one by breadth-first search (BFS).
- More generally, Dijkstra's algorithm can find the shortest path for weighted graphs where we assign weight $w:E \to [0, \infty)$ to each edge. The complexity of Dijkstra's algorithm is $\Theta(\abs{E} + \abs{V} \log \abs{V})$ (when using Fibonacci heap by Fredman and Tarjan).
Longest Path

The $\LONGPATH$ problem finds the length of the longest simple path between a given pair of vertices $s$ and $t$ in a given graph $G$.
- Simple path: a path with no repeating vertices.
  
  Again there may be exponentially many paths we need to consider.
  
  For this problem, however, we do not know of any significant improvement over the brute force enumeration algorithm.
- The best-known algorithms for the longest path problem take $O(c^n)$ time for some constant $c > 1$. (The best record is $c \approx 1.65$).
- It is a generalization of the famous Hamiltonian path problem $\HAMPATH$, which asks for a maximally long simple path ($n$ vertices).
- Hamiltonian cycle problem $\HAMCYCLE$
- Traveling salesman problem $\TSP$
  
  Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?.
Min-Cut
- Definitions
  
  Given a graph $G = (V,E)$, a cut of $G$ is a subset $S \subseteq V$ such that $S$ is neither empty nor $V$.
  
  The edges cut by $S$ are those edges that have one vertex in $S$ and the other in $V \setminus S$.
  
  For two vertices $s,t \in V$, an $s,t$-cut is a cut such that $s \in S$ and $t \in V \setminus S$.
  
  Computing the minimum $s,t$-cut ($\MINCUT$ problem) is useful in many applications since minimum cuts often correspond to bottlenecks.
- We can obtain a polynomial-time algorithm for computing the min-cut using the Max-Flow Min-Cut Theorem.
- The Ford-Fulkerson algorithm computes a flow using incremental improvements.
  
  Finding the improvement is a $\REACHABILITY$ problem.
- Computing flows in polynomial time also follows by using a much more general tool known as linear programming (LP).
  \[\begin{equation*} \begin{split} \text{Maximize:}\, & c^T x\\ \text{Subject to:}\, & A x \le b, x \ge 0\\ \end{split} \end{equation*} \]
- LP formulation of the $\MAXFLOW$ problem
  
  A flow on a graph $G$ of $m$ edges can be modeled as a real vector $x \in \real^m$ where for every edge $e = (u, v)$, $x_e$ corresponds to the flow on $e$ from $u$ to $v$. For all $u, v\in V$ and $e'=(v, u)$, $x_{e'} = - x_e$. So, for each edge $e$ we assign variable $x_e \in [-1,1]$.
  
  The (flow-balancing) constraints are
  \[\begin{equation*} \sum_{e: \pi_1(e) = v} x_e = 0\quad \forall v\in V\setminus \{s,t\} \end{equation*} \]
  The objective function we want to maximize is $\sum_{e: \pi_1(e) = s} x_e$.
- Luckily, there are polynomial-time algorithms for solving linear programming, and hence we can solve $\MAXFLOW$ and also $\MINCUT$ in polynomial time.
- Note: In a recent breakthrough, an almost linear time algorithm for the $\MAXFLOW$ problem is discovered. See Maximum Flow and Minimum-Cost Flow in Almost-Linear Time by Li Chen, Rasmus Kyng, Yang P. Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva.
Max-Cut
- $\MAXCUT$ arises in VLSI design, and has some surprising relation to analyzing the Ising model in statistical physics.
  
  The Ising model on graph $G = (V, E)$ has the Hamiltonian
  \[\begin{equation*} H(x) = \sum_{ u \sim v} J_{uv} x_u x_v, \end{equation*} \]
  which represents the energy of the system when the variables $x$ take value in $\{\pm 1\}^{V}$.
  
  William Rowan Hamilton, Hamiltonian path, Hamiltonian mechanics, Quaternions
- Finding ground energy of the Hamiltonian corresponds to computing the $\MAXCUT$ of $G$ with weights $\{J_{uv}\}$.
  
  It is easy to show $H(x) = J - 2\, \mathrm{CUT}(x)$ where $J = \sum_{u \sim v} J_{uv}$ and $\mathrm{CUT}(x)$ is the size of the cut defined by $x$.
- It is one of the favorite problems for some theorists because of its connection to approximation algorithms and the Unique Games Conjecture (UGC).
  
  Goemans and Williamson's algorithm has approximation ratio $\alpha \approx 0.878$.
  
  If UGC is true, this is the best possible approximation ratio for $\MAXCUT$ where approximation ratio is the size of the solution cut divided by the optimal max-cut size.
- It is easy to design an approximation algorithm with ratio $\alpha = 0.5$, which we will see later.

Logic Problems

SAT
- Given a conjunctive normal form (CNF) logic formula, determine whether it is satisfiable or not.
- Example of $\SAT$ input for $(x_1 \lor x_2 \lor \neg x_4) \land (x_3 \lor \neg x_4) \land (x_1 \lor x_2 \lor x_3)$.
```
c This line is a comment.
p cnf 4 2
1 2 -4 0
3 -4 0
1 2 3 0
```
  The input length of $\THREESAT$ problem of $n$ variables and $m$ clauses is $O(m \log n)$.
- Applications
  - Industrial optimization
  - Manufacturing planning
  - Circuit synthesis
  - Software verification
  - Air-traffic control
  - Automated proofs
- $k$-$\SAT$ problem
- $\TWOSAT$ is easy.
  
  The reason is that we can think of $\ell_i \lor \ell_j$ where $\ell_i, \ell_j$ are literals as an implication $\neg \ell_i \rightarrow \ell_j$. Then the problem is reduced to a simple problem on directed graphs of $2n$ vertices.
- $\THREESAT$ is hard to solve. The best-known algorithm runs in time roughly ${1.3}^n$, by Paturi, Pudlak, Saks and Zane.
- Note: SAT solvers are intensively studied and commercial software can sometimes solve instances of up to 1M variables.
  
  $\SAT$ solvers are used to find the proof of the Boolean Pythagorean Triple Theorem (Heule, Kullman, and Marek, 2016).
  1. The set $\{1, 2, \ldots, 7824\}$ can be partitioned into two parts so that both do not contain $a,b,c$ such that $a^2 + b^2 = c^2$.
  2. The same claim for $\{1, 2, \ldots, 7825\}$ is false.
  200 terabyte proof later compressed to 68 gigabytes!
  
  Watch Terence Tao talk on machine-assisted proofs.
TQBF
- The $\SAT$ problem can be written as $\exists x_1, x_2, \ldots, x_n\, \phi(x_1, x_2, \ldots, x_n)$.
- A generalization of it is the $\TQBF$ (Totally Quantified Boolean Formula) problem where there are alternating $\exists$ and $\forall$ quantifiers before $\phi$.
  
  $\exists x_1 \forall x_2 \exists x_3 \cdots Q x_n\, \phi(x_1, x_2, \ldots, x_n)$ where $Q$ is either $\exists$ or $\forall$.
- It is a harder problem than $\SAT$.

Mathematics Problems

Prime, Multiplication, and Factoring

$\PRIME$ is in $\P$! Agrawal, Kayal, and Saxena 2002.
Linear equation vs. quadratic equation
Determinant vs. permanent

For square matrix $A$, the determinant is defined as
\[\begin{equation*} \det(A) = \sum_{\pi \in S_n} \sign(\pi) \prod_{i \in [n]} A_{i,\pi(i)}. \end{equation*} \]
Even though the above expression is summation of $n!$ terms, there is an efficient algorithm that computers $\det{A}$ given $A$ as input.

Permanent is defined similarly without the sign function For matrix $A$, the determinant is defined
\[\begin{equation*} \perm(A) = \sum_{\pi \in S_n} \prod_{i \in [n]} A_{i,\pi(i)}. \end{equation*} \]
The permanent of a matrix is a natural quantity, and has been studied in several contexts including combinatorics and graph theory. It also arises in physics where it can be used to describe the quantum state of multiple Boson particles.

No efficient algorithms are known!

In Geometric Complexity Theory, proving that computing the permanent cannot be efficiently reduced to computing determinants is considered to be a major milestone for the program.

Class NP

Many problems are hard.
- $\HAMPATH$
- $\MAXCUT$
- $\THREESAT$
- $\cdots$
Why have we been unsuccessful in finding polynomial-time algorithms for these problems?

We don't know.
We would like some understanding on the reason for this behavior.

But whether this “cliff” between the easy and hard problem is a real phenomenon or a reflection of our ignorance is still an open question.
One remarkable discovery concerning this question shows that the complexities of many problems are linked.

This discovery is known as $\NP$-completeness theory.

Efficient Verifiability

The $\HAMPATH$ problem.

Recall that a Hamiltonian path between $s$ and $t$ is a path from $s$ to $t$ that goes through every vertex of the graph exactly once.
The $\HAMPATH$ problem has a feature called polynomial verifiability which is important for understanding its complexity.

Even though we do not know how to solve the $\HAMPATH$ problem efficiently, it is (obviously) easy to verify whether a given path is Hamiltonian!

How about the complement $\overline{\HAMPATH}$ of $\HAMPATH$? That is, can you find a certificate that a graph does not has a Hamiltonian path?
The $\SAT$ problem has a similar situation.

Easy to verify, hard to solve.
Another polynomial-verifiable problem is the complement of $\PRIME$ called $\COMPOSITE$, deciding whether there are numbers $p,q > 1$ such that $n = pq$ for the given number $n$.

How about $\PRIME$? Can you certify that a number is prime without resorting to AKS02? (Homework)
Proof system and computation (efficient version)

A verifier for problem $A$ is an algorithm $V$, where $A = \{x \mid V \text{ accepts } \pair{x}{w} \text{ for some string } w\}$.

Problem $A$ is polynomial-time verifiable if it has a polynomial-time verifier, whose time complexity is measured in terms of the length of input $x$.
The string $w$ is the witness or proof that $x$ is in $A$.

The proof $w$ is at most polynomial in $n = \abs{x}$ in length as otherwise the verifier won't have time to read the whole proof.
$\NP$ is a notion of easiness.

Informally, $\NP$ is the class of problems that have polynomial-time verifiers.

Nondeterministic Polynomial-Time

Why is it called $\NP$?

$\NP$ stands for nondeterministic polynomial time and is derived from an alternative characterization of the class $\NP$ using nondeterministic TMs.

The nondeterministic time class $\NTIME(t(n))$ is the collection of problems that can be decided by an $O(t(n))$ time nondeterministic Turing machine.
Example: an NTM $H$ for $\HAMPATH$.

$H$ = “On input $\tuple{G, s, t}$, where $G$ is a directed graph and $s,t$ are nodes in $G$:
1. Nondeterministically write down a list of numbers $u_1, u_2, \ldots, u_n$, where $n$ is the number of vertices in $G$.
2. Reject if there are repetitions in the list.
3. Reject if the conditions $s = u_1$ and $t = u_n$ do not hold.
4. Reject if there is an $i$ between $1$ and $n - 1$ such that $(u_i, u_{i+1})$ is not an edge of $G$.
5. Accept.”
Equivalence

A problem is in $\NP$ if and only if it is decided by some nondeterministic polynomial-time Turing machine. That is,
\[\begin{equation*} \NP = \bigcup_{c \ge 0} \NTIME (n^c). \end{equation*} \]

Problems in NP

$\NP$ is large class of important problems we care about.

Reducibility Among Combinatorial Problems, by Richard Karp 1972, considered 21 natural combinatorial problems in $\NP$.

Computers and Intractability: A Guide to the Theory of NP-Completeness, a textbook by Michael Garey and David S. Johnson published in 1979, listed more than 300 NP-hard computational problems.
Example problems in $\NP$ but not known in $\P$:

$\LONGPATH$, $\HAMPATH$, $\THREESAT$, $\MAXCUT$, $\TSP$, $\FACTOR$
Problems we haven't mentioned
1. The $\VC$ problem
2. The $\CLIQUE$ problem
  
  A $k$-clique of graph $G$ is a complete subgraph of $G$ with $k$ vertices.
  \[\begin{equation*} \CLIQUE = \{ (G, k) \mid G \text{ has a } k\text{-clique} \}. \end{equation*} \]
3. The $k$-$\COL$ problem
  
  A graph $G$ is $k$-colorable if it possible to color all the vertices of $G$ so that two vertices of each edge have different color.
4. Graph isomorphism problem $\GI$

P, NP, coNP

Relation with $\P$:
- $\P \subseteq \NP \subseteq \EXP$
- $\P$ vs. $\NP$ problem
Is verifying the proof much easier than coming up with the proof?

It is one of the seven Millennium Prize Problems.

Most researchers believe $\P \ne \NP$.
Class $\coNP$

Problems where no-instances have certificates. That is, if $x \not\in A$, there is a witness $y$ of this fact.

A decision problem $A$ is in $\coNP$ if there is a polynomial-time verifier $V$ such that $x \in A$ if and only if $\forall y, V(x, y) = 1$.

$\coNP = \{A \mid \overline{A} \in \NP\}$.

Example: Given a CNF formula $\varphi$, is it a tautology?
$\NP$ vs $\coNP$.

Draw the picture.

The problem $\FACTOR$ asking whether $n$ has a factor less than $k$ is in $\NP \cap \coNP$.

To show it is in $\coNP$, it suffices to present the prime number factorization of $n$.

Lecture 7: NP Completeness

NP-Completeness and Cook-Levin Theorem

One of the most beautiful observations of computational complexity theory is that some problems in $\NP$ are the hardest among all problems in $\NP$. If you can solve one of them in polynomial time, you can solve all problems in $\NP$ in polynomial time.

Importance

On the theoretical side, to resolve the $\P$ vs. $\NP$ conjecture, it now suffices to focus on one particular problem.
On the practical side, prevent wasting time searching for a polynomial time algorithm to solve a particular problem known to be $\NPC$.

NP-Completeness

A problem $B$ is $\NPC$ if two conditions hold

$B$ is in $\NP$.
Every $A \in \NP$ is polynomial-time reducible to $B$.

We say $B$ is $\NP$-hard if the containment in $\NP$ part does not hold or is simply not checked.

Polynomial-time reducibility

Problem $A$ is polynomial-time (mapping) reducible to problem $B$, written $A \le_P B$, if there is a polynomial-time computable function $f: \Sigma^* \to \Sigma^*$ such that $\forall x\, [x \in A \leftrightarrow f(x) \in B]$.

The function $f$ is called the polynomial-time reduction of $A$ to $B$.
It is called polynomial-time many–one reduction (or Karp reduction) in some other textbooks.

If $A \le_P B$ and $B \in \P$, then $A \in \P$.

Cook-Levin Theorem

$\SAT$ is $\NP$-complete.

We already know $\SAT \in \NP$ in the previous lecture. So it suffices to prove that all $A\in \NP$ reduces to $\SAT$.

Let $A$ be a problem in $\NP$. We present polynomial-time reduction as follows.

By definition, there is a verifier $V$ for $A$ such that

$V$ is a TM taking input $\tuple{x,y}$ running in time $\poly(\abs{x})$ and,
There exists $y$ such that $V$ accepts $\tuple{x,y}$ if and only if $x \in A$.

We consider the sequence of configurations of machine $V$ on input $\tuple{x,y}$.

Suppose an upper bound of the running time of machine $V$ is $\ell = n^c$ for large $n$. We can truncate the configurations and represent them as strings in $\Gamma^\ell$ where $\Gamma = (Q \times \Sigma) \cup \Sigma$.

For simplicity, assume that the size of $\Gamma$ is $2^k$ for some integer $k$, and a configuration for each time step is encoded as a binary string in $\{0,1\}^{\ell k}$.

We construct a CNF formula $\varphi_x$ as follows. Similar to the proof that $\lambda$-calculus is Turing-complete, the computation of $\gamma^{(t+1)}_{j}$ from $\gamma^{(t)}_{j-1}, \gamma^{(t)}_j, \gamma^{(t)}_{j+1}$ is given by a fixed function $F_V$. The input of $F_V$ is $\{0,1\}^{3k}$, and the output is $\{0,1\}^{k}$.

Every Boolean function $f:\{0,1\}^k \to \{0,1\}$ can be written as a CNF formula of size $O(k 2^k)$ by excluding all possible inputs $x$ such that $f(x) = 0$. So the condition that $\gamma^{(t+1)}_{j} = F_V \bigl( \gamma^{(t)}_{j-1}, \gamma^{(t)}_j, \gamma^{(t)}_{j+1} \bigr)$ can be expressed as $k$ such $F_V$ formulas $\varphi^{(t)}_j = \bigwedge_{i=1}^k \varphi^{(t)}_{j,i}$.

Finally, the CNF formula we use consists of three parts:

A CNF formula $\varphi_{\text{init}}$ that checks that the input is $x$.
The $F_V$ formulas $\varphi^{(t)}_{j}$ checking the propagation of the computation is correct.
A CNF formula $\varphi_{\text{final}}$ that checks that the configuration $\gamma^{(\ell)}$ is accepting.

It has the form $\varphi_x = \varphi_{\text{init}} \land \Bigl( \bigwedge_{t,j} \varphi^{(t)}_{j} \Bigr) \land \varphi_{\text{final}}$.

The variables in the formula are all the bits representing the configurations.

It is easy to see that $x \in A$ if and only if $V$ accepts $x$ for some $y$, if and only if $\varphi_x \in \SAT$.

Variants of Satisfiability

Many variants of $\SAT$ are $\NPC$.

$\THREESAT$ is $\NPC$.

Easy.

$\problem{MAX2SAT}$ and $\NAESAT$ are $\NPC$.

$\NAESAT$ (not-all-equal SAT) is a SAT problem with the additional requirement that no clause is allowed to have all literals assigned True.

See Schaefer's dichotomy theorem.

NP-Complete Reductions

Now that we know $\THREESAT$ is $\NPC$, we can prove more problems to be $\NPC$ by reducing $\THREESAT$ to it.

Vertex Cover

$\VC$ is $\NPC$.

We prove that $\THREESAT \le_{P} \VC$.

Local reductions

Play with small graphs.

Consider the vertex cover of an edge graph and a triangle graph.

We need to encode True and False of variables in CNF formulas, the non-deterministic choices for the values of the variables.
Construction

Given a $\THREESAT$ formula $\varphi = \bigwedge_{j=1}^m C_j$ of $n$ variables, where $C_j = \ell_{j,1} \lor \ell_{j,2} \lor \ell_{j,3}$, we want to construct an instance of the $\VC$ problem.

For each variable $x_i$, we introduce two variable vertices labeled by $x_i$ and $\neg x_i$ and connect them with an edge.

For each clause $C_j$, we introduce three clause vertices labeled by $\ell_{j,1}$, $\ell_{j,2}$, and $\ell_{j,3}$, forming a triangle.

Finally, we connect each clause vertex $\ell_{j,r}$ to the corresponding variable vertex. For example, if $\ell_{j,r} = \neg x_i$, $\ell_{j,r}$ is connected to variable vertex $\neg x_i$.

The resulting graph $G$ has $3m + 2n$ vertices, and we ask if there is a vertex cover of size $2m + n$.
Proof
1. If $\varphi$ is satisfiable. $G$ has a vertex cover of size $2m + n$.
2. If $G$ has a vertex cover of size $2m + n$, there is a satisfying assignment for $\varphi$.

Independent Set and Clique

An independent set of a graph $G$ is a set of vertices forming a subgraph of $G$ with no edges.

It is easy to prove that $\VC \le_P \INDSET$ as $S$ is a vertex cover if and only if $V\setminus S$ is an independent set.

The reduction is therefore simply $\tuple{G, k}$ to $\tuple{G, n -k}$ where $n$ is the number of vertices in $G$.
$S$ is an independent set of $G$ if and only if $S$ is a clique of complement graph $G^c$.

$\INDSET \le_P \CLIQUE$.

3-Coloring

$\THREECOL$ is $\NPC$.

Consider the OR gadget. Details left as homework.

Hamiltonian Cycle

$\HAMCYCLE$ is $\NPC$.

It is easy to see that $\HAMCYCLE$ is in $\NP$.
We prove the theorem by reducing $\VC$ to $\HAMCYCLE$.
Let $\tuple{G, k}$ be an instance of $\VC$. We will define a graph $H$ such that $G$ has a vertex cover of size $k$ if and only if $H$ has a Hamiltonian cycle.
Construction of $H$

Play with the Hamiltonian cycle problem on simple graphs.

If there is a vertex $v$ of degree two, the cycle must visit $v$ after visiting one of its neighbors.

For a vertex of degree larger than two, we need non-deterministic choices of the outgoing edge.

Consider the following gadget graph, there are three ways to go through all the vertices in the gadget.

Place a copy of the gadget on every edge $e$ in $G$. We call this copy $H_e$ and the vertices $a$ and $b$ the left side and $c$ and $d$ the right side. If $e = {u, v}$, the left side is close to $u$ and corresponds $u \in V(G)$. Similarly, the right side corresponds to vertex $v$ in $G$.

Use the first way of traversing (I) the gadget $H_e$ to represent choosing both vertices on the edge in the vertex cover problem. Use the second and third (II and III) to represent choosing one of the vertices in the vertex cover.

Connect the sides of gadgets $H_e$ that corresponds to the same vertex $u$ in $G$ in an arbitrary order. Each vertex in $G$ now corresponds to several linked gadgets with two open edges around the vertex.

Give an example.

Introduce $k$ vertices $v_1, \ldots, v_k$ in $H$ that each connects to all the open edges.

This is the construction of $H$, which is obviously in polynomial time.
Completeness

If $G$ has a vertex cover, we show that $H$ has a Hamiltonian cycle.

Start with $v_1$, go to the first set of gadgets corresponding to a marked vertex in the cover, and traverse the gadgets depending on whether the other vertex in the edge is marked or not. If the other vertex is marked, use traversal (I). Otherwise, use (II) or (III). Go to $v_2$ and the second set of gadgets of another marked vertex, and so on.

Go back to $v_1$ in the end.
Soundness

If $H$ has a Hamiltonian cycle, there is a vertex cover for $G$.

All Hamiltonian cycles go through $v_1, \ldots, v_k$ exactly once.

Between the visits to $v_j$ and $v_{j+1}$, the cycle can go around at most one vertex in $G$.

This gives us a cover of the vertices in $G$, which we claim to be a valid vertex cover of $G$. Otherwise, if there is an uncovered edge $e$ in $G$, then vertices in $H_e$ is not in the cycle.
It is easy to reduce $\HAMPATH$ to $\HAMCYCLE$ and vice versa.

Max Cut

$\MAXCUT$ is $\NPC$

$\MAXCUT \in \NP$ is easy.
We reduce $\NAESAT$ to $\MAXCUT$.

Choosing the starting point could simplify the reduction.

It is an art, with no general recipe.
Reduction

We use the fact that a triangle has a max-cut of size $2$.

Let $C = \bigwedge_{j=1}^m C_j$ be an instance of $\NAESAT$ with variables $x_1, x_2, \ldots, x_n$.

The graph $G$ has $2n$ vertices labeled by $x_i$ and $\neg x_i$.

For each clause $C_j = \ell_{j,1} \land \ell_{j,2} \land \ell_{j,3}$, form a triangle of vertices $\ell_{j,1}$, $\ell_{j,2}$, and $\ell_{j,3}$.

We want to use the cut to define the true assignment of variables $x_i$, so it is ideal that $x_i$ and $\neg x_i$ are on different sides of the max-cut. To do so, we add $n_i$ copies of the edges $(x_i, \neg x_i)$ where $n_i$ is the number of times $x_i$ or $\neg x_i$ occurs in the instance $C$.

Set the threshold $k = 5m$.

Give an example.

The cut will have $3m$ edges contributed by the edges between $x_i$ and $\neg x_i$. The remaining $2m$ must come from the triangles. So, all the triangles must be cut with two edges, meaning that the literals in the clause is not all True and not all False.
Completeness and soundness

If $C$ is satisifiable, the cut is defined by the literals assigned True.

If $G$ has a cut of size at least $k$, then the $\NAESAT$ problem has a satisfying assignment.

Summary

An $\NP$ reduction consists of four parts:
1. Description of the reduction $R$.
2. Efficiency of the reduction.
3. Completeness: If we are reducing $A$ to $B$. If $x \in A$, then $R(x) \in B$.
4. Soundness: If $x \not\in A$, then $R(x) \not\in B$.

NP Intermediate

If $\P \ne \NP$, there are problems of intermediate hardness between $\P$ and $\NP$-complete.

Ladner's Theorem

There are two different proofs for this theorem. The one we will present is due to Russell Impagliazzo.

Idea: Apply padding to make the $\SAT$ problem easier.

For computable function $Z: \natural \to \natural$ consider the padded $\SAT$ problem $\SAT_Z$, where
\[\begin{equation*} \SAT_Z = \bigl\{ \varphi 1 0^{n^{Z(n)}} \mid \varphi \text{ is a } \SAT \text{ instance of size } \abs{\varphi} = n \bigr\}. \end{equation*} \]
$Z$ needs to be a slowly growing function so that we can argue three things:
1. $\SAT_Z$ is in $\NP$.
2. $\SAT_Z$ is not in $\P$ unless $\P = \NP$.
3. $\SAT_Z$ is not $\NPC$.
Definition of $Z$

Enumerate all TMs $M_1, M_2, \ldots$. Without loss of generality, assume that $M_i$ has running time at most $n^i$ and each machine has infinitely many encodings appearing in this sequence.

Define $Z(1) = 1$. Suppose $Z(n-1) = i$. If for all $x$ of length at most $\log n$, $M_i$ computes $\SAT_Z(x)$, define $Z(n) = i$. Otherwise, define $Z(n) = \min \{i+1, \log\log n\}$.

Note that we require $Z(n)$ to grow slower than $\log\log n$ so that the time complexity of $M_i$ is at most polynomial in $n$.
Proofs

Claim 1. Function $Z(n)$ and $n^{Z(n)}$ are computable in time polynomial in $n$. Hence, $\SAT_Z$ is in $\NP$.

Note that $Z(n)$ is recursively defined. Solve the recurrence relation.

Claim 2. $Z$ is bounded if and only if $\SAT_Z \in \P$.

If $\SAT_Z \in \P$, then there will be a fixed TM $M_i$ solving the problem. This would imply that $Z(n)$ is constant for large $n$ by the definition of $Z$.

Consider the case when $Z$ is constant, say $Z(n)=i$, for large $n$. In this situation, we claim that $M_i$ solves the problem $\SAT_Z$ in time $n^i$. Otherwise, if there is an input $x$, which $M_i$ does not solve correctly, then for $n > \max \bigl\{ 2^{\abs{x}}, 2^{2^{i+1}} \bigr\}$, we would have $Z(n) \ne i$, a contradiction.

We conclude that $Z$ is unbounded and $\SAT_Z$ is not in $\P$. Otherwise, the padding is polynomial, and the problem $\SAT_Z$ is still $\NPC$. A contradiction to the assumption that $\P \ne \NP$.

Claim 3. $\SAT_Z$ is not $\NPC$. Suppose it is, then there is a polynomial time reduction $R$ such that $\phi \in \SAT \leftrightarrow R(\phi) \in \SAT_Z$.

Since $Z$ is unbounded, the padding is super-polynomial, so the reduction reduces $\SAT$ instance $\phi$ to a $\SAT$ instance $\varphi$ of asymptotically smaller length!

Lecture 8: What If P = NP?

“The evidence in favor of [P $\ne$ NP] and [ its algebraic counterpart ] is so overwhelming, and the consequences of their failure are so grotesque, that their status may perhaps be compared to that of physical laws rather than that of ordinary mathematical conjectures.” — Volker Strassen, laudation for Leslie Valiant, 1986.

What if P = NP

It is said that the $\P$ vs. $\NP$ problem is the great open question of computer science.

But why is it so important?
What if $\P = \NP$ but the best algorithm for $\THREESAT$ takes $n^{1000}$ time?

This is a valid possibility, and a few leading TCS figures like D. Knuth, took the possibility $\P = \NP$ more seriously than others.
We will consider two extreme scenarios:
1. $\THREESAT$ is super easy: it has an $O(n)$ or $O(n^2)$ time algorithm with reasonable constant hiding in the big-O notation (say $\le 10^6$.)
2. $\THREESAT$ is very hard: it is exponentially hard and cannot be solved faster than $2^{\eps n}$ for some constant $\eps > 0$ not too small (say $\ge 10^{-6}$).
The fastest known algorithm for $\THREESAT$ has a time complexity of $2^{0.38n}$ for formulas of $n$ variables.
What if $\P = \NP$ and the algorithm is practical?

Search-To-Decision Reductions

The ability to answer ‘yes/no’ questions of $\NP$ problems leads to the ability to search for the solutions to the problem.
$\THREESAT$ example

Suppose algorithm $D$ decides whether a formula $\varphi(x_1, x_2, \ldots, x_n)$ is satisfiable or not.

Consider algorithm $S$ = “On input $\varphi$
1. For $i = 1, 2, \cdots, n$ do:
  1. Use $D$ to test whether $\varphi(a_1, \ldots, a_{i-1}, 0, x_{i+1}, \ldots, x_n)$ is satisfiable or not.
  2. If so, set $a_i = 0$. Otherwise, set $a_i = 1$.
2. Check if $a = (a_1, a_2, \ldots, a_n)$ is a correct answer. Output $a$ if it is.”
The general case is similar.
If $\P = \NP$, then for every decision problem $A \in \NP$ with verifier $V$, there is polynomial-time algorithm $\text{Search}_V$ for its corresponding search problem $A_{\text{search}}$, whose task is to find a witness $y$ such that $V(x,y) = 1$ given $x$ as input.

For example, you can find the cut as a solution to the $\MAXCUT$ problem, the Hamiltonian path of a graph etc., all in polynomial time.

Optimization

The above trick does not guarantee the solutions to the $\NP$ problems are optimal.

For example, how do we find the longest path, instead of finding a path whose length is above certain threshold $k$ for a given graph?
In general, you can use binary search to solve all $\NP$ problems with an optimization flavor like the $\LONGPATH$ problem.
Linear programming vs. integer programming

We have mentioned that LP problems have polynomial-time algorithms.

Yet, if one requires that the variables take $0,1$ values instead of real values, the problem is $\NPC$.

So if $\P = \NP$ we have polynomial-time algorithms for integer programming problems too. These algorithms will have many important applications in machine learning, engineering, and operations research.
If $\P = \NP$, one can efficiently search through an exponential space and find not just the “local” mountain but the tallest peak.

Supervised Learning

In supervised learning, one is given sample-label pairs $(x_i, y_i)$ for $i=1, 2, \ldots, m$, $x_i \in \{0,1\}^n$ and $y_i \in \{0,1\}$.

The goal is to find a hypothesis $h:\{0,1\}^n \to \{0,1\}$ such that when we have a new input $x$, we may use $y = h(x)$ to predict its label.

The idea in supervised learning is to use the Occam's Razor principle: the simplest hypothesis that explains the data is likely to be correct.

One popular approach to model this is to pick some fairly simple function $H:\{0,1\}^{k+n} \to \{0,1\}$ and think of the first $k$ bits of $H$'s input as parameters.

For example, we can think of the first $k$ inputs of $H$ as the weights and connections for some neural networks that will be applied to the latter $n$ inputs.

In a natural approach called empirical risk minimization, one tries to find parameters $\theta = (\theta_1, \ldots, \theta_k) \in \{0,1\}^k$ that minimize the number of errors made by $x \mapsto H(\theta, x)$. That is, the problem is to minimize over the parameter space of $\theta$ for the loss function
\[\begin{equation*} \text{LOSS}(\theta) = \sum_{i=1}^m \abs{H(\theta, x_i) - y_i}. \end{equation*} \]
Hence if $\P = \NP$, we can solve the optimization for supervised learning in great generality.

Training can be done in a way as simple as inference.
Implications to Cryptography

If $\P = \NP$, most protocols in cryptography would become invalid!

It is like the supervised learning problem. Seeing enough encrypted data points will allow one to recover the secret key. Given a hash function, it would be easy to find collisions.

Find Proofs for Mathematical Problems

Recall that a proof system is defined by a TM $V$ that checks whether $w$ is a proof of $x$.
The proof checking procedure is usually very efficient, finishing the checking in $O(n)$ or at most $O(n^2)$ time.

The existence of an $n$-page proof for a statement $x$ implies that there exists a string $w$ of length $O(n)$ encoding the proof.
Thus, if $\P = \NP$, then despite Gödel’s incompleteness theorems, we can still automate mathematics since a computer can find proofs for us.
Consider the $\SHORTPROOF$ problem whose task is to answer yes if statement $x$ has a proof $w$ of length at most $m$ when given $\tuple{x, 1^m}$ as input.
Frankly speaking, if the shortest proof for some statement requires a terabyte, then human mathematicians will never find this proof either.

For this reason, Gödel felt that whether $\SHORTPROOF$ has a polynomial time algorithm is of great interest. He wrote in a letter to John von Neumann in 1956 (note this is before the concept of $\NP$ or even “polynomial time” was formally defined):

“One can obviously easily construct a Turing machine, which for every formula $F$ in first order predicate logic and every natural number $n$, allows one to decide if there is a proof of $F$ of length $n$ (length = number of symbols). Let $\psi(F, n)$ be the number of steps the machine requires for this and let $\varphi(n) = \max_F \psi(F,n)$. The question is how fast $\varphi(n)$ grows for an optimal machine. One can show that $\varphi(n) \ge k \cdot n$ [for some constant $k > 0$]. If there really were a machine with $\varphi(n) \sim k \cdot n$ (or even $\sim k n^2$), this would have consequences of the greatest importance. Namely, it would obviously mean that in spite of the undecidability of the Entscheidungsproblem, the mental work of a mathematician concerning Yes-or-No questions could be completely replaced by a machine. After all, one would simply have to choose the natural number $n$ so large that when the machine does not deliver a result, it makes no sense to think more about the problem.” — Kurt Gödel

For many reasonable proof systems, $\SHORTPROOF$ is $\NPC$, and if $\P = \NP$, we can decide whether a mathematical statement $x$ has a short proof in polynomial time.

Gödel can be considered the first person to formulate the $\P$ vs. $\NP$ question. Unfortunately, the letter was only discovered in 1988.

Gödel’s Lost Letter and P=NP, a blog by R. J. Lipton.

Quantifier Elimination

Previous discussions are based solving problems in $\NP$ if $\P = \NP$. Interestingly, we may even solve problems not known to be in $\NP$!

Problem $A$ in $\NP$ can be characterized as follows.

Deciding if $x \in A$ is equivalent to deciding if the following statement is true.
\[\begin{equation*} \exists y\, [ V(x, y) = 1 ]. \end{equation*} \]
Let us consider more general formulas, for example, whether the following statement is true.
\[\begin{equation*} \exists y\in \{0,1\}^{p(\abs{x})}\, \forall z \in \{0,1\}^{q(\abs{x})}\, [ V(x, y, z) = 1 ]. \end{equation*} \]
Note there is one more layer of alternative quantification.

More generally, we consider formulas with a constant number of alternations.

Problems characterized by such formulas form a complexity class known as the polynomial hierarchy, $\PH$.
Examples

Consider the following problem related to the $\problem{CIRCUIT\text{-}MINIMIZATION}$ problem.

Given a classical circuit $C$, decide if there is an equivalent circuit $C'$ that can be described by a string of length at most $s$.
\[\begin{equation*} \exists C'\in \{0,1\}^s\, \forall x\in \{0,1\}^n\, [C(x) = C'(x)]. \end{equation*} \]
Other examples include modeling the winning strategy within a constant number of moves.
It turns out that if $\P = \NP$ we can solve such problems efficiently as well.

We will only consider the simple case of two quantifiers, and the general proof is a simple induction.

To decide if
\[\begin{equation*} \exists y\, \forall z\, [ V(x, y, z) = 1 ], \end{equation*} \]
we consider problems $\varphi_{x, y}$ for fixed $y$, deciding if
\[\begin{equation*} \forall z\, [ V(x, y, z) = 1 ]. \end{equation*} \]
The negation $\overline{\varphi_{x, y}}$ of $\varphi_{x, y}$ is
\[\begin{equation*} \exists z\, [ V(x, y, z) = 0 ], \end{equation*} \]
which is problem in $\NP$ and therefore, there is a polynomial-time algorithm $S$ solves it given $\tuple{x,y}$ as input. That is, for all $x,y$, $S$ outputs $1$ if $\exists z\, [ V(x, y, z) = 0 ]$, or equivalently, $\varphi_{x, y}$ is false; $S$ outputs $0$ if $\varphi_{x, y}$ is true.

Hence, the original problem is equivalent to
\[\begin{equation*} \exists y\, [ S(x,y) = 0 ], \end{equation*} \]
which is again a statement in $\NP$, and by the assumption that $\P = \NP$, it is solvable in $\P$.
It can be used to show that randomized computation has no additional power assuming $\P = \NP$.

Approximate Counting

The complexity of counting solutions and $\SharpP$.

Suppose that $C$ is a circuit of size polynomial in $n$ with inputs $x = x_1, \ldots, x_n$.
How many $x$ satisfies $C(x) = 1$?

Given the description of circuit $C$, what is the size of the set $S = \{ x \mid C(x) = 1\}$.
This is a problem known to be complete for the complexity class $\SharpP$, and computing the exact answer is at least as hard as $\NP$.
Stockmeyer's counting theorem

If $\P = \NP$, we do not know how to compute $\abs{S}$, but we do know how to approximate it using Stockmeyer's theorem.

Note: Approximation within a factor of $2$ implies that within a factor of $1 + O(1/\poly(n))$.

Proof idea: Hash and use NP to find the collision. If $S$ is small, $h(x) = h(y)$ will have no solutions with $x \ne y$.

Checking if there exist distinct $x, y$ such that $h(x) = h(y)$ and $x, y \in S$ is easy.

Implications

We can solve some problems (a huge number of them) faster!

Moreover, it is a shift of paradigm of how we think about computation.
A fast algorithm for $\NP$ problems would bring about a new type of understanding.

When comparing problems for which we do not know efficient algorithms with those that do, there is usually a lack of “closed-form formula” or other types of more constructive descriptions of the problem.

Such new understanding would be very fruitful regardless of their computational utility.

“If P=NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it's found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett.” — Scott Aaronson

Relativization and Limits of Diagonalization

So far, the diagonalization method has been very successful in showing many results in this course. Can we prove $\P \ne \NP$ using the diagonalization method?

Relativization is a barrier for this approach!

Relativization

Computation assisted with the ability to solve certain problems for free.
Consider a TM with the extra ability to solve $\SAT$ problems in a single step.

We do not ask how this is accomplished but think of this extra ability as a black box.
The black box is often called an oracle.

A polynomial-time machine with a $\SAT$ oracle can solve any $\NP$ problems regardless whether $\P$ equals $\NP$ or not.

Such a machine is said to be computing relative to the $\SAT$ problem.
In general, the oracle can correspond to any problem, not just $\SAT$.

An oracle for problem $A$ is a device that can answer membership questions for $A$. An oracle TM $M^A$ is a modified TM adding the extra capability of querying the oracle $A$ to $M$. That is, $M^A$ has an oracle tape, and in one step, $M$ is informed whether the string on the oracle tape is in $A$ or not. Define $\P^A$ to be the class of problems solvable by a polynomial time oracle Turing machine using oracle $A$. Similarly, define $\NP^A$
Example
\[\begin{align*} \NP & \subseteq \P^{\SAT},\\ \coNP & \subseteq \P^{\SAT}. \end{align*} \]

Oracle Separation of $\P$ and $\NP$

There is an oracle $A$ such that $\P^A \ne \NP^A$.

Key idea: use oracles to hide structures.
We need to construct (or program) an oracle $A$ so that the claim is true.

That is, we need a problem $L_A$ such that $L_A \in \NP^A$ and $L_A \not\in \P^A$.

Consider $L_A = \{ x \mid \exists w \in A \text{ such that } \abs{x} = \abs{w} \}$.

It is easy to see $L_A \in \NP^A$. To see $x \in L_A$, the machine non-deterministically guesses $w$ of length $\abs{x}$ and check that $w$ is in $A$ using the oracle.

We will program $A$ so that $L_A \not\in \P^A$.

Let $M_1, M_2, \ldots$ be the list of all polynomial time TMs. For simplicity, we assume $M_i$ runs in time $n^i$.

Our construction of $A$ ensure that for all $i$, $M_i^A$ does not solve the problem $L_A$ correctly.

Suppose $A$ is partially specified and we have achieved our goal for all $M_j$ with $j<i$. We now work with $M_i$.

Choose $n$ so that

$2^n > n^i$.
$n$ is larger than the length of all strings queried before.

We show how to extend the definition of $A$ so that $M_i^A$ is not correct answering $1^n \in L_A$. The machine $M_i$ will query $A$, if the query point is already defined (queried before), $A$ replies accordingly. If the query point is new, $A$ reports ‘no’.

After $M_i^A$ finishes, if $M_i^A$ accepts, we program $A$ so that all strings of length $n$ is assigned ‘no’. Yet, if $M_i^A$ rejects, we program $A$ so that one of strings of length $n$ not queried by $M_i$ to be ‘yes’. This is possible as $M_i^A$ queries at most $n^i < 2^n$ positions.

This ensure that $M_i^A$ is not correct. So $L_A \not\in \P^A$ for $A$ constructed above.

Limits of Diagonalization

Oracle $B$

There is an oracle $B$ such that $\P^B = \NP^B$.

This direction is actually easier once we have all the tools ready, which we will discuss later in the course.

Note: We can choose $B$ to be $\TQBF$.

What does it say when putting both results together?
In diagonalization, one TM simulates another and then behaves differently.

The simulation can be modified easily if both machines are oracle TMs.

So any statement provable for TMs using only diagonalization will generalize to oracle TMs.

In particular, if there is a diagonalization proof for $\P \ne \NP$, then the same proof will show $\P^B \ne \NP^B$, which is a contradiction.

Lecture 9: Space Complexity

Space Complexity

Space complexity considers the complexity of computational problems in terms of the amount of space or memory they require.

Space complexity shares many of the features of time complexity and serves as a further way of classifying problems according to their computational hardness.

SPACE and NSPACE

For functions $f(n) \ge n$, define
\[\begin{equation*} \begin{split} \SPACE(f(n)) = & \{ A \mid A \text{ is solvable by a TM using }\\ & \quad O(f(n)) \text{ space} \},\\ \NSPACE(f(n)) = & \{ A \mid A \text{ is solvable by an NTM using }\\ & \quad O(f(n)) \text{ space} \}. \end{split} \end{equation*} \]
Example: $\SAT$ can be solved by a linear space algorithm.

TM $M$ = “On input $\desc{\varphi}$, where $\varphi$ is a Boolean formula:
1. For each truth assignment to the variables $x_1, x_2, \ldots , x_n$ of $\varphi$:
  1. Evaluate $\varphi$ on that truth assignment.
  2. Accept if evaluated to True.
2. Reject.”
Time vs. space
1. $\TIME(t(n)) \subseteq \SPACE(t(n))$,
2. $\SPACE(t(n)) \subseteq \TIME(2^{O(t(n))})$.
Define the complexity class
\[\begin{equation*} \PSPACE = \bigcup_{k \ge 0} \SPACE(n^k) \end{equation*} \]
So $\P \subseteq \PSPACE$.

$\NP \subseteq \PSPACE$.

Prove that $\THREESAT$ is in $\PSPACE$.

$\coNP \subseteq \PSPACE$.

Because $\class{coPSPACE} = \class{PSPACE}$.

Polynomial Space

PSPACE Completeness

Definition
1. $B$ in $\PSPACE$
2. Every $A$ in $\PSPACE$, $A$ reduces to $B$ in polynomial time.
Reductions should be as weak as possible.
The $\TQBF$ problem

$\phi_1 = \exists x \forall y\, \bigl[ (\neg x \lor y) \land (x \lor \neg y) \bigr]$.

$\phi_1 \not\in \TQBF$

$\SAT$ is a special case of $\TQBF$.

The number of alternations depends on the number of variables.
Containment in $\PSPACE$

$\TQBF$ is in $\PSPACE$.
Consider TM $S$ = “On input $\desc{\varphi}$:
1. If $\varphi$ has no quantifies, then $\varphi$ has no variables. It is either True or False. Output accordingly.
2. If $\varphi = \exists x\, \psi$, then evaluate $\psi$ with $x$ equals True and False recursively. Accept if either case accepts. Reject otherwise.
3. If $\varphi = \forall x\, \psi$, then evaluate $\psi$ with $x$ equals True and False recursively. Accept if both accept. Reject otherwise.”
The space complexity of $S$: Each recursive level uses constant extra space.

So $\TQBF \in \SPACE(n) \subseteq \PSPACE$.
Completeness

$\TQBF$ is $\PSPACE$-complete

Let $A$ be a problem in $\PSPACE$ decided by a TM using space at most $n^k$. We will give a polynomial time reduction $f$ mapping $A$ to $\TQBF$.

Plan: Design a formula $\varphi_{M,w}$ to express the condition that $M$ accepts $w$. Use the tableau of the configurations of $M$ on $w$.

The tableau has $n^k$ columns and at most $d^{n^k}$ rows where $d$ is the alphabet size.

If we use Cook-Levin directly, the formula would be too big!

We will do this recursively. For configurations $c_i$ and $c_j$, construct $\varphi_{c_i, c_j, b}$ which means $c_i \stackrel{b}{\to} c_j$. So we have
\[\begin{equation*} \varphi_{c_i,c_j,b} = \exists c_m\, \bigl[ \varphi_{c_i, c_m, b/2} \land \varphi_{c_m, c_j, b/2} \bigr ]. \end{equation*} \]
Do this recursively until the value $b$ is $1$, at which point we can use the method like in Cook-Levin. Finally, choose the formula $\varphi_{M, w} = \varphi_{c_\text{init}, c_\text{final}, t}$ where $t = d^{n^k}$.

There is an issue, however, as each recursion step doubles the number of subformulas.

The abbreviation trick!

We replace the $\land$ part with
\[\begin{equation*} \forall(c_g, c_h) \in \{(c_i, c_m), (c_m, c_j)\}\, \bigl[ \varphi_{c_g, c_h, b/2} \bigr]. \end{equation*} \]
Note that $\forall (x \in S)\, [\phi]$ is equivalent to $\forall x\, [x \in S \rightarrow \phi]$, and so each recursion increases the length of the formula by $O(n^k)$.

The number of recursion levels is $\log d^{n^k} = O(n^k)$.
$\TQBF$ remains $\PSPACE$-complete even if the inner formula without any quantifiers is a 3-CNF.

Run $\THREESAT$ $\NP$-hardness reduction on the formula which can be thought of as a verifier checking if the input is a correct witness.

Savitch's Theorem

(Savitch) For any function $f: \natural \to \natural$ where $f(n) \ge n$,

\[\begin{equation*} \NSPACE(f(n)) \subseteq \SPACE(f^2(n)). \end{equation*} \]

As a corollary, $\PSPACE = \NPSPACE$.

We have shown $\NP$ in $\PSPACE$. So, we can somehow simulate nondeterminism using space effectively.
A naive approach: trying all the branches of the NTM's computation.

However, a branch that uses $f(n)$ space may run for $2^{O(f(n))}$ steps and each step may be a nondeterministic choice.

Exploring the branches sequentially requires remembering all the choices made on a particular branch in order to find the next branch.

Therefore, this naive approach may use $2^{O(f(n))}$ space.
We can use the same recursion trick as the proof that $\TQBF$ is complete for $\PSPACE$.
Yieldability problem

Given two configurations $c_1$ and $c_2$ of the NTM $N$, one can check whether $N$ can go from $c_1$ to $c_2$ within $t$ steps using only $f(n)$ space.

Define TM $Y$ = “On input $c_1$, $c_2$ and $t$:
1. If $t = 1$, check whether $c_1 = c_2$ or whether $c_1$ goes to $c_2$ in one step according to the rules of $N$. If so, accept. Reject otherwise.
2. If $t > 1$, then for each configuration $c_m$ of $N$ using space $O(f(n))$:
  1. Run $Y$ on input $\tuple{c_1, c_m, t/2}$.
  2. Run $Y$ on input $\tuple{c_m, c_2, t/2}$.
  3. Accept if both runs accept.
3. If haven't yet accepted, reject.”
This algorithm needs space to store the recursion stack.

Each level of the recursion uses $O(f(n))$ space to store a configuration.

The depth of the recursion is $\log t$, where $t$ is the maximum time that the nondeterministic machine $N$ may use on any branch.

We have $t = 2^{O(f(n))}$, so $\log t = O(f(n))$. Hence $Y$ is a deterministic TM of space complexity $O(f^2(n))$.
Application: Construction of the oracle $B$ for $\P^B = \NP^B$

Take $B = \TQBF$.

$\NP^\TQBF \subseteq \NPSPACE \subseteq \PSPACE \subseteq \P^\TQBF$.

The first containment holds because $\TQBF$ oracle calls can be solved by $\PSPACE$ directly.

The second condition is by Savitch's theorem.

Games and Space Complexity

Connections between games and complexity

Formula game

Given an instance of $\TQBF$
\[\begin{equation*} \varphi = \exists x_1 \forall x_2 \exists x_3 \cdots (\forall/\exists) x_n\, [ (\cdots) \land \cdots \land (\cdots)] \end{equation*} \]
We consider a two-player game called the formula game with players $\exists$ and $\forall$

The $\exists$ Player assigns values to the $\exists$ quantified variables and the $\forall$ Player assigns values to the $\forall$ variables.

They choose the values in the order of quantifies in the formula.

$\exists$ Player wins if the formula evaluates to True.

Define problem
\[\begin{equation*} \begin{split} & \FG = \{ \desc{\varphi} \mid \text{Player $\exists$ has}\\ & \quad \text{a winning strategy for formulate game of } \varphi\}. \end{split} \end{equation*} \]
$\FG$ is $\PSPACE$-complete

Player $\exists$ has a winning strategy in the formula game for $\varphi$ if and only if $\varphi$ is true.

That is $\FG = \TQBF$.

Player $\exists$ has a winning strategy in the game if she can choose some assignment to $x_1$ so that for any assignment to $x_2$ chosen by Player $\forall$, there is a choice for $x_3$ and so on, such that the formula evaluates to True.

This is equivalent to the condition that the formula is true by the $\exists$ and $\forall$ semantics in logic.
接龙游戏 or the Generalized Geography game.

Given a directed graph $G$. Two players choose vertices in turn. One has to choose a neighbor of currently chosen vertex (by the opponent), and cannot choose any vertex that has been chosen before. A player who cannot make any choice loses.
\[\begin{equation*} \begin{split} & \GG = \{ \tuple{G, a} \mid \text{ Player 1 has}\\ & \quad \text{a winning strategy when starting from vertex } a\}. \end{split} \end{equation*} \]
Complexity of $\GG$

The $\GG$ problem is $\PSPACE$-complete.
1. The $\GG$ problem is in $\PSPACE$.
2. $\GG$ is $\PSPACE$-hard.
We reduce $\FG$ to it.

First, in the left part, we have diamonds to force the players to take turns. We cannot add outgoing edges to them.

On the right-hand side, we have a node on the top with $c_i$ clauses as children.

We link $x_i$ to the True side of the assignments on the left side.

The node $c$ must be chosen by a $\exists$ player (add extra nodes if not).

Player $\forall$ then tries to choose a clause $c_j$ that is not satisfied.

Player $\exists$ shows why the $c_j$ is actually satisfied.
Complexity of Chess and Go

$\PSPACE$-hard for $n \times n$ boards. Sometimes the games are even harder.

Logarithmic Space

When working with logarithmic space, we must change the model as the space is too small even to read the input.

L and NL

Consider TMs with two tapes, the input tape is now read-only and does not count as space used.

Define complexity classes $\L = \SPACE(\log n)$ and $\NL = \NSPACE(\log n)$.

Examples
1. $\PAL$ is in $\L$.
2. Define $\PATH$ to be the reachability problem for directed graphs. $\PATH$ is in $\NL$.
  
  The machine can nondeterministically guess the next vertex in the path.
  
  You will need to count the number of nodes to avoid loops.
$\L \subseteq \NL$.

Basic Facts

$\L \subseteq \P$.

Savitch's theorem works for $\NL$.

$\NL \subseteq \SPACE(\log^2 n)$.

$\NL \subseteq \P$.

Take a nondeterministic machine $M$ decides $A$ in space $O(\log n)$.

Take all the configurations of $M$ on $w$ and $c_i \to c_j$ if $c_i$ yields $c_j$ in one step.

Claim. TM $M$ accepts $w$ if and only if the configuration graph $G_{M, w}$ has a path from $c_{\text{init}}$ to $c_{\text{acc}}$.

Polynomial-time algorithm $T$ for $A$ = “On input $w$,

Compute $G_{M,w}$.
Accept if there is a path from $c_{\text{init}}$ to $c_{\text{acc}}$.”

NL Completeness

Definition

We cannot use polynomial-time reductions here!

We need to define log-space reducibility.

A log-space transducer is a TM of three tapes:

A read-only input of size $n$,
A read/write work tape of size $O(\log n)$,
A write-only output tape.

Problem $A$ is log-space reducible to $B$, written $A \le_L B$, if $A$ is mapping reducible to $B$ by a log space transducer.

Reductions in log space.

If $A \le_L B$ and $B \in \L$ then $A \in \L$.
The trivial proof that on input $w$, computes $f(w)$ would not work.

It would produce a string that is too long!
The idea is that we don't have to store the whole $f(w)$ and the decider only needs the symbols in $f(w)$ one at a time.

Compute the symbols in $f(w)$ only at the time we need them.

Each time, we start the transducer again from the beginning and ignore all output except the symbol at the desired position.

This is OK, as we are considering space complexity!

PATH is NL-Complete

We already know $\PATH \in \NL$
$\NL$ reduces to $\PATH$

The claim is not so surprising as we have already seen the configuration graph for $\NL$ problems.

Let $f(w) = \tuple{G, s, t}$, where $G$ is the configuration graph $G_{M,w}$ and $s = c_{\text{init}}$ and $t = c_{\text{acc}}$.

We need to show that there is a log-space transducer $T$ computing $f$.

TM $T$ = “On input $\tuple{M, w}$:
1. Go through all pairs of possible configurations $c_i, c_j$ of $M$ in some order.
2. Output edge $(c_1, c_2)$ if $c_i \to c_j$ is a legal move. (Size is $\log n$ as configuration mainly is the tape of the $\NL$ machine $M$)
3. Output $c_{\text{init}}$ and $c_{\text{acc}}$.”

NL = coNL

(Immerman-Szelepcsenyi) $\NL = \coNL$.

It needs a hard proof and you have to think as if you a nondeterministic machine!

See Sipser.

Space Hierarchy Theorem

It is similar to the time hierarchy theorem.

For any space-constructible function $f: \natural \to \natural$, there is a language $A$ that is decidable in $O(f(n))$ space but not decidable in space $o(f(n))$.

$\L \subsetneq \PSPACE$.
The proof is similar to that of the time hierarchy theorem.
$\L \subseteq \NL \subseteq \P \subseteq \NP \subseteq \PH \subseteq \P^{\SharpP} \subseteq \PSPACE$

We know $\NL \ne \PSPACE$ by diagonalization.

Complexity Theory's Waterloo

Lecture A: Probability Theory for Computer Science

“Probability Theory: The Logic of Science.” — E. T. Jaynes

NL = coNL

We will prove the Immerman-Szelepcsényi theorem in this section.

This is a hard proof! Now you have to think as if you a nondeterministic machine!

Strategy: we will show that $\NOPATH$ is in $\NL$

We need to give a nondeterministic machine which accepts if and only if there is no path.

You cannot just decide $\PATH$ and flip the answer as this is not doable in a nondeterministic machine.
An NTM $M$ computes a function $f: \Sigma^* \to \Sigma^*$ if for all $w$
1. All branches of $M$ on $w$ halt with $f(w)$ on the tape or reject (I don't know).
2. Some branch of $M$ will not reject.
Consider
\[\begin{equation*} \pathfunc(G, s, t) = \begin{cases} \text{Yes} & \text{if } G \text{ has a path from } s \to t,\\ \text{No} & \text{o.w.} \end{cases}. \end{equation*} \]
Define $R$ to be the set of reachable nodes and $r$ the size of the reachable.

Suppose some NL-machine computes $\pathfunc$ function, then some NL-machine computes $r$.
Consider NTM $C$ = “On input $\tuple{G, s}$
1. Let $k = 0$
2. For each node $u$
  1. if $\pathfunc(G, s ,u) = \text{Yes}$, then $k = k+1$.
  2. otherwise continue.”
The converse is also true!
Consider NTM $P$ = “One input $\tuple{G, s, t}$
1. Compute $r$
2. $k = 0$
3. For each node $u$, nondeterministically go to (p) or (n):
  
  (p) Nondeterministically pick a path from $s$ to $u$ of length at most $n$ where $n$ is the number of nodes.
  
  If fail, then reject.
  
  If $u = t$, then output ‘Yes’, else $k = k+1$.
  
  (n) Skip $u$ and continue.
4. If $k \ne r$ reject.
5. Output ‘No’.”
Now we add another small change to the problem by putting an upper bound $d$ on the length of the path.

If some NL machine computes $r_d$, there is some NL machine computing $\pathfunc_d$
The proof is similar by replacing the bound $n$ to $d$ and $r$ to $r_d$.
Now the problem is how we compute all the $r_d$ values.

If we can compute all the way up to $r_n$, we are done.
Given $r_d$, how to compute $r_{d+1}$?

Do I have an edge to one of the nodes in $R_d$?

If some $\NL$ machine computes $r_d$, there is a $\NL$ machine computing $\pathfunc_{d+1}$ (note the $d+1$).

Again the same, we just change the condition to output ‘Yes’ in the loop.
Corollary is that there is some $\NL$ machine computing $r_{d+1}$ from $r_d$.

This finishes the proof that $\NOPATH$ is in $\NL$.

Consider NTM = “One input $\tuple{G, s, t}$
1. $r_0 = 1$.
2. Compute $r_{d+1}$ from $r_d$ for $d=1, 2, \ldots, n$. In the loop, remember only the last $r_d$ value.
3. Accept if $\pathfunc_n(G,s,t) = \text{No}$.
4. Reject if $\pathfunc_n(G,s,t) = \text{Yes}$.”

Basics of Probability Theory

History of Probability Theory

Probability theory is now a fully developed area of mathematics with a wide range of applications in many areas of research, including computer science.
Probability theory is invented to understand problems in gambling.

The French nobleman and gambler Antoine Gombaud (Chevalier de Méré)
- Getting at least one “6” in four throws of a dice.
- Getting at least one double-six in $24$ throws of two dice.
- Known as de Méré's problem.
Unfinished game problem

Two gamblers, Alice and Bob, are playing a head-and-tail game. The rule is that anyone who first wins $s$ times gets all the bets. At the moment when Alice wins $a$ times and Bob wins $b$ times, the game has to stop for some reason. How should they split the bet?
- An example $s=3$, $a=2$, $b=1$.
- He couldn't solve it, so he asked Blaise Pascal and Pierre de Fermat.
- This led to the early development of probability theory.
  
  So, probability theory is born from the study of gambling!

Sample Space and Events

In TCS, we are mostly concerned with probability over finite sample spaces.
Example: The outcome of tossing $n$ coins as a bit string $x \in \{0,1\}^n$.

The sample space is $S = \{0,1\}^n$, the set of possible outcomes in the random experiment.

Each string $x$ occurs with probability $2^{-n}$.

In general, it could be any finite set (or even infinite set).
An event is a subset $A$ of the sample space $S = \{0,1\}^n$.
The probability of event $A$, denoted as $\Pr(A)$, is the probability that the randomly chosen $x$ will be in $A$.

More generally, it is
\[\begin{equation*} \Pr(A) = \sum_{x \in A} \Pr(x). \end{equation*} \]
The picture you should have for probability is that we assign each possibility a non-negative real number.

More generally, we could consider the general case where for each $x \in S$, the probability $\Pr(x)$ is a real number in $[0,1]$ satisfying the completeness condition
\[\begin{equation*} \sum_{x\in S} \Pr(x) = 1. \end{equation*} \]
The completeness condition says that we always assign $1$ to the sample space $S$ as an event.

Examples

Parity of uniform random coins

For every $n > 0$,
\[\begin{equation*} \Pr_{x\sim \{0,1\}^n} \biggl( \sum_{i=1}^n x_i \text{ is even} \biggr) = \frac{1}{2}. \end{equation*} \]

The unfinished game as a program

c = RandInt(2)
if (c == 1):
    print('Alice wins.')
else:
    c = RandInt(2)
    if (c == 1):
        print('Alice wins.')
    print('Bob wins.')

Split the bet based on the probability of winning.
\[\begin{equation*} \Pr(\text{Alice wins}) = \frac{1}{2} + \frac{1}{4} = \frac{3}{4}. \end{equation*} \]

Simple Facts

If $A \subseteq B$ then $\Pr(A) \le \Pr(B)$.
What is the probability of $\Pr(A \cup B)$?
\[\begin{equation*} \Pr(A \cup B) = \Pr(A) + \Pr(B) - \Pr(A \cap B). \end{equation*} \]
It is not always true that $\Pr(A \cup B) = \Pr(A) + \Pr(B)$, but the following inequality is always true
\[\begin{equation*} \Pr(A \cup B) \le \Pr(A) + \Pr(B). \end{equation*} \]
This inequality is known as the union bound.
\[\begin{equation*} \Pr \biggl(\bigcup_{j=1}^m A_j \biggr) \le \sum_{j=1}^m \Pr(A_j). \end{equation*} \]

Conditional Probabilities

Conditioning on event $A$ is like assuming $A$ happens for sure!

So you can take $A$ as the sample space temporarily.
\[\begin{equation*} \Pr(B|A) = \frac{\Pr(A \cap B)}{\Pr(A)}. \end{equation*} \]
Chain rule (in words)

“For events $A$ and $B$ to occur, first $A$ must occur (with $\Pr(A)$), then $B$ must occur given $A$ occurred.”
\[\begin{align*} \Pr(A \cap B) & = \Pr(A) \cdot \Pr(B|A),\\ \Pr(A \cap B \cap C) & = \Pr(A) \cdot \Pr(B|A) \cdot \Pr(C | A \cap B). \end{align*} \]
Law of total probability

For all events $A, B$
\[\begin{equation*} \Pr(B) = \Pr(A) \P(B|A) + \Pr(A^c) \P(B|A^c). \end{equation*} \]
Example: Let's provide an alternative proof of the even parity probability.

Let $A_k$ be the event that the first $n-1$ bits add to $k$ and $B$ be the event that the $n$ random bits have even parity.

The sum of all $n$ bits is, therefore, $A_k$ or $A_{k + 1}$ with equal probability, and exactly one of them will have even parity.

By the law of total probability,
\[\begin{equation*} \Pr(B) = \sum_k \Pr(A_k) \Pr(B | A_k) = \frac{1}{2} \sum_k \Pr(A_k) = \frac{1}{2}. \end{equation*} \]

Independence

Two events $A$ and $B$ are independent if $\Pr(A \cap B) = \Pr(A) \Pr(B)$.

If $\Pr(A) \ne 0$, this is equivalent to $\Pr(B|A) = \Pr(B)$.
The events $A$ and $B$ in the above example are independent.
In the random program picture, independence usually follows from the fact that $A$ and $B$ depend on disjoint parts of the program.
Multiple independent events

Independence of $A_1, A_2, \ldots, A_k$ requires that any subset $K$ of $\{1, 2, \ldots, k\}$,
\[\begin{equation*} \Pr \biggl(\bigcap_{j\in K} A_j \biggr) = \prod_{j \in K} \Pr(A_j). \end{equation*} \]

Birthday Paradox

There are $m$ students in a room. What's the probability that they all have different birthdays?

Simplification: we exclude Feb. 29th, so the number of days $N = 365$.
What's your guess?

Consider the more general case of $N$ days.

Let $A_j$ be the event that the $j$-th student's birthday differs from all previous birthdays.
The probability is
\[\begin{equation*} \begin{split} & \Pr(\text{no collision})\\ =\, & \Pr(A_1) \Pr(A_2|A_1) \cdots \Pr(A_m|A_1 \cap A_2 \cap \cdots \cap A_{m-1})\\ =\, & 1 \cdot (1 - 1/N) \cdot (1 - 2/N) \cdots (1 - (m-1)/N) \end{split} \end{equation*} \]

You can plot the probability as a function of $m$.

For $N=365$, you will need $m$ to be around $23$ for the probability to be close to $1/2$.

Nubmer $m$	Probability
1	0.0%
5	2.7%
10	11.7%
20	41.1%
23	50.7%
30	70.6%
40	89.1%
50	97.0%
60	99.4%

More generally, when $m \approx 1.18 \sqrt{N}$, the probability is around $1/2$.

Trick: We can approximate $1+x$ using $e^x$ for all real $x$ with a small absolute value.
\[\begin{equation*} \begin{split} \Pr(\text{no collision}) & \approx \exp \Bigl(-\frac{1}{N}\sum_{j=0}^{m-1} j \Bigr)\\ & = \exp\Bigl(-\frac{m(m-1)}{2N}\Bigr). \end{split} \end{equation*} \]

Applications of the Birthday Paradox

Cryptographic hash functions
- 1991: MD5 (Rivest)
- 1993: SHA-0 (NSA)
- 1995: SHA-1 (NSA)
- 2001: SHA-2 (NSA)
Birthday attack

SHA-1 outputs 160 bits, so finding a collision for it is like a birthday paradox with $N=2^{160}$.
Xiaoyun Wang's work breaks SHA-1.

Xiaoyun Wang, Yiqun Lisa Yin, Hongbo Yu, Finding Collisions in the Full SHA-1, Advances in Cryptology-CRYPTO 2005, LNCS 3621, 2005, pp. 17-36 (Best Paper Award).

Random Variables and Expectations

Random Variable

An event is a yes/no classification of elements in the sample space $S$.
A random variable is any function $X: S \to \real$ that maps every possible outcome to a real number.
Example:
\[\begin{equation*} X: x \mapsto \sum_{i=1}^n x_i, \end{equation*} \]
is a random variable that represents the Hamming weight of $n$ coin flip outcomes.
Functions of random variables
\[\begin{equation*} X = X_1 + X_2 + \cdots + X_k \end{equation*} \]

Expectation and Linearity of Expectation

Expectation is the mean value
\[\begin{equation*} \E(X) = \sum_{u\in S} p(u) X(u) \end{equation*} \]
Linearity

For all random variables $X_j$ for $j=1, 2, \ldots, k$,
\[\begin{equation*} \E\Bigl(\sum_j X_j\Bigr) = \sum_j \E(X_j). \end{equation*} \]
Expectation of the product of independent random variables

If $X_j$ are independent random variables,
\[\begin{equation*} \E\Bigl(\prod_j X_j\Bigr) = \prod_j \E(X_j). \end{equation*} \]

Examples

Binomial distribution
- Bernoulli trial
  
  Success with probability $p$.
  
  Failure with probability $q = 1-p$.
- A random variable $X$ representing the number of successes of the Bernoulli trials follows the binomial distribution, written as $X \sim \mathrm{Binomial}(n,p)$ with parameters $n \in \natural$ and $p \in [0,1]$.
  
  $\Pr(X = k) = \binom{n}{k}\, p^k(1-p)^{n-k}$.
- Expectation $\E(X) = np$.
- Variance $\Var(X) = npq$.
Geometric distribution $X \sim \mathrm{Geometric}(p)$.
- The number of trials to have one success.
  
  $\Pr(X = k) = (1-p)^{k-1} p$
- Expectation $\E(X) = \frac{1}{p}$.
- Variance $\Var(X) = \frac{1-p}{p^2}$.
Gaussian distribution

Even though we mostly focus on discrete distributions, Gaussian distribution is so important that you need to know it!
- Probability density function of $X \sim N(0,1)$.
  \[\begin{equation*} \phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}. \end{equation*} \]
- Rotational invariance of multiple Gaussians.
  
  The distribution of the random variable $X = (X_1, X_2, \ldots, X_k)$ where $X_j$ is a standard Gaussian is invariant under rotations.
  
  Independent samples of Gaussian form a multi-dimensional Gaussian.
  
  It is a useful fact from which several important results about Gaussian follow.
  
  First, we can use it to prove
  \[\begin{equation*} \int_{-\infty}^{\infty} e^{-x^2/2} \mathrm{d}x = \sqrt{2\pi}. \end{equation*} \]
  Second, it shows that the sum of several Gaussian distributions is again a Gaussian.

Coupon Collector's Problem

There are $n$ different kinds of coupons.
How many random coupons do you need to collect until you can have all $n$ coupons?

Let $X$ be the # of coupons you will need.
What is the expectation of $X$?
Write $X$ as some summation

$X_j$ is the number of days going from $j-1$ to $j$.

$X = X_1 + X_2 + \cdots X_n$.
When you have $i-1$ different coupons, the probability of getting a new one in the next step is $1 - (i-1)/n$.
So we have $X_i \sim \text{Geometric}(1-(i-1)/n)$ and
\[\begin{equation*} \E(X_i) = \frac{n}{n-i+1}. \end{equation*} \]
Therefore, by the linearity of expectation
\[\begin{equation*} \E(X) = n (1 + 1/2 + \cdots + 1/n) = n H_n, \end{equation*} \]
where $H_n$ is the $n$-th harmonic number, approximately $\log n$.

Concentration and Tail Bounds

How likely will a random variable be far from its expectation?

If we toss $n$ random coins, the number of heads is concentrated around $n/2$.

It can be written as $X = X_1 + X_2 + \cdots + X_k$ where $X_i$'s are independent Bernoulli trials.

Chernoff bounds are a collection of bounds for a quantitative analysis of the above concentration.

Markov Inequality

If $X$ is a non-negative random variable, then for all $t > 0$,

\[\begin{equation*} \Pr (X \ge t) \le \frac{\E(x)}{t}. \end{equation*} \]

Consider the indicator function. Draw the graph.

Example

As we have computed that the expectation of the coupon collector problem is $n H_n$, we know that the probability that it takes more than $100 n H_n$ coupons is at most $1/100$!
However, it does not give nice bounds for the random coin concentration problem.

Chebyshev Inequality

How concentrated a random variable $X$ is when we know the variance of $X$.
Variance
\[\begin{equation*} \Var(X) = \E\bigl( (X - \E(X))^2 \bigr). \end{equation*} \]
For independent $X_i$ where $i=1, 2, \ldots, n$, $Y = \sum_{i=1}^n X_i$, we have
\[\begin{equation*} \Var(Y) = \sum_{i=1}^n \Var(X_i). \end{equation*} \]
Standard deviation
\[\begin{equation*} \sigma(X) = \sqrt{\Var(X)}. \end{equation*} \]
Chebyshev inequality

Suppose a random variable $X$ has mean value $\mu$. Then for any $t>0$,
\[\begin{equation*} \Pr(\abs{X - \mu} \ge t) \le \frac{\Var(X)}{t^2}. \end{equation*} \]

Consider random variable $Y = (X - \mu)^2$. Then, we have $\E(Y) = \Var(X)$ by definition. By Markov inequality, the probability that $Y \ge t^2$ is at most $\E(Y) / t^2$. The proof is complete by observing that $Y \ge t^2$ is equivalent to $\abs{X - \mu} \ge t$.
Example

Number $k$ Percentage within $k$ standard deviations

$2$ $75\%$

$3$ $89\%$

$4$ $93.75\%$
Chebshev is ideal for problems with pairwise independence as it only relies on the variance (the second moment).
This can be generalized to any polynomial of $\abs{X - \mu}$, and so by the Markov inequality,
\[\begin{equation*} \Pr(\abs{X - \mu} \ge t) \le \frac{\E(\abs{X - \mu}^k)}{t^k}. \end{equation*} \]
To use this bound, you will need to bound the higher moments of the random variable, which can be complicated in most cases.

Number \(k\)	Percentage within \(k\) standard deviations
\(2\)	\(75\%\)
\(3\)	\(89\%\)
\(4\)	\(93.75\%\)

Chernoff Bound

We could even apply the same idea to functions other than polynomials!
\[\begin{equation*} \varphi(\lambda) = \E \bigl(e^{\lambda(X-\mu)}\bigr). \end{equation*} \]
This leads to a powerful tool known as the Chernoff bound.

There are many variants of Chernoff bound that are suitable for different scenarios.
Uniform coin flips.

Consider $S_n = X_1 + X_2 + \cdots X_n$ where each $X_i$ is a random coin flip. We have $\mu = \E(S_n) = n/2$.

Shift to have zero mean. $S_n - n/2 = \sum_j (2X_j-1)/2 = \sum_j Y_j/2$ for $Y_j = 2X_j-1$.

You may think of the random walk on a line.
\[\begin{equation*} \begin{split} \Pr(S_n - \mu \ge t) & = \Pr \Bigl(\exp \Bigl(\lambda \sum_j Y_j \Bigr) \ge e^{2\lambda t}\Bigr)\\ & = \Pr \Bigl(\prod_j e^{\lambda Y_j} \ge e^{2\lambda t} \Bigr)\\ & \le \frac{\E \prod_j e^{\lambda Y_j}}{e^{2\lambda t}}\\ & = \frac{(\E e^{\lambda Y_1})^n}{e^{2\lambda t}}. \end{split} \end{equation*} \]
Now, we need to bound
\[\begin{equation*} \begin{split} \E e^{\lambda Y_1} & = \frac{e^{\lambda} + e^{-\lambda}}{2}\\ & = 1 + \frac{\lambda^2}{2!} + \frac{\lambda^4}{4!} + \frac{\lambda^6}{6!} + \cdots\\ & \le e^{\lambda^2/2}. \end{split} \end{equation*} \]
Continuing the calculation, we have
\[\begin{equation*} \begin{split} \Pr(2S_n - n \ge t) & \le \exp(\lambda^2 n/2 - 2 \lambda t)\\ & = \exp(-2t^2/n), \end{split} \end{equation*} \]
by choosing the best $\lambda = 2t/n$.
Other versions

We give several variants of Chernoff bounds covering most of the use cases.

Let $X_1, X_2, \ldots, X_n$ be independent random variables such that $X_i \in [a_i,b_i]$ Define $X = \sum_{i=1}^n X_i$ and $\mu = \E (X)$. The Hoeffding bound says that, for all $t \ge 0$,
\[\begin{equation*} \begin{split} \Pr (X \le \mu -t) & \le \exp \biggl(- \frac{2 t^2}{\sum_j (b_j-a_j)^2}\biggr),\\ \Pr (X \ge \mu + t) & \le \exp \biggl(- \frac{2 t^2}{\sum_j (b_j-a_j)^2}\biggr). \end{split} \end{equation*} \]

(Multiplicative-form Chernoff Bound). Consider random variables $X_i \in [0,1]$ and $X = \sum_j X_j$. Then, for every $\varepsilon > 0$,
\[\begin{equation*} \begin{split} \Pr(X \le (1-\varepsilon) \mu) & \le \exp \biggl(-\frac{\varepsilon^2}{2}\mu\biggr),\\ \Pr(X \ge (1+\varepsilon) \mu) & \le \exp \biggl(-\frac{\varepsilon^2}{2+\varepsilon}\mu\biggr).\\ \end{split} \end{equation*} \]
Note the extra $\varepsilon$ in the second inequality.

Sampling Theorem

Suppose $(X_j)_{j=1}^n$ is a sequence of Bernoulli trails. If we sample $n$ independent samples, we can estimate the probability of success $p$ by computing the average $\hat{p} = \sum_j X_j / n$.

The Chernoff bound says that, for $n \ge \frac{1}{2 \varepsilon^2} \ln (2/\delta)$, $\abs{\hat{p} - p} \le \varepsilon$ holds with probability at least $1 - \delta$.

Probabilistic Method

The Probabilistic Method, by Noga Alon and Joel H. Spencer.

We will discuss the first example in the book regarding Paul Erdős' bound on the Ramsey number.

Ramsey Theorem. In any two coloring of the edges in the complete graph of six vertices $K_6$, there must be a monochromatic triangle $K_3$.
More generally, $R(k,l)$ is the number of vertices $n$ such that any red/blue coloring of $K_n$ will either have a red $K_k$ or a blue $K_l$.

(Erdős 1947) If $\binom{n}{k} \cdot 2^{1 - \binom{k}{2}} < 1$, then $R(k, k) > n$. Thus $R(k,k)> \floor{2^{k/2}}$ for all $k\ge 3$.

Consider a random coloring of the graph $K_n$ where each edge is colored red or blue with equal probability independently.

For each vertex subset $R$ of size $k$, let $A_R$ be the event that the corresponding subgraph is monochromatic. Clearly $\Pr(A_R) = 2^{1 - \binom{k}{2}}$.

By the union bound the probability of at least one of the $A_R$ events occurring is at most $\binom{n}{k} \cdot 2^{1 - \binom{k}{2}} < 1$.

Thus, with positive probability, no events $A_R$ occur.

This means there is a coloring of $K_n$ without a monochromatic $K_k$; that is $R(k,k) > n$. For $k\ge 3$ and $n = \floor{2^{k/2}}$ then
\[\begin{equation*} \binom{n}{k}2^{1 - \binom{k}{2}} < \frac{2^{1+k/2}}{k!} \cdot \frac{n^k}{2^{k^2/2}} < 1. \end{equation*} \]

Lecture B: Randomized Computation

“Does randomness solve problems? YES, randomness solves problems. BUT the randomness (probably) doesn't have to be random to solve problems.” — Russell Impagliazzo

Randomized Algorithms

Randomized Algorithms, by Rajeev Motwani and Prabhakar Raghavan, 1995

A randomized algorithm outputs the correct value with good probability on every possible input.

Randomized Searching Algorithm

Problem

Input: a string $x = x_1 x_2 \cdots x_n$ with exactly $n/4$ positions $1$.

Output: One such position.
Algorithms

Deterministic algorithm $D$ = “On input $x$: For $i=1, 2, \ldots, n$, return $i$ if $x_i = 1$.”

The worst case is when $x = 0^{3n/4}1^{n/4}$.

In fact, for all deterministic algorithms, the time complexity is $\Omega(n)$!

Randomized algorithm $R$ = “On input $x$: Repeat the following
1. Sample $i = \mathrm{RandInt}(n)$;
2. Return $i$ if $x_i = 1$.”
When the success probability of one trial is $p$, the running time of the algorithm is a random variable distributed according to $\mathrm{Geometric}(p)$, and so the expected running time is
\[\begin{equation*} \sum_{t=1}^\infty (1-p)^{t-1} p\, t = \frac{1}{p}. \end{equation*} \]
The expected time complexity is constant for all possible inputs of length $n$.

Matrix Multiplication Checking

Problem

Input: Matrices $A, B, C$ of size $n \times n$.

Output: ‘Yes’ if and only if $AB = C$.
Algorithms

The direct approach is to compute the product $AB$ and check if it is the same as $C$.

This has complexity $n^{\omega}$, where $\omega=3$ for the straightforward implementation and $\omega=2.37\cdots$ is the current record.

Freivalds' algorithm $F$ = “On input $\tuple{A, B, C}$:
1. Repeat the following $k$ times
  1. Generate a random vector $v \in \{0,1\}^n$;
  2. Compute $d = A(Bv) - Cv$;
  3. Return ‘No’ if $d \ne 0^n$;
2. Return ‘Yes’.”
Analysis

In $O(k n^2)$ time, the algorithm can verify the matrix product with a probability of failure less than $2^{-k}$.

If $AB=C$, the algorithm always returns ‘Yes,’ which is correct.

If $AB\ne C$, $D = AB-C$ is a non-zero matrix, let $D_{i,j}$ be a non-zero entry.

The $i$-th entry of $d$ can be written as
\[\begin{equation*} d_i = \sum_k D_{i,k} v_k = D_{i,j} v_j + \sum_{k: k\ne j} D_{i,k} v_k = D_{i,j} v_j + s. \end{equation*} \]
The probability $d_i = 0$ can be written as
\[\begin{equation*} \begin{split} \Pr(d_i = 0) & = \Pr(d_i = 0 | s = 0) \Pr(s = 0) +\\ &\quad \Pr(d_i = 0 | s \ne 0) \Pr(s \ne 0) \end{split} \end{equation*} \] \[\begin{align*} \Pr(d_i = 0 | s = 0) & = \Pr( v_j = 0 ) = \frac{1}{2}\\ \Pr(d_i = 0 | s \ne 0 ) & = \Pr( v_j = 1 \land D_{i,j} = -s )\\ & \le \Pr(v_j = 1) = \frac{1}{2} \end{align*} \]
So, we have
\[\begin{equation*} \begin{split} \Pr(d = 0^n) & \le \Pr(d_i = 0) \\ & \le \frac{1}{2}(\Pr(s = 0) + \Pr(s \ne 0)) = \frac{1}{2}. \end{split} \end{equation*} \]

Max-Cut Approximation

Problem

Input: a graph $G = (V, E)$ of $n$ vertices.

Task: find a cut of the graph $G$ that cuts as many edges as possible.
The $\MAXCUT$ problem is an $\NPC$ problem.
One way to attack $\NP$-hard problems is to consider approximate solutions.

In the $\MAXCUTAPP$ problem, for example, our task is to find a cut $C$ whose size is not far from the optimal one $C^*$.

If $\text{size}_C \ge \alpha\, \text{size}_{C^*}$, we say that $C$ is an $\alpha$-approximation of the max-cut.
A deterministic greedy algorithm.

Let $E_i = \{\{k,i\}\in E \mid k < i\}$, $C_0 = \{1\}$ and $C_1 = \emptyset$. Each time, we put $i$ in $C_0$ or $C_1$ to maximize the cut in $E_i$. Notice that $E_i$ is a partition of $E$; it is easy to show that this deterministic algorithm gives a $1/2$-approximation.
A randomized algorithm.

A cut is a $0{/}1$ assignment of the vertices.

Algorithm. Assign random $0{/}1$ numbers $x_u$ to each vertex $u \in V$.
\[\begin{equation*} \E(\text{size}_C) = \E \sum_{\{u, v\} \in E} 1_{x_u \ne x_v} = \frac{1}{2} \abs{E} \ge \frac{1}{2} \text{size}_{C^*}. \end{equation*} \]
Note that the randomized algorithm outputs large cuts in expectation.

A technique called “method of conditional expectations” can obtain an algorithm producing a cut that is always large.

For this to work, you will need to be able to efficiently compute quantities like
\[\begin{equation*} \E(\text{size}_C(x_1, \ldots, x_i, X_{i+1}, \ldots X_{n})), \end{equation*} \]
where $x_1, \ldots x_i$ are choices already made and $X_{i+1}, \ldots, X_n$ are the remaining random bits.

Derandomize It!

The above randomized algorithm uses randomness that has exponentially many possibilities.
If we can reduce the randomness to a polynomial number of possibilities, we can derandomize the algorithm!
Universal hash functions.

Hash functions map elements in a large universe to a small range. We often consider a family of hash functions $\mathcal{H} = \{ h : U \to R \}$.

Universal hash functions are a family of functions with the random-like property while the size of the family is small. We can use a small seed to choose hash functions from the family.
Pairwise independent hash functions.

A family $\mathcal{H} = \{h : U \to R\}$ is said to be pairwise independent if for any two distinct elements $x_1, x_2 \in U$, and any two (possibly equal) values $y_1, y_2 \in R$,
\[\begin{equation*} \Pr_{h \in \mathcal{H}} \bigl( h(x_1) = y_1 \text{ and } h(x_2) = y_2 \bigr) = \frac{1}{\abs{R}^2}. \end{equation*} \]

It is a very powerful tool!

It follows easily that
\[\begin{equation*} \Pr_{h \in \mathcal{H}} (h(x_1) = y_1) = \frac{1}{\abs{R}} \end{equation*} \]
and so the two events $h(x_i)= y_i$ are indeed independent, justifying the name.
Construction of pairwise independent bits.

For all $x \in \{0,1\}^k$, $h(x) = a \cdot x + b \pmod{2}$, where $a\in \{0,1\}^k$ and $b \in \{0,1\}$.

The size of the his family is $2^{k+1}$.
Prove that this is a pairwise independent hash function family for $R = \{0,1\}$.
Use it to derandomize the random algorithm for $\MAXCUTAPP$.

Key: The algorithm only checks pairwise information.
The derandomized algorithm is highly parallelizable as the value for each vertex only depends on $a,b$, and not the value of other vertices!
The algorithm considers $2n$ different cuts that are independent of the input graph. Magically, one of them has size at least $\abs{E}/2$.

Derandomization is a powerful idea!

Primality

Primality testing works by checking a property that is satisfied by prime numbers.

Fermat's little theorem

$a^{n} \equiv a \pmod{n}$ if $n$ is prime.

So, $a$ is a witness for the compositeness of $n$ if $a^n \not\equiv a \pmod {n}$ without knowing the factors of $n$.
The converse of Fermat's little theorem is false!

Even if $a^n \equiv a \pmod{n}$ for all $a$, $n$ is not necessarily prime.

Such numbers are called Carmichael numbers.

The smallest example is $n = 561$.
Rabin-Miller algorithm

The algorithm works by extending the test of Fermat's little theorem.

Consider an odd prime number $n = 2^k q + 1$. Then either $a^q \equiv 1 \pmod{n}$ or the sequence $S \equiv (a^q, a^{2q}, \ldots, a^{2^kq}) \pmod{n}$ ends with $1$ and the value before the first $1$ is $-1 \pmod{n}$.

The proof is simple. If $x^2 \equiv 1 \pmod{n}$, then $n | (x+1)(x-1)$. So, if $n$ is prime, we have $x \equiv \pm 1 \pmod{n}$.
Analysis

If $n$ is prime, the answer is ‘yes’ for all $a$.

If $n$ is composite, Rabin proved that the answer is ‘yes’ for at most $1/4$ fraction of $a$.
In 2002, Agrawal, Kayal, and Saxena ``derandomized’’ a randomized polynomial testing algorithm, showing primality testing is also in $\P$.

Yet, the randomized algorithms are preferred for primality testing in the practice as they are much more efficient.

BPP

What is the mathematical model for randomized computations?
Is it more powerful than deterministic computation?

Definition

Intuitively, polynomial-time algorithms are algorithms that call the $\mathrm{RandInt}(\cdot)$ API sometimes.

A probabilistic Turing machine is a type of nondeterministic Turing machine in which each nondeterministic step is called a coin-flip step and has two legal next moves. We assign a probability $2^{-k}$ to each branch of the machine's computation where $k$ is the number of coin flips occur in the branch.

The probability of the machine accepting the input is defined as

\[\begin{equation*} \Pr(M \text{ accepts } w) = \sum_{b:b \text{ is accepting}} \Pr(b). \end{equation*} \]

The error probability of a machine for deciding a language $A$ is $\eps$ if

$\Pr(M \text{ accepts }w) \ge 1 - \eps$ if $w\in A$, and
$\Pr(M \text{ accepts }w) \le \eps$ if $w\not\in A$.

Note: Error probability $\eps$ can be a function of the input size $n$.

$\BPP$ is the class of languages decided by probabilistic polynomial-time Turing machines with an error probability of $1/3$.

The Definition of BPP and NP

Like $\NP$, $\BPP$ has an alternative characterization.

A decision problem $A$ is in $\BPP$ if and only if there is a polynomial-time verifier $V$ such that for all $x$, $x \in A$ if and only if

\[\begin{equation*} \Pr_{r} \bigl(V(x, r) = 1 \bigr) \ge \frac{2}{3}. \end{equation*} \]

Error Reduction

We can reduce the error probability of randomized algorithms!

Any decision problem $A \in \BPP$ has a polynomial-time randomized algorithm whose error probability is $2^{-p(n)}$ where $p$ is a polynomial and $n$ is the input size.

Apply the Chernoff bound or the Sampling Theorem!

Circuits for BPP Problems

TMs vs. circuits

For a finite function $g:\{0,1\}^n \to \{0,1\}$ and a number $s$, $g \in \SIZE_n(s)$ if there is a circuit of at most $s$ NAND gates computing $g$.
To compare TMs and circuits, we must extend the class $\SIZE_n(s)$ from finite functions to functions with unbounded input lengths.

Consider $F : \{0,1\}^* \to \{0,1\}$ and $T : \natural \to \natural$. For every $n$, define the restriction of $F$ to input size $n$ as
\[\begin{equation*} F_{\restriction n} (x) = F(x) \text{ for } x\in \{0,1\}^n. \end{equation*} \]
We say that $F$ is non-uniformly computable in at most $T(n)$ size denoted $F \in \SIZE(T)$ if there is a sequence $C_0, C_1, C_2, \ldots$ of NAND circuits such that:
1. For every $n$, $C_n$ computes function $F_{\restriction n}$.
2. For every sufficiently large $n$, $C_n$ has at most $T(n)$ gates.
The non-uniform analog of $\P$ is the class $\Ppoly$ defined
\[\begin{equation*} \Ppoly = \bigcup_{c \in \natural} \SIZE(n^c). \end{equation*} \]
$\P$ vs. $\Ppoly$
1. $\P$: One TM for all inputs.
2. $\Ppoly$: For every input size $n$, there is a different circuit $C_n$.
$\P \subsetneq \Ppoly$ and $\Ppoly$ may even contain non-decidable problems.
We use error reduction to make sure that the error is at most $0.1 \cdot 2^{-n}$.

Suppose $A \in \BPP$. There is a verifier $V(x,y)$ such that for all $x$
\[\begin{equation*} \Pr_{y} \bigl( V(x,y) \ne A(x) \bigr) \le 0.1 \cdot 2^{-\abs{x}}. \end{equation*} \]
By union bound, there is a $y^*$ such that for all $x$, $V(x, y^*) = A(x)$.

Therefore, there is a circuit of at most $\poly(n)$ gates for problem $A$.
This proves that $\BPP \subseteq \Ppoly$.
Yet, it does not prove $\BPP = \P$ as we don't know how to compute $y^*$ efficiently.

P = BPP if P = NP

Sipser–Gács Theorem

Two major conjectures in complexity theory:
1. $\P = \NP$ is false.
2. $\P = \BPP$ is true.
We will prove $\P = \NP$ implies $\P = \BPP$, so at least one of them is true!
Why is it non-trivial?

$\P = \NP$ would enable us to guess a good solution.

But Even if $F(x) = 0$, there is still a possible random string $r$ such that $V(x,r) = 1$.
The proof follows the “quantifier elimination” technique assuming $\P = \NP$.

Write $F \in \BPP$ as a formula $\forall u \exists v P(u, v)$.
1. We perform error reduction so that if $F(x) = 0$, the set $S$ of strings making $V$ accept is tiny. If $F(x)=1$, the set $S$ is very large.
2. We consider shifts of $S$, $S \oplus s$.
  
  In the “no” case, the shifts for $s_1, s_2, \ldots, s_k$ will not cover all.
  
  On the other hand, in the “yes” case, there exist such polynomial number of $s_j$'s.
Now, notice that the covering can be written as a quantified formula: for all $y$, there exist $s\in S$ and $i$, such that $y = s \oplus s_i$.
The hard part is to prove the statement for the “yes” case.

Let $S \subseteq \{0,1\}^m$ be a set of size at least $2^{m-1}$. Then there exists $s_1, s_2, \ldots, s_{2m}$ satisfies the condition.

We use the probabilistic method, and choose $s_j$'s randomly.

For a point $z \in \{0,1\}^m$, Let $\mathrm{BAD}_z$ be the event that $z$ is not covered by the randomly chosen shifted sets.

We want to show $\Pr(\bigcup_z \mathrm{BAD}_z) < 1$, which follows from the union bound if $\Pr(\mathrm{BAD}_z) < 2^{-m}$. It implies that, over all the possible choices of $s_j$'s, at least one choice covers the whole set $\{0,1\}^m$.

Let $\mathrm{BAD}^i_z$ be the event that $z$ is not in $S \oplus s_i$.
\[\begin{equation*} \Pr\bigl( \mathrm{BAD}_z \bigr ) = \Pr \biggl( \bigcap_{i=1}^{2m} \mathrm{BAD}^i_z \biggr) \le \frac{1}{2^{2m}} < \frac{1}{2^m}, \end{equation*} \]
where the first inequality follows from the fact that $\mathrm{BAD}^i_z$ are independent for different $i$.

Hardness versus Randomness

(Impagliazzo-Wigderson). Suppose there is a problem $L$ in $\class{E} = \TIME(2^{O(n)})$ and a fixed $\delta > 0$ such that, for all sufficiently large $n$, the restriction $L_n$ is a Boolean function with circuit complexity at least $2^{\delta n}$. Then $\P = \BPP$.

Relation With P, NP, EXP

Simple fact: If an $\NP$-hard problem $F$ is in $\BPP$, then $\NP$ is contained in $\BPP$.
We know $\P \ne \EXP$ by the time hierarchy theorem.
We know $\BPP \subseteq \EXP$.
We do not know if $\BPP$ is also different from $\EXP$. We do not even know if $\BPP$ is different from $\NEXP$!
Possibilities
1. Expected: $\P = \BPP \subsetneq \NP \subseteq \EXP$.
2. Extreme $\BPP$: $\P \subsetneq \NP \subseteq \BPP = \EXP$.
3. Extreme $\P$: $\P = \NP = \BPP \subsetneq \EXP$.

Lecture C: Pseudorandomness and Private-Key Encryption

In this lecture and the next, we discuss using complexity theory and randomized computation in cryptography.

We will focus on defining security for cryptographic tools and how to perform security proofs using reductions.

It is merely a “taste” of cryptography!

“Our focus is on modern (post-1980s) cryptography, which is distinguished from classical cryptography by its emphasis on definitions, precise assumptions, and rigorous proofs of security.” — Jonathan Katz and Yehuda Lindell

Private-Key Encryption

Also known as the symmetric key encryption.

Setup

A sender Alice wants to send a message (also known as a plaintext) $x \in \{0, 1\}^*$ to a receiver, Bob.
They want their message to be kept secret from an adversary, called “Eve,” who listens in or “eavesdrops” on the communication channel.
Alice and Bob have some advantages over Eve: they share a secret key $k \in \{0,1\}^*$.
Alice uses the key $k$ to “scramble” or encrypt the plaintext $x$ into a ciphertext $y$.
Bob uses the key $k$ to “unscramble” or decrypt the ciphertext $y$ back into the plaintext $x$.

Validity

A pair of polynomial-time computable functions $(\Enc, \Dec)$ is a valid private key encryption scheme if for every $n \in \natural$, $k \in \{0,1\}^n$ and $x$, we have

\[\begin{equation*} \Dec(k, \Enc(k, x)) = x. \end{equation*} \]

For every valid encryption scheme, the length of the ciphertext is no less than that of the plaintext.

Easy.

The ciphertext length is always at least the plaintext length; in most applications, it is not much longer than that.
We focus on the plaintext length as the quantity to optimize in an encryption scheme.

For a given key length, the larger the plaintext length is, the better the scheme.

Security Definition: Discussions

Defining security is not a trivial matter.

Think about it yourself!
Security through obscurity is a faulty assumption.

Throughout history, many attacks on cryptosystems have been rooted in this faulty assumption.

A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.

The NSA version is like “Assume that the first copy of any device we make is shipped to the Kremlin”.
If the design is public knowledge, the key must be random.

There is no secrecy without randomness!
Randomness generators often turn out to be the Achilles heel of secure systems.

In 2006, a programmer removed a line of code from the procedure to generate entropy in the OpenSSL package distributed by Debian since it caused a warning in some automatic verification code. Two years later, it was discovered that the change led to predictable random number generators and insecure communications.

According to BBC News, there is strong evidence that the NSA has deliberately inserted a trapdoor in one of the pseudorandom generators published by the National Institute of Standards and Technology (NIST).
Even though randomness may not be needed for decision problems, it is crucial for cryptography.
Why is it OK to assume the key is secret and not the algorithm?

This is because we can always choose a fresh key.

Possible Definitions

Attempt 1: An encryption scheme is secure if it is impossible to recover the key $k$ from $\Enc_k(x)$.

Not even close.
Attempt 2: An encryption scheme is secure if the plaintext $x$ cannot be recovered from $\Enc_k(x)$.

Better, but still not good enough.

Consider an encryption scheme revealing the first 10 bits.

Perfect Secrecy

A valid encryption scheme $(\Enc, \Dec)$ with plaintext length $\ell (\cdot)$ is perfectly secret if for every $n \in \natural$ and plaintexts $x_0, x_1 \in \{0,1\}^{\ell(n)}$, the following two distributions $Y_0$ and $Y_1$ over $\{0, 1\}^*$ are identical:

$Y_b$ is obtained by sampling $k \in \{0,1\}^n$ and outputting $\Enc_k(x_b)$ for $b = 0,1$.

A priori, we assume Alice sent either $x_0$ or $x_1$ with probability $1/2$.
Now, we observe $Y = \Enc_k(x_b)$.
How does this new information update our belief about the distribution of $x_0$ and $x_1$?

Analysis

Consider the following secrecy experiment:

Sample $k \in \{0,1\}^n$.
Adversary $\adv$ outputs $x_0, x_1$ given input $1^n$.
We choose $b\in \{0,1\}$ uniformly at random and send $y = \Enc_k(x_b)$ to $\adv$.
$\adv$ outputs $b'\in \{0,1\}$.
The output of the experiment is defined to be $1$ if and only if $b = b'$, in which case we say that $\adv$ succeeds.

What's the probability that $\adv$ succeeds?

The output $b'$ depends on $b$ through $y$. So if $y$ is independent of $b$, $b'$ and $b$ will be independent.

Let $Y = \Enc_k(x_b)$ be the random variable for $y$.
\[\begin{equation*} \Pr \bigl( Y = y | b = i \bigr) = \Pr (Y_i = y). \end{equation*} \]
By the definition of perfect secrecy, $\Pr(Y_0 = y) = \Pr(Y_1 = y) = p(y)$ for some distribution $p(y)$.

By the total probability formula,
\[\begin{equation*} \begin{split} \Pr (Y = y) & = \Pr(b = 0) \Pr (Y = y | b = 0 ) + \\ & \quad \Pr(b=1) \Pr (Y = y | b = 1)\\ & = \bigl( \Pr(b = 0) + \Pr(b = 1) \bigr) p(y) = p(y). \end{split} \end{equation*} \]
So $Y$ and $b$ are independent as $\Pr (Y = y \land b = i) = \Pr( b = i ) \Pr(Y = y | b = i) = \Pr(b = i) \Pr(Y_i = y) = \Pr(b = i) p(y) = \Pr(b = i)\Pr(Y = y)$.

This implies that
\[\begin{equation*} \Pr(\adv \text{ succ}) = \frac{1}{2}. \end{equation*} \]

Construction of Perfectly Secret Encryption

One-time pad construction (Vernam 1917, Shannon 1949).

There is a perfectly secret valid encryption scheme with $\ell_{\plain}(n) = \ell_{\cipher}(n) = n$.
1. $\Enc$: On input a key $k\in \{0,1\}^n$ and message $m \in \{0,1\}^n$, outputs $c = k \oplus m$.
2. $\Dec$: On input a key $k\in \{0,1\}^n$ and ciphertext $c \in \{0,1\}^n$, outputs $m = k \oplus c$.
While the one-time pad is efficient and provides perfect secrecy, it has one glaring disadvantage.

To communicate $n$ bits, you need to store a key of length $n$.
We cannot reuse keys multiple times!

Read about the Venona project.
Long keys are necessary for perfect secrecy.

For every perfectly secret encryption scheme, the length function $\ell_{\plain}$ satisfies $\ell_{\plain}(n) \le n$.

Let $(\Enc, \Dec)$ be a perfectly secret encryption.

For all fixed $n$, consider $x_0 = 0^{\ell}$ where $\ell = \ell_{\plain}(n)$ and define the support of the distribution $Y_{0}$ to be $S_0$. As the key $k$ has $n$ bits, the size of the set $S_0$ is at most $2^n$.

By the perfect secrecy condition, for all $x_1$ and $k$, the $\Enc_k(x_1)$ must also be in $S_0$.

Consider a fixed $k$ and $x_1$ going over all possible plaintexts; the set
\[\begin{equation*} I_k = \{ y \mid \exists_x\, y = \Enc_k(x) \} \end{equation*} \]
should have a size of at least $2^{\ell}$ and contained in $S_0$. Hence, $\ell \le n$.

Computational Secrecy and PRG

Computational Secrecy

How do we overcome the drawbacks of OTP?

We have to weaken the definition of security.
Is it possible?

Yes, we know that people routinely use cryptosystems with a 16-byte (i.e., 128-bit) key but terabytes of plaintext.
Relax the perfect secrecy to a condition known as “computational secrecy”.

An encryption scheme is computationally secret if no polynomial-time algorithms can break it.
There are several different ways of defining security in private-key encryption.

We will only consider the two most basic notions.
Secrecy experiment

Consider computationally bounded adversary $\adv$.
1. A random key $k$ of length $n$ is sampled.
2. Adversary $\adv$ is given input $1^n$, and outputs a pair of messages $x_0, x_1$ of the same length.
3. We sample a bit $b\in \{0,1\}$. The ciphertext $y = \Enc_k(x_b)$ is computed and sent to $\adv$.
4. $\adv$ outputs a bit $b'$.
5. The experiment's output is $1$ if and only if $b = b'$, in which case we say that $\adv$ succeeds.
  
  Let $(\Enc, \Dec)$ be a valid encryption scheme. The scheme is computationally secret if, for every polynomial-time adversary algorithm $\adv$ in the above experiment, there is negligible function $\negl$ such that
  \[\begin{equation*} \Pr ( \adv \text{ succ} ) \le \frac{1}{2} + \negl(n), \end{equation*} \]
  where the probability is taken over the randomness of $\adv$ and the experiment.
Questions

Is it strong enough to ensure that schemes satisfying the definition protect the secrecy of the plaintext?

Is it weak enough to allow a much shorter key length?

To the best of our knowledge, yes to both!

Pseudorandom Generators

Definition

Derandomized OTP.
Let $\ell: \natural \to \natural$ be a function. A cryptographic pseudorandom generator (PRG) with stretch $\ell(\cdot)$ is a polynomial-time computable function $G: \{0,1\}^* \to \{0,1\}^*$ such that:
1. For every $n \in \natural$ and $s \in \{0,1\}^n$, $\abs{G(s)} = \ell(n)$.
2. For any polynomial-time algorithm $\adv$,
  \[\begin{equation*} \abs{\Pr_{s\in \{0,1\}^n} (\adv(G(s)) = 1) - \Pr_{r \in \{0,1\}^{\ell(n)}} (\adv(r) =1)} \end{equation*} \]
  is negligible in $n$.
The cryptographic PRG conjecture

For every $a \in \natural$, there is a cryptographic pseudorandom generator with $\ell(n) = n^a$.

PRGs exist if and only if a concept called one-way functions (OWF) exist.
Private-key encryption from PRG

Suppose that the cryptographic PRG conjecture is true. Then, for every constant $a \in \natural$, there is a computationally secret encryption scheme $(\Enc, \Dec)$ with plaintext length $\ell(n) \ge n^a$.
We define the encryption scheme as follows: $\Enc_k(x) = x \oplus G(k)$ and $\Dec_k(y) = y \oplus G(k)$.

Suppose, to the contrary, there is an adversary $\adv$ for the scheme such that
\[\begin{equation*} \Pr(\adv \text{ succ}) = \frac{1}{2} + \frac{1}{\poly(n)}, \end{equation*} \]
we can construct an adversary $\adv_{\PRG}$ for the PRG $G$.
$\adv_{\PRG}$ is given an input $w$ and works as follows:
1. Run $\adv(1^n)$ to get a pair of messages $x_0, x_1 \in \{0,1\}^{\ell_{\plain}(n)}$.
2. Choose a random bit $b$. Set $y = w \oplus x_b$.
3. Give $y$ to $\adv$ and obtain output $b'$. Output $1$ if $b'=b$ and $0$ otherwise.
The main observations are:
1. If $w$ is chosen uniformly random from $\{0,1\}^{\ell(n)}$, then the view of $\adv$ is the same as in an OTP experiment. So
  \[\begin{equation*} \Pr_{w \in \{0,1\}^{\ell(n)}} \bigl(\adv_{\PRG}(w) = 1 \bigr) = \frac{1}{2}. \end{equation*} \]
2. If $w$ is chosen to be $G(k)$, then the view of $\adv$ is the same as in the encryption scheme $(\Enc, \Dec)$.
  \[\begin{align*} \Pr \bigl(\adv_{\PRG}(w) = 1 \bigr) & = \Pr \bigl(\adv_{\PRG}(G(k)) = 1 \bigr)\\ & = \Pr \bigl(\adv \text{ succ} \bigr) = \frac{1}{2} + \frac{1}{\poly(n)}. \end{align*} \]
Thus, we have
\[\begin{equation*} \begin{split} \abs{\Pr_{w \in \{0,1\}^{\ell(n)}} \bigl(\adv_{\PRG}(w) = 1 \bigr) - \Pr_{k \in \{0,1\}^n} \bigl(\adv_{\PRG}(G(k)) = 1 \bigr)} \end{split} \end{equation*} \]
is $\dfrac{1}{\poly(n)}$. This contradicts the security of the PRG.
The reduction above showing how to convert an attack on the encryption scheme to an attack on the PRG is called a security proof.

Modern cryptography is built on proper security definitions and security proofs.

CPA Security and PRF

Chosen-Plaintext Attacks (CPA)

We formally introduce a more powerful adversarial attack called a chosen-plaintext attack (CPA).

Adversary $\adv$ is allowed to ask for encryptions of multiple messages that it chooses “on the fly” in an adaptive manner.
Formally, this is achieved by allowing $\adv$ to interact freely with the encryption algorithm $\Enc_k$, viewed as a “black-box” that encrypts messages of $\adv$'s choice.
The CPA experiment
1. A random key $k$ is sampled.
2. Adversary $\adv$ is given input $1^n$ and oracle access to $\Enc_k(\cdot)$, and outputs a pair of messages $x_0, x_1$ of the same length.
3. A random bit $b$ is chosen, then ciphertext $y = \Enc_k(x_b)$ is computed and sent to $\adv$. We call $y$ the challenge.
4. Adversary $\adv$ continues to have oracle access to $\Enc_k(\cdot)$, and outputs a bit $b'$.
5. The output of the experiment is defined to be $1$ if $b' = b$ ($\adv$ succeeds) and $0$ otherwise.
  
  An encryption scheme $(\Enc, \Dec)$ is CPA-secure if, for all polynomial-time adversary $\adv$, there exists a negligible function $\negl$ such that
  \[\begin{equation*} \Pr(\adv \text{ succ}) \le \frac{1}{2} + \negl(n). \end{equation*} \]
Why is CPA security even possible?

Can't $\adv$ simply query $\Enc_k(x_i)$ to get the ciphertexts and compare?

No deterministic encryption scheme can be secure against chosen-plaintext attacks.

Any CPA-secure encryption scheme must be probabilistic!
Is CPA realistic? It seemed very strong.

In May 1942, US Navy cryptanalysts discovered that Japan was planning an attack on Midway Island in the Central Pacific. They had learned this by intercepting a communication message containing the ciphertext fragment “AF” that they believed corresponded to the plaintext “Midway island”.

Unfortunately, their attempts to convince Washington planners that this was indeed the case were futile; the general belief was that Midway could not possibly be the target. The Navy cryptanalysts then devised the following plan. They instructed the US forces at Midway to send a plaintext message that their freshwater supplies were low. The Japanese intercepted this message and immediately reported to their superiors that “AF” was low on water. The Navy cryptanalysts now had their proof that “AF” was indeed Midway, and the US forces dispatched three aircraft carriers to the location.

Pseudorandom Functions (PRF)

We will specifically be interested in pseudorandom functions mapping $n$-bit strings to $n$-bit strings.
The pseudorandomness of a distribution on functions.

Consider a keyed family of functions $F_k: \{0,1\}^* \to \{0,1\}^*$.
Intuitively, we call $F$ pseudorandom if the function $F_k$ (for randomly chosen key $k$) is computationally indistinguishable from a function chosen uniformly at random from the set of all functions having the same domain and range.
How many random functions are there?

$2^{n2^n}$!
How would an algorithm interact with a function?

Use it as an oracle.

Let $F_k$ be a keyed family of functions that is efficient and length-preserving. We say $F_k$ is a pseudorandom function if, for all probabilistic polynomial-time distinguishers $\dis$, there exists a negligible function $\negl$ such that:
\[\begin{equation*} \abs{\Pr \bigl( D^{F_k(\cdot)}(1^n) = 1 \bigr) - \Pr \bigl( D^{f_n(\cdot)}(1^n) = 1 \bigr)} \le \negl(n) \end{equation*} \]
where $k \gets \{0,1\}^n$ is chosen uniformly at random and $f_n$ is chosen uniformly from the set of functions mapping $n$-bit strings to $n$-bit strings.
The existence of pseudorandom functions

Pseudorandom functions exist if and only if pseudorandom generators exist.

From PRF to CPA-secure Encryption

Construction

Let $F$ be a pseudorandom function. Define a private-key encryption scheme for messages of length $n$ as follows:
1. $\Enc$: On input a key $k$ and a message $x \in \{0,1\}^n$, choose $r \gets \{0,1\}^n$ uniformly at random and output the ciphertext
  \[\begin{equation*} y = \pair{r}{F_k(r) \oplus x}. \end{equation*} \]
2. $\Dec$: On input the key $k$ and ciphertext $y = \pair{r}{s}$, output the plaintext message
  \[\begin{equation*} x = F_k(r) \oplus s. \end{equation*} \]

Security Proof

First, work with truly random functions $f_n$ and show that the scheme is secure.
Next, we claim that if the scheme were insecure when $F_k$ was used, then this would imply methods distinguishing $F_k$ from a truly random function.
Consider the scheme $\widetilde{\Pi} = (\widetilde{\Enc}, \widetilde{\Dec})$ where the truly random functions are used.

Let $r_{\cipher}$ be the randomness used in the actual encryption of $y = \Enc_{k, r}(x)$.

Suppose $\adv$ makes $q(n)$ queries to $\Enc_k(\cdot)$ using randomness $r_1, \dots r_q$.
1. Case 1 (Repeat). $r_{\cipher}$ is in the sequence $r_1, \ldots, r_q$. This is a bad case because we put $r$ in the clear in the ciphertext and the adversary will notice this and successfully attack the scheme. Fortunately, the probability of this event is not high. It is at most $q(n) / 2^n$.
2. Case 2 (No Repeat). Then the encryption is like a perfect OTP, and therefore
  \[\begin{equation*} \Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ} | \text{no repeat} \bigr) = \frac{1}{2}. \end{equation*} \]
\[\begin{equation*} \begin{split} \Pr \bigl( \adv_{\widetilde{\Pi}} \text{ succ} \bigr) & = \Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ} \land \text{ repeat} \bigr)\\ & \qquad + \Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ } \land \text{ no repeat} \bigr)\\ & \le \Pr (\text{repeat}) + \Pr \bigl(\adv_{\widetilde{\Pi} \text{ succ}} | \text{ no repeat} \bigr)\\ & \le \frac{1}{2} + \frac{q(n)}{2^n}. \end{split} \end{equation*} \]
Now consider the case for $\Pi$ where $F_k$ is used.

Suppose to the contrary that
\[\begin{equation*} \Pr \bigl(\adv_{\Pi} \text{ succ} \bigr) \ge \frac{1}{2} + \frac{1}{\poly(n)}. \end{equation*} \]
Then we have
\[\begin{equation*} \abs{\Pr \bigl(\adv_{\Pi} \text{ succ} \bigr) - \Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ} \bigr)} \ge \frac{1}{\poly(n)} \end{equation*} \]
We will design a distinguisher $\dis$ for the PRF $F_k$ to complete the proof.

$\dis$ is given an oracle and asked to determine if the oracle is a pseudorandom function or a truly random function.

To do this, $\dis$ emulates the CPA experiment for $\adv$, and observes whether A succeeds or not.
Distinguisher $\dis$ has oracle access to $\oracle$ and input $1^n$:
1. Run $\adv(1^n)$ and, whenever $\adv$ queries its encryption oracle with $x$, answer the query in the following way:
  1. Choose $r \gets \{0,1\}^n$ uniformly at random;
  2. Query $\oracle(r)$ and obtain response $s'$;
  3. Return $\pair{r}{s' \oplus x}$ to $\adv$;
2. When $\adv$ outputs $x_0$ and $x_1$, choose a random bit $b$ and then:
  1. Choose $r \gets \{0,1\}^n$ uniformly at random;
  2. Query $\oracle(r)$ and obtain response $s'$;
  3. Return $\pair{r}{s' \oplus x_b}$ to $\adv$;
3. Continue answering any queries of $\adv$ as before. When $\adv$ outputs $b'$, output $1$ if $b' = b$ and $0$ otherwise.
If $\dis$'s oracle is a pseudorandom function, then the view of $\adv$ is identical to the view of $\adv$ in the experiment with $\Pi$.
\[\begin{equation*} \Pr \bigl(\dis^{F_k(\cdot)}(1^n) = 1 \bigr) = \Pr[\adv_{\Pi} \text{ succ}]. \end{equation*} \]
If $\dis$'s oracle is a truly random function, then the view of $\adv$ is identical to the view of $\adv$ in the experiment with $\widetilde{\Pi}$.
\[\begin{equation*} \Pr \bigl(\dis^{f_n(\cdot)}(1^n) = 1 \bigr) = \Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ} \bigr). \end{equation*} \]
Hence,
\[\begin{equation*} \begin{split} & \abs{\Pr \bigl(\dis^{f_n(\cdot)}(1^n) = 1 \bigr) - \Pr \bigl(\dis^{F_k(\cdot)}(1^n) = 1 \bigr)} \\ =\; & \abs{\Pr \bigl(\adv_{\widetilde{\Pi}} \text{ succ} \bigr) -\ \Pr \bigl(\adv_{\Pi} \text{ succ} \bigr)}\\ =\; & \frac{1}{\poly(n)}. \end{split} \end{equation*} \]

Lecture D: Interactive Proofs and Zero-Knowledge

In this lecture, we aim to introduce zero-knowledge proofs, another pillar of modern cryptography alongside pseudorandomness. For this, we need to discuss interactive proofs and commitment schemes first.

Interactive Proofs

“What is intuitively required from a theorem-proving procedure? First, that it is possible to ‘prove’ a true theorem. Second, that it is impossible to ‘prove’ a false theorem. Third, that communicating the proof should be efficient, in the following sense. It does not matter how long must the prover compute during the proving process, but it is essential that the computation required from the verifier is easy.” — Goldwasser, Micali, and Rackoff.

Interactive Proof Verification

$\NP$ as proof verification

An $\NP$ verifier $V$ mimics the behavior of the proof verification process between a prover $P$ and the verifier $V$, where the prover sends the proof as a written proof string and the verifier is a deterministic polynomial-time machine.
Interactive proofs

In an interactive proof system, the prover $P$ and the verifier $V$ can exchange messages back and forth before $P$ finally convinces $V$.
Randomness is important

For the interactive model to be interesting, we need to consider randomized verifier $V$. Otherwise, the prover can predict and compute all the messages the verifier will send, and the model will be no more general than $\NP$ verification.

Definition

Problem $A$ is in $\IP$ if there is a pair of interactive algorithms $(P, V)$, with $V$ running in probabilistic polynomial time in the length of input $x$, such that

(Completeness). If $x \in A$, then $\Pr \bigl( \pair{P}{V} (x) = 1 \bigr) = 1$.
(Soundness). If $x \not\in A$, then for any possibly dishonest $P^*$, we have $\Pr \bigl( \pair{P^*}{V} (x) = 1 \bigr) \le 1/2$.

$(P, V)$ satisfying above is called a proof system for $A$. We say $A \in \IP[\ell]$ if it has a proof system using $\ell=\ell(\abs{x})$ rounds of interaction (number of messages).

Remark

One may allow completeness errors in the definition.
Why do we consider interactive proof systems?
1. Interactive proofs are probably more powerful than $\NP$, even though the modifications look minor.
2. They allow additional properties such as zero-knowledge, which is impossible in the traditional proof verification model.
3. More efficient than $\NP$ by sending a shorter witness. PCP theorem.

BPP is Contained in IP

By the proof that $\P = \BPP$ assuming $\P = \NP$ in a previous lecture, For any $A \in \BPP$, the fact $x \in A$ can be expressed as
\[\begin{equation*} \Pr \Bigl( \bigvee_{j=1}^{2m} V(x, s_j \oplus r) = 1 \Bigr) = 1. \end{equation*} \]
And $x \not \in A$ can be expressed as
\[\begin{equation*} \Pr \Bigl( \bigvee_{j=1}^{2m} V(x, s_j \oplus r) = 1 \Bigr) \le 1/2. \end{equation*} \]
Hence, $\BPP \in \IP[1]$ with perfect completeness (Ask the prover to send $s_1, \ldots, s_{2m}$).

GNI is in IP

Does $\IP$ contain more problems than $\NP$?

$\GI$ is in $\NP$.

We don't know how to put $\GNI$ in $\NP$ easily.
$\GNI$ is in $\IP$.

Let $G$ be a graph of $n$ vertices and $\pi$ be a permutation of $n$ elements.
Define $\pi(G)$ to be the graph such that $(u,v) \in E(G)$ if and only if $(\pi(u), \pi(v)) \in E(\pi(G))$.

Two graphs $G_0$ and $G_1$ are isomorphic, denoted as $G_0 \sim G_1$, if and only if $G_1 = \pi(G_0)$ for some $\pi$.
Coke vs. Pepsi protocol
1. The verifier samples a random bit $b \in \{0,1\}$ and a random permutation $\pi$; sends $G' = \pi(G_b)$ to the prover.
2. The verifier receives a bit $b'$ from the prover and accepts if and only if $b' = b$.
Analysis

If $G_0 \sim G_1$, the distribution of $\pi(G_b)$ is independent of $b$ and is uniform over the graphs isomorphic to $G_0$.

The prover can succeed with probability $1/2$.

If $G_0 \not\sim G_1$, an all-powerful prover will be able to find out $b$ and win all the time as the distributions of $\pi(G_b)$ for $b=0,1$ have disjoint support.

Merlin-Arthur (Public-Coin) Protocols (Skip)

Definitions

A Merlin-Arthur protocol is an interactive proof system where the verifier (Arthur) messages are public random coin flips.

Name of Merlin and Arthur.

$\MA = \AM[1]$, $\AM = \AM[2]$.
Goldwasser-Sipser theorem

For all $\ell$, $\IP[\ell] \subseteq \AM[\ell+2]$.
GNI is in AM

Let $S = \{ (H, \pi) \mid \pi \in \mathrm{Aut}(H), H \sim G_0 \text{ or } H \sim G_1\}$. Then
\[\begin{equation*} \abs{S} = \begin{cases} n! & \text{ if } G_0 \sim G_1,\\ 2 n! & \text{ otherwise}. \end{cases} \end{equation*} \]
We want an AM protocol that convinces the verifier that $\abs{S} = N$ and not $\abs{S} = N/2$ where $N = 2 n!$.

Let $\mathcal{H}$ be a pairwise independent hash family with $U = \{0,1\}^m$ and $R = \{0,1\}^k$ where $k < m$.
Goldwasser-Sipser Set Lower Bound protocol

Set $S \subseteq \{0,1\}^m$ has size $N$ or at most $N/2$.

Choose $k$ such that $2^{k-2} \le N < 2^{k-1}$.

Define $p = 2^{-k} N \in [1/4, 1/2)$.
1. Arthur samples $h \in \mathcal{H}$ and $y \in \{0,1\}^k$; sends $(h, y)$ to Merlin.
2. Merlin finds an $x \in S$ such that $h(x) = y$; sends $x$ and a proof of $x \in S$ to Arthur.
3. Arthur checks that $h(x) = y$, and that $\pi$ certifies that $x \in S$.
If $\abs{S} = N$, then for fixed $y$
\[\begin{equation*} \begin{split} & \Pr_{h} \bigl( \exists x \in S, h(x) = y \bigr) \\ \ge\, & \sum_{x \in S} \Pr_h (h(x) = y) - \sum_{x_1 \ne x_2 \in S} \Pr_h \bigl( h(x_1) = y \land h(x_2) = y \bigr)\\ \ge\, & \frac{\abs{S}}{2^k} - \frac{\abs{S}^2}{2\cdot 2^{2k}}\\ \ge\, & p - \frac{N}{2^{k+1}} p \ge \frac{3p}{4}. \end{split} \end{equation*} \]
The first inequality is by the inclusive-exclusive principle.
If $\abs{S} = N/2$, then
\[\begin{equation*} \begin{split} & \Pr_{h} \bigl( \exists x \in S, h(x) = y \bigr) \\ \le\, & \sum_{x \in S} \Pr_h (h(x) = y) = \frac{\abs{S}}{2^k} = \frac{p}{2}. \end{split} \end{equation*} \]
Therefore, Arthur will accept with probability
\[\begin{equation*} \Pr_{h, y} \bigl( \exists x \in S \mid h(x) = y \bigr) \end{equation*} \]
which is at least $3p/4$ for yes instances and at most $p/2$ for no instances.

More Facts

$\IP = \PSPACE$.
$\IP[\ell] = \AM[\ell] = \AM$ for all constant $\ell \ge 2$.
$\AM$, $\MA$, and $\IP$ remain the same if we allow completeness error.

Coin Flipping and Commitment

Coin Flipping Over the Phone/Internet

Alice and Bob want to flip a coin to decide something.
It's a protocol where Alice and Bob talk to each other and then agree on the outcome $R$ of the coin flip so that the result is not biased even if one of them is cheating (malicious).

It is unlike the encryption case where there is a bad guy, Eve. In this case, one of the players may be dishonest.
Security definition

Even if $A$ or $B$ is malicious,
\[\begin{align*} \Pr ( R = 0 ) & \le \frac{1}{2} + \negl(n),\\ \Pr ( R = 1 ) & \le \frac{1}{2} + \negl(n).\\ \end{align*} \]
Attempt
1. Alice sends $a \in \{0,1\}$ to Bob.
2. Bob sends $b \in \{0,1\}$ to Alice.
3. The outcome is set to $R = a \oplus b$.
Bob can bias the outcome.
We need a way to send $a$ so that it is only revealed after Bob sends $b$.

Bit Commitment

A commitment scheme is a protocol consisting of two phases, the commit phase and the reveal phase, satisfying both the hiding and binding properties.
Definition
A PPT protocol $\Commit$ is a (computationally hiding and statistically binding) commitment scheme if
1. (Hiding). For all $x_0 \ne x_1$, $\Commit(x_0) \approx_c \Commit(x_1)$.
2. (Binding). The probability that Alice can open $\Commit(x_0)$ to $x_1$ with $x_1 \ne x_0$ is negligible.

Coin Flipping from Bit Commitment

Attempt 2
1. Alice sends $\Commit(a)$ to Bob for a random $a$.
2. Bob sends $\Commit(b)$ to Alice for a random $b$.
3. Alice opens $\Commit(a)$.
4. Bob opens $\Commit(b)$.
5. The outcome is set to $R = a \oplus b$.
Copycat attack.
Protocol
1. Alice sends $\Commit(a)$ to Bob for a random $a$.
2. Bob sends $b$ to Alice.
3. Alice opens $\Commit(a)$.
4. The outcome is set to $R = a \oplus b$.
Cheating Alice

The outcome is truly random, as Bob is honest.
Cheating Bob

$\Pr(b = a) \le \frac{1}{2} + \negl(n)$ for all PPT Bob by the hiding property.

So $\Pr(R = 0) \le \frac{1}{2} + \negl(n)$. Similarly, $\Pr(R = 1) \le \frac{1}{2} + \negl(n)$.

Bit Commitment From PRG (Naor's Construction)

Assume $G : \{0,1\}^n \to \{0,1\}^{3n}$ is a PRG.

In the commit phase:
1. The receiver sends $x \in \{0,1\}^{3n}$;
2. The sender samples $m\in \{0,1\}, z \in \{0,1\}^n$ and sends $y = G(z) \oplus x^m$;
In the reveal phase:
1. The sender sends $z$ and $m$.
Proof of the hiding property

We cannot learn $m$ if $G(z)$ is replaced with true randomness.

It then follows from the security of the PRG.
Proof of the binding property

To commit $m\in\{0,1\}$ and open to $\overline{m}$, the sender must find $z$ and $z'$ such that $G(z) \oplus G(z') = x$.

By a counting argument, there are at most $2^{2n}$ choices of $x$ such that the equation of $z$ and $z'$ has a solution.

As the receiver chooses $x$ uniformly at random, the probability that the sender cheats with probability at most $2^{-n}$.

Zero-Knowledge Proofs

On October 31, 1903, the mathematician Frank Nelson Cole gave an hourlong lecture to a meeting of the American Mathematical Society where he did not speak a single word. Rather, he calculated on the board the value $2^{67} - 1$ which is equal to 147, 573, 952, 589, 676, 412, 927, and then showed that this number is equal to 193, 707, 721 × 761, 838, 257, 287. Cole’s proof showed that $2^{67} - 1$ is not a prime, but it also revealed additional information, namely its actual factors. This is often the case with proofs: they teach us more than just the validity of the statements.

Background

Zero-knowledge proofs were invented by Goldwasser, Micali and Rackoff in 1982 in a paper appeared in FOCS 85.

Motivations for zero-knowledge proofs:
1. Philosophical Motivation:
  
  Questioning whether a proof inherently carries knowledge.
2. Practical Motivation:
  
  In cryptographic protocols, we need to handle cheating parties.
  
  Zero-knowledge offers a solution in certain scenarios.
Definition is challenging

Not only is the construction of such protocols tricky and hard, but defining them properly is already highly non-trivial.

Simple Examples

Watch Zero-Knowledge Proofs Explained in 5 Levels of Difficulty.

Coke vs. Pepsi
Alibaba's cave
Where is Waldo?

A More Concrete Example

The Coke vs. Pepsi protocol for $\GNI$

The verifier sees a bit $b'$ indicating which graph in $G_0, G_1$ is isomorphic to $G'$.

Is this zero-knowledge?
Definition (Attempt 1)

After the interaction, the verifier knows
1. The claim is true;
2. A view of the interaction containing the transcript of the interaction and local information.
Idea: When the claim is true, the view gives nothing that he couldn't have computed on this own without interacting with the prover.
An interactive protocol $(P,V)$ is perfect zero-knowledge for a language $A$ if there exists a probabilistic polynomial-time simulator $S$ such that for every $x \in A$, the following two distributions are the same:
1. $\mathrm{view}_V \pair{P}{V}(x)$;
2. $S(x)$.
According to this definition, the protocol for $\GNI$ is zero-knowledge.

Or is it?

Say, the verifier has another graph $H$ that is isomorphic to either $G_0$ or $G_1$, but he does not know which one. He can run the protocol and find this out!
The verifier has deviated from the protocol description!

We call the above definition honest-verifier perfect zero-knowledge (HVPZK), a weaker notion of zero-knowledge.

Honest but curious.
If we allow the two distributions in the definition to be statistically indistinguishable, we get $\HVSZK$.

Formal Definition

PZK, SZK and CZK
An interactive protocol $(P,V)$ is perfect (statistical or computational) zero-knowledge for a language $A$ if, for every probabilistic polynomial-time $V^*$, there exists a probabilistic polynomial-time simulator $S$ such that for every $x \in A$, the following two distributions are perfectly (statistically or computationally) indistinguishable:
1. $\mathrm{view}_{V^*} \pair{P}{V^*}(x)$;
2. $S(x)$.
$\GI \in \PZK$

The prover knows a permutation $\sigma$ such that $\sigma(G_1) = G_0$.
A zero-knowledge protocol for $\GI$:
1. The prover samples a random permutation $\pi$ and computes $G = \pi(G_0)$; sends $G$ to the verifier.
2. The verifier samples a random bit $b \in \{0,1\}$; sends $b$ to the prover.
3. The prover responds with a proof that $G$ and $G_b$ are isomorphic, the permutation $\pi \circ \sigma^b$.
Completeness and soundness are easy to analyze.

Simulator and Rewinding

Choose $a\in \{0,1\}$ uniformly at random.
Sample a random permutation $\pi$ and compute $G = \pi(G_a)$.
Randomly sample $r$ and simulate $V^*$ with random tape content $r$.
If $V^*$ sends $b = a$, outputs $(G, \pi)$ as the message and the random tape $r$ as the internal randomness of $V^*$.
If $V^*$ sends $b \ne a$, rewind and start from the beginning.

The simulator does not actually solve any interesting problem; it just generates conversations that look like the view of the malicious verifier $V^*$.
The power of the simulator comes from the fact it does not need to respect the order of the messages!
The simulator runs in expected polynomial time.

The probability that it successfully generates a valid view is $1/2$ because $a$ is random and independent of $V^*$.

Zero-Knowledge Proofs for NP

(Goldreich-Micali-Wigderson). If statistically binding commitments exist, then there exists a zero-knowledge proof system for $\THREECOL$.

As $\THREECOL$ is $\NPC$, the theorem provides a zero-knowledge proof for all problems in $\NP$.
For simplicity, we also assume the commitment phase is non-interactive.
For ZK, we need to work with all-powerful provers, so we need a statistically binding commitment scheme.
Protocol
1. $P$ samples random permutation $\pi : \{0, 1, 2\} \to \{0, 1, 2\}$. For each $v \in V$ it computes a commitment $c_v = \Commit(\pi(\phi(v)))$. It sends the set of commitments $\{c_v\}_{v\in V}$ to the verifier.
2. $V$ samples a random edge $(u, v) \gets E$ and sends $(u,v)$ to $P$.
3. $P$ opens the commitments $c_u$ and $c_v$.
4. Finally, $V$ checks
  1. $\pi(\phi(u)), \pi(\phi(v)) \in \{0, 1, 2\}$
  2. $\pi(\phi(u)) \ne \pi(\phi(v))$
  3. Commitments are correct.
Completeness is easy.
Soundness is also easy.

When the graph is not colorable, define $B$ to denote the event that the commitments are binding.
\[\begin{align*} \Pr(V \text{ accepts }) & \le \Pr(\neg B) + \Pr (V \text{ accepts} | B)\\ & \le \negl(n) + (1 - 1/\abs{E})\\ & = 1 - 1/\poly(n). \end{align*} \]
Zero-knowledge

Consider simulator $\Sim^{V^*}$:
1. Choose a random edge $e' = (u', v')$ and commit to two different random colors $c_{u'}$ and $c_{v'}$. For all other vertices, commit to color $0$.
2. Send first message to $V^*$ and get $e = (u,v)$ from $V^*$.
3. If $e = e'$ open $c_{u'}$ and $c_{v'}$; otherwise, go to Step 1.
$\Sim$ is efficient as it only needs to rewind with probability $1-1/\poly(n)$.

Hybrid Argument

The view of $V^*$ in the interactive protocol is indistinguishable from the simulator output.
Consider the following hybrids:
1. Hybrid 1. $\Sim$ has the real coloring witness $\phi$ and interacts with the verifier as the prover.
2. Hybrid 2. $\Sim$ has $\phi$ and chooses a random edge $e'=(u', v')$. Commit as the prover and get $e = (u, v)$. If $e' \ne e$, restart again.
3. Hybrid 3. $\Sim$ has $\phi$ and chooses a random edge $e'=(u', v')$. Commit using the real witness for $u'$ and $v'$. Commit using the $0$ color for other vertices.
4. Hybrid 4. Real simulator.
Hybrid 1 and 2 are the same because $e'$ is independently sampled.
Hybrid 2 and 3 are close by the hiding property of the commitment scheme.
Hybrid 3 and 4 are the same as the only difference is that in Hybrid 3, the coloring is true coloring for $u,v$ after the permutation $\pi$, while in Hybrid 4, it is a pair of different random colors.

Error Reduction

Sequential repetition is easy but needs to enhance the definition of the zero-knowledge property so that composition work more smoothly.

This will increase the number of rounds.
There is a modified construction using parallel repetition so the number of rounds remain constant.

More Security Notions

Introduction to Modern Cryptography, Jonathan Katz and Yehuda Lindell

MIT Foundations of Cryptography

Public-key cryptography
Digital signature schemes
Multi-party computation
Fully homomorphic encryption
$\cdots$

Lecture E: Propositions as Types

Logic and Computation

Unusual Effectiveness of Logic in Computer Science

Logic was developed in the early 20th century to address foundational issues of mathematics.
Nonetheless, logic is not the mainstream of mathematics nowadays.
But logic is effective in computer science.

Logic has been called "the calculus of computer science."

Descriptive Complexity

Understanding the computational complexity of problems via their logical description.

Database

SQL language uses and extends FO

AI

Multi-agent system

Programming Languages

Programming and analysis of concurrent systems

Proposition as Types

“Powerful insights arise from linking two fields of study previously thought separate. Examples include Descartes's coordinates, which links geometry to algebra, Planck's Quantum Theory, which links particles to waves, and Shannon's Information Theory, which links thermodynamics to communication. Such a synthesis is offered by the principle of Propositions as Types, which links logic to computation. At first sight it appears to be a simple coincidence—almost a pun—but it turns out to be remarkably robust, inspiring the design of automated proof assistants and programming languages, and continuing to influence the forefronts of computing.” — Philip Wadler

Talk by Wadler

CACM article by Wadler

Intuitionistic Logic and Natural Deduction

The concept of proof verification has influenced many of our early discussions, including Gödel's incompleteness theorem, $\NPC$ theory, and interactive proofs. In those discussions, the verification procedure is treated informally as a mechanical verification procedure. To set up the discussion of “proposition as type,” we need a more careful definition of proofs and verification in logic.

Classical Logic and Intuitionistic Logic

Classical logic is the standard choice for mathematics.

It is based on the notion of absolute “truth” so that a well-formed statement is either true or false.

Excluded middle: $p \lor \neg p$ holds for all statement $p$.

Truth table logic
However, no system of rules can capture the rich and complex nature of human reasoning.

Classical logic is not the only choice!

Logic comes in many varieties; one distinction is between “classical” and “intuitionistic.”
The debate on whether mathematics should be constructive or not:

Hilbert: No. I always use proof by contradiction, and it seems OK.

Brouwer: Yes.
The intuitionistic logic we will present below has its place, especially in computer science, because of its constructive nature.
Intuitionistic logic insists upon a constructionist notion of truth. In particular, a proof of $A \lor B$ must show which of $A$ or $B$ holds.

Natural Deduction

1935, at the age of 25, Gerhard Gentzen introduced not one but two new formulations of logic—natural deduction and sequent calculus—that became established as the two major systems for formulating logic and proved certain normalization theorems for proofs.

He is also the guy who introduced the $\forall$ notation!
Gentzen's insight was that proof rules should come in pairs, a feature not present in earlier systems (after 50 years of research by great minds working on this subject).
1. Introduction rules
2. Elimination rules
The introduction rules are like definitions of the logical operators.

The elimination rules are the consequences of their definitions.

We will need this pair of rules to be in harmony in a sense defined later.
Why is it useful to consider pairs of rules?

It guides the proof simplification procedure that avoids roundabouts.
Gentzen's contribution to proof theory.

Subformula principle: Proofs can be normalized so that no concepts enter the proof other than those contained in the final result (subformula).

It is a very powerful tool to show consistency in logic.

How many subformulas are there in $\bot$?

Guiding Question for Defining Logic

Martin-Löf: The meaning of a proposition is determined by what counts as a verification (or proof) of it.
Distinction between proposition and judgment.
- Proposition: $A$
- Judgment: $A \text{ true}, A \text{ is true at time } t, \ldots$.

Conjunction

\[\begin{prooftree} \AXC{$A \text{ true}$} \AXC{$B \text{ true}$} \LeftLabel{$(\land \text{I})$} \BIC{$A \land B \text{ true}$} \end{prooftree}\]

In this inference rule, $A$ and $B$ are schematic variables, and $\land\text{I}$ is the name of the rule.

Intuitively, it says that the proof of $A \land B \text{ true}$ consists of a proof of $A \text{ true}$ and a proof of $B \text{ true}$.

Verificationist view: the meaning of a proposition is determined by the verification (proof) of it.
If $A \text{ true}$ and $B \text{ true}$, we can conclude $A \land B \text{ true}$.
It is also the definition of what a conjunction is.

For $A \land B$ to be true, there must be evidences for $A$ is true and evidences for $B$ is true.
Given $A \land B \text{ true}$, how can we use it?

\[ \begin{prooftree} \AXC{$A \land B \text{ true}$} \LeftLabel{($\land \text{E}_1$)} \UIC{$A \text{ true}$} \end{prooftree} \quad \begin{prooftree} \AXC{$A \land B \text{ true}$} \LeftLabel{($\land \text{E}_2$)} \UIC{$B \text{ true}$} \end{prooftree} \]

They are the two elimination rules for conjunction.

Pragmatist view: the meaning of a proposition should be defined by how it can be used.

Harmony

We have harmony if the verificationist and pragmatist views coincide.

Local soundness and completeness

Local Soundness

Local soundness shows that the elimination rules are not too strong.

No matter how we apply elimination rules to the result of an introduction, we cannot gain any new information.
We demonstrate this by showing that we can find a more direct proof of the conclusion.
\[\begin{equation*} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A$} \AXC{$\mathcal{E}$} \noLine \UIC{$B$} \LeftLabel{($\land \text{I}$)} \BIC{$A \land B$} \LeftLabel{($\land \text{E}_1$)} \UIC{$A$} \end{prooftree} \Longrightarrow_{R} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A$} \end{prooftree} \end{equation*} \]
Local reduction

The driving force behind the computation part, which we will introduce later.

Local Completeness

Local completeness shows that the elimination rules are not too weak.

There is always a way to apply elimination rules so that we can reconstitute a proof of the original proposition from the results by applying introduction rules.
How do we know we have enough elimination rules?

From $A \land B$, we can use elimination and go back to $A \land B$.
\[\begin{equation*} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A \land B$} \end{prooftree} \Longrightarrow_{E} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A \land B$} \LeftLabel{($\land\text{E}_1$)} \UIC{A} \AXC{$\mathcal{D}$} \noLine \UIC{$A \land B$} \LeftLabel{($\land\text{E}_2$)} \UIC{B} \LeftLabel{($\land\text{I}$)} \BIC{$A \land B$} \end{prooftree} \end{equation*} \]
Local expansion

Implication

Consider \[\begin{prooftree} \AXC{$A \land (B \land C) \text{ true}$} \LeftLabel{($\land\text{E}_2$)} \UIC{$B \land C\text{ true}$} \LeftLabel{($\land\text{E}_1$)} \UIC{$B\text{ true}$} \end{prooftree}\]
Is it a proof for $B \text{ true}$? No!
It is not an inference; it consists of two inferences.
It's a hypothetical judgment.

It derives a new rule
\[\begin{prooftree} \AXC{$A \land (B \land C) \text{ true}$} \UIC{$B\text{ true}$} \end{prooftree} \]
We can also write a hypothetical judgment, $A \land (B \land C) \text{ true}\vdash B \text{ true}$.
Implication introduction rule
\[\begin{prooftree} \AXC{$\overline{A \text{ true}}^x$} \noLine \UIC{$\vdots$} \noLine \UIC{$B \text{ true}$} \LeftLabel{($\supset\text{I}^x$)} \UIC{$A \supset B \text{ true}$} \end{prooftree} \]
The scope of $x$ is from $\overline{A \text{ true}}^x$ to $B \text{ true}$ in the premises of the rule and can only be used there.

This reflects the fact that the hypothesis $A\text{ true}$ is only available in this scope.

If we omit $x$ all along, the inference will not be correct.
Elimination rules for implication

\[\begin{prooftree} \AXC{$A \supset B \text{ true}$} \AXC{$A \text{ true}$} \LeftLabel{($\supset\text{E}$)} \BIC{$B \text{ true }$} \end{prooftree}\]
Local soundness
\[\begin{equation*} \begin{prooftree} \AXC{$\overline{A \text{ true}}^x$} \noLine \UIC{$\mathcal{D}$} \noLine \UIC{$B \text{ true}$} \LeftLabel{($\supset\text{I}^x$)} \UIC{$A \supset B \text{ true}$} \AXC{$\mathcal{E}$} \noLine \UIC{$A \text{ true}$} \BIC{$B \text{ true}$} \end{prooftree} \Longrightarrow_{R} \begin{prooftree} \alwaysNoLine \AXC{$\mathcal{E}$} \UIC{$\overline{A \text{ true}}^x$} \UIC{$\mathcal{D}$} \UIC{$B \text{ true}$} \end{prooftree} \end{equation*} \]
Local completeness
\[\begin{equation*} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A \supset B\text{ true}$} \end{prooftree} \Longrightarrow_{E} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$A \supset B\text{ true}$} \AXC{$\overline{A \text{ true}}^x$} \LeftLabel{($\supset\text{E}$)} \BIC{$B \text{ true}$} \LeftLabel{($\supset\text{I}^x$)} \UIC{$A \supset B\text{ true}$} \end{prooftree} \end{equation*} \]

Disjunction

Disjunction introduction
\[\begin{equation*} \begin{prooftree} \AXC{$A\text{ true}$} \LeftLabel{($\lor\text{I}_1$)} \UIC{$A \lor B\text{ true}$} \end{prooftree} \quad \begin{prooftree} \AXC{$B\text{ true}$} \LeftLabel{($\lor\text{I}_2$)} \UIC{$A \lor B\text{ true}$} \end{prooftree} \end{equation*} \]
Disjunction elimination

\[\begin{prooftree} \alwaysNoLine \AXC{$A \lor B\text{ true}$} \AXC{$\overline{A\text{ true}}^x$} \UIC{$\vdots$} \UIC{$C \text{ true}$} \AXC{$\overline{B\text{ true}}^y$} \UIC{$\vdots$} \UIC{$C \text{ true}$} \LeftLabel{($\lor\text{E}^{x,y}$)} \alwaysSingleLine \TIC{$C \text{ true}$} \end{prooftree}\]
Local soundness and completeness

\[\begin{equation*} \begin{prooftree} \alwaysNoLine \AXC{$\mathcal{D}$} \UIC{$A \text{ true}$} \alwaysSingleLine \LeftLabel{($\lor\text{I}_1$)} \UIC{$A\lor B \text{ true}$} \AXC{$\overline{A\text{ true}}^x$} \alwaysNoLine \UIC{$\mathcal{E}$} \UIC{$C \text{ true}$} \AXC{$\overline{B\text{ true}}^y$} \UIC{$\mathcal{F}$} \UIC{$C \text{ true}$} \LeftLabel{($\lor\text{E}^{x,y}$)} \alwaysSingleLine \TIC{$C \text{ true}$} \end{prooftree} \Longrightarrow_{R} \begin{prooftree} \alwaysNoLine \AXC{$\mathcal{D}$} \UIC{$\overline{A \text{ true}}^x$} \UIC{$\mathcal{E}$} \UIC{$C \text{ true}$} \end{prooftree} \end{equation*} \] \[\begin{equation*} \begin{prooftree} \alwaysNoLine \AXC{$\mathcal{D}$} \UIC{$A \lor B\text{ true}$} \end{prooftree} \Longrightarrow_{E} \begin{prooftree} \AXC{$A \lor B\text{ true}$} \AXC{$\overline{A\text{ true}}^x$} \LeftLabel{($\lor I_1$)} \UIC{$A \lor B\text{ true}$} \AXC{$\overline{B\text{ true}}^y$} \LeftLabel{($\lor I_2$)} \UIC{$A \lor B\text{ true}$} \LeftLabel{($\lor E^{x,y}$)} \TIC{$A \lor B\text{ true}$} \end{prooftree} \end{equation*} \]

Falsehood

No introduction rules, as $\bot$ should have no proof.
Elimination
\[\begin{prooftree} \AXC{$\bot\text{ true}$} \LeftLabel{($\bot\text{E}$)} \UIC{$C\text{ true}$} \end{prooftree} \]
Define $\neg A \stackrel{\text{def}}{=} A \supset \bot$.
Prove that it is not possible for $A$ and $\neg A$ to be both true.

$(A \land \neg A) \supset \bot$

Intuitionistic Natural Deduction

The collection of introduction and elimination rules for $\land$, $\supset$, $\lor$, $\bot$ forms the complete set of inference rules for intuitionistic logic.

$\neg$, $\top$, and $\leftrightarrow$ are derived from the above four.

$\top$ is $\bot \supset \bot$.

$A \leftrightarrow B$ is $(A \supset B) \land (B \supset A)$.
Examples: The following can be proved
1. $A \supset B \supset A$.
2. $A \supset \neg \neg A$.
3. $(A \supset B) \supset (\neg B \supset \neg A)$.
4. $\neg (A \lor B) \leftrightarrow (\neg A \land \neg B)$.
Excluded middle $A \lor (\neg A)$ is not provable in intuitionistic logic.

But we do not have the right tools to prove it now.

If we include it as a rule with no premises, we get the classical logic.
More examples of classical tautologies not provable in intuitionistic logic:
1. $((A \supset B) \supset A) \supset A$. (Peirce's law)
2. $\neg\neg A \supset A$.
3. $(\neg A \supset \neg B) \supset (B \supset A)$.
4. $\neg (A \land B) \leftrightarrow (\neg A \lor \neg B)$.
Fun facts
1. $A$ is valid in classical logic if and only if $\neg \neg A$ is valid in intuitionistic logic!
  
  Gilvenko's Theorem
  
  In this sense, intuitionistic logic is in fact a stronger logic.
2. Intuitionistic logic is $\PSPACE$-complete.
  
  Given a formula in the intuitionistic logic, it is $\PSPACE$-complete to tell if it is valid.
  
  Compare: The validity of a formula in the classical logic is $\coNP$-complete.

Propositions as Types

Also known as Curry-Howard isomorphism

In 1934, Curry observed a curious fact, relating a theory of functions to a theory of implication.
In 1969, Howard pointed out there is a similar correspondence between natural deduction, on the one hand, and simply typed lambda calculus, on the other!
- Simplification of proofs corresponds to the evaluation of programs!
- He showed the correspondence extends to the other logical connectives—/conjunction and disjunction/—by extending his lambda calculus with constructs representing pairs and disjoint sums.
  
  He was (justly!) proud of his paper, citing it as one of the two great achievements of his career.

Propositions as Types

$M:A$ is a judgment that $M$ is a proof of $A$ and, alternatively, $M$ is a program of type $A$.

These dual interpretations of the same judgment are at the core of the Curry-Howard isomorphism.

Proposition	Type
$A \land B$	$A \times B$
$A \supset B$	$A \to B$
$A \lor B$	$A + B$
Proof	$\lambda$-term, Programs
Introduction Rule	Constructor
Elimination Rule	Destructor
Local Reduction	Computation
Local Extension	Extension Equality

Implication

Introduction rule

\[\begin{prooftree} \alwaysNoLine \AXC{$\overline{x: A}^x$} \UIC{$\vdots$} \UIC{$M: B$} \alwaysSingleLine \LeftLabel{($\supset\text{I}^x$)} \UIC{$(\lambda x.M):A \supset B$} \end{prooftree}\]
The fact that this proof term is a lambda expression is the essence of Howard's paper from 1969.
Elimination rule

\[\begin{prooftree} \AXC{$M:A \supset B$} \AXC{$N:A$} \LeftLabel{($\supset\text{E}$)} \BIC{$M N: B$} \end{prooftree}\]
Local reduction gives

\[\begin{equation*} \begin{prooftree} \AXC{$\overline{x: A}^x$} \noLine \UIC{$\mathcal{D}$} \noLine \UIC{$M:B$} \LeftLabel{($\supset\text{I}^x$)} \UIC{$(\lambda x.M):A \supset B$} \AXC{$\mathcal{E}$} \noLine \UIC{$N:A$} \BIC{$(\lambda x.M)N:B$} \end{prooftree} \Longrightarrow_{R} \begin{prooftree} \alwaysNoLine \AXC{$\mathcal{E}$} \UIC{$\overline{N:A}^x$} \UIC{$\mathcal{D}[N/x]$} \UIC{$M[N/x]:B$} \end{prooftree} \end{equation*} \]

That is, we have $(\lambda x.M) N \Longrightarrow_{R} M[N/x]$, which is exactly the $\beta$ reduction!
Local expansion gives
\[\begin{equation*} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$M:A \supset B$} \end{prooftree} \Longrightarrow_{E} \begin{prooftree} \AXC{$\mathcal{D}$} \noLine \UIC{$M:A \supset B$} \AXC{$\overline{x:A }^x$} \LeftLabel{($\supset\text{E}$)} \BIC{$Mx:B$} \LeftLabel{($\supset\text{I}^x$)} \UIC{$(\lambda x.Mx):A \supset B$} \end{prooftree} \end{equation*} \]
Or, simply $M \Longrightarrow_{E} (\lambda x. Mx)$ with the condition that $x$ does not occur in $M$.

This is the $\eta$ rule.

Typed vs. Untyped Lambda Calculus

Untyped $\lambda$-calculus is powerful.

It is not consistent as a logic.
Turing also proved that typed $\lambda$-calculus always terminates.

We get an extended simply typed $\lambda$-calculus by the propositions as types principle.

Conjunction

To prove $A \land B$, we need to present a pair of proofs for $A$ and $B$ respectively.

$M : A$ means that $M$ is a proof (term) for $A$.

\[\begin{prooftree} \AXC{$M:A$} \AXC{$N:B$} \LeftLabel{($\land\text{I}$)} \BIC{$\langle M, N \rangle : A \land B$} \end{prooftree}\]

\[\begin{equation*} \begin{prooftree} \AXC{$M:A\land B$} \LeftLabel{($\land\text{E}_1$)} \UIC{$\mathbf{fst}\;M: A$} \end{prooftree} \quad \begin{prooftree} \AXC{$M:A\land B$} \LeftLabel{($\land\text{E}_2$)} \UIC{$\mathbf{snd}\;M: B$} \end{prooftree} \end{equation*} \]

Local reduction gives

\[\begin{align*} \mathbf{fst}\;\langle M, N \rangle & \Longrightarrow_{R} M,\\ \mathbf{snd}\;\langle M, N \rangle & \Longrightarrow_{R} N. \end{align*} \]

Reduction arises when a destructor is applied to a constructor.

Local expansion gives

\[\begin{equation*} M : A \land B \Longrightarrow_{E} \langle \mathbf{fst}\;M, \mathbf{snd}\;M \rangle. \end{equation*} \]

Disjunction

\[\begin{equation*} \begin{prooftree} \AXC{$M:A$} \LeftLabel{($\lor\text{I}_1$)} \UIC{$\mathbf{inl}\;M:A \lor B$} \end{prooftree} \quad \begin{prooftree} \AXC{$M:B$} \LeftLabel{($\lor\text{I}_2$)} \UIC{$\mathbf{inr}\;M:A \lor B$} \end{prooftree} \end{equation*} \] \[\begin{equation*} \begin{prooftree} \AXC{$M:A \lor B$} \AXC{$\overline{x:A}^x$} \noLine \UIC{$N:C$} \AXC{$\overline{y:B}^y$} \noLine \UIC{$P:C$} \LeftLabel{($\lor\text{E}^{x,y}$)} \TIC{$\mathbf{case}(M, x.N, y.P):C$} \end{prooftree} \end{equation*} \]

$\mathbf{inl}$ and $\mathbf{inr}$ inject the values of $A$ and $B$ into the disjoint sum type $A+B$.
Local reductions
\[\begin{align*} \mathbf{case}(\mathbf{inl}\;M, x.N, y.P) & \Rightarrow_{R} N[M/x],\\ \mathbf{case}(\mathbf{inr}\;M, x.N, y.P) & \Rightarrow_{R} P[M/y]. \end{align*} \]
Local expansion
\[\begin{equation*} M: A \lor B \Rightarrow_{E} \mathbf{case}(M, x. \mathbf{inl}\;x, y. \mathbf{inr}\;y) \end{equation*} \]

Falsehood

\[\begin{prooftree} \AXC{$M: \bot\text{ true}$} \LeftLabel{($\bot\text{E}$)} \UIC{$\mathbf{abort}\;M:C\text{ true}$} \end{prooftree} \]

There is no computation rule and we will not try to evaluate $\mathbf{abort}\; M$.

Example

We prove $f: A \supset (B \supset C) \vdash (A \land B) \supset C$.

\[\begin{prooftree} \AXC{$f:A \supset (B \supset C)$} \AXC{$\overline{P:A \land B}^P$} \LeftLabel{($\land\text{E}_1$)} \UIC{$\mathbf{fst}\;P: A$} \LeftLabel{($\supset\text{E}$)} \BIC{$f\;(\mathbf{fst}\;P): B \supset C$} \AXC{$\overline{P:A \land B}^P$} \LeftLabel{($\land\text{E}_2$)} \UIC{$\mathbf{snd}\;P: B$} \LeftLabel{($\supset\text{E}$)} \BIC{$(f\;(\mathbf{fst}\;P)) (\mathbf{snd}\;P): C$} \LeftLabel{($\supset\text{I}^P$)} \UIC{$\lambda P.(f\;(\mathbf{fst}\;P)) (\mathbf{snd}\;P): (A \land B) \supset C $} \end{prooftree}\]

So we can write $f: A \supset (B \supset C) \vdash \lambda P.(f\; (\mathbf{fst}\;P)) (\mathbf{snd}\;P) : (A \land B) \supset C$.

Given this single line, we can reconstruct the 2-D proof.

That's why it's called an isomorphism!

Now, we do the other direction directly using lambda calculus.

$g: (A \land B) \supset C \vdash ? : A \supset (B \supset C)$

We write $\lambda x.\lambda y. $ then $x$ will have type $A$ and $y$ will have type $B$ and the rest will have type $C$.

So it's $\lambda x y. g \langle x, y \rangle$.

Connection to Hilbert System $H_1$

Hilbert proof system $H_1$ is a simple proof system with implication as the only connective.

It has axioms AK and AS, and a single inference rule, modus ponens.

The $\lambda$-terms

\[\begin{align*} \mathbf{k} & \equiv \lambda xy.x,\\ \mathbf{s} & \equiv \lambda xyz.xz(yz), \end{align*} \]

corresponds to the axioms

\[\begin{align*} \text{AK} &: A \supset (B \supset A),\\ \text{AS} &: (A \supset (B \supset C)) \supset ((A \supset B) \supset (A \supset C)). \end{align*} \]

Applications

Propositions as Types in Variants of Logic

Natural Deduction (Gentzen 35) vs. Typed Lambda Calculus (Church 40)
Type Schemes (Hindley 69) vs. ML Type System (Milner 75)
System F (Girard 72) vs. Polymorphic Lambda Calculus (Reynolds 74)
Model Logic (Lewis 10) vs. Monads (state, exceptions, Kleisli 65, Moggi 87)
Classical-Intuitionistic Embedding (Gödel 33) vs. Continuation Passing Style (Reynolds 72)
Linear Logic (Girard 87) vs. Session Types (Honda 93)

Once by a logician and once by a computer scientist

Proof Assistants

We need to generalize the above discussion to include $\exists$ and $\forall$ and this leads to a feature of programming called dependent types.

Automath (de Bruijn 70)
Coq (Huet and Coquand 88)

Software Foundation

CompCert Project

EasyCrypt
Isabelle (Paulson 93)
Lean (de Moura 13)

Lean Impressed a Fields Medalist

Try it out by playing a natural number game!

Watch Mathematical Superintelligence (MSI).

Proposition	Type
\(A \land B\)	\(A \times B\)
\(A \supset B\)	\(A \to B\)
\(A \lor B\)	\(A + B\)
Proof	\(\lambda\)-term, Programs
Introduction Rule	Constructor
Elimination Rule	Destructor
Local Reduction	Computation
Local Extension	Extension Equality

Introduction to Theoretical Computer Science

Lecture 1: Introduction and Lambda Calculus

About This Course

Welcome

History

PB: Theoretical Computer Science

Why Should You Care About TCS?

Lecturer

Q & A

Assessment

Textbook

Tips for the Students

Topics

Lambda Calculus

Motivation

Proof Theory Notation

Terms

Subterms

Abbreviations

Free and Bound Variables

Alpha-Conversion

Beta-Conversion

Equational Theory

The Standard Theory in \(\lambda\)-Calculus

Eta Rule

A Fixed-Point Theorem

Extra: Deriving the Y Combinator

Lecture 2: Programming in Lambda Calculus

Lambda Calculus II

Beta-Reduction

Compatible Closure

Examples

Normal Form

Reduction Strategy

Church-Rosser Property

Simple Observations

Church-Rosser for \(\beta\)-Reduction

Proof Strategy

Proof

Lambda Calculus as a Programming Language

Booleans

Natural Numbers

Successor, Addition, and Multiplication

Currying

General Functions

Examples

Lecture 3: Church-Turing Thesis

Fix-Point Combinators in Use

Data Structures in \(\lambda\)-Calculus

Data Structures: Pair and Tuples

Data Structures: Lists

List Operations

Functional Programming

Church-Turing Thesis

What is Computation

Turing Machines Revisited

Important Facts About Turing Machines

Undecidability Review

Reduction

Lambda Calculus is Turing Complete

Lecture 4: Advanced Computability

Kleene's Recursion Theorem

Motivation

Self-Reproduction

Self-Printing Machines

Kleene's Recursion Theorem

Applications

Gödel's Incompleteness Theorems

Logic Review

Truth and Interpretations (Models)

Proofs and Deduction Calculus

Gödel's Completeness Theorem

Typical Use of First Order Logic

Gödel's Incompleteness Theorems

A Proof-Searching Approach to Attack the Halting Problem

Gödel's Second Incompleteness Theorem

Lecture 5: Time Complexity

Time Complexity

Palindrome Example

Measuring the Time Complexity Using TMs