\( \newcommand{\ket}[1]{\vert#1\rangle} \newcommand{\bra}[1]{\langle#1\vert} \newcommand{\bigket}[1]{\bigl\vert#1\bigr\rangle} \newcommand{\bigbra}[1]{\bigl\langle#1\bigr\vert} \newcommand{\class}[1]{\mathrm{#1}} \newcommand{\problem}[1]{\mathrm{#1}} \renewcommand{\natural}{\mathbb{N}} \newcommand{\real}{\mathbb{R}} \newcommand{\complex}{\mathbb{C}} \newcommand{\ip}[2]{\left\langle#1, #2 \right\rangle} \newcommand{\Tr}{\mathop{\mathrm{Tr}}\nolimits} \newcommand{\tr}{\mathop{\mathrm{tr}}\nolimits} \newcommand{\abs}[1]{\left\lvert#1 \right\rvert} \newcommand{\norm}[1]{\left\lVert#1 \right\rVert} \newcommand{\floor}[1]{\left\lfloor#1 \right\rfloor} \newcommand{\X}{\mathcal{X}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\A}{\mathcal{A}} \newcommand{\B}{\mathcal{B}} \newcommand{\E}{\mathop{\mathbb{E}}} \newcommand{\Var}{\mathop{\mathrm{Var}}} \newcommand{\dif}{\mathrm{d}} \newcommand{\eps}{\epsilon} \newcommand{\sign}{\mathrm{sign}} \newcommand{\poly}{\mathrm{poly}} \newcommand{\polylog}{\mathrm{polylog}} \newcommand{\negl}{\mathrm{negl}} \newcommand{\church}[1]{\overline{#1}} \newcommand{\defeq}{\stackrel{\text{def}}{=}} \newcommand{\pair}[2]{\langle#1, #2\rangle} \newcommand{\tuple}[1]{\langle#1\rangle} \newcommand{\red}{\rightarrow} \newcommand{\reds}{\twoheadrightarrow} \newcommand{\betared}{\rightarrow_{\beta}} \newcommand{\betareds}{\twoheadrightarrow_{\beta}} \newcommand{\betaeq}{=_{\beta}} \newcommand{\betaetared}{\rightarrow_{\beta\eta}} \newcommand{\betaetareds}{\twoheadrightarrow_{\beta\eta}} \newcommand{\parared}{\rightarrow_{\|}} \newcommand{\parareds}{\twoheadrightarrow_{\|}} \newcommand{\desc}[1]{\langle#1 \rangle} \newcommand{\adv}{\mathcal{A}} \newcommand{\dis}{\mathcal{D}} \newcommand{\labelsty}[1]{\mathrm{#1}} \newcommand{\Enc}{\labelsty{Enc}} \newcommand{\Dec}{\labelsty{Dec}} \newcommand{\plain}{\labelsty{p}} \newcommand{\ciper}{\labelsty{c}} \newcommand{\PRG}{\labelsty{PRG}} \newcommand{\Commit}{\labelsty{Com}} \newcommand{\oracle}{\mathcal{O}} \newcommand{\Sim}{\labelsty{Sim}} \newcommand{\comp}{\labelsty{c}} \newcommand{\view}{\labelsty{view}} \newcommand\TIME{\class{TIME}} \newcommand\SPACE{\class{SPACE}} \newcommand\SIZE{\class{SIZE}} \newcommand\NTIME{\class{NTIME}} \newcommand\NSPACE{\class{NSPACE}} \newcommand\BQP{\class{BQP}} \newcommand\BPP{\class{BPP}} \newcommand\QMA{\class{QMA}} \renewcommand\P{\class{P}} \newcommand\Ppoly{\class{P{/}poly}} \newcommand\NP{\class{NP}} \newcommand\NPC{\class{NP}\text{-complete}} \newcommand\coNP{\class{coNP}} \newcommand\MA{\class{MA}} \newcommand\AM{\class{AM}} \newcommand\QCMA{\class{QCMA}} \newcommand\EXP{\class{EXP}} \newcommand\NEXP{\class{NEXP}} \newcommand\PP{\class{PP}} \newcommand\GapP{\class{GapP}} \newcommand\IP{\class{IP}} \newcommand\QIP{\class{QIP}} \newcommand\PSPACE{\class{PSPACE}} \newcommand\NPSPACE{\class{NPSPACE}} \newcommand\MIP{\class{MIP}} \newcommand\RE{\class{RE}} \newcommand\QMAM{\class{QMAM}} \newcommand\PH{\class{PH}} \renewcommand\L{\class{L}} \newcommand\NL{\class{NL}} \newcommand\coNL{\class{coNL}} \newcommand\NLC{\class{NL}\text{-complete}} \newcommand\PPAD{\class{PPAD}} \newcommand\SharpP{\#\class{P}} \newcommand\perm{\mathop{\mathrm{perm}}\nolimits} \newcommand\pathfunc{\mathrm{path}} \newcommand\SELF{\problem{SELF}} \newcommand\HALT{\problem{HALT}} \newcommand\MINTM{\problem{MIN}_\text{TM}} \newcommand\ACCTM{\problem{ACC}_\text{TM}} \newcommand\TQBF{\problem{TQBF}} \newcommand\PAL{\problem{PAL}} \newcommand\PATH{\problem{PATH}} \newcommand\LONGPATH{\problem{LONGEST\text{-}PATH}} \newcommand\SHORTPATH{\problem{SHORTEST\text{-}PATH}} \newcommand\HAMPATH{\problem{HAM\text{-}PATH}} \newcommand\HAMCYCLE{\problem{HAM\text{-}CYCLE}} \newcommand\INDSET{\problem{IND\text{-}SET}} \newcommand\MINCUT{\problem{MIN\text{-}CUT}} \newcommand\MAXFLOW{\problem{MAX\text{-}FLOW}} \newcommand\MAXCUT{\problem{MAX\text{-}CUT}} \newcommand\MAXCUTAPP{\problem{MAX\text{-}CUT\text{-}APPROX}} \newcommand\SAT{\problem{SAT}} \newcommand\PERM{\problem{PERMANENT}} \newcommand\THREESAT{3\text{-}\problem{SAT}} \newcommand\TWOSAT{2\text{-}\problem{SAT}} \newcommand\TWOUNSAT{2\text{-}\problem{UNSAT}} \newcommand\NAESAT{\problem{NAESAT}} \newcommand\REACHABILITY{\problem{REACHABILITY}} \newcommand\THREECOL{3\text{-}\problem{COLORING}} \newcommand\CLIQUE{\problem{CLIQUE}} \newcommand\COL{\problem{COLORING}} \newcommand\COMPOSITE{\problem{COMPOSITE}} \newcommand\PRIME{\problem{PRIME}} \newcommand\TSP{\problem{TSP}} \newcommand\VC{\problem{VERTEX\text{-}COVER}} \newcommand\FACTOR{\problem{FACTOR}} \newcommand\GI{\problem{GRAPH\text{-}ISO}} \newcommand\SHORTPROOF{\problem{SHORT\text{-}PROOF}} \newcommand\UHALT{\problem{UNARY\text{-}HALT}} \newcommand\GG{\problem{GENERAL\text{-}GEOGRAPHY}} \newcommand\FG{\problem{FORMULA\text{-}GAME}} \)

Introduction to Theoretical Computer Science

Zhengfeng Ji

Lecture 1: Introduction and Lambda Calculus

About This Course

Welcome

This is an incomplete set of notes prepared by Zhengfeng Ji for the course. Please do not distribute.

History

First time ever!

Theoretical Computer Science

  • Theoretical computer science studies computation and, more generally, information processing from a mathematical perspective.

    What is computation?

    How efficient can we solve a problem, and why are some problems much harder to solve than others?

    The importance of mathematical rigorous, logical, and mathematical thinking.

  • Theory A and B: Algorithms and complexity vs. formal methods and logic

    We will cover topics from both sides.

Why Should You Care About TCS?

Galileo's tenet: The book of nature is written in the language of mathematics.

Logic and mathematics are the languages for computer science.

  • Foundation of computer science: the science part of our department

    CS without theory is like engineering without physics.

  • CS, mathematics, and physics

As stated in the ITCS (Introduction to the Theory of Computation) book: computation and information play a role in the 21st century similar to the role that energy and matter played in the 20th century.

A concrete example: It from Qubit initiative supported by the Simons Foundation, which aims to understand quantum field theory and gravity from an information theory perspective.

CS concepts such as \(\NP\)-completeness are gifts to mathematics, physics, and economics, etc.

Computer science primarily deals with finite structures, whereas mathematics is more inclined towards the study of infiniteness. However, computer scientists often explore exponentially large finite scenarios, making the discussion non-trivial.

  • Even if you are not going to choose theory as your future research direction:

    TCS offers interesting perspectives on the nature of computation and the inherent complexity of problems.

    Computation is not solely about writing programs to solve specific problems. Developing the ability to think about computation theoretically can increase your potential to contribute creatively to the field.

    By including TCS in your computer science education, you become more well-rounded and educated in the field.

    Learning about TCS in computer science is comparable to studying classic poems in a Chinese literature class.

    It is crucial to balance between being capable in practical applications and being knowledgeable about the underlying principles and theories.

Lecturer

  • Zhengfeng Ji

  • East Main Building 8-307

  • My research

    Quantum computing, theory of computing.

Q & A

  • Online: Wechat group, learn.tsinghua.edu.cn, email.

  • Office hour: TBA

  • Your TA: Xingjian Li (lxj22@mails.tsinghua.edu.cn)

Assessment

  1. Homework. (20%)

    Emphasize homework: Meant to reinforce the understanding of the materials; Prepares you for the exams!

    Rules.

  2. Midterm. (20%)

  3. Final Exam. (60%)

Textbook

  1. Introduction to Theoretical Computer Science (Boaz Barak)

  2. Introduction to the Theory of Computation (Michael Sipser)

Helpful videos online:

  1. Great Ideas in Theoretical Computer Science by Ryan O'Donnell, CMU

  2. Theory of Computation by Michael Sipser, MIT

Lecture notes will be available at http://itcs.finite-dimensinoal.space after each lecture.

Tips for the Students

  • The subject could be challenging for many of you.

    1. It has a different style and different set of requirements.

    2. It covers a large number of different topics, ranging from NP-completeness reductions to \(\lambda\)-calculus and analyzing probabilistic algorithms and protocols.

  • Pause and think about the problem or theorem by yourself before you see the proof.

    Barak: Even a five-minute attempt to solve the problem by yourself will significantly improve your understanding of the topic.

  • Get involved and give us feedback!

Topics

  1. Computational Models
    1. Lambda Calculus
    2. Recursion Theorems
    3. Godel's Incompleteness Theorem
  2. Complexity Theory
    1. Time Complexity
    2. P vs NP
    3. Space Complexity
    4. Hierarchy Theorems
    5. Non-Uniform Computation
  3. Probabilistic Computation
    1. Probability Theory for Computer Science
    2. Randomized Algorithms
    3. The Class BPP
  4. Cryptography
    1. Encryption and Pseudorandomness
    2. Zero-Knowledge
  5. Logic and Computation
    1. Natural Deduction
    2. Proposition as Types

Lambda Calculus

Is there a functional theory where functions and their arguments are at the same level?

And if so, what use is it?

See more history in Dana Scott's talk

Motivation

  • We have studied Turing machines in FLA where we also mentioned there are many equivalent models of computation.

    The \(\lambda\) calculus is another way to define computable functions.

    \(\lambda\) calculus to CS is like algebra to mathematics.

  • While Turing machines are not used for practical computation, the \(\lambda\) calculus has inspired functional programming languages such as LISP, ML and Haskell, and indirectly the development of many other programming languages as well.

  • The \(\lambda\) operator.

    At the core of the \(\lambda\) calculus is a way to define anonymous functions.

    Function \(f(x) = x \times x\) is represented as \(\lambda x. x \times x\).

    For example, in Python we can define the squaring function using

    lambda x: x*x
  • The study of the \(\lambda\)-calculus mainly about the set of terms and equations between the terms.

    Both concepts are defined inductively.

    Structural induction.

Proof Theory Notation

  • During the course, we define many sets (also relations) using inductive definitions.

    We will usually use the following notation from proof theory:

    \[\begin{prooftree} \AXC{$P_1\in \mathcal{S}$} \AXC{$\cdots$} \AXC{$P_n\in \mathcal{S}$} \LeftLabel{(name)} \RightLabel{ (side condition)} \TIC{$C\in \mathcal{S}$} \end{prooftree} \]
  • It means that if the premises \(P_1, \ldots, P_n \in \mathcal{S}\) hold satisfying the side conditions, then the conclusion is also in \(\mathcal{S}\).

  • When there are no premises, the rule is sometimes called an axiom.

  • The side conditions are arbitrary. The premises only talk about whether \(P_i\) is in \(\mathcal{S}\).

  • When we define the set \(\mathcal{S}\) in this way, we mean that \(\mathcal{S}\) is the least set for which all the rules hold.

Terms

  • The objects of study in the \(\lambda\)-calculus are the terms of the language, and ideas of convertibility or equality between pairs of these terms.

  • The terms are remarkably simple yet unexpectedly powerful objects.

  • Assume the existence of a countable set of variables.

    A term of the untyped \(\lambda\)-calculus is a finite string made up of the symbols \(\lambda\), \((\), \()\), \(.\), and variable identifiers.

    The set of terms \(\Lambda\) is defined by the following Backus-Naur Form (BNF) grammar:

    \[\begin{equation*} s ::= x \mid (\lambda x. s) \mid (s s) \end{equation*} \]

    Or we can define \(\Lambda\) inductively as

    \[\begin{align*} \text{(variable)}\; & \frac{}{x\in \Lambda}\; (x \in \mathcal{V})\\ \text{(abstraction)}\; & \frac{s\in \Lambda}{(\lambda x. s) \in \Lambda}\; (x \in \mathcal{V})\\ \text{(application)}\; & \frac{s \in \Lambda\quad t \in \Lambda}{(s t) \in \Lambda} \end{align*} \]

    Variable, abstraction, application.

  • Note: We use the notation \((st)\), rather than \(s(t)\), to denote the application of a function \(s\) to an argument \(t\).

    This may take some time to get familiar with.

Subterms

  • Can be defined inductively.

  • \(\lambda x. (\lambda y. (xz))\)

    \((xz)\) is an application subterm.

    \((\lambda y. (xz))\) is an abstraction subterm.

Abbreviations

  • The parenthesis and \(.\) are used to disambiguate the structure of expressions.

  • In practice, we minimize their use of parenthesis to avoid notational clutter:

    1. Outermost parentheses can be omitted.
    2. A sequence of applications associates to the left. So \(x y z\) means \(((xy)z)\).
    3. The body of a lambda abstraction (the part after the dot) extends as far to the right as possible. So \(\lambda x. st\) means \(\lambda x.(st)\) not \((\lambda x.s) t\).
    4. Nested abstractions associate to the right and can be collected under a single \(\lambda\) sign. So \(\lambda xy.xz\) means \((\lambda x.(\lambda y.(xz)))\).
    5. When an abstraction occurs on the right of an application, the parentheses around it may be omitted. \(z \lambda xy.xz\) means \((z(\lambda x.(\lambda y.(xz))))\).

Free and Bound Variables

  • Free variables

    Free variables \(\text{FV}(s)\)

    \[\begin{align*} \text{FV}(x) &= \{x\}\\ \text{FV}(st) &= \text{FV}(s) \cup \text{FV}(t)\\ \text{FV}(\lambda x.s) &= \text{FV}(s) \setminus \{x\}. \end{align*} \]
  • A variable \(x\) is free in a term \(s\) if \(x \in \text{FV}(s)\).

  • A variable \(x\) is bound if it is in a subterm \(\lambda x. u\) of \(s\).

  • As \(x\) can occur multiple times, it could be the case that some occurrences of \(x\) are free while other occurrences are bound.

  • A term \(s\) is called closed if it has no free variables, \(\text{FV}(s) = \emptyset\).

    Closed terms are also sometimes called combinators.

    Some standard combinators named using boldface letters.

    \[\begin{align*} \mathbf{i} & \equiv \lambda x.x\\ \mathbf{k} & \equiv \lambda xy.x\\ \mathbf{s} & \equiv \lambda xyz.xz(yz)\\ \mathbf{t} & \equiv \lambda xy.x\\ \mathbf{f} & \equiv \lambda xy.y\\ \mathbf{\Omega} & \equiv (\lambda x.xx)(\lambda x.xx)\\ \end{align*} \]
  • Since the names of bound variables are not important, we can rename them without changing the meaning of a term. But we should not rename a bound variable to be the same as a free variable!

Alpha-Conversion

  • \(\lambda\) is a binding Operator

    A bound variable is similar to the variable \(x\) in the summation \(\sum_x f(x)\). \(x\), which is a dummy variable that can be replaced without changing the meaning of the term.

    Two terms \(s\) and \(t\) are said to be \(\alpha\)-convertible, written \(s \equiv_\alpha t\), if one can derive the same term from both purely by renaming bound variables to fresh variables.

    We sometimes omit \(\alpha\) and write \(\equiv\) only.

  • We consider terms that are \(\alpha\)-convertible to each other to be identical at the syntactic level.

Beta-Conversion

  • Application in \(\lambda\) calculus means substitution.

  • We now give another type of conversion to define what \(\lambda\) does with the argument.

  • The term \(\lambda x.s\) behaves like a function as if \(s\) were parameterized by \(x\).

  • \(\beta\)-convertibility is a binary relation which we write infix as \(\beta\), is defined by the scheme

    \[\begin{equation*} (\lambda x.s) t\;\beta\;s[t/x] \end{equation*} \]

    for any terms \(s,t\) where \(s[t/x]\) means the result of substituting \(t\) for every free occurrence of \(x\) in \(s\).

  • \(s[t/x]\) is not a term, it denotes the term that is obtained by the substitution.

  • Substitution may be defined inductively.

    \[\begin{align*} x[t/x] & \equiv t,\\ y[t/x] & \equiv y,\\ (su)[t/x] & \equiv (s[t/x])(u[t/x]),\\ (\lambda y.s) [t/x] & \equiv \lambda y. (s[t/x]) \text{ assuming } y \not\equiv x \text{ and } y\not\in \text{FV}(t) \end{align*} \]

    Two complications:

    1. Substitution should only apply to free occurrences of \(x\) in \(s\).

    2. Avoid unintended capture of free variables!

      For example, \(s \equiv \lambda x.yx\) and \(t \equiv \lambda z.xz\).

      What is the result of substituting \(t\) for \(y\) in \(s\)?

      \(\lambda x.(\lambda z.xz)x\) is not correct.

      Rename \(x\) to \(x'\) first in \(s\).

      \(\lambda x'.(\lambda z.xz)x'\) is the correct substitution.

  • \((\lambda xy.yz)[yy/z] \equiv \lambda xu.u(yy)\) (note that the parenthesis is needed here).

Equational Theory

  • In logic, a theory is a set of formulae closed under some notion of provability.

  • In the formal theories of the \(\lambda\)-calculus, the only formulae are equations between terms.

    '\(s = t\)' for terms \(s\) and \(t\).

  • We write \(\mathcal{T} \vdash s = t\) to mean \(s=t\) is a formula in the theory \(\mathcal{T}\).

The Standard Theory in \(\lambda\)-Calculus

  • \(\lambda\beta\) of \(\lambda\)-calculus is defined by the rules:

    \[\begin{align*} \text{(reflexivity)}\quad & \frac{}{s=s} \\ \text{(symmetry)}\quad & \frac{s=t}{t=s}\\ \text{(transitivity)}\quad & \frac{s=t\quad t=u}{s=u}\\ \text{(application)}\quad & \frac{s=s'\quad t=t'}{st = s't'}\\ \text{(abstraction)}\quad & \frac{s=t}{\lambda x.s = \lambda x.t}\\ (\beta) \quad & \frac{\phantom{\lambda}}{(\lambda x.s)t=s[t/x]} \end{align*} \]
  • A formal proof of an equation in \(\lambda\beta\) is a proof tree constructed by the rules.

    For example,

    \[\begin{prooftree} \AXC{} \LeftLabel{($\beta$)} \UIC{$(\lambda xy.x)p = \lambda y.p$} \AXC{} \LeftLabel{(refl)} \UIC{$q=q$} \LeftLabel{(app)} \BIC{$(\lambda xy.x)pq = (\lambda y.p)q$} \AXC{} \LeftLabel{($\beta$)} \UIC{$(\lambda y.p)q = p$} \LeftLabel{(trans)} \BIC{$(\lambda xy.x)pq = p$} \LeftLabel{(sym)} \UIC{$p = (\lambda xy.x)pq$} \end{prooftree} \]

    is the proof that \(\lambda\beta \vdash p = (\lambda xy.x) pq\).

  • The proof tree looks tedious and we may write a linearized proof of the same fact: \(\lambda \beta \vdash (\lambda xy.x) pq = (\lambda y.p)q = p\).

Eta Rule

  • Another important theory is the \(\lambda\beta\eta\) theory with one additional rule

    \[\begin{align*} (\eta) \quad \frac{}{\lambda x.sx = s} \; (x\not\in \text{FV}(s)) \end{align*} \]
  • This rule does not follow from the \(\beta\) rule (hopefully we will see a proof it) but it is justified if we apply them to another term \(t\).

  • The principle of extensionality: if \(M\) and \(M'\) define the same function, then \(M\) and \(M'\) are equal.

  • If we wish to take an extensional viewpoint then the \(\eta\)-convertible terms should be equated.

  • A \(\lambda\)-theory \(\mathcal{T}\) is one that closed under all rules for \(\lambda\beta\).

  • Is \(\mathcal{T}\) consistent? (Does not equate all terms. Yes for \(\lambda\beta\)!)

A Fixed-Point Theorem

fixed-points of functions \(f(x)=x^2\), \(f(x) = x\), \(f(x) = x+1\).

For terms \(f\) and \(u\), \(u\) is said to be a fixed-point of \(f\) (in \(\lambda\beta\)) if \(\lambda\beta \vdash fu = u\).

(First Recursion Theorem). Let \(f\) be a term, Then there is a term \(u\) which is the fixed-point of \(f\).

In fact, the fixed-point can be computed in \(\lambda\) calculus itself by using fixed-point combinators. A fixed-point combinator is a closed term \(s\) such that for all terms \(f\), \(f(sf) = sf\). In other words, \(sf\) is a fixed-point of \(f\) for all terms \(f\).

The power of (untyped) \(\lambda\)-calculus: A function can even be its own argument!

There are many fixed-point combinators, two important examples are:

  1. Curry's 'paradoxical' Y combinator:

    \[\begin{equation*} \mathbf{y} \equiv \lambda f.(\lambda x. f(xx))(\lambda x. f(xx)). \end{equation*} \]
  2. Turing's fixed-point combinator:

    \[\begin{equation*} \mathbf{\Theta} \equiv (\lambda xy.y(xxy))(\lambda xy.y(xxy)). \end{equation*} \]
  3. Proof that \(\mathbf{y}\) is a fixed-point combinator!

    Easy!

  4. Do you know the company named Y Combinator?

Extra: Deriving the Y Combinator

  • Say \(f\) is a given term and we want to solve \(X = fX\) to get a fixed-point of \(f\).

    This is not allowed as \(X\) appears in the right hand side.

  • We replace the \(X\) on the right with a variable \(x\) and write \(X' x = f x\).

    However, this does not help us to solve find the fixed-point.

  • One weird trick is all we need: use-mention trick, mockingbird, or self-reference.

    Consider \(X'' x = f (x x)\); now we can take \(X = X'' X''\) and have \(X = X'' X'' = f(X'' X'') = f X\)!

    So \(X'' = \lambda x. f(x x)\) and the fixed-point is \(X = X'' X'' = (\lambda x. f(xx))(\lambda x. f(xx))\).

  • Finally, we bind \(f\) to get the Y combinator!

  • Another way to derive the Y combinator is to use the diagonal argument.

Lecture 2: Programming in Lambda Calculus

Lambda Calculus II

Beta-Reduction

  • Motivations

  • Substitution implements the computation

    Substitution is asymmetric, and it is natural to consider the reduction part, \((\lambda x.s) t \betared s[t/x]\).

    The computational drive of \(\lambda\)-calculus.

  • For \(\beta\)-conversion, the reduction part is called \(\beta\)-reduction.

    It is also a binary relation but is not an equivalence relation anymore.

Compatible Closure

  • For any binary relation \(R\), one-step \(R\) reduction is the compatible closure of \(R\), an induced relation denoted \(\red_R\) defined as follows.

    \[\begin{align*} (R)\; & \frac{}{s \red_R t}\quad (\tuple{s,t} \in R)\\ (\text{Reduction on the left})\; & \frac{s \red_R t}{su \red_R tu}\\ (\text{Reduction on the right})\; & \frac{s \red_R t}{us \red_R ut}\\ (\text{Reduction under abstraction})\; & \frac{s \red_R t}{\lambda x.s \red_R \lambda x.t}\\ \end{align*} \]
  • \(t\) in \(s \red_R t\) is called the reduct of \(s\).

  • The relation \(\reds_R\) is the reflexive, transitive closure of \(\red_R\).

  • The relation \(=_R\) is the reflexive, symmetric, transitive closure of \(\red_R\).

    When \(s =_R t\), we say that \(s\) is \(R\)-convertible to \(t\).

  • We are mostly interested in the \(\beta\)-reduction \(\betared\) defined by relation \(\beta\):

    \[\begin{equation*} (\lambda x.s) t\; \beta\; s[t/x]. \end{equation*} \]
  • One step of \(\beta\)-reduction is denoted as \(\betared\), and the reflexive and transitive closure is denoted \(\betareds\).

Examples

  1. Example 1

    \[\begin{equation*} \begin{split} \mathbf{s} \mathbf{k} \mathbf{k} & \equiv (\lambda x y z. (xz)(yz)) \mathbf{k} \mathbf{k}\\ & \betared \lambda y z.(\mathbf{k} z)(yz) \mathbf{k}\\ & \betared \lambda z.(\mathbf{k} z)(\mathbf{k} z)\\ & \equiv \lambda z.((\lambda xy.x)z)(\mathbf{k} z)\\ & \betared \lambda z.(\lambda y.z)(\mathbf{k} z)\\ & \betared \lambda z.z\\ & \equiv \mathbf{i}. \end{split} \end{equation*} \]
  2. Example 2: \(\mathbf{\Omega} \betared \mathbf{\Omega}\)

  3. Note: \(\mathbf{\Theta} f \betareds f(\mathbf{\Theta} f)\) and \(\mathbf{y} f \betaeq f(\mathbf{y} f)\).

Normal Form

A term \(s\) is in normal form if there is no term \(t\) such that \(s \betared t\).

A term \(s\) has a normal form if there is a term \(t\) in normal form with \(s \betareds t\). Equivalently, we say that \(s\) is normalizable.

A term \(s\) is called strongly normalizable if there does not exist an infinite sequence \(t_1, t_2, \ldots\) such that \(s \betared t_1 \betared t_2 \betared \cdots\).

  • Examples

    \(\lambda x. x\) is in normal form.

    \((\lambda x. x) (\lambda x. x)\) has a normal form and is strongly normalizable.

    \(\mathbf{\Omega}\) is not even normalizable as it reduces to itself.

    \(\mathbf{f}\, \mathbf{\Omega}\, \mathbf{i}\) is normalizable but not strongly normalizable.

Reduction Strategy

  • A fundamental theorem about \(\beta\)-reductions is the following theorem.

    (Reduction Theorem). Leftmost reduction always finds the normal form if it exists.

    Leftmost reduction: the redex \((\lambda x.u)v\) is the leftmost among all redexes in \(s\) (measuring the beginning position).

  • The proof is complicated.

  • It is related to the concept of call-by-value/call-by-name distinction in programming languages.

Church-Rosser Property

Would different reduction strategies lead to different normal forms?

  • Let \(\red\) be a binary relation and \(\reds\) its reflexive transitive closure.

    1. The relation \(\red\) satisfies the diamond property if \(s \red u\) and \(s \red v\) implies \(u \red t\) and \(v \red t\) for some \(t\).

    2. Relation \(\red\) is Church-Rosser if \(\reds\) satisfies the diamond property.

    3. The relation \(\red\) has the unique normal form property if \(s \reds t_1\) and \(s \reds t_2\) for normal forms \(t_1\) and \(t_2\) implies that \(t_1 \equiv t_2\).

Simple Observations

If \(\red\) is Church-Rosser, then it has the unique normal form property.

Easy.

If \(\red\) satisfies the diamond property, then \(\reds\) satisfies the diamond property.

Proof by picture.

Suppose \(\red\) is a relation with the Church-Rosser property and \(=\) is its reflexive, symmetric, transitive closure. Then \(s = t\) if and only if there is some \(u\) such that \(s \reds u\) and \(t \reds u\).

Proof by picture.

What does \(s = t\) mean in the picture?

  • A corollary is that if \(\red\) has Church-Rosser property and \(s\) and \(t\) are normal forms for \(\red\) satisfying \(s = t\), then \(s \equiv t\).

  • \(\eta\) rule does not follow from \(\beta\)

    Terms \(x\) and \(\lambda y.xy\) are not \(\beta\)-convertible as they are normal forms that are not \(\alpha\)-equivalent.

Church-Rosser for \(\beta\)-Reduction

  • A fundamental result of \(\lambda\)-calculus is that \(\beta\)-reduction has the Church-Rosser property.

    (Church and Rosser, 1936). \(\beta\)-reduction has the Church-Rosser property.

  • Note: \(\betared\) does not have the diamond property.

    You may duplicate a term in one step and then need two steps to cancel it.

    \[\begin{align*} s & \equiv (\lambda x.xx)((\lambda x.y) z),\\ u & \equiv ((\lambda x.y) z)((\lambda x.y) z),\\ v & \equiv (\lambda x.xx) y. \end{align*} \]

Proof Strategy

  • How do we prove Church-Rosser (CR)?!

    One possible idea is to prove something between diamond and CR in Figure (b).

    Unfortunately, this does not work as (b) does not imply (c) in general!

    Infinite brick tiling picture

  • The proof we show here is given by Martin-Löf and Tait.

    We define a parallel version of \(\beta\)-reduction.

    The deduction rules:

    The parallel \(\beta\)-reduction \(\parared\) is defined by the rules

    \[\begin{align*} \text{(refl)}\; & \frac{}{s \parared s}\\ \text{(abs)}\; & \frac{s \parared s'}{\lambda x. s \parared \lambda x. s'}\\ \text{(\(\|\)-app)}\; & \frac{s \parared s'\quad t \parared t'}{ st \parared s't'}\\ \text{(\(\|\)-$\beta$)}\; & \frac{s \parared s'\quad t \parared t'}{ (\lambda x. s)t \parared s'[t'/x]} \end{align*} \]

Proof

We prove two results about \(\parared\):

  1. It has the diamond property!
  2. \(\betareds\) is the transitive closure of \(\parared\).

Part 2 is easy to show and can be proved by observing that \(\betared \;\subseteq\; \parared \;\subseteq\; \betareds \;\subseteq\, \Lambda^2\).

For Part 1, we want to prove that if \(s \parared u\) and \(s \parared v\), then there is \(t\) such that \(u \parared t\) and \(v \parared t\).

We prove the claim using an induction on \(s\) and a case analysis on the rules showing \(s \parared u\) and \(s \parared v\).

  1. Case 1. \(s \parared u\) is by (refl) and \(s \parared v\) is arbitrary. In this case, \(s \equiv u\) and we can take \(t \equiv v\). We can similarly handle the case when \(s \parared v\) uses (refl).

  2. Case 2. \(s \parared u\) is by (\(\|\)-app) and \(s \parared v\) is also by (\(\|\)-app). So \(s \equiv s_1s_2\), \(u \equiv u_1u_2\) and \(v \equiv v_1v_2\) with \(s_i \parared u_i\) and \(s_i \parared v_i\) for \(i=1, 2\). By the induction hypothesis, we have \(t_1, t_2\) such that \(u_i \parared t_i\) and \(v_i \parared t_i\) for \(i=1, 2\). We can take \(t \equiv t_1t_2\), and the claim follows by rule (\(\|\)-app).

  3. Case 3. \(s \parared u\) is by (\(\|\)-app) and \(s \parared v\) is by (\(\|\)-\(\beta\)). In this case, \(s\) can be reduced using both the parallel application and parallel beta rule. This means that it has form \(s \equiv (\lambda x.s_1)s_2\). Similarly \(u = (\lambda x.u_1)u_2\) (this is not obvious), \(v = v_1[v_2/x]\) and \(s_i \parared u_i\), \(s_i \parared v_i\) for \(i=1,2\). Apply the induction hypothesis, and we have \(t_1, t_2\) such that \(u_i \parared t_i\) and \(v_i \parared t_i\) for \(i=1,2\). We take \(t = t_1[t_2/x]\). \(u \parared t\) by the (\(\|\)-\(\beta\)) rule. We claim that \(v \equiv v_1[v_2/x] \parared t\), which we prove in the Substitution Lemma below.

    The case where the first reduction uses the parallel beta rule and the second uses the parallel application is similar.

  4. Case 4. Both use (\(\|\)-\(\beta\)). Easy.

  5. Case 5. If \(s \parared u\) is by (abs), then \(s \parared v\) is by (abs) too. Easy.

(Substitution Lemma). If \(s \parared s'\) and \(t \parared t'\), then \(s[t/x] \parared s'[t'/x]\).

Omit the proof of the lemma in class. The proof is again an induction on the structure of \(s\).

Lambda Calculus as a Programming Language

Lambda calculus is the foundation for many functional programming languages.

Lambda calculus can encode both data and programs that operate on the data without adding any additional syntax or axioms.

Program vs data and von Neumann architecture.

Booleans

  • We use the following combinators to encode true and false,

    \[\begin{align*} \mathbf{t} &= \lambda xy.x\\ \mathbf{f} &= \lambda xy.y, \end{align*} \]

    instead of introducing new keywords as in other languages.

    Define \(\mathbf{and} = \lambda ab.ab\mathbf{f}\).

  • It is easy to verify

    \[\begin{align*} \mathbf{and}\, \mathbf{t} \mathbf{t} & \betareds \mathbf{t},\\ \mathbf{and}\, \mathbf{t} \mathbf{f} & \betareds \mathbf{f},\\ \mathbf{and}\, \mathbf{f} \mathbf{t} & \betareds \mathbf{f},\\ \mathbf{and}\, \mathbf{f} \mathbf{f} & \betareds \mathbf{f}. \end{align*} \]
  • Note that \(\mathbf{t}\) and \(\mathbf{f}\) are normal forms, so we can say that a term such as \(\mathbf{and}\, \mathbf{t} \mathbf{t}\) evaluates to \(\mathbf{t}\).

  • We say that \(\mathbf{and}\) represents the boolean function and. It is understood that this encoding is with respect to the particular encoding of true and false.

  • There are other ways to encode and other than \(\lambda ab.ab\mathbf{f}\).

    Another possibility is \(\lambda ab.bab\).

  • We can define \(\mathbf{ite} = \lambda x.x\). Show that

    \[\begin{align*} \mathbf{ite}\, \mathbf{t} s t & \betareds s,\\ \mathbf{ite}\, \mathbf{f} s t & \betareds t. \end{align*} \]

    Note that you can always omit \(\mathbf{ite}\).

Natural Numbers

  • If \(f\) and \(x\) are terms, and \(n\ge 0\) is a natural number, we write \(f^n x\) for the term \(f(f(\cdots(f x)\cdots))\) where \(f\) occurs \(n\) times.

  • For each natural number \(n\), define a term \(\church{n}\) as

    \[\begin{equation*} \church{n} \equiv \lambda fx.f^n x. \end{equation*} \]

    It is the \(n\)-th Church numeral encoding natural number \(n\) in lambda calculus.

    The first few Church numerals are

    \[\begin{align*} \church{0} &\equiv \lambda fx.x,\\ \church{1} &\equiv \lambda fx.fx,\\ \church{2} &\equiv \lambda fx.f(fx),\\ \church{3} &\equiv \lambda fx.f(f(fx)),\\ & \cdots \end{align*} \]

    This particular way of encoding the natural numbers is due to Alonzo Church.

    This may be more natural if you recall the Peano axioms for natural numbers.

    1. Zero is a natural number.
    2. Every natural number has a successor in the natural numbers.
    3. Zero is not the successor of any natural number.
    4. If the successor of two natural numbers is the same, then the two original numbers are the same.
    5. If a set contains zero and the successor of every number is in the set, then the set contains the natural numbers.

    \(f\) and \(x\) play the role of the successor and Zero respectively.

  • Note that \(\church{0}\) is the same as \(\mathbf{f}\), which is a neat coincidence.

  • We should know beforehand whether to interpret the result as a boolean value or a numeral in untyped lambda calculus.

Successor, Addition, and Multiplication

  • The successor function can be defined as \(\mathbf{succ} = \lambda n f x.f(n f x)\).

    To see this, we check that

    \[\begin{align*} \mathbf{succ}\, \church{n} & \equiv (\lambda n f x.f(n f x)) \church{n}\\ & \betared \lambda f x.f(\church{n} f x)\\ & \betareds \lambda f x.f(f^n x)\\ & \equiv \church{n+1}. \end{align*} \]

    Thus, we have proved that the term \(\mathbf{succ}\) represents the successor function when applied to a numeral.

  • We can also define addition and multiplication as

    \[\begin{align*} \mathbf{add} &\equiv \lambda n m f x . n f(m f x),\\ \mathbf{mult} &\equiv \lambda n m f. n (m f). \end{align*} \]

Currying

  • In the definition of \(\mathbf{mult}\), we used the idea of Currying.

  • For traditional functions of two or more arguments, say \(f(x, y)\), we must have all the arguments ready to evaluate \(f(x, y)\).

  • In \(\lambda\)-calculus, the expression \(\lambda x y. \mathbf{add}\, x\, y\) can take a single argument \(\church{a}\) and \(\beta\)-reduces to

    \[\begin{equation*} \lambda y. \mathbf{add}\, \church{a}\, y, \end{equation*} \]

    which is a function.

  • Similarly, for church numeral \(\church{m} = \lambda fx.f^m x\), \(\church{m} f\) reduces to the term \(\lambda x.f^m x\) representing the function taking \(x\) to \(f^m x\).

  • This method is known as Currying.

  • Uncurrying is the dual process that takes a function \(f\) returning another function \(g\) to a function that takes parameters as arguments for both \(f\) and \(g\).

General Functions

  • More generally, suppose \(f: \natural^k \to \natural\) is a \(k\)-ary function on the natural numbers. We say that a lambda term \(s\) represents \(f\) if for all \(n_1, n_2, \ldots, n_k \in \natural\),

    \[\begin{equation*} s\, \church{n_1} \cdots \church{n_k} \betareds \church{f(n_1, \ldots, n_k)}. \end{equation*} \]
  • This definition easily generalizes to boolean functions or functions of other data types.

Examples

  1. Prove that \(\mathbf{iszero} = \lambda n x y. n (\lambda z.y) x\) represents the boolean function that evaluates to true if and only if the input is \(0\).

    \[\begin{align*} \mathbf{iszero} \, \church{0} & \equiv (\lambda n x y. n (\lambda z.y) x) (\lambda f x.x)\\ & \betared \lambda x y.(\lambda f x.x)(\lambda z.y) x\\ & \betareds \lambda x y.x\\ & \equiv \mathbf{t} \end{align*} \]

    For \(n \ge 1\), we have

    \[\begin{align*} \mathbf{iszero} \, \church{n} & \equiv (\lambda n x y. n (\lambda z.y) x) \church{n}\\ & \betared \lambda x y.\church{n}(\lambda z.y) x\\ & \betareds \lambda x y.y\\ & \equiv \mathbf{f} \end{align*} \]
  • You will work out the \(\exp\) function and \(\mathrm{pred}\) function in your homework

    \[\begin{equation*} \mathrm{pred}(n) = \begin{cases} 0 & \text{ if } n = 0,\\ n-1 & \text{ otherwise.} \end{cases} \end{equation*} \]
  • In fact, Church believed for a while that it was impossible until his student Kleene found a solution. (In fact, Kleene said he found the solution while having his wisdom teeth pulled, so his trick for defining the predecessor function is sometimes referred to as the wisdom teeth trick.)

    Important for showing equivalence between recursive functions and \(\lambda\)-calculus.

Fix-Point Combinators in Use

  • How do we define the factorial function?

  • Using recursion, we can define

    \[\begin{align*} \mathbf{fact}\, n & := \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (\mathbf{fact}\, (\mathbf{pred}\, n))). \end{align*} \]
  • However, \(\lambda\)-calculus does not allow us to write recursive function definitions directly.

  • So we rewrite the equation as

    \[\begin{align*} \mathbf{fact} & := \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (\mathbf{fact}\, (\mathbf{pred}\, n))),\\ \mathbf{fact} & := (\lambda f. \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\,\\ &\qquad (\mathbf{mult}\, n\, (f\, (\mathbf{pred}\, n))))\, \mathbf{fact}.\\ \end{align*} \]

    Choose \(\mathbf{fact}\) to be the fixed-point of the term \(\mathbf{fac} \equiv \lambda f. \lambda n. \mathbf{ite}\, (\mathbf{iszero}\, n)\, \church{1}\, (\mathbf{mult}\, n\, (f\, (\mathbf{pred}\, n)))\).

    That is, we can solve this fixed-point equation (up to \(\beta\)-equivalence) by letting \(\mathbf{fact} \equiv \mathbf{y}\, \mathbf{fac}\) where \(\mathbf{y}\) is the Y combinator!

  • Example: \(\mathbf{fact}\, \church{2}\)

    \[\begin{equation*} \begin{split} \mathbf{fact}\, \church{2} & \equiv \mathbf{y}\,\mathbf{fac}\, \church{2}\\ & = \mathbf{fac} (\mathbf{y}\, \mathbf{fac})\, \church{2}\\ & \betareds \mathbf{ite}\, (\mathbf{iszero}\, \church{2})\, \church{1}\, (\mathbf{mult}\, \church{2}\, (\mathbf{y}\, \mathbf{fac}\, (\mathbf{pred}\, \church{2})))\\ & \betareds \mathbf{mult}\, \church{2}\, (\mathbf{y}\, \mathbf{fac}\, \church{1})\\ & \equiv \mathbf{mult}\, \church{2}\, (\mathbf{fact}\, \church{1}). \end{split} \end{equation*} \]

    Using the two main theorems about \(\lambda\)-calculus, we know that \(\mathbf{fact}\, \church{2}\) evaluates to the correct answer in the normal form.

Lecture 3: Church-Turing Thesis

Data Structures in \(\lambda\)-Calculus

Data Structures: Pair and Tuples

  • If \(s,t\) are \(\lambda\) terms, we define \(\pair{s}{t}\) to be the term \(\lambda x.x s t\).

    Alternatively, we can write \(\mathbf{pair} \equiv \lambda ab.\lambda x. ((x a) b)\), which encloses the pair of \(a\) and \(b\) in function \(x\).

    You can think of \(\pair{s}{t}\) as a syntactic sugar for \(\mathbf{pair}\, st\) or \(\lambda x. x s t\).

  • Why is it a pair?

  • Define terms \(\pi_1 \equiv \mathbf{fst} \equiv \lambda p. p\, \mathbf{t}\) and \(\pi_2 \equiv \mathbf{snd} \equiv \lambda p. p\, \mathbf{f}\).

    \[\begin{align*} \mathbf{fst}\, \pair{s}{t} & \equiv (\lambda p. p\, \mathbf{t}) \lambda x. x s t\\ & \betared (\lambda x. x s t)\, \mathbf{t}\\ & \betared \mathbf{t}\, s t\\ & \betareds s. \end{align*} \]
  • Similarly \(\mathbf{snd}\, \pair{s}{t} \betareds t\).

  • For terms \(s_1, s_2, \ldots, s_k\), we can define the tuple \(\langle s_1, s_2, \ldots, s_k \rangle\) of them as the term \(\lambda x.x s_1 \cdots s_k\). The \(i\)-th projection \(\pi_i^k \equiv \lambda p. p(\lambda x_1 \cdots x_k . x_i)\).

Data Structures: Lists

  • We write \(\mathbf{nil}\) to denote the empty list.

    We write \(h :: t\) for the list whose first element is \(h\) and \(t\) is the tail (the remaining list).

    For example, \(1 :: (2 :: (3 :: \mathbf{nil}))\) is a list of three elements.

  • Assuming the convention that \(::\) associates to the right, we can omit the parenthesis.

  • Corresponding terms

    \[\begin{align*} \mathbf{nil} & \equiv \lambda x. \mathbf{t}\\ h :: t & \equiv \mathbf{cons}\, h\, t \text{ where } \mathbf{cons} \equiv \mathbf{pair}. \end{align*} \]

    We consider \(1 :: 2 :: 3 :: \mathbf{nil}\) as the syntactic sugar for \(\mathbf{cons}\, 1\, (\mathbf{cons}\, 2\, (\mathbf{cons}\, 3\, \mathbf{nil}))\).

    Note: The definition is not unique. For example, you may also take \(\mathbf{nil} \equiv \mathbf{f}\), but the form of \(\mathbf{empty}\) would be different.

  • Head, tail and empty

    \[\begin{align*} \mathbf{head} & \equiv \mathbf{fst},\\ \mathbf{tail} & \equiv \mathbf{snd},\\ \mathbf{empty} & \equiv \lambda l. l\, \lambda h t. \mathbf{f}. \end{align*} \]

    You may check that \(\mathbf{empty} \, \mathbf{nil} \betareds \mathbf{t}\) and \(\mathbf{empty} \, h :: t \betareds \mathbf{f}\).

List Operations

  • \(\mathbf{map}, \mathbf{reduce}, \mathbf{filter}\).

    Given a list \(l = x_0 :: x_1 :: \cdots :: x_{n-1} :: \mathbf{nil}\) and a function \(f\), \(\mathbf{map}\, l\, f\) applies \(f\) to all entries in the list \(l\) to obtain a new list \(l' = f(x_0) :: f(x_1) :: \cdots :: f(x_{n-1}) :: \mathbf{nil}\).

  • For a Boolean function \(f\), \(\mathbf{filter}\, l\, f\) returns the list \(l'\) containing only elements \(x_i\) such that \(f x_i \betareds \mathbf{t}\).

  • The term \(\mathbf{reduce}\) takes a list \(l\), an operation \(f\) taking two arguments, and a lambda term \(z\) which we think of as the neutral element of the operation \(f\).

    Draw pictures

  • We can build \(\mathbf{map}\) and \(\mathbf{filter}\) from \(\mathbf{reduce}\) and build \(\mathbf{reduce}\) using fixed-point combinators.

    \(\mathbf{map} \equiv \lambda lf . \mathbf{reduce}\, l\, (\lambda ab . \mathbf{cons}\, (f\,a)\, b)\, \mathbf{nil}\).

  • These terms motivated Google's MapReduce framework for processing big data sets.

  • Define \(\mathbf{reduce}\) recursively.

    First, we want \(\mathbf{reduce}\, \mathbf{nil}\, f\, z = z\). Then, we want to set \(\mathbf{reduce}\, (h :: t)\, f\, z\) to \(f\, h\, (\mathbf{reduce}\, t\, f\, z)\). So, we may define \(\mathbf{reduce}\) as the fixed-point of

    \[\begin{equation*} \mathbf{red} \equiv \lambda s l f z. \mathbf{ite}\, (\mathbf{empty}\, l)\, z\, (f\, (\mathbf{head}\, l)\, (s\, (\mathbf{tail}\, l)\, f\, z)). \end{equation*} \]

    Finally, \(\mathbf{reduce} \equiv \mathbf{y}\, \mathbf{red}\) completes the construction.

Functional Programming

  • Advantages:

    1. Immutability helps eliminate side effects
    2. First-class and higher-order functions
    3. Declarative style
    4. Concurrency made easier
  • It's possible to realize all the \(\lambda\) terms mentioned above in Python.

  • Yet, it is recommended that you learn a fully functional programming language, such as Scheme, Haskell, or Scala.

Church-Turing Thesis

What is Computation

Hilbert, Gödel, Church, Turing

  • Hilbert's 10th problem (1900)

    Devise a mechanical process determining in a finite number of operations whether a multivariate polynomial with integer coefficients has an integer solution.

    It is answered in the negative by Matiyasevich, Robinson, Davis, Putnam in 1970.

    Does it have a real root? It's decidable by a result of Tarski proved in 1951.

  • Hilbert's Entscheidungsproblem (Decision Problem, 1928)

    Given a sentence in first-order logic, give an effectively calculable procedure for determining if it's provable.

    “We must know, we will now.”

  • We have defined functions to be computable by

    1. Recursive functions (Kurt Gödel)
    2. \(\lambda\)-calculus (Alonzo Church)
    3. Turing machines (Alan Turing)
    4. Cellular automaton, Conway's game of life (Simple rules in a distributed system can create complex behavior)
    5. RAM
    6. Python, C

    Veritasium video Maths Fundamental Flaw is recommended. It talks about Cellular automaton, Turing completeness and Gödel's incompleteness theorem.

  • Turing machine as the definition for computatoin is quickly taken by many pioneers, so game over!

    Why is the Turing machine special among all equivalent models?

    Turing machine captures the notion of computation intuitively!

    “Computing is normally done by writing certain symbols on paper. We may suppose that this paper is divided into squares like a child's arithmetic book. The behavior of the [human] computer at any moment is determined by the symbols which he is observing, and of his 'state of mind' at that moment… We may suppose that in a simple operation not more than one symbol is altered. We compare a man in the process of computing … to a machine which is only capable of a finite number of configurations … The machine is supplied with a ‘tape’ (the analogue of paper) … divided into sections (called ‘squares’) each capable of bearing a ‘symbol’”, Alan Turing, 1936.

  • The Church-Turing thesis claims this is the only sensible definition of computable functions!

    “[In 1934], Church had been speculating, and finally definitely proposed, that the \(\lambda\)-definable functions are all the effectively calculable functions … When Church proposed this thesis, I sat down to disprove it … but, quickly realizing that [my approach failed], I became overnight a supporter of the thesis.”, Stephen Kleene, 1979.

  • No candidate computing device, including quantum computers, is a serious challenge to the Church-Turing thesis.

    Extended Church-Turing thesis.

Turing Machines Revisited

  • A Turing machine consists of

    1. A finite control unit
    2. A 1-D tape, divided into cells; each cell can hold a finite number of symbols.
    3. A head positioned at one of the cells
  • Initial configuration:

    Input is a finite-length string over the input alphabet; it is placed on the tape. All other cells, extending indefinitely to the left and right, hold the blank symbol. The head is positioned at the leftmost cell of the input.

  • How a TM computes

    The transition is a function of the state of the machine, and the tape symbol scanned. In one move, it will

    1. Change the state.
    2. Write a symbol in the cell scanned.
    3. Move the head one cell left or right.

    More formally, the computation is governed by the transition function \(\delta: Q \times \Sigma \to Q \times \Sigma \times \{\text{L},\text{R}\}\).

    It will perform the above step-by-step computation until it reaches the accept or reject state.

    Accepting/rejection configurations are halting configurations.

  • Most of the time, we use high-level descriptions of a TM for simplicity.

Important Facts About Turing Machines

  1. We can encode Turing machines as binary strings (or integers).

    They are the codes for Turing machines.

    Description of a Turing machine \(\desc{M}\) is an effective encoding of \(M\) as a string that is easy to parse.

  2. A configuration (or instantaneous description) of a TM \(M\) at certain time step is the collection of information measuring

    1. tape content,
    2. current state, and,
    3. head position.

    It is convenient to encode it as a string \(\alpha\) in \(\Gamma^*\) where \(\Gamma = (Q \times \Sigma) \cup \Sigma\).

  3. There is a universal Turing machine \(U\) that, given the description of another machine \(\desc{M}\) and string \(w\) as input, simulates the computation of \(M\) on input \(w\) (for \(t\) steps).

  4. Variants of TMs are equivalent in terms of computability.

    Multiple-tape TMs

  5. Nondeterministic TMs

    A nondeterministic TM accepts if there exists an accepting branch.

    Let \(M_N\) be a nondeterministic TM. Then there is a deterministic TM \(M_D\) such that \(L(M_N) = L(M_D)\).

    Proof idea: Use a two-tape machine; one of its tapes maintains a queue of configurations of \(M_N\). Breadth-first search on the nondeterministic computation tree.

Undecidability Review

  • A language or problem is Turing-recognizable if it is the collection of strings \(L(M)\) accepted by a Turing machine \(M\).

  • A language or problem is decidable if there is a machine recognizing it and the machine always halt.

  • Given two TM descriptions \(\desc{M_1}\) and \(\desc{M_2}\), do they act the same on all inputs?

  • Does it print Hello World!?

    GoldbachCounterexample()
    print("Hello World!")
  • Just run it?

  • Halting problem and diagonal argument

    \(\HALT = \{ \tuple{M, w} \mid M \text{ halts on input } w\}\).

    Suppose there is a machine \(H\) that decides the Halting problem.

    Define machine \(D\) = “On input \(\desc{M}\), the encoding of a Turing machine:
    1. Simulate \(H\) on input \(\tuple{M, \desc{M}}\);
    2. Loop if \(H\) accepts;
    3. Halt if \(H\) rejects.”

    So \(D(\desc{M})\) loops if \(M(\desc{M})\) halts and \(D(\desc{M})\) halts if \(M(\desc{M})\) loops.

    Now the question is whether \(D\) halts on input \(\desc{D}\)?!

    It is called Cantor's diagonal argument.

Reduction

The idea of reduction can be used to prove undecidability of more problems.

Problem \(A\) reduces to \(B\) means that it's possible to decide \(A\) using an algorithm for \(B\) as a subroutine.

We write \(A \le_T B\) to mean \(A\) is Turing reducible to \(B\).

If \(B\) is decidable, then \(A\) is also decidable.

To prove a negative result about computation (that certain problem is uncomputable), we usually design an algorithm (the reduction).

  • \(\problem{Post\ Correspondence}\) problem

  • \(\problem{Mortal\ Matrix}\) problem

    Given two \(15 \times 15\) matrices \(A\) and \(B\) of integers (or six \(3 \times 3\) matrices). Is it possible to multiply themselves together in some order and get a \(0\) matrix.

    Shown to be undecidable (in 2014 for \(15 \times 15\)).

  • Wang tiles

    Given a set of Wang tiles, decide if it can be used to tile the plane.

    Robert Berger, a student of Wang, proved in 1966 that there is a transformation taking any Turing machine into a set of Wang tiles that tiles the plane if and only if the Turing machine does not halt.

Lambda Calculus is Turing Complete

  • A computational model is said to be Turing-complete or computationally universal if it can simulate any Turing machine.

  • Suppose \(M\) is a Turing machine, we will define a \(\lambda\)-term \(\mathbf{simulate}_M\) simulating the machine \(M\) in the sense that for all list \(\alpha\) representing a possible initial configuration of the machine, \(\mathbf{simulate}_M\, \alpha \betareds \alpha_{\mathrm{final}}\) if the machine terminates with final configuration \(\alpha_{\mathrm{final}}\).

    What happens if the machine does not halt?

    The leftmost \(\beta\)-reduction does not lead to a normal form!

  • The key idea for the simulation is to keep track of the configurations of the machine \(M\) on input \(w\).

  • We use the list data structure in \(\lambda\)-calculus to represent a configuration in \(\Gamma^*\).

    Suppose the current configuration is

    \[\begin{equation*} \alpha = \gamma_0 :: \gamma_1 :: \cdots :: \gamma_{k-1} :: \mathbf{nil}, \end{equation*} \]

    and the next configuration is

    \[\begin{equation*} \alpha' = \gamma'_0 :: \gamma'_1 :: \cdots :: \gamma'_{k'-1} :: \mathbf{nil}. \end{equation*} \]

    Note that \(k'\) differ from \(k\) by at most \(1\).

    As computation is local, every entry \(\gamma'_j\) in the next configuration \(\alpha'\) is a function of \(\gamma_{j-1}\), \(\gamma_j\), and \(\gamma_{j+1}\) (corner cases need special care).

    That is, there is a constant size function computing \(\gamma'_j\) given \(\gamma_{j-1}, \gamma_j, \gamma_{j+1}\) as input. We claim there is a term \(\mathbf{local}_M\) that represents the computation of the function using a constant number of \(\mathbf{and}\) and \(\mathbf{not}\).

    \[\begin{equation*} \mathbf{local}_M\, \tuple{\gamma_{j-1}, \gamma_j, \gamma_{j+1}} \betareds \gamma'_j. \end{equation*} \]

    You can use, for example, Shannon expansion formula.

    It is also easy to come up with a term \(\mathbf{pack}\) that shifts and packs \(\alpha\) to get the list \(\cdots :: \tuple{\gamma_{j-1}, \gamma_j, \gamma_{j+1}} :: \cdots\). Similarly, there is a term \(\mathbf{terminate}\) that checks whether a configuration is terminating or not.

    Then we define \(\mathbf{next}_M \equiv \lambda l. \mathbf{map}\, (\mathbf{pack}\, l)\, \mathbf{local}_M\).

    Note that \(\mathbf{next}_M\) depends on \(M\) only.

    Finally, consider terms

    \[\begin{equation*} \mathbf{sim}_M \equiv \lambda sl. (\mathbf{terminate}\, l)\, l\, (s\, (\mathbf{next}_M\, l)), \end{equation*} \]

    and \(\mathbf{simulate}_M \equiv \mathbf{y}\, \mathbf{sim}_M\).

    Check that if \(M\) does not halt, the term \(\mathbf{simulate}_M\) does not have a normal form.

  • Note: It is also possible to simulate the computation of \(\lambda\)-calculus on a Turing machine by performing the leftmost \(\beta\)-reductions.

Homework Problem Discussions

  1. Homework 1, Problem 5
  2. Predecessor

Extra: Why is the Predecessor Harder?

  • Schema of iteration

    \[\begin{align*} f\, 0 & \equiv c,\\ f\, (n+1) & \equiv g (f\, n). \end{align*} \]

    The value of the current iteration depends on the value of the previous iteration.

    That's the case for \(\mathbf{succ}\) and \(\mathbf{exp}\) etc.

    Using Church numerals, \(f \equiv \lambda n. n g c\).

  • Schema of primitive recursion

    \[\begin{align*} f\, 0 & \equiv c,\\ f\, (n+1) & \equiv h\, n\, (f\, n). \end{align*} \]

    Using a pair, we can define a new function \(f'\, n = \pair{n}{f\, n}\) to convert it to iteration.

    Primitive recursive functions correspond to programs whose loops are for loops.

Lecture 4: Advanced Computability

Kleene's Recursion Theorem

“No one shall expel us from the paradise that Cantor has created.” — David Hilbert

“A Swiss pocket knife for computability.” — Neil D. Jones

Motivation

  • The recursion theorem is an advanced mathematical result in the theory of computability.

  • The proof of the recursion theorem is easy to follow but not to come up with.

  • How did Kleene even formulate the theorem?

    Kleene was a student of Alonzo Church, whose pet invention was the \(\lambda\)-calculus. It is probably motivated by the fixed-point combinators in \(\lambda\)-calculus.

    Indeed, we will see a fixed-point theorem called Roger's fixed-point theorem which is equivalent to Kleene's recursion theorem and can be proven by directly mimicking the working of the Y combinator!

  • It connects to mathematical logic, the theory of self-reproducing systems, and even computer viruses.

Self-Reproduction

  • Factory of factory

    A car factory must be more complex than the cars it produces.

    This claim must be true because the factory itself has the car's design within it, in addition to the design of all the manufacturing robots.

  • Can machines construct replicas of themselves?

Self-Printing Machines

  • Can we design a Turing machine that ignores its input and prints a copy of its own description?

  • We start with an easy lemma.

    There is a computable function \(q: \Sigma^* \to \Sigma^*\), where if \(w\) is any string, \(q(w)\) is the description of a Turing machine \(P_w\) that prints out \(w\) and then halts.

    This is easy once one understands the requirements.

    Turing machine \(Q\) = “On input \(w\):
    1. Constructs Turing machine \(P_w\) = ‘On any input:
      1. Erase input;
      2. Write \(w\) on the tape;
      3. Halt.’
    2. Output \(\desc{P_w}\).”
  1. The Construction Outline of SELF

    We now construct a TM \(\SELF\) that prints its code on the tape following Sipser's textbook.

    • Turing machine \(\SELF\) contains two parts \(A\) and \(B\), which we can think of as two separate procedures that go into \(\SELF\).

    • \(A\) runs first and passes the control to \(B\) once completed.

    • \(A\) uses the machine \(P_{\desc{B}}\), described by \(q(\desc{B})\).

      So machine \(A\) depends on machine \(B\), and we cannot complete its description without specifying the details of machine \(B\).

    • Construction for \(B\)

      1. Note that we cannot define it to be \(q(\desc{A})\) as it will be circular.

      2. Instead, we define \(B\) using a different strategy.

        \(B\) computes \(A\) from the output that \(A\) produces.

      3. If \(B\) has access to \(\desc{B}\), then it can call \(q\) and obtain \(\desc{A}\).

        Where does this \(\desc{B}\) come from? It's printed on the tape when \(A\) finishes.

  2. Full Construction

    In summary, we take machine \(A = P_{\desc{B}}\), and machine \(B\) = “On input \(\desc{M}\), where \(M\) is a portion of a TM:

    1. Compute \(q(\desc{M})\);
    2. Combine the result with \(\desc{M}\) to make a complete TM;
    3. Print the description of this TM and halt.”

    If we now run \(\SELF\), we observe the following behavior:

    1. First \(A\) runs, and it prints \(\desc{B}\) on the tape;
    2. Then \(B\) starts. It looks at the tape and finds its input, \(\desc{B}\).
    3. \(B\) calculates \(q(\desc{B}) = \desc{A}\) and combines that with \(\desc{B}\) into a TM description, \(\desc{\SELF}\).
    4. \(B\) prints this description and halts.
    • We can implement this construction in any (Turing-complete) programming language to obtain a program that outputs a copy of itself.

      s='s=%r;print(s%%s)';print(s%s)
    • We can even do so in natural language!

      Print out two copies of the following, the second one in quotes: “Print out two copies of the following, the second one in quotes:”

Kleene's Recursion Theorem

Let \(T\) be a TM that computes a function \(t: \Sigma^* \times \Sigma^* \to \Sigma^*\). There exists a TM \(R\) computing function \(r: \Sigma^* \to \Sigma^*\) such that for every \(w\), \(r(w) = t(\desc{R}, w)\).

  • Note that we can explicitly construct \(R\) from \(T\)!

  • Proof of the recursion theorem is similar to the construction of \(\SELF\) with several minor modifications.

    We construct a TM \(R\) in three parts \(A\), \(B\), \(T\) where \(T\) is given in the statement of the theorem.
    1. Here \(A\) is the TM \(P_{\desc{BT}}\) described by \(q(\desc{BT})\). To preserve the input \(w\), we redesign \(q\) so that \(P_{\desc{BT}}\) writes its output after the input \(w\) on the tape. So after \(A\) finishes, the tape contains \(w\desc{BT}\).
    2. \(B\) is the machine that finds the \(\desc{BT}\) part on the tape and runs \(q\) to obtain \(\desc{A} = q(\desc{BT})\). Then \(B\) combines \(A\), \(B\), and \(T\) into a single machine and obtains its description \(\desc{ABT} = \desc{R}\). \(B\) then transforms the content of the tape to \(\tuple{R,w}\) and passes control to \(T\).

Applications

  • The recursion theorem states that Turing machines can obtain their own description and then go on to compute with it.

    If you are designing a machine \(M\), you can include the phrase obtain own description \(\desc{M}\) in the informal description of \(M\)'s algorithm.

  • At first glance, this capability may seem useful only for tasks such as making a machine that prints a copy of itself.

  • But, as we will demonstrate, it is a handy technique for solving many problems in the theory of computability.

  1. SELF Redesigned

    Machine \(\SELF\) redesigned as follows.

    Machine \(\SELF\) = “On any input:

    1. Obtain, via the recursion theorem, own description \(\desc{\SELF}\).
    2. Print \(\desc{\SELF}\).”
  2. Undecidability

    \(\ACCTM = \{\desc{M, w} \mid M \text{ accepts } w \}\) is undecidable.

    It is Turing-recognizable by universal TMs.

    We assume to the contrary that there is a TM \(H\) decides \(\ACCTM\).

    Consider machine \(B\): “On input \(w\):

    1. Obtain, via the recursion theorem, own description \(\desc{B}\).
    2. Run \(H\) on input \(\desc{B, w}\).
    3. Do the opposite of what \(H\) says. That is, accept if \(H\) rejects and reject if \(H\) accepts.”

    \(H\) does not perform as required on \(\desc{B,w}\), so it cannot decide \(\ACCTM\).

  3. Minimal Machine

    • The description length of machine \(M\) is the number of symbols in the description \(\desc{M}\).

    • \(M\) is minimal if no Turing machine equivalent to \(M\) has shorter description.

      \[\begin{equation*} \MINTM = \{ \desc{M} \mid M \text{ is a minimal TM } \}. \end{equation*} \]
    • \(\MINTM\) is not Turing-recognizable.

      Assume to the contrary that some TM \(E\) enumerates \(\MINTM\) and we will obtain a contradiction.

      We construct the following TM \(C\) = “On input \(w\):
      1. Obtain, via the recursion theorem, own description \(\desc{C}\).
      2. Run the enumerator \(E\) until machine \(D\) appears with a longer description than that of \(C\).
      3. Simulate \(D\) on input \(w\).”

      Because \(\MINTM\) is infinite, \(E\)'s list must contain a TM with a longer description than C's description.

      Therefore, Step 2 of machine \(C\) eventually terminates with some machine \(D\) that is longer than \(C\).

      Machine \(C\) simulates \(D\), and so is equivalent to it.

      Because \(C\) is shorter than \(D\) and is equivalent to it, \(D\) cannot be minimal.

      But \(D\) appears on the list that \(E\) produces, a contradiction.

      Let \(A\) be infinite subsets of \(\MINTM\). Is \(A\) Turing-recognizable?

  4. Roger's Fixed-Point Theorem.

    Let \(t:\Sigma^* \to \Sigma^*\) be a computable function. Then there is a Turing machine \(F\) for which \(t(\desc{F})\) describes a machine equivalent to \(F\).

    Here, we assume strings of invalid descriptions correspond to machines that reject immediately.

    Consider machine \(F\) = “On input \(w\):

    1. Obtain, via the recursion theorem, own description \(\desc{F}\).
    2. Compute \(t(\desc{F})\) to obtain the description of a TM \(G\).
    3. Simulate \(G\) on \(w\).”

    The rest of the proof is now straightforward.

    Note: There is proof of Roger's fixed-point theorem that implements the Y combinator!

Gödel's Incompleteness Theorems

Proving the famous Gödel's Incompleteness Theorem is easy if you use computer science.

There are inherent limitations in mathematical reasoning.

‘This statement cannot be proved.’

Euclidean geometry and fifth axiom: Through a given point \(P\) not on a line \(L\), there is one and only one line in the plane of \(P\) and \(L\) which does not meet \(L\).

Is it a consequence of the other four axioms?

Logic Review

  • Logic is a branch of mathematics that studies mathematical reasoning.

  • You should have seen propositional logic and first-order logic.

  • first-order logic

    1. Logical connectives \(\land\), \(\lor\), \(\neg\), and \(\rightarrow\),
    2. Quantifiers \(\forall, \exists\),
    3. Variables \(x, y, \ldots\),
    4. Equality symbol \(=\).
    5. Constants \(c_1, c_2, \ldots\),
    6. Relations \(R_1, R_2, \ldots R_k\),
    7. Functions \(f_1, f_2, \ldots, f_r\),
  • Well-formed formula

    1. \(\forall q \exists p \forall x,y \bigl[ x,y > 1 \rightarrow xy \ne p \bigr]\).
    2. \(\forall a,b,c,n\bigl[ (a,b,c>0 \land n>2) \rightarrow a^n+b^n \ne c^n \bigr]\). (Fermat's Last Theorem and Andrew Wiles)
    3. \(\forall q\exists p \forall x,y \bigl[ p > q \land (x,y>1 \rightarrow (xy \ne p \land xy \ne p+2)) \bigr]\). (Twin Prime Conjecture and Yitang Zhang)
  • A formula with no free variables is called a sentence.

Truth and Interpretations (Models)

Truth is defined by the semantics.

  • An interpretation \(\mathcal{U}\) consists of

    1. A non-empty set \(U\) to represent the universe,
    2. For every constant \(c\), assign an element of \(c^{\mathcal{U}}\) in \(U\),
    3. For every \(m\)-ary predicate \(R\), assign an \(m\)-ary relation \(R^{\mathcal{U}} \subseteq U^m\),
    4. For every \(n\)-ary \(f\), assign an \(n\)-ary function \(f^{\mathcal{U}} : U^n \to U\).
  • If \(\varphi\) is a sentence, it is either true or false in a given model \(\mathcal{U}\).

  • We call \(\varphi\) a tautology if it is true in all models.

    \(\exists y \forall x \bigl[ x = R(y) \bigr] \rightarrow \forall x,y \bigl[ x = y\bigr]\).

Proofs and Deduction Calculus

Proofs are syntactic deductions.

There are many proof systems for first-order logic, including Hilbert-style deductive systems, natural deduction, the sequent calculus, etc.

  • Axioms

    1. \(A \lor \neg A\) is a tautology for all sentences \(A\).

      Or we write

      \[\begin{prooftree} \AXC{} \UIC{$A \lor \neg A$} \end{prooftree} \]
    2. \(\cdots\)

    It is not a finite number of axioms. It has finitely many axiom schema.

  • Deduction rules

    From \(A\) and \(A \rightarrow B\) we can deduce \(B\) (modus ponens).

    \[\begin{prooftree} \AXC{$A$} \AXC{$A \to B$} \BIC{$B$} \end{prooftree} \]

    From \(P(c)\) we can deduce \(\forall x\, P(x)\) (universal generalization).

    \[\begin{prooftree} \AXC{$P(c)$} \UIC{$\forall x\, P(x)$} \end{prooftree} \]
  • Example: Hilbert System

    \[\begin{align*} \mathbf{Ax}_1 \quad & A \to (B \to A),\\ \mathbf{Ax}_2 \quad & (A \to (B \to C)) \to ((A \to B) \to (A \to C)),\\ \mathbf{Ax}_3 \quad & (\neg A \to \neg B) \to (B \to A). \end{align*} \]
  • Proof and computation

    A proof system is an algorithm \(V\) that satisfies

    1. Effectiveness. For every statement \(x\) and proof string \(w\), \(V(x, w)\) halts with output either \(0\) or \(1\).
    2. Soundness. If \(x\) is not true, then for every \(w\), \(V(x, w) = 0\).

    Completeness. True statements can be proved. One of \(S\) and \(\neg S\) can be proved.

Gödel's Completeness Theorem

  • Is every tautology provable?

    Yes!

    Gödel's PhD thesis

  • A fundamental theorem (which you may have seen previously)

Typical Use of Fist Order Logic

  • In mathematics, we care more about the truth of a fixed model.

    • Start with some axioms that are true in the model/interpretation.

    • Derive true statements from the axioms.

  • Example: Arithmetic of \(\natural\) (Peano axioms, first-order version)

    Constants: \(0\)

    Functions: \(\mathrm{S}(x)\), \(\mathrm{Plus}(x,y)\), \(\mathrm{Times}(x,y)\).

    Axioms:

    1. \(\forall x\, (\mathrm{S} x \ne 0)\),
    2. \(\forall x, y\, (\mathrm{S} x = \mathrm{S} y \to x = y)\),
    3. \(\cdots\)
  • Set theory (Zermelo-Fraenkel, ZFC)

    1. Axiom of Extensionality: If \(X\) and \(Y\) have the same elements, then \(X=Y\)

      \(\forall u\, [u \in X \leftrightarrow u \in Y ] \to X = Y\).

    2. \(\cdots\)

    C is for the infamous axiom of choice.

    AC is independent of ZF (P. Cohen).

    ZFC: Standard set theory can be used to model and prove almost anything in mathematics.

  • Principia Mathematica

    Bertrand Russell and Alfred Whitehead

    《数学原理》

    Doable but tedious! It took them hundreds of pages to prove \(1 + 1 = 2\).

    Can computers help?!

Gödel's Incompleteness Theorems

Any mathematical proof system which is sufficiently expressive and has computable axioms, cannot be both complete and consistent.

Consistent: Cannot prove both \(S\) and \(\neg S\).

A Proof-Searching Approach to Attack the Halting Problem

  • Consider the Halting problem \(\HALT_{\varepsilon} = \{ \tuple{M} \mid M \text{ halts on the empty input}\}\).

  • Idea: Search for proofs that the machine \(M\) on \(w\) halts or loops.

    If \(M\) halts on input \(w\), there is mathematical (ZFC) proof for it.

    1. Configuration at time step 1 is \(q_0 \alpha_0 \alpha_1 \cdots\),
    2. Configuration at time step \(t\) is correct according to the previous configuration \(\alpha^{(t-1)}\) and descriptions of \(M\),
    3. Configuration at final step \(T\) is in a halting state.
  • Idea: Combine the proof-search approach with diagonal argument

  • Define Turing machine \(D\)

    \(D\) = “On any input:

    1. Obtain own description \(\desc{D}\) by recursion theorem.
    2. For all \(k = 1, 2, \ldots\) and all strings \(P\) of length \(k\):
      1. (Halt Proof Check). Check if \(P\) is a valid ZFC proof for the statement ‘\(D\) halts’. If yes, then \(D\) enters a loop-forever state.
      2. (Loop Proof Check). Check if \(P\) is a valid ZFC proof for the statement ‘\(D\) loops’. If yes, halt.”

    If ZFC proves ‘\(D\) loops’, then the machine \(D\) will find this proof and \(D\) actually halts. By the above lemma, ZFC also proves ‘\(D\) halts’, which is a contradiction to consistency.

    Note: We used the assumption that ZFC is consistent twice. In addition to the appearance in concluding the contradiction, it is also used to argue that \(D\) actually halts. If ZFC is not consistent, there may be a proof for ‘\(D\) halts’ and it appears earlier than the proof for ‘\(D\) loops’ and \(D\) will enters the loop-forever state.

    If ZFC proves ‘\(D\) halts’, then the machine \(D\) will finally enters a loop-forever state after \(T\) steps! In that case, the statement ‘\(D\) loops’ can by proved in ZFC by considering the configuration history for \(T+1\) steps. Again a contradiction to consistency.

  • ZFC cannot be both complete and consistent.

    Either ‘\(D\) loops’ or ‘\(D\) halts’ is true. None are provable assuming ZFC is consistent!

    We are done proving Gödel's First Incompleteness Theorem.

  • Remark: Adding this true statement as a new axiom does not solve the problem!

  • Remark: PA for natural numbers is also powerful enough (to encode and reason about TMs)!

    Gödel numbering: assigns a unique natural number to each symbol and well-formed formula of some formal language.

    ‘Statement \(g\) has no proof’ has Gödel number \(g\)!

Gödel's Second Incompleteness Theorem

  • Which statement is true, ‘\(D\) halts’ or ‘\(D\) loops’?

    \(D\) halts’ cannot be true. As if it were true, there is a ZFC proof of it by the lemma and then by the definition of \(D\), it loops. Contradiction.

  • Wait! Can we write down the above argument that ‘\(D\) loops’?

    Is it a proof for ‘\(D\) loops’?

  • No! What we can prove based on the above argument is that ‘ZFC consistent’ implies ‘\(D\) loops’ as the argument and, in particular, the analysis of the behavior of \(D\) relied on the assumption that ZFC is consistent.

    ZFC proves ‘ZFC consistent \(\to\) \(D\) loops’.

    But ZFC cannot prove ‘\(D\) loops’ as we have already proved.

  • So ZFC cannot prove its own consistency!

    Assume ZFC is consistent. Then ‘ZFC consistent’ is a true statement that it cannot prove.

    Like the Halting problem for undecidability, it is a concrete/interesting statement for the incompleteness theorem!

Lecture 5: Time Complexity

Time Complexity

We will embark on the study of computational complexity.

Computational models (deterministic, nondeterministic, randomized), resources (time, space), and problems.

Time since the Big Bang: \(13.8\) billion years or \(4.36 \times 10^{17}\) seconds.

Palindrome Example

  • Consider the palindrome problem \(\PAL = \{w \mid w^R = w\}\).

  • Examples:

    Dammit I'm mad!

    Dr. Awkward & Olson in Oslo, a palindromic novel numbering 31,594 words (or approximately 104,000 letters) by Lawrence Levine

    《西江月·咏梅》

    〔宋〕苏轼

    马趁香微路远,沙笼月淡烟斜。 渡波清彻映妍华。 倒绿枝寒凤挂。

    挂凤寒枝绿倒,华妍映彻清波。 渡斜烟淡月笼沙。 远路微香趁马。

  • You have seen this problem/language in the FLA course.

    It cannot be solved by finite automaton, but can be solved by the full notion of an algorithm or a programming language (Python/C).

  • What is the time complexity of the palindrome problem?

Measuring the Time Complexity Using TMs

Consider the \(\PAL\) Turing machine \(P\) = “On input \(x\):

  1. Read the first letter, remember what it is, replace it with blank;
  2. Move to the end of the input and check if the last letter matches the first; Reject if it does not match; Replace it with blank otherwise;
  3. If the string is empty and has not found a mismatch, accept;
  4. Otherwise, move to the beginning of the string, and repeat from step 1.”
  • It is natural to measure the time complexity of a TM by the number of steps it takes on an input.

  • It depends on the length of input \(x\), and we usually reserve \(n\) for this.

    The running time of a TM is a function of the input length.

  • If \(x\) is really a palindrome, we need (basically)

    \((n+1) + n + \cdots + 1 = \frac{1}{2}n^2 + \frac{3}{2}n + 1\).

  • If \(x\) is not a palindrome, it depends on the input and may stop much earlier.

    The running time of an algorithm \(\mathcal{A}\) is a function \(T_{\mathcal{A}} : \natural \to \natural\) defined as

    \[\begin{equation*} T_{\mathcal{A}}(n) = \max_{\substack{\text{ instances } x \text{ of}\\ \text{length } n}} \text{steps}(\mathcal{A}, x). \end{equation*} \]

    So it is a worst-case definition.

  • We may omit subscript \(\mathcal{A}\) when the algorithm under investigation is clear from the context and write \(T(n) = \frac{1}{2}n^2 + \frac{3}{2}n + 1\) as the running of the PAL Turing machine \(P\).

  • Why worst-case?

    1. Other variants such as average-case complexity and smoothed analysis are seriously considered.

      Smoothed analysis of algorithms by Spielman and Teng, 2002.

    2. But the worst-case complexity is the most important, and probably the first thing you need to understand when analyzing an algorithm.

      It gives a guarantee for all possible instances.

      It is sometimes hard to define what the distribution of random inputs should be.

Big-O Notations

  • We'd like to focus on the big picture of the running time.

    How does it scale as a function of \(n\).

    In other types of computation models, it is not that clear what a step means.

  • So we need the right level of abstraction and hide unimportant details.

    For the \(T(n) = \frac{1}{2}n^2 + \frac{3}{2}n + 1\) case, the main message of the running time is that it is a quadratic function, proportional to \(n^2\).

  • \(O(\cdot)\), \(\Omega(\cdot)\), and \(\Theta(\cdot)\):

    1. \(T(n)\) is \(O(n^2)\) if \(T(n)\) is at most proportional to \(n^2\) as \(n\) grows;
    2. \(T(n)\) is \(\Omega(n^2)\) if \(T(n)\) is at least proportional to \(n^2\) as \(n\) grows;
    3. \(T(n)\) is \(\Theta(n^2)\) if \(T(n)\) is proportional to \(n^2\) as \(n\) grows.
  • Examples

    \(\frac{1}{2}n^2 + \frac{3}{2}n + 1\) is \(\Theta(n^2)\), \(O(n^2)\), \(O(n^3)\), \(\Omega(n^2)\), \(\Omega(n)\) and is not \(O(n)\) or \(\Omega(2^n)\).

  • Formal definition

    Let \(f\) and \(g\) be functions \(f, g: \natural \to \natural\). Say that \(f(n) = O(g(n))\) if there are positive integers \(c\) and \(n_0\) such that for every integer \(n \ge n_0\), \(f(n) \le c g(n)\).

    Let \(f\) and \(g\) be functions \(f, g: \natural \to \natural\). Say that \(f(n) = \Omega(g(n))\) if there are positive real numbers \(c\) and \(n_0\) such that for every integer \(n \ge n_0\), \(f(n) \ge c g(n)\).

    We say that \(f(n) = \Theta(g(n))\) if \(f(n)\) is \(O(g(n))\) and \(\Omega(g(n))\).

  • Example functions, each is the big-O of the next.

    \(O(1), O(\log^* n), O(\log\log n), O(\log n), O(\sqrt{n}), O \bigl( \frac{n}{\log(n)} \bigr)\),

    \(O(n), O(n \log n), O(n^{1.5}), O(n^3), O(n^{\log n})\),

    \(O(2^n), O(n!), O(2^{2^n})\).

Recurrence Relation

\(T(n) = a T(n/b) + f(n)\)

Examples:

  1. Binary search, \(T(n) = T(n/2) + O(1)\), \(T(n) = O(\log n)\)
  2. Merge sort, \(T(n) = 2T(n/2) + O(n)\), \(T(n) = O(n \log n)\).

See Master Theorem.

Intrinsic Time Complexity

  • Given a problem we can ask for its intrinsic time complexity.

    The running time of the fastest algorithm up to \(\Theta(\cdot)\) and fixing the model of computation.

  • Claim: Any algorithm solving \(\PAL\) uses at least \(n-1\) steps.

    Suppose algorithm \(\mathcal{A}\) solves it in \(n-2\) steps. Then for input \(x = 0^n\) there are two places \(i\) and \(j\) which \(\mathcal{A}\) never reads. Now we can change \(x[i]\) and \(x[j]\) so the answer will be flipped.

Time Complexity Class

Define the time complexity class \(\TIME(t(n))\) as the collection of decision problems solvable by a TM with running time \(O(t(n))\).

Time Complexity for Programs

def TwoFingerPal(x):
    lo = 0
    hi = len(x)-1
    while (lo < hi):
        if (x[lo] != x[hi]):
            return False
        lo = lo + 1
        hi = hi -1
    return True
def OneLinePal(x):
    return (x == x[::-1])

What is the time complexity for this algorithm?

Is it \(\Theta(n)\) or …?

  • Time complexity is model dependent.

    A two-tape TM could do it in \(O(n)\) steps.

    For single-tape TM, it requires \(\Omega(n^2)\) steps.

  • Two subtle questions for considering the time complexity of the python code:

    1. How do we measure the cost of accessing x[lo] and x[hi]?

    2. How do we measure the cost of \(+1\) and \(-1\) (when the number could be as large as \(n\)).

  1. RAM Model

    • One limitation of TMs is that we can only access one location of tape where the head is.

    • Actual computers often provide Random Access Memory (RAM) which can be thought of as a large array (memory).

      The computational model that models access to such a memory is the RAM machine.

    • The additional ability of a RAM machine is that the machine can read/write an indexed entry of the memory in one step.

      Usually, the memory of a RAM machine is an array of unbounded size where each cell can store a single word.

      In addition to the memory array, a RAM machine also contains a constant number of registers \(r_0, \ldots, r_{k-1}\), each of which contain a single word.

      1. Data movement: Move data between a cell of memory and a register.

        For example, in one step, a RAM machine can load into register \(r_j\) from the location of the memory specified by \(r_i\).

      2. Computation: Performed on the registers.

    • It is closer to real machines.

      Pros: Reasonably realistic and relatively simple.

      Cons: It's not easy to define precisely.

    • We write \(\TIME_{\text{RAM}}(t(n))\) for class of problems decidable in time \(O(t(n))\) on a RAM machine.

    • It is equivalent to TM model with polynomial overhead.

Class \(\P\)

  • The class of problems that can be solved by deterministic TMs in polynomial time.

    \(\P = \bigcup_{c>0} \TIME(n^c)\).

    Adopting polynomial worst-case performance as our criterion of efficiency results in an elegant and useful theory that says something meaningful about practical computation, and would be impossible without this simplification.

  • There are efficient computations that are not polynomial, and polynomial computations that are not efficient in practice!

    \(n^{100}\) vs. \(n^{\log n}\).

  • It captures the notion of efficient computation in most of the cases.

    Usually, the exponent can be reduced to a small number.

    Extended Church Turing Thesis

  • \(\P\) is stable: All “reasonable” computational models are equivalent if we only care about the distinction between polynomial and exponential.

    Polynomial bounds are closed under addition and multiplication.

  • Exponential time

    \(\EXP = \bigcup_{c>0} \TIME(2^{n^c})\).

Time Hierarchy Theorem

If you have more time, you can solve more problems.

Theorem Statement

For any time-constructible function \(t: \natural \to \natural\), there is a language \(A\) that is decidable in \(O(t(n))\) time but not decidable in time \(o(t(n) / \log t(n))\).

In other words \(\TIME\bigl( o(t(n) / \log t(n))\bigr) \subsetneq \TIME(t(n))\).

Remarks

  • Remark 1: Why time-constructible (nice)?

    A function \(t(n) \ge n \log(n)\) is called time-constructible if there is a TM taking \(1^n\) as input and computes the binary representation of \(t(n)\) in time \(O(t(n))\).

    Examples: \(n \log n\), \(n \sqrt{n}\), and \(2^n\) are time-constructible. \(\sqrt{n}\) or \(n\) is not.

    Space-constructible functions can be defined similarly.

    Examples: \(\log n\), \(n^2\) are space-constructible. \(\log\log n\) is not. Yet, there is a space constructible function in \(O(\log\log n)\).

  • Remark 2: The logarithmic gap!

    This comes from the overhead from the \(t\)-step simulation on the single-tape TMs. If this simulation were improved, we could improve the theorem accordingly.

  • Remark 3: In Barak, the difference is said to be arbitrary, and this subtle difference comes from the fact that the running time there was defined for a variant of the RAM model.

Proof Outline

  • The following \(O(t(n))\) time algorithm \(D\) decides a language \(A\) that is not decidable in \(o(t(n)/ \log t(n))\) time.

    \(D\) = “On input \(w\):

    1. Compute the length \(n\) of \(w\).
    2. Compute \(t(n)\) using time-constructibility and store the value \(t(n)/ \log t(n)\) in a binary counter.
    3. Parse \(w\) as \(\desc{M}10^∗\) for some TM \(M\).
    4. Simulate \(M\) on \(w\). Decrement this counter before each step. If the counter ever hits \(0\), reject.
    5. If \(M\) accepts, then reject. If \(M\) rejects, then accept.”
  • Note: Why we use input \(\desc{M}10^*\)? This is a padding technique to allow us to choose the \(n_0\) constant in the asymptotic scaling.

    First of all, we claim that \(D\) runs in time \(O(t(n))\).

    1. This is true for Step 1 because \(t(n) \ge n \log n\).
    2. For Step 2, this follows from the time-constructibility of \(t\).
    3. For Step 3, Easy.
    4. For Step 4, we can simulate the machine using the multi-track technique. We use three tracks: first track for \(M\)'s tape, second track for keep the state, head symbol, and transition function of \(M\) close to the head, and the third track for holding the counter which has length \(O(\log(t(n)/ \log t(n))) = O(\log t(n))\). Each step has a \(\log t(n)\) cost and the number of steps is \(t(n) / \log t(n)\). So the total time is still \(O(t(n))\).

    Next, we prove that the language defined by the machine \(D\) above does not have any algorithm of running time \(o(t(n)/ \log t(n))\). Assume to the contrary that TM \(M\) decides it in time \(g(n) = o(t(n) / \log t(n))\).

    Say \(D\) simulates \(M\) (without counting) in time \(d g(n)\) where \(d\) is some constant. Choose sufficiently large \(n \ge n_0\), \(d g(n) \le t(n)/\log(n)\).

    Then \(D\) on input \(\desc{M}10^{n_0}\) will complete the simulation and output differently (compared with \(M\)).

Corollaries

  • An easy corollary is that for all constants \(c_1 > c_0 \ge 1\), \(\TIME(n^{c_0}) \subsetneq \TIME(n^{c_1})\).

    And, similarly, \(\P \subsetneq \EXP\).

The Gap Theorem

  • There is an increasing computable function \(g: \natural \to \natural\) such that

    \[\begin{equation*} \TIME(g(n)) = \TIME(2^{g(n)}). \end{equation*} \]
  • It means that the time-constructible condition is necessary for the time hierarchy theorem.

  • Proof Outline

    Consider \(T_m(n)\), the maximum time machine \(m\) takes on input of size \(n\). Note that \(T_m(n)\) could be undefined if \(m\) does not halt on some input of length \(n\).

    Define \(g(0) = 1\). For \(n \ge 1\) define \(g(n) = j\) where \(j\) is the smallest number such that \(j > g(n-1)\) and, for all \(m < n\), \(T_{m}(n)\) is not in the interval \([j, 2^j]\). One can prove that \(g\) is computable function, and for all \(m\), at least one of the following holds a. \(T_m(n) \le g(n)\) for all but finitely many \(n\); b. \(T_m(n) > 2^{g(n)}\) for infinitely many \(n\). This means that no machine has time complexity between \(g(n)\) and \(2^{g(n)}\).

Lecture 6: Problems in P and NP

Computational Problems

“An algorithm is a finite answer to an infinite number of questions.” — Stephen Kleene

  • Computational problem: A family of problem instances with increasing size.

  • Decision problems and search problems

Graph Problems

  • Graph \(G = (V, E)\).

    Reserve \(n\) to denote the number of vertices \(\abs{V}\).

    Directed edge \((u,v)\) or undirected edge \(\{u,v\}\) which can also be denoted as \(u \to v\) or \(u \sim v\).

  • Graphs can be represented either in the adjacency list or adjacency matrix representation.

    Input size is measured by \(n\) (and sometimes also \(m = \abs{E}\)).

  • Graphs are ubiquitous in CS.

    • Roads and route planning
    • Web
    • Social network
    • Correlations in data
  • We focus on undirected graphs for simplicity.

  1. Shortest Path and Reachability

    Given \(G = (V, E)\) and \(s,t \in V\), the \(\SHORTPATH\) problem computes the length of the shortest path from \(s\) to \(t\). The simpler \(\REACHABILITY\) problem asks whether such a path exists.

    • There can be exponentially many paths from \(s\) to \(t\).

      We have an efficient way to find the shortest one by breadth-first search (BFS).

    • More generally, Dijkstra's algorithm can find the shortest path for weighted graphs where we assign weight \(w:E \to [0, \infty)\) to each edge. The complexity of Dijkstra's algorithm is \(\Theta(\abs{E} + \abs{V} \log \abs{V})\) (when using min-priority queue by Fredman and Tarjan).

  2. Longest Path

    The \(\LONGPATH\) problem finds the length of the longest simple path between a given pair of vertices \(s\) and \(t\) in a given graph \(G\).

    • Simple path: a path with no repeating vertices.

      Again there may be exponentially many paths we need to consider.

      For this problem, however, we do not know of any significant improvement over the brute force enumeration algorithm.

    • The best-known algorithms for the longest path problem take \(O(c^n)\) time for some constant \(c > 1\). (The best record is \(c \approx 1.65\)).

    • It is a generalization of the famous Hamiltonian path problem \(\HAMPATH\), which asks for a maximally long simple path (\(n\) vertices).

    • Hamiltonian cycle problem \(\HAMCYCLE\)

    • Traveling salesman problem \(\TSP\)

      Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?.

  3. Min-Cut

    • Definitions

      Given a graph \(G = (V,E)\), a cut of \(G\) is a subset \(S \subseteq V\) such that \(S\) is neither empty nor \(V\).

      The edges cut by \(S\) are those edges that have one vertex in \(S\) and the other in \(V \setminus S\).

      For two vertices \(s,t \in V\), an \(s,t\)-cut is a cut such that \(s \in S\) and \(t \in V \setminus S\).

      Computing the minimum \(s,t\)-cut (\(\MINCUT\) problem) is useful in many applications since minimum cuts often correspond to bottlenecks.

    • We can obtain a polynomial-time algorithm for computing the min-cut using the Max-Flow Min-Cut Theorem.

    • The Ford-Fulkerson algorithm computes a flow using incremental improvements.

      Finding the improvement is a \(\REACHABILITY\) problem.

    • Computing flows in polynomial time also follows by using a much more general tool known as linear programming (LP).

      \[\begin{equation*} \begin{split} \text{Maximize:}\, & c^T x\\ \text{Subject to:}\, & A x \le b, x \ge 0\\ \end{split} \end{equation*} \]
    • LP formulation of the \(\MAXFLOW\) problem

      A flow on a graph \(G\) of \(m\) edges can be modeled as a real vector \(x \in \real^m\) where for every edge \(e = (u, v)\), \(x_e\) corresponds to the flow on \(e\) from \(u\) to \(v\). For all \(u, v\in V\) and \(e'=(v, u)\), \(x_{e'} = - x_e\). So, for each edge \(e\) we assign variable \(x_e \in [-1,1]\).

      The (flow-balancing) constraints are

      \[\begin{equation*} \sum_{e: \pi_1(e) = v} x_e = 0\quad \forall v\in V\setminus \{s,t\} \end{equation*} \]

      The objective function we want to maximize is \(\sum_{e: \pi_1(e) = s} x_e\).

    • Luckily, there are polynomial-time algorithms for solving linear programming, and hence we can solve \(\MAXFLOW\) and also \(\MINCUT\) in polynomial time.

    • Note: In a recent breakthrough, an almost linear time algorithm for the \(\MAXFLOW\) problem is discovered. See Maximum Flow and Minimum-Cost Flow in Almost-Linear Time by Li Chen, Rasmus Kyng, Yang P. Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva.

  4. Max-Cut

    • \(\MAXCUT\) arises in VLSI design, and has some surprising relation to analyzing the Ising model in statistical physics.

      The Ising model on graph \(G = (V, E)\) has the Hamiltonian

      \[\begin{equation*} H(x) = \sum_{ u \sim v} J_{uv} x_u x_v, \end{equation*} \]

      which represents the energy of the system when the variables \(x\) take value in \(\{\pm 1\}^{V}\).

      William Rowan Hamilton, Hamiltonian path, Hamiltonian mechanics, Quaternions

    • Finding ground energy of the Hamiltonian corresponds to computing the \(\MAXCUT\) of \(G\) with weights \(\{J_{uv}\}\).

      It is easy to show \(H(x) = J - 2\, \mathrm{CUT}(x)\) where \(J = \sum_{u \sim v} J_{uv}\) and \(\mathrm{CUT}(x)\) is the size of the cut defined by \(x\).

    • It is one of the favorite problems for some theorists because of its connection to approximation algorithms and the Unique Games Conjecture (UGC).

      Goemans and Williamson's algorithm has approximation ratio \(\alpha \approx 0.878\).

      If UGC is true, this is the best possible approximation ratio for \(\MAXCUT\) where approximation ratio is the size of the solution cut divided by the optimal max-cut size.

    • It is easy to design an approximation algorithm with ratio \(\alpha = 0.5\), which we will see later.

Logic Problems

  1. SAT

    • Given a conjunctive normal form (CNF) logic formula, determine whether it is satisfiable or not.

    • Example of \(\SAT\) input for \((x_1 \lor x_2 \lor \neg x_4) \land (x_3 \lor \neg x_4) \land (x_1 \lor x_2 \lor x_3)\).

      c This line is a comment.
      p cnf 4 2
      1 2 -4 0
      3 -4 0
      1 2 3 0
      

      The input length of \(\THREESAT\) problem of \(n\) variables and \(m\) clauses is \(O(m \log n)\).

    • Applications

      • Industrial optimization
      • Manufacturing planning
      • Circuit synthesis
      • Software verification
      • Air-traffic control
      • Automated proofs
    • \(k\)-\(\SAT\) problem

    • \(\TWOSAT\) is easy.

      The reason is that we can think of \(\ell_i \lor \ell_j\) where \(\ell_i, \ell_j\) are literals as an implication \(\neg \ell_i \rightarrow \ell_j\). Then the problem is reduced to a simple problem on directed graphs of \(2n\) vertices.

    • \(\THREESAT\) is hard to solve. The best-known algorithm runs in time roughly \({1.3}^n\), by Paturi, Pudlak, Saks and Zane.

    • Note: SAT solvers are intensively studied and commercial software can sometimes solve instances of up to 1M variables.

      \(\SAT\) solvers are used to find the proof of the Boolean Pythagorean Triple Theorem (Heule, Kullman, and Marek, 2016).

      1. The set \(\{1, 2, \ldots, 7824\}\) can be partitioned into two parts so that both do not contain \(a,b,c\) such that \(a^2 + b^2 = c^2\).
      2. The same claim for \(\{1, 2, \ldots, 7825\}\) is false.

      200 terabyte proof later compressed to 68 gigabytes!

      Watch Terence Tao talk on machine-assisted proofs.

  2. TQBF

    • The \(\SAT\) problem can be written as \(\exists x_1, x_2, \ldots, x_n\, \phi(x_1, x_2, \ldots, x_n)\).

    • A generalization of it is the \(\TQBF\) (Totally Quantified Boolean Formula) problem where there are alternating \(\exists\) and \(\forall\) quantifiers before \(\phi\).

      \(\exists x_1 \forall x_2 \exists x_3 \cdots Q x_n\, \phi(x_1, x_2, \ldots, x_n)\) where \(Q\) is either \(\exists\) or \(\forall\).

    • It is a harder problem than \(\SAT\).

Mathematics Problems

  • Prime, Multiplication, and Factoring

    \(\PRIME\) is in \(\P\)! Agrawal, Kayal, and Saxena 2002.

  • Linear equation vs. quadratic equation

  • Determinant vs. permanent

    For square matrix \(A\), the determinant is defined as

    \[\begin{equation*} \det(A) = \sum_{\pi \in S_n} \sign(\pi) \prod_{i \in [n]} A_{i,\pi(i)}. \end{equation*} \]

    Even though the above expression is summation of \(n!\) terms, there is an efficient algorithm that computers \(\det{A}\) given \(A\) as input.

    Permanent is defined similarly without the sign function For matrix \(A\), the determinant is defined

    \[\begin{equation*} \perm(A) = \sum_{\pi \in S_n} \prod_{i \in [n]} A_{i,\pi(i)}. \end{equation*} \]

    The permanent of a matrix is a natural quantity, and has been studied in several contexts including combinatorics and graph theory. It also arises in physics where it can be used to describe the quantum state of multiple Boson particles.

    No efficient algorithms are known!

    In Geometric Complexity Theory, proving that computing the permanent cannot be efficiently reduced to computing determinants is considered to be a major milestone for the program.

Class NP

  • Many problems are hard.

    • \(\HAMPATH\)
    • \(\MAXCUT\)
    • \(\THREESAT\)
    • \(\cdots\)
  • Why have we been unsuccessful in finding polynomial-time algorithms for these problems?

    We don't know.

  • We would like some understanding on the reason for this behavior.

    But whether this “cliff” between the easy and hard problem is a real phenomenon or a reflection of our ignorance is still an open question.

  • One remarkable discovery concerning this question shows that the complexities of many problems are linked.

    This discovery is known as \(\NP\)-completeness theory.

Efficient Verifiability

  • The \(\HAMPATH\) problem.

    Recall that a Hamiltonian path between \(s\) and \(t\) is a path from \(s\) to \(t\) that goes through every vertex of the graph exactly once.

  • The \(\HAMPATH\) problem has a feature called polynomial verifiability which is important for understanding its complexity.

    Even though we do not know how to solve the \(\HAMPATH\) problem efficiently, it is (obviously) easy to verify whether a given path is Hamiltonian!

    How about the complement \(\overline{\HAMPATH}\) of \(\HAMPATH\)? That is, can you find a certificate that a graph does not has a Hamiltonian path?

  • The \(\SAT\) problem has a similar situation.

    Easy to verify, hard to solve.

  • Another polynomial-verifiable problem is the complement of \(\PRIME\) called \(\COMPOSITE\), deciding whether there are numbers \(p,q > 1\) such that \(n = pq\) for the given number \(n\).

    How about \(\PRIME\)? Can you certify that a number is prime without resorting to AKS02? (Homework)

  • Proof system and computation (efficient version)

    A verifier for problem \(A\) is an algorithm \(V\), where \(A = \{x \mid V \text{ accepts } \pair{x}{w} \text{ for some string } w\}\).

    Problem \(A\) is polynomial-time verifiable if it has a polynomial-time verifier, whose time complexity is measured in terms of the length of input \(x\).

  • The string \(w\) is the witness or proof that \(x\) is in \(A\).

    The proof \(w\) is at most polynomial in \(n = \abs{x}\) in length as otherwise the verifier won't have time to read the whole proof.

  • \(\NP\) is a notion of easiness.

    Informally, \(\NP\) is the class of problems that have polynomial-time verifiers.

Nondeterministic Polynomial-Time

  • Why is it called \(\NP\)?

    \(\NP\) stands for nondeterministic polynomial time and is derived from an alternative characterization of the class \(\NP\) using nondeterministic TMs.

    The nondeterministic time class \(\NTIME(t(n))\) is the collection of problems that can be decided by an \(O(t(n))\) time nondeterministic Turing machine.

  • Example: an NTM \(H\) for \(\HAMPATH\).

    \(H\) = “On input \(\tuple{G, s, t}\), where \(G\) is a directed graph and \(s,t\) are nodes in \(G\):

    1. Nondeterministically write down a list of numbers \(u_1, u_2, \ldots, u_n\), where \(n\) is the number of vertices in \(G\).
    2. Reject if there are repetitions in the list.
    3. Reject if the conditions \(s = u_1\) and \(t = u_n\) do not hold.
    4. Reject if there is an \(i\) between \(1\) and \(n − 1\) such that \((u_i, u_{i+1})\) is not an edge of \(G\).
    5. Accept.”
  • Equivalence

    A problem is in \(\NP\) if and only if it is decided by some nondeterministic polynomial-time Turing machine. That is,

    \[\begin{equation*} \NP = \bigcup_{c \ge 0} \NTIME (n^c). \end{equation*} \]

Problems in NP

  • \(\NP\) is large class of important problems we care about.

    Reducibility Among Combinatorial Problems, by Richard Karp 1972, considered 21 natural combinatorial problems in \(\NP\).

    Computers and Intractability: A Guide to the Theory of NP-Completeness, a textbook by Michael Garey and David S. Johnson published in 1979, listed more than 300 NP-hard computational problems.

  • Example problems in \(\NP\) but not known in \(\P\):

    \(\LONGPATH\), \(\HAMPATH\), \(\THREESAT\), \(\MAXCUT\), \(\TSP\), \(\FACTOR\)

  • Problems we haven't mentioned

    1. The \(\VC\) problem

    2. The \(\CLIQUE\) problem

      A \(k\)-clique of graph \(G\) is a complete subgraph of \(G\) with \(k\) vertices.

      \[\begin{equation*} \CLIQUE = \{ (G, k) \mid G \text{ has a } k\text{-clique} \}. \end{equation*} \]
    3. The \(k\)-\(\COL\) problem

      A graph \(G\) is \(k\)-colorable if it possible to color all the vertices of \(G\) so that two vertices of each edge have different color.

    4. Graph isomorphism problem \(\GI\)

P, NP, coNP

  • Relation with \(\P\):

    • \(\P \subseteq \NP \subseteq \EXP\)

    • \(\P\) vs. \(\NP\) problem

    Is verifying the proof much easier than coming up with the proof?

    It is one of the seven Millennium Prize Problems.

    Most researchers believe \(\P \ne \NP\).

  • Class \(\coNP\)

    Problems where no-instances have certificates. That is, if \(x \not\in A\), there is a witness \(y\) of this fact.

    A decision problem \(A\) is in \(\coNP\) if there is a polynomial-time verifier \(V\) such that \(x \in A\) if and only if \(\forall y, V(x, y) = 1\).

    \(\coNP = \{A \mid \overline{A} \in \NP\}\).

    Example: Given a CNF formula \(\varphi\), is it a tautology?

  • \(\NP\) vs \(\coNP\).

    Draw the picture.

    The problem \(\FACTOR\) asking whether \(n\) has a factor less than \(k\) is in \(\NP \cap \coNP\).

    To show it is in \(\coNP\), it suffices to present the prime number factorization of \(n\).

Lecture 7: NP Completeness

NP-Completeness and Cook-Levin Theorem

One of the most beautiful observations of computational complexity theory is that some problems in \(\NP\) are the hardest among all problems in \(\NP\). If you can solve one of them in polynomial time, you can solve all problems in \(\NP\) in polynomial time.

Importance

  1. On the theoretical side, to resolve the \(\P\) vs. \(\NP\) conjecture, it now suffices to focus on one particular problem.

  2. On the practical side, prevent wasting time searching for a polynomial time algorithm to solve a particular problem known to be \(\NPC\).

NP-Completeness

A problem \(B\) is \(\NPC\) if two conditions hold

  1. \(B\) is in \(\NP\).
  2. Every \(A \in \NP\) is polynomial-time reducible to \(B\).

We say \(B\) is \(\NP\)-hard if the containment in \(\NP\) part does not hold or is simply not checked.

  • Polynomial-time reducibility

    Problem \(A\) is polynomial-time (mapping) reducible to problem \(B\), written \(A \le_P B\), if there is a polynomial-time computable function \(f: \Sigma^* \to \Sigma^*\) such that \(\forall x\, [x \in A \leftrightarrow f(x) \in B]\).

    The function \(f\) is called the polynomial-time reduction of \(A\) to \(B\).

  • It is called polynomial-time many–one reduction (or Karp reduction) in some other textbooks.

    If \(A \le_P B\) and \(B \in \P\), then \(A \in \P\).

Cook-Levin Theorem

\(\SAT\) is \(\NP\)-complete.

We already know \(\SAT \in \NP\) in the previous lecture. So it suffices to prove that all \(A\in \NP\) reduces to \(\SAT\).

Let \(A\) be a problem in \(\NP\). We present polynomial-time reduction as follows.

By definition, there is a verifier \(V\) for \(A\) such that

  1. \(V\) is a TM taking input \(\tuple{x,y}\) running in time \(\poly(\abs{x})\) and,
  2. There exists \(y\) such that \(V\) accepts \(\tuple{x,y}\) if and only if \(x \in A\).

We consider the sequence of configurations of machine \(V\) on input \(\tuple{x,y}\).

Suppose an upper bound of the running time of machine \(V\) is \(\ell = n^c\) for large \(n\). We can truncate the configurations and represent them as strings in \(\Gamma^\ell\) where \(\Gamma = (Q \times \Sigma) \cup \Sigma\).

For simplicity, assume that the size of \(\Gamma\) is \(2^k\) for some integer \(k\), and a configuration for each time step is encoded as a binary string in \(\{0,1\}^{\ell k}\).

We construct a CNF formula \(\varphi_x\) as follows. Similar to the proof that \(\lambda\)-calculus is Turing-complete, the computation of \(\gamma^{(t+1)}_{j}\) from \(\gamma^{(t)}_{j-1}, \gamma^{(t)}_j, \gamma^{(t)}_{j+1}\) is given by a fixed function \(F_V\). The input of \(F_V\) is \(\{0,1\}^{3k}\), and the output is \(\{0,1\}^{k}\).

Every Boolean function \(f:\{0,1\}^k \to \{0,1\}\) can be written as a CNF formula of size \(O(k 2^k)\) by excluding all possible inputs \(x\) such that \(f(x) = 0\). So the condition that \(\gamma^{(t+1)}_{j} = F_V \bigl( \gamma^{(t)}_{j-1}, \gamma^{(t)}_j, \gamma^{(t)}_{j+1} \bigr)\) can be expressed as \(k\) such \(F_V\) formulas \(\varphi^{(t)}_j = \bigwedge_{i=1}^k \varphi^{(t)}_{j,i}\).

Finally, the CNF formula we use consists of three parts:

  1. A CNF formula \(\varphi_{\text{init}}\) that checks that the input is \(x\).
  2. The \(F_V\) formulas \(\varphi^{(t)}_{j}\) checking the propagation of the computation is correct.
  3. A CNF formula \(\varphi_{\text{final}}\) that checks that the configuration \(\gamma^{(\ell)}\) is accepting.

It has the form \(\varphi_x = \varphi_{\text{init}} \land \Bigl( \bigwedge_{t,j} \varphi^{(t)}_{j} \Bigr) \land \varphi_{\text{final}}\).

The variables in the formula are all the bits representing the configurations.

It is easy to see that \(x \in A\) if and only if \(V\) accepts \(x\) for some \(y\), if and only if \(\varphi_x \in \SAT\).

Variants of Satisfiability

  • Many variants of \(\SAT\) are \(\NPC\).

    \(\THREESAT\) is \(\NPC\).

    Easy.

    \(\problem{MAX2SAT}\) and \(\NAESAT\) are \(\NPC\).

    \(\NAESAT\) (not-all-equal SAT) is a SAT problem with the additional requirement that no clause is allowed to have all literals assigned True.

    See Schaefer's dichotomy theorem.

NP-Complete Reductions

Now that we know \(\THREESAT\) is \(\NPC\), we can prove more problems to be \(\NPC\) by reducing \(\THREESAT\) to it.

Vertex Cover

\(\VC\) is \(\NPC\).

We prove that \(\THREESAT \le_{P} \VC\).

  • Local reductions

    Play with small graphs.

    Consider the vertex cover of an edge graph and a triangle graph.

    We need to encode True and False of variables in CNF formulas, the non-deterministic choices for the values of the variables.

  • Construction

    Given a \(\THREESAT\) formula \(\varphi = \bigwedge_{j=1}^m C_j\) of \(n\) variables, where \(C_j = \ell_{j,1} \lor \ell_{j,2} \lor \ell_{j,3}\), we want to construct an instance of the \(\VC\) problem.

    For each variable \(x_i\), we introduce two variable vertices labeled by \(x_i\) and \(\neg x_i\) and connect them with an edge.

    For each clause \(C_j\), we introduce three clause vertices labeled by \(\ell_{j,1}\), \(\ell_{j,2}\), and \(\ell_{j,3}\), forming a triangle.

    Finally, we connect each clause vertex \(\ell_{j,r}\) to the corresponding variable vertex. For example, if \(\ell_{j,r} = \neg x_i\), \(\ell_{j,r}\) is connected to variable vertex \(\neg x_i\).

    The resulting graph \(G\) has \(3m + 2n\) vertices, and we ask if there is a vertex cover of size \(2m + n\).

  • Proof

    1. If \(\varphi\) is satisfiable. \(G\) has a vertex cover of size \(2m + n\).

    2. If \(G\) has a vertex cover of size \(2m + n\), there is a satisfying assignment for \(\varphi\).

Independent Set and Clique

  • An independent set of a graph \(G\) is a set of vertices forming a subgraph of \(G\) with no edges.

    It is easy to prove that \(\VC \le_P \INDSET\) as \(S\) is a vertex cover if and only if \(V\setminus S\) is an independent set.

    The reduction is therefore simply \(\tuple{G, k}\) to \(\tuple{G, n -k}\) where \(n\) is the number of vertices in \(G\).

  • \(S\) is an independent set of \(G\) if and only if \(S\) is a clique of complement graph \(G^c\).

    \(\INDSET \le_P \CLIQUE\).

3-Coloring

\(\THREECOL\) is \(\NPC\).

Consider the OR gadget. Details left as homework.

Hamiltonian Cycle

\(\HAMCYCLE\) is \(\NPC\).

  • It is easy to see that \(\HAMCYCLE\) is in \(\NP\).

  • We prove the theorem by reducing \(\VC\) to \(\HAMCYCLE\).

  • Let \(\tuple{G, k}\) be an instance of \(\VC\). We will define a graph \(H\) such that \(G\) has a vertex cover of size \(k\) if and only if \(H\) has a Hamiltonian cycle.

  • Construction of \(H\)

    Play with the Hamiltonian cycle problem on simple graphs.

    If there is a vertex \(v\) of degree two, the cycle must visit \(v\) after visiting one of its neighbors.

    For a vertex of degree larger than two, we need non-deterministic choices of the outgoing edge.

    Consider the following gadget graph, there are three ways to go through all the vertices in the gadget.

    Place a copy of the gadget on every edge \(e\) in \(G\). We call this copy \(H_e\) and the vertices \(a\) and \(b\) the left side and \(c\) and \(d\) the right side. If \(e = {u, v}\), the left side is close to \(u\) and corresponds \(u \in V(G)\). Similarly, the right side corresponds to vertex \(v\) in \(G\).

    Use the first way of traversing (I) the gadget \(H_e\) to represent choosing both vertices on the edge in the vertex cover problem. Use the second and third (II and III) to represent choosing one of the vertices in the vertex cover.

    Connect the sides of gadgets \(H_e\) that corresponds to the same vertex \(u\) in \(G\) in an arbitrary order. Each vertex in \(G\) now corresponds to several linked gadgets with two open edges around the vertex.

    Give an example.

    Introduce \(k\) vertices \(v_1, \ldots, v_k\) in \(H\) that each connects to all the open edges.

    This is the construction of \(H\), which is obviously in polynomial time.

  • Completeness

    If \(G\) has a vertex cover, we show that \(H\) has a Hamiltonian cycle.

    Start with \(v_1\), go to the first set of gadgets corresponding to a marked vertex in the cover, and traverse the gadgets depending on whether the other vertex in the edge is marked or not. If the other vertex is marked, use traversal (I). Otherwise, use (II) or (III). Go to \(v_2\) and the second set of gadgets of another marked vertex, and so on.

    Go back to \(v_1\) in the end.

  • Soundness

    If \(H\) has a Hamiltonian cycle, there is a vertex cover for \(G\).

    All Hamiltonian cycles go through \(v_1, \ldots, v_k\) exactly once.

    Between the visits to \(v_j\) and \(v_{j+1}\), the cycle can go around at most one vertex in \(G\).

    This gives us a cover of the vertices in \(G\), which we claim to be a valid vertex cover of \(G\). Otherwise, if there is an uncovered edge \(e\) in \(G\), then vertices in \(H_e\) is not in the cycle.

  • It is easy to reduce \(\HAMPATH\) to \(\HAMCYCLE\) and vice versa.

Max Cut

\(\MAXCUT\) is \(\NPC\)

  • \(\MAXCUT \in \NP\) is easy.

  • We reduce \(\NAESAT\) to \(\MAXCUT\).

    Choosing the starting point could simplify the reduction.

    It is an art, with no general recipe.

  • Reduction

    We use the fact that a triangle has a max-cut of size \(2\).

    Let \(C = \bigwedge_{j=1}^m C_j\) be an instance of \(\NAESAT\) with variables \(x_1, x_2, \ldots, x_n\).

    The graph \(G\) has \(2n\) vertices labeled by \(x_i\) and \(\neg x_i\).

    For each clause \(C_j = \ell_{j,1} \land \ell_{j,2} \land \ell_{j,3}\), for a triangle of vertices \(\ell_{j,1}\), \(\ell_{j,2}\), and \(\ell_{j,3}\).

    We want to use the cut to define the true assignment of variables \(x_i\), so it is ideal that \(x_i\) and \(\neg x_i\) are on different sides of the max-cut. To do so, we add \(n_i\) copies of the edges \((x_i, \neg x_i)\) where \(n_i\) is the number of times \(x_i\) or \(\neg x_i\) occurs in the instance \(C\).

    Set the threshold \(k = 5m\).

    Give an example.

    The cut will have \(3m\) edges contributed by the edges between \(x_i\) and \(\neg x_i\). The remaining \(2m\) must come from the triangles. So, all the triangles must be cut with two edges, meaning that the literals in the clause is not all True and not all False.

  • Completeness and soundness

    If \(C\) is satisifiable, the cut is defined by the literals assigned True.

    If \(G\) has a cut of size at least \(k\), then the \(\NAESAT\) problem has a satisfying assignment.

Summary

  • An \(\NP\) reduction consists of four parts:

    1. Description of the reduction \(R\).
    2. Efficiency of the reduction.
    3. Completeness: If we are reducing \(A\) to \(B\). If \(x \in A\), then \(R(x) \in B\).
    4. Soundness: If \(x \not\in A\), then \(R(x) \not\in B\).

NP Intermediate

If \(\P \ne \NP\), there are problems of intermediate hardness between \(\P\) and \(\NP\)-complete.

Ladner's Theorem

There are two different proofs for this theorem. The one we will present is due to Russell Impagliazzo.

  • Idea: Apply padding to make the \(\SAT\) problem easier.

    For computable function \(Z: \natural \to \natural\) consider the padded \(\SAT\) problem \(\SAT_Z\), where

    \[\begin{equation*} \SAT_Z = \bigl\{ \varphi 1 0^{n^{Z(n)}} \mid \varphi \text{ is a } \SAT \text{ instance of size } \abs{\varphi} = n \bigr\}. \end{equation*} \]

    \(Z\) needs to be a slowly growing function so that we can argue three things:

    1. \(\SAT_Z\) is in \(\NP\).
    2. \(\SAT_Z\) is not in \(\P\) unless \(\P = \NP\).
    3. \(\SAT_Z\) is not \(\NPC\).
  • Definition of \(Z\)

    Enumerate all TMs \(M_1, M_2, \ldots\). Without loss of generality, assume that \(M_i\) has running time at most \(n^i\) and each machine has infinitely many encodings appearing in this sequence.

    Define \(Z(1) = 1\). Suppose \(Z(n-1) = i\). If for all \(x\) of length at most \(\log n\), \(M_i\) computes \(\SAT_Z(x)\), define \(Z(n) = i\). Otherwise, define \(Z(n) = \min \{i+1, \log\log n\}\).

    Note that we require \(Z(n)\) to grow slower than \(\log\log n\) so that the time complexity of \(M_i\) is at most polynomial in \(n\).

  • Proofs

    Claim 1. Function \(Z(n)\) and \(n^{Z(n)}\) are computable in time polynomial in \(n\). Hence, \(\SAT_Z\) is in \(\NP\).

    Note that \(Z(n)\) is recursively defined. Solve the recurrence relation.

    Claim 2. \(Z\) is bounded if and only if \(\SAT_Z \in \P\).

    If \(\SAT_Z \in \P\), then there will be a fixed TM \(M_i\) solving the problem. This would imply that \(Z(n)\) is constant for large \(n\) by the definition of \(Z\).

    Consider the case when \(Z\) is constant, say \(Z(n)=i\), for large \(n\). In this situation, we claim that \(M_i\) solves the problem \(\SAT_Z\) in time \(n^i\). Otherwise, if there is an input \(x\), which \(M_i\) does not solve correctly, then for \(n > \max \bigl\{ 2^{\abs{x}}, 2^{2^{i+1}} \bigr\}\), we would have \(Z(n) \ne i\), a contradiction.

    We conclude that \(Z\) is unbounded and \(\SAT_Z\) is not in \(\P\). Otherwise, the padding is polynomial, and the problem \(\SAT_Z\) is still \(\NPC\). A contradiction to the assumption that \(\P \ne \NP\).

    Claim 3. \(\SAT_Z\) is not \(\NPC\). Suppose it is, then there is a polynomial time reduction \(R\) such that \(\phi \in \SAT \leftrightarrow R(\phi) \in \SAT_Z\).

    Since \(Z\) is unbounded, the padding is super-polynomial, so the reduction reduces \(\SAT\) instance \(\phi\) to a \(\SAT\) instance \(\varphi\) of asymptotically smaller length!

Lecture 8: What If P = NP?

“The evidence in favor of [P \(\ne\) NP] and [ its algebraic counterpart ] is so overwhelming, and the consequences of their failure are so grotesque, that their status may perhaps be compared to that of physical laws rather than that of ordinary mathematical conjectures.” — Volker Strassen, laudation for Leslie Valiant, 1986.

What if P = NP

  • It is said that the \(\P\) vs. \(\NP\) problem is the great open question of computer science.

    But why is it so important?

  • What if \(\P = \NP\) but the best algorithm for \(\THREESAT\) takes \(n^{1000}\) time?

    This is a valid possibility, and a few leading TCS figures like D. Knuth, took the possibility \(\P = \NP\) more seriously than others.

  • We will consider two extreme scenarios:

    1. \(\THREESAT\) is super easy: it has an \(O(n)\) or \(O(n^2)\) time algorithm with reasonable constant hiding in the big-O notation (say \(\le 10^6\).)
    2. \(\THREESAT\) is very hard: it is exponentially hard and cannot be solved faster than \(2^{\eps n}\) for some constant \(\eps > 0\) not too small (say \(\ge 10^{−6}\)).
  • The fastest known algorithm for \(\THREESAT\) has a time complexity of \(2^{0.38n}\) for formulas of \(n\) variables.

  • What if \(\P = \NP\) and the algorithm is practical?

Search-To-Decision Reductions

  • The ability to answer ‘yes/no’ questions of \(\NP\) problems leads to the ability to search for the solutions to the problem.

  • \(\THREESAT\) example

    Suppose algorithm \(D\) decides whether a formula \(\varphi(x_1, x_2, \ldots, x_n)\) is satisfiable or not.

    Consider algorithm \(S\) = “On input \(\varphi\)

    1. For \(i = 1, 2, \cdots, n\) do:
      1. Use \(D\) to test whether \(\varphi(a_1, \ldots, a_{i-1}, 0, x_{i+1}, \ldots, x_n)\) is satisfiable or not.
      2. If so, set \(a_i = 0\). Otherwise, set \(a_i = 1\).
    2. Check if \(a = (a_1, a_2, \ldots, a_n)\) is a correct answer. Output \(a\) if it is.”
  • The general case is similar.

  • If \(\P = \NP\), then for every decision problem \(A \in \NP\) with verifier \(V\), there is polynomial-time algorithm \(\text{Search}_V\) for its corresponding search problem \(A_{\text{search}}\), whose task is to find a witness \(y\) such that \(V(x,y) = 1\) given \(x\) as input.

    For example, you can find the cut as a solution to the \(\MAXCUT\) problem, the Hamiltonian path of a graph etc., all in polynomial time.

Optimization

  • The above trick does not guarantee the solutions to the \(\NP\) problems are optimal.

    For example, how do we find the longest path, instead of finding a path whose length is above certain threshold \(k\) for a given graph?

  • In general, you can use binary search to solve all \(\NP\) problems with an optimization flavor like the \(\LONGPATH\) problem.

  • Linear programming vs. integer programming

    We have mentioned that LP problems have polynomial-time algorithms.

    Yet, if one requires that the variables take \(0,1\) values instead of real values, the problem is \(\NPC\).

    So if \(\P = \NP\) we have polynomial-time algorithms for integer programming problems too. These algorithms will have many important applications in machine learning, engineering, and operations research.

  • If \(\P = \NP\), one can efficiently search through an exponential space and find not just the “local” mountain but the tallest peak.

  1. Supervised Learning

    In supervised learning, one is given sample-label pairs \((x_i, y_i)\) for \(i=1, 2, \ldots, m\), \(x_i \in \{0,1\}^n\) and \(y_i \in \{0,1\}\).

    The goal is to find a hypothesis \(h:\{0,1\}^n \to \{0,1\}\) such that when we have a new input \(x\), we may use \(y = h(x)\) to predict its label.

    The idea in supervised learning is to use the Occam's Razor principle: the simplest hypothesis that explains the data is likely to be correct.

    One popular approach to model this is to pick some fairly simple function \(H:\{0,1\}^{k+n} \to \{0,1\}\) and think of the first \(k\) bits of \(H\)'s input as parameters.

    For example, we can think of the first \(k\) inputs of \(H\) as the weights and connections for some neural networks that will be applied to the latter \(n\) inputs.

    In a natural approach called empirical risk minimization, one tries to find parameters \(\theta = (\theta_1, \ldots, \theta_k) \in \{0,1\}^k\) that minimize the number of errors made by \(x \mapsto H(\theta, x)\). That is, the problem is to minimize over the parameter space of \(\theta\) for the loss function

    \[\begin{equation*} \text{LOSS}(\theta) = \sum_{i=1}^m \abs{H(\theta, x_i) - y_i}. \end{equation*} \]

    Hence if \(\P = \NP\), we can solve the optimization for supervised learning in great generality.

    Training can be done in a way as simple as inference.

  2. Implications to Cryptography

    If \(\P = \NP\), most protocols in cryptography would become invalid!

    It is like the supervised learning problem. Seeing enough encrypted data points will allow one to recover the secret key.

Find Proofs for Mathematical Problems

  • Recall that a proof system is defined by a TM \(V\) that checks whether \(w\) is a proof of \(x\).

  • The proof checking procedure is usually very efficient, finishing the checking in \(O(n)\) or at most \(O(n^2)\) time.

    The existence of an \(n\)-page proof for a statement \(x\) implies that there exists a string \(w\) of length \(O(n)\) encoding the proof.

  • Thus, if \(\P = \NP\), then despite Gödel’s incompleteness theorems, we can still automate mathematics since a computer can find proofs for us.

  • Consider the \(\SHORTPROOF\) problem whose task is to answer yes if statement \(x\) has a proof \(w\) of length at most \(m\) when given \(\tuple{x, 1^m}\) as input.

  • Frankly speaking, if the shortest proof for some statement requires a terabyte, then human mathematicians will never find this proof either.

    For this reason, Gödel felt that whether \(\SHORTPROOF\) has a polynomial time algorithm is of great interest. He wrote in a letter to John von Neumann in 1956 (note this is before the concept of \(\NP\) or even “polynomial time” was formally defined):

    “One can obviously easily construct a Turing machine, which for every formula \(F\) in first order predicate logic and every natural number \(n\), allows one to decide if there is a proof of \(F\) of length \(n\) (length = number of symbols). Let \(\psi(F, n)\) be the number of steps the machine requires for this and let \(\varphi(n) = \max_F \psi(F,n)\). The question is how fast \(\varphi(n)\) grows for an optimal machine. One can show that \(\varphi(n) \ge k \cdot n\) [for some constant \(k > 0\)]. If there really were a machine with \(\varphi(n) \sim k \cdot n\) (or even \(\sim k n^2\)), this would have consequences of the greatest importance. Namely, it would obviously mean that in spite of the undecidability of the Entscheidungsproblem, the mental work of a mathematician concerning Yes-or-No questions could be completely replaced by a machine. After all, one would simply have to choose the natural number \(n\) so large that when the machine does not deliver a result, it makes no sense to think more about the problem.” — Kurt Gödel

    For many reasonable proof systems, \(\SHORTPROOF\) is \(\NPC\), and if \(\P = \NP\), we can decide whether a mathematical statement \(x\) has a short proof in polynomial time.

    Gödel can be considered the first person to formulate the \(\P\) vs. \(\NP\) question. Unfortunately, the letter was only discovered in 1988.

    Gödel’s Lost Letter and P=NP, a blog by R. J. Lipton.

Quantifier Elimination

Previous discussions are based solving problems in \(\NP\) if \(\P = \NP\). Interestingly, we may even solve problems not known to be in \(\NP\)!

  • Problem \(A\) in \(\NP\) can be characterized as follows.

    Deciding if \(x \in A\) is equivalent to deciding if the following statement is true.

    \[\begin{equation*} \exists y\, [ V(x, y) = 1 ]. \end{equation*} \]
  • Let us consider more general formulas, for example, whether the following statement is true.

    \[\begin{equation*} \exists y\in \{0,1\}^{p(\abs{x})}\, \forall z \in \{0,1\}^{q(\abs{x})}\, [ V(x, y, z) = 1 ]. \end{equation*} \]

    Note there is one more layer of alternative quantification.

    More generally, we consider formulas with a constant number of alternations.

    Problems characterized by such formulas form a complexity class known as the polynomial hierarchy, \(\PH\).

  • Examples

    Consider the following problem related to the \(\problem{CIRCUIT\text{-}MINIMIZATION\) problem.

    Given a classical circuit \(C\), decide if there is an equivalent circuit \(C'\) that can be described by a string of length at most \(s\).

    \[\begin{equation*} \exists C'\in \{0,1\}^s\, \forall x\in \{0,1\}^n\, [C(x) = C'(x)]. \end{equation*} \]

    Other examples include modeling the winning strategy within a constant number of moves.

  • It turns out that if \(\P = \NP\) we can solve such problems efficiently as well.

    We will only consider the simple case of two quantifiers, and the general proof is a simple induction.

    To decide if

    \[\begin{equation*} \exists y\, \forall z\, [ V(x, y, z) = 1 ], \end{equation*} \]

    we consider problems \(\varphi_{x, y}\) for fixed \(y\), deciding if

    \[\begin{equation*} \forall z\, [ V(x, y, z) = 1 ]. \end{equation*} \]

    The negation \(\overline{\varphi_{x, y}}\) of \(\varphi_{x, y}\) is

    \[\begin{equation*} \exists z\, [ V(x, y, z) = 0 ], \end{equation*} \]

    which is problem in \(\NP\) and therefore, there is a polynomial-time algorithm \(S\) solves it given \(\tuple{x,y}\) as input. That is, for all \(x,y\), \(S\) outputs \(1\) if \(\exists z\, [ V(x, y, z) = 0 ]\), or equivalently, \(\varphi_{x, y}\) is false; \(S\) outputs \(0\) if \(\varphi_{x, y}\) is true.

    Hence, the original problem is equivalent to

    \[\begin{equation*} \exists y\, [ S(x,y) = 0 ], \end{equation*} \]

    which is again a statement in \(\NP\), and by the assumption that \(\P = \NP\), it is solvable in \(\P\).

Approximate Counting

The complexity of counting solutions and \(\SharpP\).

  • Suppose that \(C\) is a circuit of size polynomial in \(n\) with inputs \(x = x_1, \ldots, x_n\).

  • How many \(x\) satisfies \(C(x) = 1\)?

    Given the description of circuit \(C\), what is the size of the set \(S = \{ x \mid C(x) = 1\}\).

  • This is a problem known to be complete for the complexity class \(\SharpP\), and computing the exact answer is at least as hard as \(\NP\).

  • Stockmeyer's counting theorem

    If \(\P = \NP\), we do not know how to compute \(\abs{S}\), but we do know how to approximate it using Stockmeyer's theorem.

Implications

  • We can solve some problems (a huge number of them) faster!

    Moreover, it is a shift of paradigm of how we think about computation.

  • A fast algorithm for \(\NP\) problems would bring about a new type of understanding.

    When comparing problems for which we do not know efficient algorithms with those that do, there is usually a lack of “closed-form formula” or other types of more constructive descriptions of the problem.

    Such new understanding would be very fruitful regardless of their computational utility.

“If P=NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it's found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett.” — Scott Aaronson

Relativization and Limits of Diagonalization

So far, the diagonalization method has been very successful in showing many results in this course. Can we prove \(\P \ne \NP\) using the diagonalization method?

Relativization is a barrier for this approach!

Relativization

  • Computation assisted with the ability to solve certain problems for free.

  • Consider a TM with the extra ability to solve \(\SAT\) problems in a single step.

    We do not ask how this is accomplished but think of this extra ability as a black box.

  • The black box is often called an oracle.

    A polynomial-time machine with a \(\SAT\) oracle can solve any \(\NP\) problems regardless whether \(\P\) equals \(\NP\) or not.

    Such a machine is said to be computing relative to the \(\SAT\) problem.

  • In general, the oracle can correspond to any problem, not just \(\SAT\).

    An oracle for problem \(A\) is a device that can answer membership questions for \(A\). An oracle TM \(M^A\) is a modified TM adding the extra capability of querying the oracle \(A\) to \(M\). That is, \(M^A\) has an oracle tape, and in one step, \(M\) is informed whether the string on the oracle tape is in \(A\) or not. Define \(\P^A\) to be the class of problems solvable by a polynomial time oracle Turing machine using oracle \(A\). Similarly, define \(\NP^A\)

  • Example

    \[\begin{align*} \NP & \subseteq \P^{\SAT},\\ \coNP & \subseteq \P^{\SAT}. \end{align*} \]

Oracle Separation of \(\P\) and \(\NP\)

There is an oracle \(A\) such that \(\P^A \ne \NP^A\).

  • Key idea: use oracles to hide structures.

  • We need to construct (or program) an oracle \(A\) so that the claim is true.

    That is, we need a problem \(L_A\) such that \(L_A \in \NP^A\) and \(L_A \not\in \P^A\).

Consider \(L_A = \{ x \mid \exists w \in A \text{ such that } \abs{x} = \abs{w} \}\).

It is easy to see \(L_A \in \NP^A\). To see \(x \in L_A\), the machine non-deterministically guesses \(w\) of length \(\abs{x}\) and check that \(w\) is in \(A\) using the oracle.

We will program \(A\) so that \(L_A \not\in \P^A\).

Let \(M_1, M_2, \ldots\) be the list of all polynomial time TMs. For simplicity, we assume \(M_i\) runs in time \(n^i\).

Our construction of \(A\) ensure that for all \(i\), \(M_i^A\) does not solve the problem \(L_A\) correctly.

Suppose \(A\) is partially specified and we have achieved our goal for all \(M_j\) with \(j<i\). We now work with \(M_i\).

Choose \(n\) so that

  1. \(2^n > n^i\).
  2. \(n\) is larger than the length of all strings queried before.

We show how to extend the definition of \(A\) so that \(M_i^A\) is not correct answering \(1^n \in L_A\). The machine \(M_i\) will query \(A\), if the query point is already defined (queried before), \(A\) replies accordingly. If the query point is new, \(A\) reports ‘no’.

After \(M_i^A\) finishes, if \(M_i^A\) accepts, we program \(A\) so that all strings of length \(n\) is assigned ‘no’. Yet, if \(M_i^A\) rejects, we program \(A\) so that one of strings of length \(n\) not queried by \(M_i\) to be ‘yes’. This is possible as \(M_i^A\) queries at most \(n^i < 2^n\) positions.

This ensure that \(M_i^A\) is not correct. So \(L_A \not\in \P^A\) for \(A\) constructed above.

Limits of Diagonalization

  • Oracle \(B\)

    There is an oracle \(B\) such that \(\P^B = \NP^B\).

    This direction is actually easier once we have all the tools ready, which we will discuss later in the course.

    Note: We can choose \(B\) to be \(\TQBF\).

    What does it say when putting both results together?

  • In diagonalization, one TM simulates another and then behaves differently.

    The simulation can be modified easily if both machines are oracle TMs.

    So any statement provable for TMs using only diagonalization will generalize to oracle TMs.

    In particular, if there is a diagonalization proof for \(\P \ne \NP\), then the same proof will show \(\P^B \ne \NP^B\), which is a contradiction.

Lecture 9: Space Complexity

Space Complexity

Space complexity considers the complexity of computational problems in terms of the amount of space or memory they require.

Space complexity shares many of the features of time complexity and serves as a further way of classifying problems according to their computational hardness.

SPACE and NSPACE

  • For functions \(f(n) \ge n\), define

    \[\begin{equation*} \begin{split} \SPACE(f(n)) = & \{ A \mid A \text{ is solvable by a TM using }\\ & \quad O(f(n)) \text{ space} \},\\ \NSPACE(f(n)) = & \{ A \mid A \text{ is solvable by an NTM using }\\ & \quad O(f(n)) \text{ space} \}. \end{split} \end{equation*} \]
  • Example: \(\SAT\) can be solved by a linear space algorithm.

    TM \(M\) = “On input \(\desc{\varphi}\), where \(\varphi\) is a Boolean formula:

    1. For each truth assignment to the variables \(x_1, x_2, \ldots , x_n\) of \(\varphi\):
      1. Evaluate \(\varphi\) on that truth assignment.
      2. Accept if evaluated to True.
    2. Reject.”
  • Time vs. space

    1. \(\TIME(t(n)) \subseteq \SPACE(t(n))\),
    2. \(\SPACE(t(n)) \subseteq \TIME(2^{O(t(n))})\).

    Define the complexity class

    \[\begin{equation*} \PSPACE = \bigcup_{k \ge 0} \SPACE(n^k) \end{equation*} \]

    So \(\P \subseteq \PSPACE\).

    \(\NP \subseteq \PSPACE\).

    Prove that \(\THREESAT\) is in \(\PSPACE\).

    \(\coNP \subseteq \PSPACE\).

    Because \(\class{coPSPACE} = \class{PSPACE}\).

Polynomial Space

PSPACE Completeness

  • Definition

    1. \(B\) in \(\PSPACE\)
    2. Every \(A\) in \(\PSPACE\), \(A\) reduces to \(B\) in polynomial time.

    Reductions should be as weak as possible.

  • The \(\TQBF\) problem

    \(\phi_1 = \exists x \forall y\, \bigl[ (\neg x \lor y) \land (x \lor \neg y) \bigr]\).

    \(\phi_1 \not\in \TQBF\)

    \(\SAT\) is a special case of \(\TQBF\).

    The number of alternations depends on the number of variables.

  • Containment in \(\PSPACE\)

    \(\TQBF\) is in \(\PSPACE\).

    Consider TM \(S\) = “On input \(\desc{\varphi}\):
    1. If \(\varphi\) has no quantifies, then \(\varphi\) has no variables. It is either True or False. Output accordingly.
    2. If \(\varphi = \exists x\, \psi\), then evaluate \(\psi\) with \(x\) equals True and False recursively. Accept if either case accepts. Reject otherwise.
    3. If \(\varphi = \forall x\, \psi\), then evaluate \(\psi\) with \(x\) equals True and False recursively. Accept if both accept. Reject otherwise.”

    The space complexity of \(S\): Each recursive level uses constant extra space.

    So \(\TQBF \in \SPACE(n) \subseteq \PSPACE\).

  • Completeness

    \(\TQBF\) is \(\PSPACE\)-complete

    Let \(A\) be a problem in \(\PSPACE\) decided by a TM using space at most \(n^k\). We will give a polynomial time reduction \(f\) mapping \(A\) to \(\TQBF\).

    Plan: Design a formula \(\varphi_{M,w}\) to express the condition that \(M\) accepts \(w\). Use the tableau of the configurations of \(M\) on \(w\).

    The tableau has \(n^k\) columns and at most \(d^{n^k}\) rows where \(d\) is the alphabet size.

    If we use Cook-Levin directly, the formula would be too big!

    We will do this recursively. For configurations \(c_i\) and \(c_j\), construct \(\varphi_{c_i, c_j, b}\) which means \(c_i \stackrel{b}{\to} c_j\). So we have

    \[\begin{equation*} \varphi_{c_i,c_j,b} = \exists c_m\, \bigl[ \varphi_{c_i, c_m, b/2} \land \varphi_{c_m, c_j, b/2} \bigr ]. \end{equation*} \]

    Do this recursively until the value \(b\) is \(1\), at which point we can use the method like in Cook-Levin. Finally, choose the formula \(\varphi_{M, w} = \varphi_{c_\text{init}, c_\text{final}, t}\) where \(t = d^{n^k}\).

    There is an issue, however, as each recursion step doubles the number of subformulas.

    The abbreviation trick!

    We replace the \(\land\) part with

    \[\begin{equation*} \forall(c_g, c_h) \in \{(c_i, c_m), (c_m, c_j)\}\, \bigl[ \varphi_{c_g, c_h, b/2} \bigr]. \end{equation*} \]

    Note that \(\forall (x \in S)\, [\phi]\) is equivalent to \(\forall x\, [x \in S \rightarrow \phi]\), and so each recursion increases the length of the formula by \(O(n^k)\).

    The number of recursion levels is \(\log d^{n^k} = O(n^k)\).

  • \(\TQBF\) remains \(\PSPACE\)-complete even if the inner formula without any quantifiers is a 3-CNF.

    Run \(\THREESAT\) \(\NP\)-hardness reduction on the formula which can be thought of as a verifier checking if the input is a correct witness.

Savitch's Theorem

(Savitch) For any function \(f: \natural \to \natural\) where \(f(n) \ge n\),

\[\begin{equation*} \NSPACE(f(n)) \subseteq \SPACE(f^2(n)). \end{equation*} \]

As a corollary, \(\PSPACE = \NPSPACE\).

  • We have shown \(\NP\) in \(\PSPACE\). So, we can somehow simulate nondeterminism using space effectively.

  • A naive approach: trying all the branches of the NTM's computation.

    However, a branch that uses \(f(n)\) space may run for \(2^{O(f(n))}\) steps and each step may be a nondeterministic choice.

    Exploring the branches sequentially requires remembering all the choices made on a particular branch in order to find the next branch.

    Therefore, this naive approach may use \(2^{O(f(n))}\) space.

  • We can use the same recursion trick as the proof that \(\TQBF\) is complete for \(\PSPACE\).

  • Yieldability problem

    Given two configurations \(c_1\) and \(c_2\) of the NTM \(N\), one can check whether \(N\) can go from \(c_1\) to \(c_2\) within \(t\) steps using only \(f(n)\) space.

    Define TM \(Y\) = “On input \(c_1\), \(c_2\) and \(t\):

    1. If \(t = 1\), check whether \(c_1 = c_2\) or whether \(c_1\) goes to \(c_2\) in one step according to the rules of \(N\). If so, accept. Reject otherwise.
    2. If \(t > 1\), then for each configuration \(c_m\) of \(N\) using space \(O(f(n))\):
      1. Run \(Y\) on input \(\tuple{c_1, c_m, t/2}\).
      2. Run \(Y\) on input \(\tuple{c_m, c_2, t/2}\).
      3. Accept if both runs accept.
    3. If haven't yet accepted, reject.”

    This algorithm needs space to store the recursion stack.

    Each level of the recursion uses \(O(f(n))\) space to store a configuration.

    The depth of the recursion is \(\log t\), where \(t\) is the maximum time that the nondeterministic machine \(N\) may use on any branch.

    We have \(t = 2^{O(f(n))}\), so \(\log t = O(f(n))\). Hence \(Y\) is a deterministic TM of space complexity \(O(f^2(n))\).

  • Application: Construction of the oracle \(B\) for \(\P^B = \NP^B\)

    Take \(B = \TQBF\).

    \(\NP^\TQBF \subseteq \NPSPACE \subseteq \PSPACE \subseteq P^\TQBF\).

    The first containment holds because \(\TQBF\) oracle calls can be solved by \(\PSPACE\) directly.

    The second condition is by Savitch's theorem.

Games and Space Complexity

Connections between games and complexity

  • Formula game

    Given an instance of \(\TQBF\)

    \[\begin{equation*} \varphi = \exists x_1 \forall x_2 \exists x_3 \cdots (\forall/\exists) x_n\, [ (\cdots) \land \cdots \land (\cdots)] \end{equation*} \]

    We consider a two-player game called the formula game with players \(\exists\) and \(\forall\)

    The \(\exists\) Player assigns values to the \(\exists\) quantified variables and the \(\forall\) Player assigns values to the \(\forall\) variables.

    They choose the values in the order of quantifies in the formula.

    \(\exists\) Player wins if the formula evaluates to True.

    Define problem

    \[\begin{equation*} \begin{split} & \FG = \{ \desc{\varphi} \mid \text{Player $\exists$ has}\\ & \quad \text{a winning strategy for formulate game of } \varphi\}. \end{split} \end{equation*} \]
  • \(\FG\) is \(\PSPACE\)-complete

    Player \(\exists\) has a winning strategy in the formula game for \(\varphi\) if and only if \(\varphi\) is true.

    That is \(\FG = \TQBF\).

    Player \(\exists\) has a winning strategy in the game if she can choose some assignment to \(x_1\) so that for any assignment to \(x_2\) chosen by Player \(\forall\), there is a choice for \(x_3\) and so on, such that the formula evaluates to True.

    This is equivalent to the condition that the formula is true by the \(\exists\) and \(\forall\) semantics in logic.

  • 接龙游戏 or the Generalized Geography game.

    Given a directed graph \(G\). Two players choose vertices in turn. One has to choose a neighbor of currently chosen vertex (by the opponent), and cannot choose any vertex that has been chosen before. A player who cannot make any choice loses.

    \[\begin{equation*} \begin{split} & \GG = \{ \tuple{G, a} \mid \text{ Player 1 has}\\ & \quad \text{a winning strategy when starting from vertex } a\}. \end{split} \end{equation*} \]
  • Complexity of \(\GG\)

    The \(\GG\) problem is \(\PSPACE\)-complete.

    1. The \(\GG\) problem is in \(\PSPACE\).

    2. \(\GG\) is \(\PSPACE\)-hard.

    We reduce \(\FG\) to it.

    First, in the left part, we have diamonds to force the players to take turns. We cannot add outgoing edges to them.

    On the right-hand side, we have a node on the top with \(c_i\) clauses as children.

    We link \(x_i\) to the True side of the assignments on the left side.

    The node \(c\) must be chosen by a \(\exists\) player (add extra nodes if not).

    Player \(\forall\) then tries to choose a clause \(c_j\) that is not satisfied.

    Player \(\exists\) shows why the \(c_j\) is actually satisfied.

  • Complexity of Chess and Go

    \(\PSPACE\)-hard for \(n \times n\) boards. Sometimes the games are even harder.

Logarithmic Space

When working with logarithmic space, we must change the model as the space is too small even to read the input.

L and NL

Consider TMs with two tapes, the input tape is now read-only and does not count as space used.

Define complexity classes \(\L = \SPACE(\log n)\) and \(\NL = \NSPACE(\log n)\).

  • Examples

    1. \(\PAL\) is in \(\L\).

    2. Define \(\PATH\) to be the reachability problem for directed graphs. \(\PATH\) is in \(\NL\).

      The machine can nondeterministically guess the next vertex in the path.

      You will need to count the number of nodes to avoid loops.

  • \(\L \subseteq \NL\).

Basic Facts

\(\L \subseteq \P\).

Savitch's theorem works for \(\NL\).

\(\NL \subseteq \SPACE(\log^2 n)\).

\(\NL \subseteq \P\).

Take a nondeterministic machine \(M\) decides \(A\) in space \(O(\log n)\).

Take all the configurations of \(M\) on \(w\) and \(c_i \to c_j\) if \(c_i\) yields \(c_j\) in one step.

Claim. TM \(M\) accepts \(w\) if and only if the configuration graph \(G_{M, w}\) has a path from \(c_{\text{init}}\) to \(c_{\text{acc}}\).

Polynomial-time algorithm \(T\) for \(A\) = “On input \(w\),

  1. Compute \(G_{M,w}\).
  2. Accept if there is a path from \(c_{\text{init}}\) to \(c_{\text{acc}}\).”

NL Completeness

  • Definition

We cannot use polynomial-time reductions here!

We need to define log-space reducibility.

A log-space transducer is a TM of three tapes:

  1. A read-only input of size \(n\),
  2. A read/write work tape of size \(O(\log n)\),
  3. A write-only output tape.

Problem \(A\) is log-space reducible to \(B\), written \(A \le_L B\), if \(A\) is mapping reducible to \(B\) by a log space transducer.

  • Reductions in log space.

    If \(A \le_L B\) and \(B \in \L\) then \(A \in \L\).

  • The trivial proof that on input \(w\), computes \(f(w)\) would not work.

    It would produce a string that is too long!

  • The idea is that we don't have to store the whole \(f(w)\) and the decider only needs the symbols in \(f(w)\) one at a time.

    Compute the symbols in \(f(w)\) only at the time we need them.

    Each time, we start the transducer again from the beginning and ignore all output except the symbol at the desired position.

    This is OK, as we are considering space complexity!

PATH is NL-Complete

  • We already know \(\PATH \in \NL\)

  • \(\NL\) reduces to \(\PATH\)

    The claim is not so surprising as we have already seen the configuration graph for \(\NL\) problems.

    Let \(f(w) = \tuple{G, s, t}\), where \(G\) is the configuration graph \(G_{M,w}\) and \(s = c_{\text{init}}\) and \(t = c_{\text{acc}}\).

    We need to show that there is a log-space transducer \(T\) computing \(f\).

    TM \(T\) = “On input \(\tuple{M, w}\):

    1. Go through all pairs of possible configurations \(c_i, c_j\) of \(M\) in some order.
    2. Output edge \((c_1, c_2)\) if \(c_i \to c_j\) is a legal move. (Size is \(\log n\) as configuration mainly is the tape of the \(\NL\) machine \(M\))
    3. Output \(c_{\text{init}}\) and \(c_{\text{acc}}\).”

NL = coNL

(Immerman-Szelepcsenyi) \(\NL = \coNL\).

Space Hierarchy Theorem

  • It is similar to the time hierarchy theorem.

    For any space-constructible function \(f: \natural \to \natural\), there is a language \(A\) that is decidable in \(O(f(n))\) space but not decidable in space \(o(f(n))\).

    \(\L \subsetneq \PSPACE\).

  • The proof is similar to that of the time hierarchy theorem.

  • \(\L \subseteq \NL \subseteq \P \subseteq \NP \subseteq \PH \subseteq \P^{\SharpP} \subseteq \PSPACE\)

    We know \(\NL \ne \PSPACE\) by diagonalization.

Complexity Theory's Waterloo

Lecture A: Probability Theory for Computer Science

Basics of Probability Theory

History of Probability Theory

  • Probability theory is now a fully developed area of mathematics with a wide range of applications in many areas of research, including computer science.

  • Probability theory is invented to understand problems in gambling.

    The French nobleman and gambler Antoine Gombaud (Chevalier de Méré)

    • Getting at least one “6” in four throws of a dice.

    • Getting at least one double-six in \(24\) throws of two dice.

    • Known as de Méré's problem.

  • Unfinished game problem

    Two gamblers, Alice and Bob, are playing a head-and-tail game. The rule is that anyone who first wins \(s\) times gets all the bets. At the moment when Alice wins \(a\) times and Bob wins \(b\) times, the game has to stop for some reason. How should they split the bet?

    • An example \(s=3\), \(a=2\), \(b=1\).

    • He couldn't solve it, so he asked Blaise Pascal and Pierre de Fermat.

    • This led to the early development of probability theory.

      So, probability theory is born from the study of gambling!

Sample Space and Events

  • In TCS, we are mostly concerned with probability over finite sample spaces.

  • Example: The outcome of tossing \(n\) coins as a bit string \(x \in \{0,1\}^n\).

    The sample space is \(S = \{0,1\}^n\), the set of possible outcomes in the random experiment.

    Each string \(x\) occurs with probability \(2^{-n}\).

    In general, it could be any finite set (or even infinite set).

  • An event is a subset \(A\) of the sample space \(S = \{0,1\}^n\).

  • The probability of event \(A\), denoted as \(\Pr(A)\), is the probability that the randomly chosen \(x\) will be in \(A\).

    More generally, it is

    \[\begin{equation*} \Pr(A) = \sum_{x \in A} \Pr(x). \end{equation*} \]
  • The picture you should have for probability is that we assign each possibility a non-negative real number.

    More generally, we could consider the general case where for each \(x \in S\), the probability \(\Pr(x)\) is a real number in \([0,1]\) satisfying the completeness condition

    \[\begin{equation*} \sum_{x\in S} \Pr(x) = 1. \end{equation*} \]

    The completeness condition says that we always assign \(1\) to the sample space \(S\) as an event.

Examples

  • Parity of uniform random coins

    For every \(n > 0\),

    \[\begin{equation*} \Pr_{x\sim \{0,1\}^n} \biggl( \sum_{i=1}^n x_i \text{ is even} \biggr) = \frac{1}{2}. \end{equation*} \]
  • The unfinished game as a program

    c = RandInt(2)
    if (c == 1):
        print('Alice wins.')
    else:
        c = RandInt(2)
        if (c == 1):
            print('Alice wins.')
        print('Bob wins.')
  • Split the bet based on the probability of winning.

    \[\begin{equation*} \Pr(\text{Alice wins}) = \frac{1}{2} + \frac{1}{4} = \frac{3}{4}. \end{equation*} \]

Simple Facts

  • If \(A \subseteq B\) then \(\Pr(A) \le \Pr(B)\).

  • What is the probability of \(\Pr(A \cup B)\)?

    \[\begin{equation*} \Pr(A \cup B) = \Pr(A) + \Pr(B) - \Pr(A \cap B). \end{equation*} \]
  • It is not always true that \(\Pr(A \cup B) = \Pr(A) + \Pr(B)\), but the following inequality is always true

    \[\begin{equation*} \Pr(A \cup B) \le \Pr(A) + \Pr(B). \end{equation*} \]
  • This inequality is known as the union bound.

    \[\begin{equation*} \Pr \biggl(\bigcup_{j=1}^m A_j \biggr) \le \sum_{j=1}^m \Pr(A_j). \end{equation*} \]

Conditional Probabilities

  • Conditioning on event \(A\) is like assuming \(A\) happens for sure!

    So you can take \(A\) as the sample space temporarily.

    \[\begin{equation*} \Pr(B|A) = \frac{\Pr(A \cap B)}{\Pr(A)}. \end{equation*} \]
  • Chain rule (in words)

    “For events \(A\) and \(B\) to occur, first \(A\) must occur (with \(\Pr(A)\)), then \(B\) must occur given \(A\) occurred.”

    \[\begin{align*} \Pr(A \cap B) & = \Pr(A) \cdot \Pr(B|A),\\ \Pr(A \cap B \cap C) & = \Pr(A) \cdot \Pr(B|A) \cdot \Pr(C | A \cap B). \end{align*} \]
  • Law of total probability

    For all events \(A, B\)

    \[\begin{equation*} \Pr(B) = \Pr(A) \P(B|A) + \Pr(A^c) \P(B|A^c). \end{equation*} \]
  • Example: Let's provide an alternative proof of the even parity probability.

    Let \(A_k\) be the event that the first \(n-1\) bits add to \(k\) and \(B\) be the event that the \(n\) random bits have even parity.

    The sum of all \(n\) bits is, therefore, \(A_k\) or \(A_{k + 1}\) with equal probability, and exactly one of them will have even parity.

    By the law of total probability,

    \[\begin{equation*} \Pr(B) = \sum_k \Pr(A_k) \Pr(B | A_k) = \frac{1}{2} \sum_k \Pr(A_k) = \frac{1}{2}. \end{equation*} \]

Independence

Two events \(A\) and \(B\) are independent if \(\Pr(A \cap B) = \Pr(A) \Pr(B)\).

  • If \(\Pr(A) \ne 0\), this is equivalent to \(\Pr(B|A) = \Pr(B)\).

  • The events \(A\) and \(B\) in the above example are independent.

  • In the random program picture, independence usually follows from the fact that \(A\) and \(B\) depend on disjoint parts of the program.

  • Multiple independent events

    Independence of \(A_1, A_2, \ldots, A_k\) requires that any subset \(K\) of \(\{1, 2, \ldots, k\}\),

    \[\begin{equation*} \Pr \biggl(\bigcap_{j\in K} A_j \biggr) = \prod_{j \in K} \Pr(A_j). \end{equation*} \]

Birthday Paradox

There are \(m\) students in a room. What's the probability that they all have different birthdays?

  • Simplification: we exclude Feb. 29th, so the number of days \(N = 365\).

  • What's your guess?

  • Consider the more general case of \(N\) days.

    Let \(A_j\) be the event that the \(j\)-th student's birthday differs from all previous birthdays.

  • The probability is

    \[\begin{equation*} \begin{split} & \Pr(\text{no collision})\\ =\, & \Pr(A_1) \Pr(A_2|A_1) \cdots \Pr(A_m|A_1 \cap A_2 \cap \cdots \cap A_{m-1})\\ =\, & 1 \cdot (1 - 1/N) \cdot (1 - 2/N) \cdots (1 - (m-1)/N) \end{split} \end{equation*} \]
  • You can plot the probability as a function of \(m\).

    For \(N=365\), you will need \(m\) to be around \(23\) for the probability to be close to \(1/2\).

    Nubmer \(m\) Probability
    1 0.0%
    5 2.7%
    10 11.7%
    20 41.1%
    23 50.7%
    30 70.6%
    40 89.1%
    50 97.0%
    60 99.4%
  • More generally, when \(m \approx 1.18 \sqrt{N}\), the probability is around \(1/2\).

    Trick: We can approximate \(1+x\) using \(e^x\) for all real \(x\) with a small absolute value.

    \[\begin{equation*} \begin{split} \Pr(\text{no collision}) & \approx \exp \Bigl(-\frac{1}{N}\sum_{j=0}^{m-1} j \Bigr)\\ & = \exp\Bigl(-\frac{m(m-1)}{2N}\Bigr). \end{split} \end{equation*} \]

Applications of the Birthday Paradox

  • Cryptographic hash functions

    • 1991: MD5 (Rivest)
    • 1993: SHA-0 (NSA)
    • 1995: SHA-1 (NSA)
    • 2001: SHA-2 (NSA)
  • Birthday attack

    SHA-1 outputs 160 bits, so finding a collision for it is like a birthday paradox with \(N=2^{160}\).

  • Xiaoyun Wang's work breaks SHA-1.

    Xiaoyun Wang, Yiqun Lisa Yin, Hongbo Yu, Finding Collisions in the Full SHA-1, Advances in Cryptology-CRYPTO 2005, LNCS 3621, 2005, pp. 17-36 (Best Paper Award).

Random Variables and Expectations

Random Variable

  • An event is a yes/no classification of elements in the sample space \(S\).

  • A random variable is any function \(X: S \to \real\) that maps every possible outcome to a real number.

  • Example:

    \[\begin{equation*} X: x \mapsto \sum_{i=1}^n x_i, \end{equation*} \]

    is a random variable that represents the Hamming weight of \(n\) coin flip outcomes.

  • Functions of random variables

    \[\begin{equation*} X = X_1 + X_2 + \cdots + X_k \end{equation*} \]

Expectation and Linearity of Expectation

  • Expectation is the mean value

    \[\begin{equation*} \E(X) = \sum_{u\in S} p(u) X(u) \end{equation*} \]
  • Linearity

    For all random variables \(X_j\) for \(j=1, 2, \ldots, k\),

    \[\begin{equation*} \E\Bigl(\sum_j X_j\Bigr) = \sum_j \E(X_j). \end{equation*} \]
  • Expectation of the product of independent random variables

    If \(X_j\) are independent random variables,

    \[\begin{equation*} \E\Bigl(\prod_j X_j\Bigr) = \prod_j \E(X_j). \end{equation*} \]

Examples

  • Binomial distribution

    • Bernoulli trial

      Success with probability \(p\).

      Failure with probability \(q = 1-p\).

    • A random variable \(X\) representing the number of successes of the Bernoulli trials follows the binomial distribution, written as \(X \sim \mathrm{Binomial}(n,p)\) with parameters \(n \in \natural\) and \(p \in [0,1]\).

      \(\Pr(X = k) = \binom{n}{k}\, p^k(1-p)^{n-k}\).

    • Expectation \(\E(X) = np\).

    • Variance \(\Var(X) = npq\).

  • Geometric distribution \(X \sim \mathrm{Geometric}(p)\).

    • The number of trials to have one success.

      \(\Pr(X = k) = (1-p)^{k-1} p\)

    • Expectation \(\E(X) = \frac{1}{p}\).

    • Variance \(\Var(X) = \frac{1-p}{p^2}\).

  • Gaussian distribution

    Even though we mostly focus on discrete distributions, Gaussian distribution is so important that you need to know it!

    • Probability density function of \(X \sim N(0,1)\).

      \[\begin{equation*} \phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}. \end{equation*} \]
    • Rotational invariance of multiple Gaussians.

      The distribution of the random variable \(X = (X_1, X_2, \ldots, X_k)\) where \(X_j\) is a standard Gaussian is invariant under rotations.

      Independent samples of Gaussian form a multi-dimensional Gaussian.

      It is a useful fact from which several important results about Gaussian follow.

      First, we can use it to prove

      \[\begin{equation*} \int_{-\infty}^{\infty} e^{-x^2/2} \mathrm{d}x = \sqrt{2\pi}. \end{equation*} \]

      Second, it shows that the sum of several Gaussian distributions is again a Gaussian.

Coupon Collector's Problem

  • There are \(n\) different kinds of coupons.

  • How many random coupons do you need to collect until you can have all \(n\) coupons?

    Let \(X\) be the # of coupons you will need.

  • What is the expectation of \(X\)?

  • Write \(X\) as some summation

    \(X_j\) is the number of days going from \(j-1\) to \(j\).

    \(X = X_1 + X_2 + \cdots X_n\).

  • When you have \(i-1\) different coupons, the probability of getting a new one in the next step is \(1 - (i-1)/n\).

  • So we have \(X_i \sim \text{Geometric}(1-(i-1)/n)\) and

    \[\begin{equation*} \E(X_i) = \frac{n}{n-i+1}. \end{equation*} \]

    Therefore, by the linearity of expectation

    \[\begin{equation*} \E(X) = n (1 + 1/2 + \cdots + 1/n) = n H_n, \end{equation*} \]

    where \(H_n\) is the \(n\)-th harmonic number, approximately \(\log n\).

Concentration and Tail Bounds

How likely will a random variable be far from its expectation?

If we toss \(n\) random coins, the number of heads is concentrated around \(n/2\).

It can be written as \(X = X_1 + X_2 + \cdots + X_k\) where \(X_i\)'s are independent Bernoulli trials.

Chernoff bounds are a collection of bounds for a quantitative analysis of the above concentration.

Markov Inequality

If \(X\) is a non-negative random variable, then for all \(t > 0\),

\[\begin{equation*} \Pr (X \ge t) \le \frac{\E(x)}{t}. \end{equation*} \]

Consider the indicator function. Draw the graph.

  • Example

    As we have computed that the expectation of the coupon collector problem is \(n H_n\), we know that the probability that it takes more than \(100 n H_n\) coupons is at most \(1/100\)!

  • However, it does not give nice bounds for the random coin concentration problem.

Chebyshev Inequality

  • How concentrated a random variable \(X\) is when we know the variance of \(X\).

  • Variance

    \[\begin{equation*} \Var(X) = \E\bigl( (X - \E(X))^2 \bigr). \end{equation*} \]

    For independent \(X_i\) where \(i=1, 2, \ldots, n\), \(Y = \sum_{i=1}^n X_i\), we have

    \[\begin{equation*} \Var(Y) = \sum_{i=1}^n \Var(X_i). \end{equation*} \]
  • Standard deviation

    \[\begin{equation*} \sigma(X) = \sqrt{\Var(X)}. \end{equation*} \]
  • Chebyshev inequality

    Suppose a random variable \(X\) has mean value \(\mu\). Then for any \(t>0\),

    \[\begin{equation*} \Pr(\abs{X - \mu} \ge t) \le \frac{\Var(X)}{t^2}. \end{equation*} \]

    Consider random variable \(Y = (X - \mu)^2\). Then, we have \(\E(Y) = \Var(X)\) by definition. By Markov inequality, the probability that \(Y \ge t^2\) is at most \(\E(Y) / t^2\). The proof is complete by observing that \(Y \ge t^2\) is equivalent to \(\abs{X - \mu} \ge t\).

  • Example

    Number \(k\) Percentage within \(k\) standard deviations
    \(2\) \(75\%\)
    \(3\) \(89\%\)
    \(4\) \(93.75\%\)
  • Chebshev is ideal for problems with pairwise independence as it only relies on the variance (the second moment).

  • This can be generalized to any polynomial of \(\abs{X - \mu}\), and so by the Markov inequality,

    \[\begin{equation*} \Pr(\abs{X - \mu} \ge t) \le \frac{\E(\abs{X - \mu}^k)}{t^k}. \end{equation*} \]

    To use this bound, you will need to bound the higher moments of the random variable, which can be complicated in most cases.

Chernoff Bound

  • We could even apply the same idea to functions other than polynomials!

    \[\begin{equation*} \varphi(\lambda) = \E \bigl(e^{\lambda(X-\mu)}\bigr). \end{equation*} \]
  • This leads to a powerful tool known as the Chernoff bound.

    There are many variants of Chernoff bound that are suitable for different scenarios.

  • Uniform coin flips.

    Consider \(S_n = X_1 + X_2 + \cdots X_n\) where each \(X_i\) is a random coin flip. We have \(\mu = \E(S_n) = n/2\).

    Shift to have zero mean. \(S_n - n/2 = \sum_j (2X_j-1)/2 = \sum_j Y_j/2\) for \(Y_j = 2X_j-1\).

    You may think of the random walk on a line.

    \[\begin{equation*} \begin{split} \Pr(S_n - \mu \ge t) & = \Pr \Bigl(\exp \Bigl(\lambda \sum_j Y_j \Bigr) \ge e^{2\lambda t}\Bigr)\\ & = \Pr \Bigl(\prod_j e^{\lambda Y_j} \ge e^{2\lambda t} \Bigr)\\ & \le \frac{\E \prod_j e^{\lambda Y_j}}{e^{2\lambda t}}\\ & = \frac{(\E e^{\lambda Y_1})^n}{e^{2\lambda t}}. \end{split} \end{equation*} \]

    Now, we need to bound

    \[\begin{equation*} \begin{split} \E e^{\lambda Y_1} & = \frac{e^{\lambda} + e^{-\lambda}}{2}\\ & = 1 + \frac{\lambda^2}{2!} + \frac{\lambda^4}{4!} + \frac{\lambda^6}{6!} + \cdots\\ & \le e^{\lambda^2/2}. \end{split} \end{equation*} \]

    Continuing the calculation, we have

    \[\begin{equation*} \begin{split} \Pr(2S_n - n \ge t) & \le \exp(\lambda^2 n/2 - 2 \lambda t)\\ & = \exp(-2t^2/n), \end{split} \end{equation*} \]

    by choosing the best \(\lambda = 2t/n\).

  • Other versions

    We give several variants of Chernoff bounds covering most of the use cases.

    Let \(X_1, X_2, \ldots, X_n\) be independent random variables such that \(X_i \in [a_i,b_i]\) Define \(X = \sum_{i=1}^n X_i\) and \(\mu = \E (X)\). The Hoeffding bound says that, for all \(t \ge 0\),

    \[\begin{equation*} \begin{split} \Pr (X \le \mu -t) & \le \exp \biggl(- \frac{2 t^2}{\sum_j (b_j-a_j)^2}\biggr),\\ \Pr (X \ge \mu + t) & \le \exp \biggl(- \frac{2 t^2}{\sum_j (b_j-a_j)^2}\biggr). \end{split} \end{equation*} \]

    (Multiplicative-form Chernoff Bound). Consider random variables \(X_i \in [0,1]\) and \(X = \sum_j X_j\). Then, for every \(\varepsilon > 0\),

    \[\begin{equation*} \begin{split} \Pr(X \le (1-\varepsilon) \mu) & \le \exp \biggl(-\frac{\varepsilon^2}{2}\mu\biggr),\\ \Pr(X \ge (1+\varepsilon) \mu) & \le \exp \biggl(-\frac{\varepsilon^2}{2+\varepsilon}\mu\biggr).\\ \end{split} \end{equation*} \]

    Note the extra \(\varepsilon\) in the second inequality.

Sampling Theorem

Suppose \((X_j)_{j=1}^n\) is a sequence of Bernoulli trails. If we sample \(n\) independent samples, we can estimate the probability of success \(p\) by computing the avearge \(\hat{p} = \sum_j X_j / n\).

The Chernoff bound says that, for \(n \ge 3 \ln (2/\delta) / \varepsilon^2\), \(\abs{\hat{p} - p} \le \varepsilon\) holds with probability at least \(1 - \delta\).

Probabilistic Method

The Probabilistic Method, by Noga Alon and Joel H. Spencer.

  • We will discuss the first example in the book regarding Paul Erdős' bound on the Ramsey number.

    Ramsey Theorem. In any two coloring of the edges in the complete graph of six vertices \(K_6\), there must be a monochromatic triangle \(K_3\).

  • More generally, \(R(k,l)\) is the number of vertices \(n\) such that any red/blue coloring of \(K_n\) will either have a red \(K_k\) or a blue \(K_l\).

    (Erdős 1947) If \(\binom{n}{k} \cdot 2^{1 - \binom{k}{2}} < 1\), then \(R(k, k) > n\). Thus \(R(k,k)> \floor{2^{k/2}}\) for all \(k\ge 3\).

    Consider a random coloring of the graph \(K_n\) where each edge is colored red or blue with equal probability independently.

    For each vertex subset \(R\) of size \(k\), let \(A_R\) be the event that the corresponding subgraph is monochromatic. Clearly \(\Pr(A_R) = 2^{1 - \binom{k}{2}}\).

    By the union bound the probability of at least one of the \(A_R\) events occurring is at most \(\binom{n}{k} \cdot 2^{1 - \binom{k}{2}} < 1\).

    Thus, with positive probability, no events \(A_R\) occur.

    This means there is a coloring of \(K_n\) without a monochromatic \(K_k\); that is \(R(k,k) > n\). For \(k\ge 3\) and \(n = \floor{2^{k/2}}\) then

    \[\begin{equation*} \binom{n}{k}2^{1 - \binom{k}{2}} < \frac{2^{1+k/2}}{k!} \cdot \frac{n^k}{2^{k^2/2}} < 1. \end{equation*} \]

Lecture B: Randomized Computation

“Does randomness solve problems? YES, randomness solves problems. BUT the randomness (probably) doesn't have to be random to solve problems.” — Russell Impagliazzo

Randomized Algorithms

Randomized Algorithms, by Rajeev Motwani and Prabhakar Raghavan, 1995

A randomized algorithm outputs the correct value with good probability on every possible input.

Randomized Searching Algorithm

  • Problem

    Input: a string \(x = x_1 x_2 \cdots x_n\) with exactly \(n/4\) positions \(1\).

    Output: One such position.

  • Algorithms

    Deterministic algorithm \(D\) = “On input \(x\): For \(i=1, 2, \ldots, n\), return \(i\) if \(x_i = 1\).”

    The worst case is when \(x = 0^{3n/4}1^{n/4}\).

    In fact, for all deterministic algorithms, the time complexity is \(\Omega(n)\)!

    Randomized algorithm \(R\) = “On input \(x\): Repeat the following

    1. Sample \(i = \mathrm{RandInt}(n)\);
    2. Return \(i\) if \(x_i = 1\).”
  • When the success probability of one trial is \(p\), the running time of the algorithm is a random variable distributed according to \(\mathrm{Geometric}(p)\), and so the expected running time is

    \[\begin{equation*} \sum_{t=1}^\infty (1-p)^{t-1} p\, t = \frac{1}{p}. \end{equation*} \]

    The expected time complexity is constant for all possible inputs of length \(n\).

Matrix Multiplication Checking

  • Problem

    Input: Matrices \(A, B, C\) of size \(n \times n\).

    Output: ‘Yes’ if and only if \(AB = C\).

  • Algorithms

    The direct approach is to compute the product \(AB\) and check if it is the same as \(C\).

    This has complexity \(n^{\omega}\), where \(\omega=3\) for the straightforward implementation and \(\omega=2.37\cdots\) is the current record.

    Freivalds' algorithm \(F\) = “On input \(\tuple{A, B, C}\):

    1. Repeat the following \(k\) times
      1. Generate a random vector \(v \in \{0,1\}^n\);
      2. Compute \(d = A(Bv) - Cv\);
      3. Return ‘No’ if \(d \ne 0^n\);
    2. Return ‘Yes’.”
  • Analysis

    In \(O(k n^2)\) time, the algorithm can verify the matrix product with a probability of failure less than \(2^{-k}\).

    If \(AB=C\), the algorithm always returns ‘Yes,’ which is correct.

    If \(AB\ne C\), \(D = AB-C\) is a non-zero matrix, let \(D_{i,j}\) be a non-zero entry.

    The \(i\)-th entry of \(d\) can be written as

    \[\begin{equation*} d_i = \sum_k D_{i,k} v_k = D_{i,j} v_j + \sum_{k: k\ne j} D_{i,k} v_k = D_{i,j} v_j + s. \end{equation*} \]

    The probability \(d_i = 0\) can be written as

    \[\begin{equation*} \begin{split} \Pr(d_i = 0) & = \Pr(d_i = 0 | s = 0) \Pr(s = 0) +\\ &\quad \Pr(d_i = 0 | s \ne 0) \Pr(s \ne 0) \end{split} \end{equation*} \] \[\begin{align*} \Pr(d_i = 0 | s = 0) & = \Pr( v_j = 0 ) = \frac{1}{2}\\ \Pr(d_i = 0 | s \ne 0 ) & = \Pr( v_j = 1 \land D_{i,j} = -s )\\ & \le \Pr(v_j = 1) = \frac{1}{2} \end{align*} \]

    So, we have

    \[\begin{equation*} \begin{split} \Pr(d = 0^n) & \le \Pr(d_i = 0) \\ & \le \frac{1}{2}(\Pr(s = 0) + \Pr(s \ne 0)) = \frac{1}{2}. \end{split} \end{equation*} \]

Max-Cut Approximation

  • Problem

    Input: a graph \(G = (V, E)\) of \(n\) vertices.

    Task: find a cut of the graph \(G\) that cuts as many edges as possible.

  • The \(\MAXCUT\) problem is an \(\NPC\) problem.

  • One way to attack \(\NP\)-hard problems is to consider approximate solutions.

    In the \(\MAXCUTAPP\) problem, for example, our task is to find a cut \(C\) whose size is not far from the optimal one \(C^*\).

    If \(\text{size}_C \ge \alpha\, \text{size}_{C^*}\), we say that \(C\) is an \(\alpha\)-approximation of the max-cut.

  • A deterministic greedy algorithm.

    Let \(E_i = \{\{k,i\}\in E \mid k < i\}\), \(C_0 = \{1\}\) and \(C_1 = \emptyset\). Each time, we put \(i\) in \(C_0\) or \(C_1\) to maximize the cut in \(E_i\). Notice that \(E_i\) is a partition of \(E\); it is easy to show that this deterministic algorithm gives a \(1/2\)-approximation.

  • A randomized algorithm.

    A cut is a \(0{/}1\) assignment of the vertices.

    Algorithm. Assign random \(0{/}1\) numbers \(x_u\) to each vertex \(u \in V\).

    \[\begin{equation*} \E(\text{size}_C) = \E \sum_{\{u, v\} \in E} 1_{x_u \ne x_v} = \frac{1}{2} \abs{E} \ge \frac{1}{2} \text{size}_{C^*}. \end{equation*} \]
  • Note that the randomized algorithm outputs large cuts in expectation.

    A technique called “method of conditional expectations” can obtain an algorithm producing a cut that is always large.

    For this to work, you will need to be able to efficiently compute quantities like

    \[\begin{equation*} \E(\text{size}_C(x_1, \ldots, x_i, X_{i+1}, \ldots X_{n})), \end{equation*} \]

    where \(x_1, \ldots x_i\) are choices already made and \(X_{i+1}, \ldots, X_n\) are the remaining random bits.

Derandomize It!

  • The above randomized algorithm uses randomness that has exponentially many possibilities.

  • If we can reduce the randomness to a polynomial number of possibilities, we can derandomize the algorithm!

  • Universal hash functions.

    Hash functions map elements in a large universe to a small range. We often consider a family of hash functions \(\mathcal{H} = \{ h : U \to R \}\).

    Universal hash functions are a family of functions with the random-like property while the size of the family is small. We can use a small seed to choose hash functions from the family.

  • Pairwise independent hash functions.

    A family \(\mathcal{H} = \{h : U \to R\}\) is said to be pairwise independent if for any two distinct elements \(x_1, x_2 \in U\), and any two (possibly equal) values \(y_1, y_2 \in R\),

    \[\begin{equation*} \Pr_{h \in \mathcal{H}} \bigl( h(x_1) = y_1 \text{ and } h(x_2) = y_2 \bigr) = \frac{1}{\abs{R}^2}. \end{equation*} \]

    It is a very powerful tool!

    It follows easily that

    \[\begin{equation*} \Pr_{h \in \mathcal{H}} (h(x_1) = y_1) = \frac{1}{\abs{R}} \end{equation*} \]

    and so the two events \(h(x_i)= y_i\) are indeed independent, justifying the name.

  • Construction of pairwise independent bits.

    For all \(x \in \{0,1\}^k\), \(h(x) = a \cdot x + b \pmod{2}\), where \(a\in \{0,1\}^k\) and \(b \in \{0,1\}\).

    The size of the his family is \(2^{k+1}\).

  • Prove that this is a pairwise independent hash function family for \(R = \{0,1\}\).

  • Use it to derandomize the random algorithm for \(\MAXCUTAPP\).

    Key: The algorithm only checks pairwise information.

  • The derandomized algorithm is highly parallelizable as the value for each vertex only depends on \(a,b\), and not the value of other vertices!

  • The algorithm considers \(2n\) different cuts that are independent of the input graph. Magically, one of them has size at least \(\abs{E}/2\).

    Derandomization is a powerful idea!

Primality

Primality testing works by checking a property that is satisfied by prime numbers.

  • Fermat's little theorem

    \(a^{n} \equiv a \pmod{n}\) if \(n\) is prime.

    So, \(a\) is a witness for the compositeness of \(n\) if \(a^n \not\equiv a \pmod {n}\) without knowing the factors of \(n\).

  • The converse of Fermat's little theorem is false!

    Even if \(a^n \equiv a \pmod{n}\) for all \(a\), \(n\) is not necessarily prime.

    Such numbers are called Carmichael numbers.

    The smallest example is \(n = 561\).

  • Rabin-Miller algorithm

    The algorithm works by extending the test of Fermat's little theorem.

    Consider an odd prime number \(n = 2^k q + 1\). Then either \(a^q \equiv 1 \pmod{n}\) or the sequence \(S \equiv (a^q, a^{2q}, \ldots, a^{2^kq}) \pmod{n}\) ends with \(1\) and the value before the first \(1\) is \(-1 \pmod{n}\).

    The proof is simple. If \(x^2 \equiv 1 \pmod{n}\), then \(n | (x+1)(x-1)\). So, if \(n\) is prime, we have \(x \equiv \pm 1 \pmod{n}\).

  • Analysis

    If \(n\) is prime, the answer is ‘yes’ for all \(a\).

    If \(n\) is composite, Rabin proved that the answer is ‘yes’ for at most \(1/4\) fraction of \(a\).

  • In 2002, Agrawal, Kayal, and Saxena ``derandomized’’ a randomized polynomial testing algorithm, showing primality testing is also in \(\P\).

    Yet, the randomized algorithms are preferred for primality testing in the practice as they are much more efficient.

BPP

  • What is the mathematical model for randomized computations?

  • Is it more powerful than deterministic computation?

Definition

Intuitively, polynomial-time algorithms are algorithms that call the \(\mathrm{RandInt}(\cdot)\) API sometimes.

A probabilistic Turing machine is a type of nondeterministic Turing machine in which each nondeterministic step is called a coin-flip step and has two legal next moves. We assign a probability \(2^{-k}\) to each branch of the machine's computation where \(k\) is the number of coin flips occur in the branch.

The probability of the machine accepting the input is defined as

\[\begin{equation*} \Pr(M \text{ accepts } w) = \sum_{b:b \text{ is accepting}} \Pr(b). \end{equation*} \]

The error probability of a machine for deciding a language \(A\) is \(\eps\) if

  1. \(\Pr(M \text{ accepts }w) \ge 1 - \eps\) if \(w\in A\), and
  2. \(\Pr(M \text{ accepts }w) \le \eps\) if \(w\not\in A\).

Note: Error probability \(\eps\) can be a function of the input size \(n\).

\(\BPP\) is the class of languages decided by probabilistic polynomial-time Turing machines with an error probability of \(1/3\).

The Definition of BPP and NP

Like \(\NP\), \(\BPP\) has an alternative characterization.

A decision problem \(A\) is in \(\BPP\) if and only if there is a polynomial-time verifier \(V\) such that for all \(x\), \(x \in A\) if and only if

\[\begin{equation*} \Pr_{r} \bigl(V(x, r) = 1 \bigr) \ge \frac{2}{3}. \end{equation*} \]

Error Reduction

We can reduce the error probability of randomized algorithms!

Any decision problem \(A \in \BPP\) has a polynomial-time randomized algorithm whose error probability is \(2^{-p(n)}\) where \(p\) is a polynomial and \(n\) is the input size.

Apply the Chernoff bound or the Sampling Theorem!

Circuits for BPP Problems

  • TMs vs. circuits

    For a finite function \(g:\{0,1\}^n \to \{0,1\}\) and a number \(s\), \(g \in \SIZE_n(s)\) if there is a circuit of at most \(s\) NAND gates computing \(g\).

  • To compare TMs and circuits, we must extend the class \(\SIZE_n(s)\) from finite functions to functions with unbounded input lengths.

    Consider \(F : \{0,1\}^* \to \{0,1\}\) and \(T : \natural \to \natural\). For every \(n\), define the restriction of \(F\) to input size \(n\) as

    \[\begin{equation*} F_{\restriction n} (x) = F(x) \text{ for } x\in \{0,1\}^n. \end{equation*} \]
  • We say that \(F\) is non-uniformly computable in at most \(T(n)\) size denoted \(F \in \SIZE(T)\) if there is a sequence \(C_0, C_1, C_2, \ldots\) of NAND circuits such that:

    1. For every \(n\), \(C_n\) computes function \(F_{\restriction n}\).
    2. For every sufficiently large \(n\), \(C_n\) has at most \(T(n)\) gates.

    The non-uniform analog of \(\P\) is the class \(\Ppoly\) defined

    \[\begin{equation*} \Ppoly = \bigcup_{c \in \natural} \SIZE(n^c). \end{equation*} \]
  • \(\P\) vs. \(\Ppoly\)

    1. \(\P\): One TM for all inputs.
    2. \(\Ppoly\): For every input size \(n\), there is a different circuit \(C_n\).

    \(\P \subsetneq \Ppoly\) and \(\Ppoly\) may even contain non-decidable problems.

  • We use error reduction to make sure that the error is at most \(0.1 \cdot 2^{-n}\).

    Suppose \(A \in \BPP\). There is a verifier \(V(x,y)\) such that for all \(x\)

    \[\begin{equation*} \Pr_{y} \bigl( V(x,y) \ne A(x) \bigr) \le 0.1 \cdot 2^{-\abs{x}}. \end{equation*} \]

    By union bound, there is a \(y^*\) such that for all \(x\), \(V(x, y^*) = A(x)\).

    Therefore, there is a circuit of at most \(\poly(n)\) gates for problem \(A\).

  • This proves that \(\BPP \subseteq \Ppoly\).

  • Yet, it does not prove \(\BPP = \P\) as we don't know how to compute \(y^*\) efficiently.

P = BPP if P = NP

  • Sipser–Gács Theorem

    Two major conjectures in complexity theory:

    1. \(\P = \NP\) is false.
    2. \(\P = \BPP\) is true.

    We will prove \(\P = \NP\) implies \(\P = \BPP\), so at least one of them is true!

  • Why is it non-trivial?

    \(\P = \NP\) would enable us to guess a good solution.

    But Even if \(F(x) = 0\), there is still a possible random string \(r\) such that \(V(x,r) = 1\).

  • The proof follows the “quantifier elimination” technique assuming \(\P = \NP\).

    Write \(F \in \BPP\) as a formula \(\forall u \exists v P(u, v)\).

    1. We perform error reduction so that if \(F(x) = 0\), the set \(S\) of strings making \(V\) accept is tiny. If \(F(x)=1\), the set \(S\) is very large.

    2. We consider shifts of \(S\), \(S \oplus s\).

      In the “no” case, the shifts for \(s_1, s_2, \ldots, s_k\) will not cover all.

      On the other hand, in the “yes” case, there exist such polynomial number of \(s_j\)'s.

    Now, notice that the covering can be written as a quantified formula: for all \(y\), there exist \(s\in S\) and \(i\), such that \(y = s \oplus s_i\).

  • The hard part is to prove the statement for the “yes” case.

    Let \(S \subseteq \{0,1\}^m\) be a set of size at least \(2^{m-1}\). Then there exists \(s_1, s_2, \ldots, s_{2m}\) satisfies the condition.

    We use the probabilistic method, and choose \(s_j\)'s randomly.

    For a point \(z \in \{0,1\}^m\), Let \(\mathrm{BAD}_z\) be the event that \(z\) is not covered by the randomly chosen shifted sets.

    We want to show \(\Pr(\bigcup_z \mathrm{BAD}_z) < 1\), which follows from the union bound if \(\Pr(\mathrm{BAD}_z) < 2^{-m}\). It implies that, over all the possible choices of \(s_j\)'s, at least one choice covers the whole set \(\{0,1\}^m\).

    Let \(\mathrm{BAD}^i_z\) be the event that \(z\) is not in \(S \oplus s_i\).

    \[\begin{equation*} \Pr\bigl( \mathrm{BAD}_z \bigr ) = \Pr \biggl( \bigcap_{i=1}^{2m} \mathrm{BAD}^i_z \biggr) \le \frac{1}{2^{2m}} < \frac{1}{2^m}, \end{equation*} \]

    where the first inequality follows from the fact that \(\mathrm{BAD}^i_z\) are independent for different \(i\).

Hardness versus Randomness

(Impagliazzo-Wigderson). Suppose there is a problem \(L\) in \(\class{E} = \TIME(2^{O(n)})\) and a fixed \(\delta > 0\) such that, for all sufficiently large \(n\), the restriction \(L_n\) is a Boolean function with circuit complexity at least \(2^{\delta n}\). Then \(\P = \BPP\).

Relation With P, NP, EXP

  • Simple fact: If an \(\NP\)-hard problem \(F\) is in \(\BPP\), then \(\NP\) is contained in \(\BPP\).

  • We know \(\P \ne \EXP\) by the time hierarchy theorem.

  • We know \(\BPP \subseteq \EXP\).

  • We do not know if \(\BPP\) is also different from \(\EXP\). We do not even know if \(\BPP\) is different from \(\NEXP\)!

  • Possibilities

    1. Expected: \(\P = \BPP \subsetneq \NP \subseteq \EXP\).
    2. Extreme \(\BPP\): \(\P \subsetneq \NP \subseteq \BPP = \EXP\).
    3. Extreme \(\P\): \(\P = \NP = \BPP \subsetneq \EXP\).