Class Venue and Time: CSIC 3120, 2-3:15PM Tue, Thu

The following is an approximate list of topics covered in
CMSC858L, Fall 2015.

This is an approximate summary, but gives an idea of the primary issues
covered. Please let Aravind know if there are any major omissions or
errors.

**Lecture 1, Sep. 1, 2015:**

- Admin. details, the most important being to join Piazza, to email David Harris about your group (or lack thereof), and to keep the class very participatory.
- The power of ML in practice: examples including FaceBook M, discovering natural laws, and predictive medicine.
- The future of jobs?
- Memorization vs. generalization; spam detection as a detailed example from Avrim Blum's slides (and an initial peek at the power of halfspaces).
- Three lab experiments on rats and pigeons that illustrate inductive inference, cases with useless generalization, and the power of prior hypotheses (inductive bias).
- Basic definitions, the consistency model, and simple algorithms in this model from logic and geometry.
- The weaknesses of the consistency model.
**Required Reading:**Samuel Ieong's probability notes, Schapire's scribed notes #1 and #2.**Additional Reading:**Avrim Blum's Lecture 1 from Spring 2012.

- More on the consistency model:
- Two weaknesses: the lack of generalizability, and no crisp notion of convergence to the target concept c.
- Note: a consistent hypothesis may not match the target (e.g., our simple algorithm for monotone conjunctions with one training example).
- Nontrivial results are still possible in the consistency model: decision trees vs. decision lists (PARITY vs. AND/OR -- a very significant gain in representability), and a provable algorithm for learning decision lists.

- PAC learning: definition, history, and basic motivation (more motivation to follow in the next class).
- Example with X = {0,1}^7, C = H = {monotone conjunctions}
UNION {the always-zero function}, c = conjunction of two terms, and with the
distribution D being the uniform distribution on vectors with exactly four
ones:
- err_D for some example hypotheses;
- what our simple algorithm guarantees when given one training example from D.

**Required Reading:**Schapire's scribed notes #2 and #3.**Additional Reading:**Avrim Blum's Lecture 1 from Spring 2012.**A question to ponder, thanks to student Karthik A. Sankararaman:**Can the O(n m^2) runtime of the decision-list algorithm be improved upon?

- More on the last item from the previous class: what (epsilon, delta)-guarantees does our algorithm for monotone conjunctions offer in the case considered in that example?
- The PAC definition and probability:
- Why a probabilistic model? Interpretations of probability.
- Recap independence; are independent samples a reasonable model? Could data acquisition from a single source lead to "stickiness" in the samples? What about asynchronously-arriving examples from heterogeneous sources? Finally, the notion of pragmatism in probabilistic modeling via independence assumptions.
- Refreshing our understanding of basic probability: show that the 2^n equations of independence given by the monotone conjunctions imply independence for all 3^n conjunctions. (Brief mention of the general notion of Fourier transform here for arbitrary (non-product) distributions -- a very useful tool in learning theory.)

- The union bound and sample complexity for learning intervals (our algorithm will not lead to any false positives).
- Why does the natural logarithm show up so "naturally"? A relevant quote by
Timothy Gowers
from this blog post: "I then drew some pictures for different a^x, pointing out that
some of them crossed the y-axis with a slope less than 1 and others with a
slope greater than 1 and that e^x was the one where it actually
equalled 1. He asked me
*why*the slope was exactly 1 for e^x, which was a good opportunity for me to try to explain that that was getting things the wrong way round, and that e was chosen precisely to make that work." - Brief discussion of Theorem 1 from Schapire's notes #3.
**Required Reading:**Schapire's scribed notes #3.

- Sample complexity for intervals: some subtleties in the proof, and why the union bound does not cause any major loss in the sample complexity
- Theorem 1 from Schapire's notes
#3:
- Proof and the computational lens on Occam's Razor
- What happens in the cases of all possible Boolean functions, monotone conjunctions, and decision lists?

- The VC-dimension and sample complexity for infinite hypothesis families:
- The range of possible behaviors for simple geometric families
- The VC-dimension and some examples
- The Perles-Sauer-Shelah Lemma without proof (see Schapire's notes #6 for a proof), and statement of the main result to be proved on learnability under bounded VC-dimension
- The VC-dimension of halfspaces in R^d is d+1: proof of the lower bound

**Required Reading:**Schapire's scribed notes #3 and #4.**Additional Reading:**See Gil Kalai's blog post for a short proof due to Noga Alon of the Perles-Sauer-Shelah Lemma (which in fact shows that if the number of hypotheses is large, then*many*sets are shattered), along with the story of Perles' re-discovery mentioned in class.

- Full proof for VC-dimension of halfspaces in R^d, as well as for halfspaces that pass through the origin
- Recap of the main theorems we aim to prove for PAC-learning under bounded VC-dimension
- Some basic probability, convergence results, and an introduction to large-deviation bounds (very useful tools in machine
learning):
- The expectation and its linearity; illustration using the standard "number of fixed points of a random permutation", the Poisson limit of this random variable, and the general area of Poisson approximation
- The Guassian distribution, useful bounds on the Guassian tail, the Central Limit Theorem, and why one needs finitary results such as the Chernoff bound (that we will see later)
- Why large-deviations
*heuristically*suggest that the generalization error of our algorithm should be small

**Required Reading:**Schapire's scribed notes #5 and #6.

- Conditional probability (key for the proof to come, and often misunderstood):
- Basic example, conditional probability as renormalization
- Two uses of conditional probability (both to be seen in our proof for learnability today): (i) conditioning on a high-probability event cannot change probabilities much -- and its contrapositive, and (ii) the introduction of an auxiliary random variable X -- conditioning on any value of which, one can uniformly bound the conditional probability of a desired Y.

- Proof of learnability under bounded VC-dimension
**Required Reading:**Schapire's scribed notes #5 and #6.

- Lower bound of Omega(d) for PAC-learning, slightly improving two of the constant terms from Schapire's notes #7; introducing Markov's inequality in the process.
- Situations in which consistent learning is not possible/feasible: e.g., when the "label" is probabilistic.
- The Bayesian optimal classifier and general maximal-likelihood approaches; e.g., spatial variation in search-engine queries.
- Relation between empirical and generalization errors; generalization error of algorithms that (approximately) minimize empirical error.
**Required Reading:**Schapire's scribed notes #7.

- Recap of our basic question:
- If we try to do "quite well" on our empirical data using some h from our family H of hypotheses, how well can we do if we assume the number m of training examples is large enough?
- Will not the natural "maximum likelihood" hypothesis have an empirical error of at most 1/2? No: such a h may not lie in H
- Why should H be somewhat limited? A discussion of ML in medicine: why our hypotheses need
to be
*intelligible*to practitioners (e.g., physicians)

- The main theorem on sample complexity that we will prove.
- K-L-divergence-based proof of Chernoff-Hoeffding bound in this simplified regime (sum of i.i.d. Bernoullis) due to Sanjoy Dasgupta; finishing our full proof.
- More on the K-L divergence: how do sample paths look with high probability, conditional on the (very low-likelihood) event of a large deviation?
**Required Reading:**Schapire's scribed notes #8.

- Convexity and Jensen's inequality.
- Brief discussion of sample-paths analysis and the K-L divergence; the K-L divergence as modeling the distance from the ground truth to the model.
- Proof of the Chernoff-Hoeffding bound.
- Martingales, super- and sub- martingales; proof of McDiarmid's bounded-differences tail bound.

- Jensen's inequality and the supremum.
- Motivation for Rademacher complexity.
- Warmup: concentration of R_S(F) around its mean R_m(F).
- Martingale tail bounds and the work of Bollobas, McDiarmid, Milman, and
Shamir-Spencer: the Shamir-Spencer remark that not knowing the mean easily
(but getting the tail bound easily) is the
*strength and weakness*of the method. - Proof of the main theorem for Rademacher complexity, and how it will connect to our main goal (showing that the generalization error is not much more than empirical error with high probability).
**Required Reading:**Schapire's scribed notes #9 and #10.

- How we will use the main theorem on Rademacher complexity? First consider finite hypothesis families H and then proceed to finite VC-dimension
- Proof approach for finite H:
- The basic problem: how to upper-bound E[max{X_1, X_2, ..., X_N}].
- Initial approaches: the first-, second- and higher moments.
- A more robust approach: bound by min_{t > 0} [ln(\sum_i e^{t X_i}) / ln t], and what this gives for us.
- Exercise: use the above recipe for bounding the expected max-load when throwing n balls at random into n bins: a fundamental allocation setup that is very useful to understand.

- Complete the generalization-error proof for finite hypothesis families H and for finite VC-dimension.
- Boosting: the AdaBoost algorithm and starting its analysis.
**Required Reading:**Schapire's scribed notes #10 and #11.

- Finish the proof for boosting; a key idea in Step 2 of the proof and its connection to the standard Chernoff-bound proof.
- Margin theory for boosting, and how boosting increases the margin on the training examples.
- A non-technical but vital issue for all of us: what is the future of jobs for humans,
when machines can do a great deal of what humans do today?
- A relevant article by Brynjolfsson and McAfee, on why this is not just the "lump of labor" fallacy.
- Larry Page on this issue, excerpted from
here:
"
*... Everyone I've asked -- I've asked a lot of people about this. Maybe not you guys, but most people, if I ask them, "Would you like an extra week of vacation?" They raise their hands, 100 percent of the people. "Two weeks [of vacation], or a four-day work week?" Everyone will raise their hand. Most people like working, but they'd also like to have more time with their family or to do their own interests. So that would be one way to deal with the problem, is if you had a coordinated way to just reduce the work week. And then, if you had slightly less employment, you can adjust and people will still have jobs. ...*"

- The
*Multiplicative Weights Update*method: a powerful paradigm that underlies aspects of Boosting; the weighted majority algorithm from Section 1.1 of the Arora-Hazan-Kale survey. **Required Reading:**Schapire's scribed notes #11 and #12; up to Corollary 2.2 from Arora-Hazan-Kale.

- More on the first two algorithms from Arora-Hazan-Kale, and some general context for the MWU framework.
- Our goal: showing that when Boosting reduces the margin on the training examples, it also achieves small generalization error.
- A proof of the Rademacher bound under Lipschitz-bounded compositions from Kakade-Tewari.
- A nice closure property of the Rademacher complexity: moving to the convex hull of H does not change it (proof to be shown in next class).
**Required Reading:**Schapire's scribed notes #12, proof of Rademacher bound under Lipschitz-bounded compositions from Kakade-Tewari (Lemma 1.1), and up to Corollary 2.2 from Arora-Hazan-Kale.**Additional Reading:**Gravin-Peres-Sivan's work on optimal algorithms for prediction with expert advice.

- Generalization theory for AdaBoost, from the end of Schapire #12.
- Rademacher composition and Rademacher bounds for linear predictors (Lemmas 1.1, 2.1 and 2.2) from Kakade-Tewari.
- Material on SVM up to, but including, soft margins from Schapire #13.
**Required Reading:**Schapire's scribed notes #12 and #13, as well as Lemmas 1.1, 2.1 and 2.2 from Kakade-Tewari.

- Quick review of Rademacher complexity for linear predictors.
- Online learning and the halving algorithm from Schapire #14.
- Schapire #15: learning with expert advice and connection to PAC learning, lower bound based on the VC-dimension, the weighted majority and randomized weighted majority algorithms and their analyses (this was already covered when we studied MWU).
- Discussion of prediction markets in practice.
**Required Reading:**Schapire #14 and #15.**Additional Reading:**Hanneke's recent breakthrough on optimal sample complexity for PAC-learning under bounded VC-dimension.

- Mid-term review.
- Discussion and analysis of Perceptron and Winnow from Schapire #16.
- Discussion of the practical uses of the two algorithms covered, as well as of AdaBoost.
**Required Reading:**Schapire #16.

- Recap of Winnow.
- Balanced Winnow.
- Recap of basic optimization:
- Why gradient-descent.
- Newton's method (illustrating two of the key concepts we will see later -- small step-size (η that we will use later) and first-order analysis to see what improvement we get -- albeit in the one-dimensional case); illustration for finding square-roots.

- Further material from Schapire
#17:
- Regression, quadratic loss, and connecting quadratic loss to the expected value of (h(x) - p(x))^2.
- The advantage of convexity, and what the gradient says for high-dimensional ("offline") linear regression.
- Combining regression with online learning; the Widrow-Hoff algorithm.

**Required Reading:**Schapire #17.

- Brief further discussion of first-order methods and stochastic gradient descent.
- Mid-term discussion.
- Finish Schapire #18: complete Widrow-Hoff analysis, families of online algorithms, and converting online algorithms to our usual batch-learning setting.
- How to choose parameters such as η, T, and m.
**Required Reading:**Schapire #18.

- Discussion of two mid-term problems.
- Motivations for learning in the presence of noise.
- Learning from noisy data, intro. to Statistical Query Model: from Avrim Blum's slides.
**Required Reading:**Avrim Blum's Lecture 15 Slides from Spring 2012.

- Recap of learning in the presence of (random) classification noise.
- The Statistical Query Model, and why learnability in the SQ model implies learnability under noise.
**Required Reading:**Avrim Blum's Lecture 15 Slides from Spring 2012.

- Most of Avrim Blum's class notes on characterizing SQ-learnability under the SQ dimension (until Parseval's identity).
- A gentle introduction to Fourier analysis, and the fact that the parity functions form a useful basis under the uniform distribution on the Boolean cube.
**Required Reading:**Avrim Blum's class notes: class notes on the SQ-dimension and its consequences.**Additional Reading:**Take a look at Ryan O'Donnell's wonderful book on*Analysis of Boolean Functions*.

- Finish the proof of non-learnability in the SQ model when the SQ-dimension is large, from Avrim Blum's class notes.
- Discriminative vs. generative approaches.
- Probability density estimation and maximum-likelihood formulation; motivation for "log loss" (more of this in further classes).
- Inferring the distribution given features of the data as constraints: maximum-entropy formulation.
**Required Reading:**Avrim Blum's class notes, and**up to (7)**in Schapire #19.

- Online learning with log loss.
- Bayes' algorithm, motivation, and analysis.
- Switching experts: motivation, formulations, start analyses of meta-expert algorithms.
**Required Reading:**Schapire #21 and #22.

- Bayes' algorithm recap, switching experts, simple-to-describe (but time-intensive) algorithm for
simple experts;
*students to read the improved algorithm from Schapire #22.* - Portfolio selection: high-level discussion of iterative loss framework and constant rebalanced portfolios.
*Students to read Schapire #23 in detail*.- Two-person zero-sum games, linear programming, minmax theorem, and start proof of minmax theorem via online learning.
**Required Reading:**Schapire #22, #23 and #24.**Additional Discussion and Optional Reading:**- The exploration-exploitation tradeoff, and online learning with limited feedback: Alon, Cesa-Bianchi, Dekel, and Koren.
- Beyond linear programming -- the revolution in fast iterative algorithms for convex programming: see the overview from Section 1.1 of Vishnoi's lecture notes.
- Winning strategies and strategy-stealing for some intriguing games: the case of
*Chomp*.

- Complete the proof of the LP-minmax theorem via online learning.
- A key bonus of the above proof: how it can be turned into a
*sublinear*algorithm via random sampling, for this class of linear programs. - An aside about the random-sampling proof: how Chernoff does better than the use of Pinsker's inequality [D(p||q) >= 2(p - q)^2] in the high-relative-error regime.
- How the above discussion connects to keeping the number of mistakes small in learning.
- Fairness issues: Suresh Venkat's blog post on algorithmic fairness; also, can algorithms in fact reverse bias in human decisions (see the Bertrand-Mullainathan paper for an example of such bias)?
**Required Reading:**Schapire #24.**Additional Discussion and Optional Reading:**Further iterative algorithms: the Straszak-Vishnoi work on natural algorithms for linear programming.

- The issue of datasets not well-separable linearly, and a quick discussion of the frontier of research in finding good low-dimensional manifolds that classify high-dimensional data.
- From Section 6.5 of Blum-Hopcroft-Kannan: margins, hinge loss, SVM, and how Perceptron can be used instead of convex programming to guarantee a small number of mistakes in this setting.
- Start kernels and nonlinear separators from Section 6.6 of Blum-Hopcroft-Kannan.
**Required Reading:**Sections 6.5 and 6.6 of Blum-Hopcroft-Kannan.

- Kernels: closure properties, examples, and the intuitive "neighborliness" property we aim for in kernels.
- An interesting exercise: what is the mapping φ (with infinite-dimensional range) for the Gaussian kernel?
- Stochastic gradient descent, and how Perceptron is a special case.
- Further brief discussion of whether (machine-learning) algorithms can help mitigate human bias in decision-making.
**Required Reading:**Sections 6.3.2 and 6.6 of Blum-Hopcroft-Kannan.

- Class recap.