Mecha Learn: November 1998

INTRODUCTION TO MACHINE LEARNING

1 Preliminaries 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 What is Machine Learning? . . . . . . . . . . . . . . . . . 1 1.1.2 Wellsprings of Machine Learning . . . . . . . . . . . . . . 3 1.1.3 Varieties of Machine Learning . . . . . . . . . . . . . . . . 4 1.2 Learning Input-Output Functions . . . . . . . . . . . . . . . . . . 5 1.2.1 Types of Learning . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Input Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Training Regimes . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . 9 1.3 Learning Requires Bias . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Sample Applications . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 13 2 Boolean Functions 15 2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Diagrammatic Representations . . . . . . . . . . . . . . . 16 2.2 Classes of Boolean Functions . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Terms and Clauses . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 DNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 CNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Decision Lists . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.5 Symmetric and Voting Functions . . . . . . . . . . . . . . 23 2.2.6 Linearly Separable Functions . . . . . . . . . . . . . . . . 23 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 25 iii 3 Using Version Spaces for Learning 27 3.1 Version Spaces and Mistake Bounds . . . . . . . . . . . . . . . . 27 3.2 Version Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Learning as Search of a Version Space . . . . . . . . . . . . . . . 32 3.4 The Candidate Elimination Method . . . . . . . . . . . . . . . . 32 3.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 34 4 Neural Networks 35 4.1 Threshold Logic Units . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 Definitions and Geometry . . . . . . . . . . . . . . . . . . 35 4.1.2 Special Cases of Linearly Separable Functions . . . . . . . 37 4.1.3 Error-Correction Training of a TLU . . . . . . . . . . . . 38 4.1.4 Weight Space . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.5 The Widrow-Hoff Procedure . . . . . . . . . . . . . . . . . 42 4.1.6 Training a TLU on Non-Linearly-Separable Training Sets 44 4.2 Linear Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Networks of TLUs . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Motivation and Examples . . . . . . . . . . . . . . . . . . 46 4.3.2 Madalines . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Piecewise Linear Machines . . . . . . . . . . . . . . . . . . 50 4.3.4 Cascade Networks . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Training Feedforward Networks by Backpropagation . . . . . . . 52 4.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.2 The Backpropagation Method . . . . . . . . . . . . . . . . 53 4.4.3 Computing Weight Changes in the Final Layer . . . . . . 56 4.4.4 Computing Changes to the Weights in Intermediate Layers 58 4.4.5 Variations on Backprop . . . . . . . . . . . . . . . . . . . 59 4.4.6 An Application: Steering a Van . . . . . . . . . . . . . . . 60 4.5 Synergies Between Neural Network and Knowledge-Based Methods 61 4.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 61 5 Statistical Learning 63 5.1 Using Statistical Decision Theory . . . . . . . . . . . . . . . . . . 63 5.1.1 Background and General Method . . . . . . . . . . . . . . 63 5.1.2 Gaussian (or Normal) Distributions . . . . . . . . . . . . 65 5.1.3 Conditionally Independent Binary Components . . . . . . 68 5.2 Learning Belief Networks . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Nearest-Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 72 iv 6 Decision Trees 73 6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Supervised Learning of Univariate Decision Trees . . . . . . . . . 74 6.2.1 Selecting the Type of Test . . . . . . . . . . . . . . . . . . 75 6.2.2 Using Uncertainty Reduction to Select Tests . . . . . . . 75 6.2.3 Non-Binary Attributes . . . . . . . . . . . . . . . . . . . . 79 6.3 Networks Equivalent to Decision Trees . . . . . . . . . . . . . . . 79 6.4 Overfitting and Evaluation . . . . . . . . . . . . . . . . . . . . . 80 6.4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.2 Validation Methods . . . . . . . . . . . . . . . . . . . . . 81 6.4.3 Avoiding Overfitting in Decision Trees . . . . . . . . . . . 82 6.4.4 Minimum-Description Length Methods . . . . . . . . . . . 83 6.4.5 Noise in Data . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 The Problem of Replicated Subtrees . . . . . . . . . . . . . . . . 84 6.6 The Problem of Missing Attributes . . . . . . . . . . . . . . . . . 86 6.7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 87 7 Inductive Logic Programming 89 7.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . 90 7.2 A Generic ILP Algorithm . . . . . . . . . . . . . . . . . . . . . . 91 7.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.4 Inducing Recursive Programs . . . . . . . . . . . . . . . . . . . . 98 7.5 Choosing Literals to Add . . . . . . . . . . . . . . . . . . . . . . 100 7.6 Relationships Between ILP and Decision Tree Induction . . . . . 101 7.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 104 8 Computational Learning Theory 107 8.1 Notation and Assumptions for PAC Learning Theory . . . . . . . 107 8.2 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2.1 The Fundamental Theorem . . . . . . . . . . . . . . . . . 109 8.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2.3 Some Properly PAC-Learnable Classes . . . . . . . . . . . 112 8.3 The Vapnik-Chervonenkis Dimension . . . . . . . . . . . . . . . . 113 8.3.1 Linear Dichotomies . . . . . . . . . . . . . . . . . . . . . . 113 8.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3.3 A More General Capacity Result . . . . . . . . . . . . . . 116 8.3.4 Some Facts and Speculations About the VC Dimension . 117 8.4 VC Dimension and PAC Learning . . . . . . . . . . . . . . . . . 118 8.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 118 v 9 Unsupervised Learning 119 9.1 What is Unsupervised Learning? . . . . . . . . . . . . . . . . . . 119 9.2 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9.2.1 A Method Based on Euclidean Distance . . . . . . . . . . 120 9.2.2 A Method Based on Probabilities . . . . . . . . . . . . . . 124 9.3 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . 125 9.3.1 A Method Based on Euclidean Distance . . . . . . . . . . 125 9.3.2 A Method Based on Probabilities . . . . . . . . . . . . . . 126 9.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 130 10 Temporal-Difference Learning 131 10.1 Temporal Patterns and Prediction Problems . . . . . . . . . . . . 131 10.2 Supervised and Temporal-Difference Methods . . . . . . . . . . . 131 10.3 Incremental Computation of the (∆W)i . . . . . . . . . . . . . . 134 10.4 An Experiment with TD Methods . . . . . . . . . . . . . . . . . 135 10.5 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138 10.6 Intra-Sequence Weight Updating . . . . . . . . . . . . . . . . . . 138 10.7 An Example Application: TD-gammon . . . . . . . . . . . . . . . 140 10.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 141 11 Delayed-Reinforcement Learning 143 11.1 The General Problem . . . . . . . . . . . . . . . . . . . . . . . . 143 11.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 11.3 Temporal Discounting and Optimal Policies . . . . . . . . . . . . 145 11.4 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11.5 Discussion, Limitations, and Extensions of Q-Learning . . . . . . 150 11.5.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 150 11.5.2 Using Random Actions . . . . . . . . . . . . . . . . . . . 152 11.5.3 Generalizing Over Inputs . . . . . . . . . . . . . . . . . . 153 11.5.4 Partially Observable States . . . . . . . . . . . . . . . . . 154 11.5.5 Scaling Problems . . . . . . . . . . . . . . . . . . . . . . . 154 11.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 155 vi 12 Explanation-Based Learning 157 12.1 Deductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 157 12.2 Domain Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 12.4 Evaluable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . 162 12.5 More General Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.6 Utility of EBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.7.1 Macro-Operators in Planning . . . . . . . . . . . . . . . . 164 12.7.2 Learning Search Control Knowledge . . . . . . . . . . . . 167 12.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 168

Mecha Learn

Monday, 2 November 1998

INTRODUCTION TO MACHINE LEARNING

Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Report Abuse