Mecha Learn

Friday, 1 October 2010

Introduction to Machine Learning

Alex Smola and S.V.N. Vishwanathan Yahoo! Labs Santa Clara –and– Departments of Statistics and Computer Science Purdue University –and– College of Engineering and Computer Science Australian National University

Contents Preface page 1 1 Introduction 3 1.1 A Taste of Machine Learning 3 1.1.1 Applications 3 1.1.2 Data 7 1.1.3 Problems 9 1.2 Probability Theory 12 1.2.1 Random Variables 12 1.2.2 Distributions 13 1.2.3 Mean and Variance 15 1.2.4 Marginalization, Independence, Conditioning, and Bayes Rule 16 1.3 Basic Algorithms 20 1.3.1 Naive Bayes 22 1.3.2 Nearest Neighbor Estimators 24 1.3.3 A Simple Classifier 27 1.3.4 Perceptron 29 1.3.5 K-Means 32 2 Density Estimation 37 2.1 Limit Theorems 37 2.1.1 Fundamental Laws 38 2.1.2 The Characteristic Function 42 2.1.3 Tail Bounds 45 2.1.4 An Example 48 2.2 Parzen Windows 51 2.2.1 Discrete Density Estimation 51 2.2.2 Smoothing Kernel 52 2.2.3 Parameter Estimation 54 2.2.4 Silverman’s Rule 57 2.2.5 Watson-Nadaraya Estimator 59 2.3 Exponential Families 60 2.3.1 Basics 60 v vi 0 Contents 2.3.2 Examples 62 2.4 Estimation 66 2.4.1 Maximum Likelihood Estimation 66 2.4.2 Bias, Variance and Consistency 68 2.4.3 A Bayesian Approach 71 2.4.4 An Example 75 2.5 Sampling 77 2.5.1 Inverse Transformation 78 2.5.2 Rejection Sampler 82 3 Optimization 91 3.1 Preliminaries 91 3.1.1 Convex Sets 92 3.1.2 Convex Functions 92 3.1.3 Subgradients 96 3.1.4 Strongly Convex Functions 97 3.1.5 Convex Functions with Lipschitz Continous Gradient 98 3.1.6 Fenchel Duality 98 3.1.7 Bregman Divergence 100 3.2 Unconstrained Smooth Convex Minimization 102 3.2.1 Minimizing a One-Dimensional Convex Function 102 3.2.2 Coordinate Descent 104 3.2.3 Gradient Descent 104 3.2.4 Mirror Descent 108 3.2.5 Conjugate Gradient 111 3.2.6 Higher Order Methods 115 3.2.7 Bundle Methods 121 3.3 Constrained Optimization 125 3.3.1 Projection Based Methods 125 3.3.2 Lagrange Duality 127 3.3.3 Linear and Quadratic Programs 131 3.4 Stochastic Optimization 135 3.4.1 Stochastic Gradient Descent 136 3.5 Nonconvex Optimization 137 3.5.1 Concave-Convex Procedure 137 3.6 Some Practical Advice 139 4 Online Learning and Boosting 143 4.1 Halving Algorithm 143 4.2 Weighted Majority 144 Contents vii 5 Conditional Densities 149 5.1 Logistic Regression 150 5.2 Regression 151 5.2.1 Conditionally Normal Models 151 5.2.2 Posterior Distribution 151 5.2.3 Heteroscedastic Estimation 151 5.3 Multiclass Classification 151 5.3.1 Conditionally Multinomial Models 151 5.4 What is a CRF? 152 5.4.1 Linear Chain CRFs 152 5.4.2 Higher Order CRFs 152 5.4.3 Kernelized CRFs 152 5.5 Optimization Strategies 152 5.5.1 Getting Started 152 5.5.2 Optimization Algorithms 152 5.5.3 Handling Higher order CRFs 152 5.6 Hidden Markov Models 153 5.7 Further Reading 153 5.7.1 Optimization 153 6 Kernels and Function Spaces 155 6.1 The Basics 155 6.1.1 Examples 156 6.2 Kernels 161 6.2.1 Feature Maps 161 6.2.2 The Kernel Trick 161 6.2.3 Examples of Kernels 161 6.3 Algorithms 161 6.3.1 Kernel Perceptron 161 6.3.2 Trivial Classifier 161 6.3.3 Kernel Principal Component Analysis 161 6.4 Reproducing Kernel Hilbert Spaces 161 6.4.1 Hilbert Spaces 163 6.4.2 Theoretical Properties 163 6.4.3 Regularization 163 6.5 Banach Spaces 164 6.5.1 Properties 164 6.5.2 Norms and Convex Sets 164 7 Linear Models 165 7.1 Support Vector Classification 165 viii 0 Contents 7.1.1 A Regularized Risk Minimization Viewpoint 170 7.1.2 An Exponential Family Interpretation 170 7.1.3 Specialized Algorithms for Training SVMs 172 7.2 Extensions 177 7.2.1 The ν trick 177 7.2.2 Squared Hinge Loss 179 7.2.3 Ramp Loss 180 7.3 Support Vector Regression 181 7.3.1 Incorporating General Loss Functions 184 7.3.2 Incorporating the ν Trick 186 7.4 Novelty Detection 186 7.5 Margins and Probability 189 7.6 Beyond Binary Classification 189 7.6.1 Multiclass Classification 190 7.6.2 Multilabel Classification 191 7.6.3 Ordinal Regression and Ranking 192 7.7 Large Margin Classifiers with Structure 193 7.7.1 Margin 193 7.7.2 Penalized Margin 193 7.7.3 Nonconvex Losses 193 7.8 Applications 193 7.8.1 Sequence Annotation 193 7.8.2 Matching 193 7.8.3 Ranking 193 7.8.4 Shortest Path Planning 193 7.8.5 Image Annotation 193 7.8.6 Contingency Table Loss 193 7.9 Optimization 193 7.9.1 Column Generation 193 7.9.2 Bundle Methods 193 7.9.3 Overrelaxation in the Dual 193 7.10 CRFs vs Structured Large Margin Models 194 7.10.1 Loss Function 194 7.10.2 Dual Connections 194 7.10.3 Optimization 194 Appendix 1 Linear Algebra and Functional Analysis 197 Appendix 2 Conjugate Distributions 201 Appendix 3 Loss Functions 203 Bibliography 221

Monday, 1 June 2009

Natural Language Processing with Python

Chapter 1 Language Processing and Python 1 Chapter 2 Accessing Text Corpora and Lexical Resources 39 Chapter 3 Processing Raw Text 79 Chapter 4 Writing Structured Programs 129 Chapter 5 Categorizing and Tagging Words 179 Chapter 6 Learning to Classify Text 221 Chapter 7 Extracting Information from Text 261 Chapter 8 Analyzing Sentence Structure 291 Chapter 9 Building FeatureBased Grammars 327 Chapter 10 Analyzing the Meaning of Sentences 361 Chapter 11 Managing Linguistic Data 407 The Language Challenge 441 Bibliography 449 NLTK Index 459 General Index 463 Copyright

https://www.datascienceassn.org/sites/default/files/Natural%20Language%20Processing%20with%20Python.pdf

Friday, 31 December 2004

Natural Language Processing Lecture Notes

Natural Language Processing

Overview NLP is a large and multidisciplinary field, so this course can only provide a very general introduction. The first lecture is designed to give an overview of the main subareas and a very brief idea of the main applications and the methodologies which have been employed. The history of NLP is briefly discussed as a way of putting this into perspective. The next six lectures describe some of the main subareas in more detail. The organisation is roughly based on increased ‘depth’ of processing, starting with relatively surface-oriented techniques and progressing to considering meaning of sentences and meaning of utterances in context. Most lectures will start off by considering the subarea as a whole and then go on to describe one or more sample algorithms which tackle particular problems. The algorithms have been chosen because they are relatively straightforward to describe and because they illustrate a specific technique which has been shown to be useful, but the idea is to exemplify an approach, not to give a detailed survey (which would be impossible in the time available). (Lecture 5 is a bit different in that it concentrates on a data structure instead of an algorithm.) The final lecture brings the preceding material together in order to describe the state of the art in three sample applications. There are various themes running throughout the lectures. One theme is the connection to linguistics and the tension that sometimes exists between the predominant view in theoretical linguistics and the approaches adopted within NLP. A somewhat related theme is the distinction between knowledge-based and probabilistic approaches. Evaluation will be discussed in the context of the different algorithms. Because NLP is such a large area, there are many topics that aren’t touched on at all in these lectures. Speech recognition and speech synthesis is almost totally ignored. Information retrieval and information extraction are the topic of a separate course given by Simone Teufel, for which this course is a prerequisite. Feedback on the handout, lists of typos etc, would be greatly appreciated

https://www.cl.cam.ac.uk/teaching/2002/NatLangProc/revised.pdf

Monday, 2 November 1998

INTRODUCTION TO MACHINE LEARNING

1 Preliminaries 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 What is Machine Learning? . . . . . . . . . . . . . . . . . 1 1.1.2 Wellsprings of Machine Learning . . . . . . . . . . . . . . 3 1.1.3 Varieties of Machine Learning . . . . . . . . . . . . . . . . 4 1.2 Learning Input-Output Functions . . . . . . . . . . . . . . . . . . 5 1.2.1 Types of Learning . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Input Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Training Regimes . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . 9 1.3 Learning Requires Bias . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Sample Applications . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 13 2 Boolean Functions 15 2.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Boolean Algebra . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Diagrammatic Representations . . . . . . . . . . . . . . . 16 2.2 Classes of Boolean Functions . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Terms and Clauses . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 DNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 CNF Functions . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Decision Lists . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.5 Symmetric and Voting Functions . . . . . . . . . . . . . . 23 2.2.6 Linearly Separable Functions . . . . . . . . . . . . . . . . 23 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 25 iii 3 Using Version Spaces for Learning 27 3.1 Version Spaces and Mistake Bounds . . . . . . . . . . . . . . . . 27 3.2 Version Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Learning as Search of a Version Space . . . . . . . . . . . . . . . 32 3.4 The Candidate Elimination Method . . . . . . . . . . . . . . . . 32 3.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 34 4 Neural Networks 35 4.1 Threshold Logic Units . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 Definitions and Geometry . . . . . . . . . . . . . . . . . . 35 4.1.2 Special Cases of Linearly Separable Functions . . . . . . . 37 4.1.3 Error-Correction Training of a TLU . . . . . . . . . . . . 38 4.1.4 Weight Space . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.5 The Widrow-Hoff Procedure . . . . . . . . . . . . . . . . . 42 4.1.6 Training a TLU on Non-Linearly-Separable Training Sets 44 4.2 Linear Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Networks of TLUs . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Motivation and Examples . . . . . . . . . . . . . . . . . . 46 4.3.2 Madalines . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.3 Piecewise Linear Machines . . . . . . . . . . . . . . . . . . 50 4.3.4 Cascade Networks . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Training Feedforward Networks by Backpropagation . . . . . . . 52 4.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.2 The Backpropagation Method . . . . . . . . . . . . . . . . 53 4.4.3 Computing Weight Changes in the Final Layer . . . . . . 56 4.4.4 Computing Changes to the Weights in Intermediate Layers 58 4.4.5 Variations on Backprop . . . . . . . . . . . . . . . . . . . 59 4.4.6 An Application: Steering a Van . . . . . . . . . . . . . . . 60 4.5 Synergies Between Neural Network and Knowledge-Based Methods 61 4.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 61 5 Statistical Learning 63 5.1 Using Statistical Decision Theory . . . . . . . . . . . . . . . . . . 63 5.1.1 Background and General Method . . . . . . . . . . . . . . 63 5.1.2 Gaussian (or Normal) Distributions . . . . . . . . . . . . 65 5.1.3 Conditionally Independent Binary Components . . . . . . 68 5.2 Learning Belief Networks . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Nearest-Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 72 iv 6 Decision Trees 73 6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Supervised Learning of Univariate Decision Trees . . . . . . . . . 74 6.2.1 Selecting the Type of Test . . . . . . . . . . . . . . . . . . 75 6.2.2 Using Uncertainty Reduction to Select Tests . . . . . . . 75 6.2.3 Non-Binary Attributes . . . . . . . . . . . . . . . . . . . . 79 6.3 Networks Equivalent to Decision Trees . . . . . . . . . . . . . . . 79 6.4 Overfitting and Evaluation . . . . . . . . . . . . . . . . . . . . . 80 6.4.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.4.2 Validation Methods . . . . . . . . . . . . . . . . . . . . . 81 6.4.3 Avoiding Overfitting in Decision Trees . . . . . . . . . . . 82 6.4.4 Minimum-Description Length Methods . . . . . . . . . . . 83 6.4.5 Noise in Data . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.5 The Problem of Replicated Subtrees . . . . . . . . . . . . . . . . 84 6.6 The Problem of Missing Attributes . . . . . . . . . . . . . . . . . 86 6.7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 87 7 Inductive Logic Programming 89 7.1 Notation and Definitions . . . . . . . . . . . . . . . . . . . . . . . 90 7.2 A Generic ILP Algorithm . . . . . . . . . . . . . . . . . . . . . . 91 7.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.4 Inducing Recursive Programs . . . . . . . . . . . . . . . . . . . . 98 7.5 Choosing Literals to Add . . . . . . . . . . . . . . . . . . . . . . 100 7.6 Relationships Between ILP and Decision Tree Induction . . . . . 101 7.7 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 104 8 Computational Learning Theory 107 8.1 Notation and Assumptions for PAC Learning Theory . . . . . . . 107 8.2 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2.1 The Fundamental Theorem . . . . . . . . . . . . . . . . . 109 8.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2.3 Some Properly PAC-Learnable Classes . . . . . . . . . . . 112 8.3 The Vapnik-Chervonenkis Dimension . . . . . . . . . . . . . . . . 113 8.3.1 Linear Dichotomies . . . . . . . . . . . . . . . . . . . . . . 113 8.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3.3 A More General Capacity Result . . . . . . . . . . . . . . 116 8.3.4 Some Facts and Speculations About the VC Dimension . 117 8.4 VC Dimension and PAC Learning . . . . . . . . . . . . . . . . . 118 8.5 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 118 v 9 Unsupervised Learning 119 9.1 What is Unsupervised Learning? . . . . . . . . . . . . . . . . . . 119 9.2 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9.2.1 A Method Based on Euclidean Distance . . . . . . . . . . 120 9.2.2 A Method Based on Probabilities . . . . . . . . . . . . . . 124 9.3 Hierarchical Clustering Methods . . . . . . . . . . . . . . . . . . 125 9.3.1 A Method Based on Euclidean Distance . . . . . . . . . . 125 9.3.2 A Method Based on Probabilities . . . . . . . . . . . . . . 126 9.4 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 130 10 Temporal-Difference Learning 131 10.1 Temporal Patterns and Prediction Problems . . . . . . . . . . . . 131 10.2 Supervised and Temporal-Difference Methods . . . . . . . . . . . 131 10.3 Incremental Computation of the (∆W)i . . . . . . . . . . . . . . 134 10.4 An Experiment with TD Methods . . . . . . . . . . . . . . . . . 135 10.5 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138 10.6 Intra-Sequence Weight Updating . . . . . . . . . . . . . . . . . . 138 10.7 An Example Application: TD-gammon . . . . . . . . . . . . . . . 140 10.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 141 11 Delayed-Reinforcement Learning 143 11.1 The General Problem . . . . . . . . . . . . . . . . . . . . . . . . 143 11.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 11.3 Temporal Discounting and Optimal Policies . . . . . . . . . . . . 145 11.4 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11.5 Discussion, Limitations, and Extensions of Q-Learning . . . . . . 150 11.5.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . 150 11.5.2 Using Random Actions . . . . . . . . . . . . . . . . . . . 152 11.5.3 Generalizing Over Inputs . . . . . . . . . . . . . . . . . . 153 11.5.4 Partially Observable States . . . . . . . . . . . . . . . . . 154 11.5.5 Scaling Problems . . . . . . . . . . . . . . . . . . . . . . . 154 11.6 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 155 vi 12 Explanation-Based Learning 157 12.1 Deductive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 157 12.2 Domain Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12.3 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 12.4 Evaluable Predicates . . . . . . . . . . . . . . . . . . . . . . . . . 162 12.5 More General Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.6 Utility of EBL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 12.7.1 Macro-Operators in Planning . . . . . . . . . . . . . . . . 164 12.7.2 Learning Search Control Knowledge . . . . . . . . . . . . 167 12.8 Bibliographical and Historical Remarks . . . . . . . . . . . . . . 168