stuartjandrews.com

An Introduction to Machine Learning for Biologists

Journal Club

September 25, 2012

  1. A Brief Introduction To Machine Learning, Gunnar Rätsch pdf
  2. ML Bioinformatics Summer Course, Gunnar Rätsch web
  3. Element of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman pdf

Images and text are copied from these sources.

Machine Learning

"Aims to mimic intelligent abilities of humans by machines"

Machine Learning

What is it used for?

Examples

It's what Google does

Figure_13

Netflix Prize

Figure_2

Bioinformatics

Figure_9

Bioinformatics

Figure_10

Microarray Analysis

  1. which samples are most similar
  2. which genes are most similar
  3. which expression variations correlate with specific diseases

Figure_1

Polyphen2

Predicts possible impact of an amino acid substitution

  • SVM, Naive Bayes (Sunyaev)

Figure_4

ArrayCGH

Predict copy number

  • Fused Lasso (ESL)

Figure_7

Splicing Code

Predict splicing

  • SVM, Graphical Model, ...

Figure_8

Protein Structure

Figure_12

Machine Learning

A Closer Look ...

Machine Learning is ...

Concerned with how to make a machines learn from data

  • Observe examples that represent incomplete information about some "statistical phenomenon"

Learning Algorithms

Generate rules for making predictions

  • INPUT: Training Data
  • OUTPUT: Classifier, Probabilistic Model, Clusters ...

Learning Algorithms

Two Types

  1. Unsupervised learning
  2. Supervised learning

Unsupervised Learning

Uncover hidden regularities or detect anomolies

  • Input training data $D$
  • Learn model $P(D)$

Supervised Learning

Learn function that predicts label $Y$ associated with each example $X$

  • Input training data $D$
  • Learn prediction function $Y = F_D(X)$

Supervised Learning

Binary $Y$ == "Classification"

Real-valued $Y$ == "Regression"

Learning Algorithms

Many varieties

Very accurate and efficient

Easy to use

Surpass human's ability to process large quantities of complex data

An Aside About Training Data ...

Data Collection and Representation

Input $X$

"Pattern", "Features", "Predictors"

  • Invariant to undetermined transformations
  • Sensitive to differences between examples

Output $Y$

"Response", "Target", "Class / Categorical label"

  • Represents the Truth
  • Typically difficult and/or expensive to collect

Training Data

Inputs

  • $n$ examples
  • $m$ features
  • $n x m$ matrix

Figure_18

Outputs

  • $Y_1, \ldots, Y_n$

Classification Algorithms

k - Nearest Neighbors

Prediction is majority vote between the k closest training points

  • Distance measured in $m$ dimensional features space

k=15 - Nearest Neighbors

  • 2-dimensional features
  • Color indicates class
  • Line depicts decision boundary

Figure_23

k=1 - Nearest Neighbors

  • Different solutions are possible with different $k$

Figure_24

Decision Trees

Partition data into a tree

  • Prediction is a majority vote of training labels within a leaf node

Figure_52

Support Vector Machines

  • Separating hyperplane

Figure_20

Support Vector Machines

  • SVM's can find non-linear decision boundaries efficiently
  • A linear decision boundary in the high-dimensional transformed space corresponds to a non-linear boundary in the original space

Figure_19

Support Vector Machines

  • Comparison of linear and non-linear classification performance

Figure_42 Figure_43

More Classification Algorithms

  • Linear Discriminant Analysis
  • Boosting
  • Neural Networks

Regression Algorithms

Learning to predict real-valued outputs

  • Linear Regression
  • Logistic Regression
  • Regression Trees

Figure_32

Feature Selection & Dimensionality Reduction

Techniques used before and/or during learning

Characterize inputs in low-dimensional space

  • False Discovery Rate
  • PCA
  • Multi Dimensional Scaling
  • Latent Factor Analysis

Figure_3

Not Covered ...

Learning theory

Bias & variance (overfitting)

Cross validation

Fin.

  1. A Brief Introduction To Machine Learning, Gunnar Rätsch pdf
  2. ML Bioinformatics Summer Course, Gunnar Rätsch web
  3. Element of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman pdf

Images and text are copied from these sources.


blog comments powered by Disqus

Tags

Published

24 September 2012