SMLHomework/CourseExample.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine learning for text classification\n",
    "with scikit-learn and nltk\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Agenda\n",
    "\n",
    "- Model building in scikit-learn (refresher)\n",
    "- Representing text as numerical data\n",
    "- Reading a text-based dataset into pandas\n",
    "- Vectorizing our dataset\n",
    "- Building and evaluating a model\n",
    "- Comparing models\n",
    "- Examining a model for further insight\n",
    "- Tuning the vectorizer (discussion)\n",
    "- Some NLP tools to preprocess text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# for Python 2: use print only as a function\n",
    "from __future__ import print_function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model building in scikit-learn (refresher)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# load the iris dataset as an example\n",
    "from sklearn.datasets import load_iris\n",
    "iris = load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# store the feature matrix (X) and label vector (y)\n",
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n",
      "(150,)\n",
      "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n",
      "[ 4.9  3.   1.4  0.2]\n"
     ]
    }
   ],
   "source": [
    "# check the shapes of X and y\n",
    "print(X.shape)\n",
    "print(y.shape)\n",
    "print(iris.feature_names)\n",
    "print(X[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "0                5.1               3.5                1.4               0.2\n",
       "1                4.9               3.0                1.4               0.2\n",
       "2                4.7               3.2                1.3               0.2\n",
       "3                4.6               3.1                1.5               0.2\n",
       "4                5.0               3.6                1.4               0.2"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the first 5 rows of the feature matrix (including the feature names)\n",
    "# to install pandas, 'pip install pandas' or 'conda install pandas'\n",
    "import pandas as pd\n",
    "pd.DataFrame(X, columns=iris.feature_names).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
      " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2]\n"
     ]
    }
   ],
   "source": [
    "# examine the label vector\n",
    "print(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to **build a model**, the features must be **numeric**, and every observation must have the **same features in the same order**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
       "           metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
       "           weights='uniform')"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import the class\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "# instantiate the model (with the default parameters)\n",
    "knn = KNeighborsClassifier()\n",
    "\n",
    "# fit the model with data (occurs in-place)\n",
    "knn.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predict the response for a new observation\n",
    "knn.predict([[3, 5, 4, 2]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scikit-learn classical workflow:\n",
    "1. import\n",
    "2. instantiate\n",
    "3. fit\n",
    "4. predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Representing text as numerical data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# example text for model training (SMS messages)\n",
    "simple_train = [\"call you tonight\", \"Call me a cab\", \"please call me... PLEASE!\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
    "\n",
    "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
    "\n",
    "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import and instantiate CountVectorizer (with the default parameters)\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learn the \"vocabulary\" of the training data (occurs in-place)\n",
    "vect.fit(simple_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['cab', 'call', 'me', 'please', 'tonight', 'you']"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the fitted vocabulary\n",
    "vect.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x6 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 9 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform training data into a \"document-term matrix'\n",
    "simple_train_dtm = vect.transform(simple_train)\n",
    "simple_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 0, 0, 1, 1],\n",
       "       [1, 1, 1, 0, 0, 0],\n",
       "       [0, 1, 1, 2, 0, 0]])"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert sparse matrix to a dense matrix\n",
    "simple_train_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   0       0        1    1\n",
       "1    1     1   1       0        0    0\n",
       "2    0     1   1       2        0    0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
    "\n",
    "> In this scheme, features and samples are defined as follows:\n",
    "\n",
    "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
    "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
    "\n",
    "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
    "\n",
    "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or _\"Bag of n-grams\"_ representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "scipy.sparse.csr.csr_matrix"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the type of the document-term matrix\n",
    "type(simple_train_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 1)\t1\n",
      "  (0, 4)\t1\n",
      "  (0, 5)\t1\n",
      "  (1, 0)\t1\n",
      "  (1, 1)\t1\n",
      "  (1, 2)\t1\n",
      "  (2, 1)\t1\n",
      "  (2, 2)\t1\n",
      "  (2, 3)\t2\n"
     ]
    }
   ],
   "source": [
    "# examine the sparse matrix contents\n",
    "print(simple_train_dtm)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
    "\n",
    "> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).\n",
    "\n",
    "> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.\n",
    "\n",
    "> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# example text for model testing\n",
    "simple_test = [\"please don't call me\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 1, 1, 0, 0]])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform testing data into a document-term matrix (using existing vocabulary)\n",
    "simple_test_dtm = vect.transform(simple_test)\n",
    "simple_test_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   1       1        0    0"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Summary:**\n",
    "\n",
    "- `vect.fit(train)` **learns the vocabulary** of the training data\n",
    "- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data\n",
    "- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading a text-based dataset into pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# read file into pandas using a relative path\n",
    "path = \"data/sms.tsv\"\n",
    "sms = pd.read_table(path, header=None, names=[\"label\", \"message\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# alternative: read file into pandas from a URL\n",
    "#url = \"http://www.irisa.fr/dyliss/public/fcoste/data/pub/sms.tsv\"\n",
    "#sms = pd.read_table(url, header=None, names=['label', \"message\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5572, 2)"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the shape\n",
    "sms.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>spam</td>\n",
       "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ham</td>\n",
       "      <td>Even my brother is not like to speak with me. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ham</td>\n",
       "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>spam</td>\n",
       "      <td>WINNER!! As a valued network customer you have...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>spam</td>\n",
       "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                            message\n",
       "0   ham  Go until jurong point, crazy.. Available only ...\n",
       "1   ham                      Ok lar... Joking wif u oni...\n",
       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3   ham  U dun say so early hor... U c already then say...\n",
       "4   ham  Nah I don't think he goes to usf, he lives aro...\n",
       "5  spam  FreeMsg Hey there darling it's been 3 week's n...\n",
       "6   ham  Even my brother is not like to speak with me. ...\n",
       "7   ham  As per your request 'Melle Melle (Oru Minnamin...\n",
       "8  spam  WINNER!! As a valued network customer you have...\n",
       "9  spam  Had your mobile 11 months or more? U R entitle..."
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the first 10 rows\n",
    "sms.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ham     4825\n",
       "spam     747\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the class distribution\n",
    "sms.label.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# convert label to a numerical variable\n",
    "sms[\"label_num\"] = sms.label.map({\"ham\":0, \"spam\":1})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "      <th>label_num</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>spam</td>\n",
       "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ham</td>\n",
       "      <td>Even my brother is not like to speak with me. ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ham</td>\n",
       "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>spam</td>\n",
       "      <td>WINNER!! As a valued network customer you have...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>spam</td>\n",
       "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                            message  label_num\n",
       "0   ham  Go until jurong point, crazy.. Available only ...          0\n",
       "1   ham                      Ok lar... Joking wif u oni...          0\n",
       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1\n",
       "3   ham  U dun say so early hor... U c already then say...          0\n",
       "4   ham  Nah I don't think he goes to usf, he lives aro...          0\n",
       "5  spam  FreeMsg Hey there darling it's been 3 week's n...          1\n",
       "6   ham  Even my brother is not like to speak with me. ...          0\n",
       "7   ham  As per your request 'Melle Melle (Oru Minnamin...          0\n",
       "8  spam  WINNER!! As a valued network customer you have...          1\n",
       "9  spam  Had your mobile 11 months or more? U R entitle...          1"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check that the conversion worked\n",
    "sms.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n",
      "(150,)\n"
     ]
    }
   ],
   "source": [
    "# how to define X and y (from the iris data) for use with a MODEL\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "print(X.shape)\n",
    "print(y.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5572,)\n",
      "(5572,)\n"
     ]
    }
   ],
   "source": [
    "# how to define X and y (from the SMS data) for use with COUNTVECTORIZER\n",
    "X = sms.message\n",
    "y = sms.label_num\n",
    "print(X.shape)\n",
    "print(y.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4179,)\n",
      "(1393,)\n",
      "(4179,)\n",
      "(1393,)\n"
     ]
    }
   ],
   "source": [
    "# split X and y into training and testing sets\n",
    "#from sklearn.cross_validation import train_test_split\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "print(y_train.shape)\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vectorizing our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# instantiate the vectorizer\n",
    "vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# learn training data vocabulary, then use it to create a document-term matrix\n",
    "vect.fit(X_train)\n",
    "X_train_dtm = vect.transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# equivalently: combine fit and transform into a single step\n",
    "X_train_dtm = vect.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4179x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 55209 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the document-term matrix\n",
    "X_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<1393x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 17604 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
    "X_test_dtm = vect.transform(X_test)\n",
    "X_test_dtm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building and evaluating a model\n",
    "\n",
    "We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
    "\n",
    "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import and instantiate a Multinomial Naive Bayes model\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "nb = MultinomialNB()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n",
      "Wall time: 2.29 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train the model using X_train_dtm (timing it with an IPython \"magic command\")\n",
    "%time nb.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# make class predictions for X_test_dtm\n",
    "y_pred_class = nb.predict(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98851399856424982"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate accuracy of class predictions\n",
    "from sklearn import metrics\n",
    "metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1203,    5],\n",
       "       [  11,  174]])"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print the confusion matrix\n",
    "metrics.confusion_matrix(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "574               Waiting for your call.\n",
       "3375             Also andros ice etc etc\n",
       "45      No calls..messages..missed calls\n",
       "3415             No pic. Please re-send.\n",
       "1988    No calls..messages..missed calls\n",
       "Name: message, dtype: object"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print message text for the false positives (ham incorrectly classified as spam)\n",
    "X_test[(y_pred_class==1) & (y_test==0)] # using pandas' indexing by boolean array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3132    LookAtMe!: Thanks for your purchase of a video...\n",
       "5       FreeMsg Hey there darling it's been 3 week's n...\n",
       "3530    Xmas & New Years Eve tickets are now on sale f...\n",
       "684     Hi I'm sue. I am 20 years old and work as a la...\n",
       "1875    Would you like to see my XXX pics they are so ...\n",
       "1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n",
       "4298    thesmszone.com lets you send free anonymous an...\n",
       "4949    Hi this is Amy, we will be sending you a free ...\n",
       "2821    INTERFLORA - It's not too late to order Inter...\n",
       "2247    Hi ya babe x u 4goten bout me?' scammers getti...\n",
       "4514    Money i have won wining number 946 wot do i do...\n",
       "Name: message, dtype: object"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print message text for the false negatives (spam incorrectly classified as ham)\n",
    "X_test[(y_pred_class < y_test)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# example false negative\n",
    "X_test[3132]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**predict_proba(X)**\n",
    "\n",
    "    Return probability estimates for the test vector X.\n",
    "    Parameters:\t\n",
    "    X : array-like, shape = [n_samples, n_features]\n",
    "    Returns: \n",
    "    C : array-like, shape = [n_samples, n_classes]\n",
    "\n",
    "        Returns the probability of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<1393x7456 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 17604 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_test_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  9.97122551e-01,   2.87744864e-03],\n",
       "       [  9.99981651e-01,   1.83488846e-05],\n",
       "       [  9.97926987e-01,   2.07301295e-03],\n",
       "       ..., \n",
       "       [  9.99998910e-01,   1.09026171e-06],\n",
       "       [  1.86697467e-10,   1.00000000e+00],\n",
       "       [  9.99999996e-01,   3.98279868e-09]])"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nb.predict_proba(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,\n",
       "         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate predicted probabilities for X_test_dtm (poorly calibrated)\n",
    "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
    "y_pred_prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98664310005369615"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate AUC\n",
    "metrics.roc_auc_score(y_test, y_pred_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing models\n",
    "\n",
    "We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):\n",
    "\n",
    "> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import and instantiate a logistic regression model \n",
    "from sklearn.linear_model import LogisticRegression\n",
    "logreg = LogisticRegression()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 32 ms, sys: 0 ns, total: 32 ms\n",
      "Wall time: 32.4 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train the model using X_train_dtm\n",
    "%time logreg.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# make class predictions for X_test_dtm\n",
    "y_pred_class = logreg.predict(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.01269556,  0.00347183,  0.00616517, ...,  0.03354907,\n",
       "        0.99725053,  0.00157706])"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate predicted probabilities for X_test_dtm (well calibrated)\n",
    "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]\n",
    "y_pred_prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9877961234745154"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate accuracy\n",
    "metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99368176123143015"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate AUC\n",
    "metrics.roc_auc_score(y_test, y_pred_prob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examining a model for further insight \n",
    "We will examine the our **trained Naive Bayes model** to calculate the approximate **\"spamminess\" of each token**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7456"
      ]
     },
     "execution_count": 74,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# store the vocabulary of X_train\n",
    "X_train_tokens = vect.get_feature_names()\n",
    "len(X_train_tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']\n"
     ]
    }
   ],
   "source": [
    "# examine the first 50 tokens\n",
    "print(X_train_tokens[0:50])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']\n"
     ]
    }
   ],
   "source": [
    "# examine the last 50 tokens\n",
    "print(X_train_tokens[-50:])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  0.,   0.,   0., ...,   1.,   1.,   1.],\n",
       "       [  5.,  23.,   2., ...,   0.,   0.,   0.]])"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Naive Bayes counts the number of times each token appears in each class\n",
    "# trailing underscore is scikit convention for attributes that are learned during model fitting\n",
    "nb.feature_count_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2, 7456)"
      ]
     },
     "execution_count": 78,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# rows represent classes, columns represent tokens\n",
    "nb.feature_count_.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.,  0.,  0., ...,  1.,  1.,  1.])"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# number of times each token appears across all HAM messages\n",
    "ham_token_count = nb.feature_count_[0, :]\n",
    "ham_token_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  5.,  23.,   2., ...,   0.,   0.,   0.])"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# number of times each token appears across all SPAM messages\n",
    "spam_token_count = nb.feature_count_[1, :]\n",
    "spam_token_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>00</th>\n",
       "      <td>0.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000</th>\n",
       "      <td>0.0</td>\n",
       "      <td>23.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>008704050406</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0121</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>01223585236</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              ham  spam\n",
       "token                  \n",
       "00            0.0   5.0\n",
       "000           0.0  23.0\n",
       "008704050406  0.0   2.0\n",
       "0121          0.0   1.0\n",
       "01223585236   0.0   1.0"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a DataFrame of tokens with their separate ham and spam counts\n",
    "tokens = pd.DataFrame({\"token\":X_train_tokens, \"ham\":ham_token_count, \"spam\":spam_token_count}).set_index(\"token\")\n",
    "tokens.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>very</th>\n",
       "      <td>64.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nasty</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>villa</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>beloved</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>textoperator</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               ham  spam\n",
       "token                   \n",
       "very          64.0   2.0\n",
       "nasty          1.0   1.0\n",
       "villa          0.0   1.0\n",
       "beloved        1.0   0.0\n",
       "textoperator   0.0   2.0"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine 5 random DataFrame rows\n",
    "tokens.sample(5, random_state=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 3617.,   562.])"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Naive Bayes counts the number of observations in each class\n",
    "nb.class_count_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we can calculate the _\"spamminess\"_ of each token, we need to avoid **multiplying by zero** and account for the **class imbalance**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>very</th>\n",
       "      <td>65.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nasty</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>villa</th>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>beloved</th>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>textoperator</th>\n",
       "      <td>1.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               ham  spam\n",
       "token                   \n",
       "very          65.0   3.0\n",
       "nasty          2.0   2.0\n",
       "villa          1.0   2.0\n",
       "beloved        2.0   1.0\n",
       "textoperator   1.0   3.0"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# add 1 to ham and spam counts to avoid 0 probabilities\n",
    "tokens['ham'] = tokens['ham'] + 1\n",
    "tokens['spam'] = tokens['spam'] + 1\n",
    "tokens.sample(5, random_state=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>very</th>\n",
       "      <td>0.017971</td>\n",
       "      <td>0.005338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nasty</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.003559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>villa</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.003559</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>beloved</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.001779</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>textoperator</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.005338</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   ham      spam\n",
       "token                           \n",
       "very          0.017971  0.005338\n",
       "nasty         0.000553  0.003559\n",
       "villa         0.000276  0.003559\n",
       "beloved       0.000553  0.001779\n",
       "textoperator  0.000276  0.005338"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert the ham and spam counts into frequencies\n",
    "tokens['ham'] = tokens['ham'] / nb.class_count_[0]\n",
    "tokens['spam'] = tokens['spam'] / nb.class_count_[1]\n",
    "tokens.sample(5, random_state=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "      <th>spam_ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>very</th>\n",
       "      <td>0.017971</td>\n",
       "      <td>0.005338</td>\n",
       "      <td>0.297044</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nasty</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.003559</td>\n",
       "      <td>6.435943</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>villa</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.003559</td>\n",
       "      <td>12.871886</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>beloved</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>3.217972</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>textoperator</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.005338</td>\n",
       "      <td>19.307829</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   ham      spam  spam_ratio\n",
       "token                                       \n",
       "very          0.017971  0.005338    0.297044\n",
       "nasty         0.000553  0.003559    6.435943\n",
       "villa         0.000276  0.003559   12.871886\n",
       "beloved       0.000553  0.001779    3.217972\n",
       "textoperator  0.000276  0.005338   19.307829"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the ratio of spam-to-ham for each token\n",
    "tokens['spam_ratio'] = tokens['spam'] / tokens['ham']\n",
    "tokens.sample(5, random_state=6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "      <th>spam_ratio</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>token</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>claim</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.158363</td>\n",
       "      <td>572.798932</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>prize</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.135231</td>\n",
       "      <td>489.131673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>150p</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.087189</td>\n",
       "      <td>315.361210</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tone</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.085409</td>\n",
       "      <td>308.925267</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>guaranteed</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.076512</td>\n",
       "      <td>276.745552</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.069395</td>\n",
       "      <td>251.001779</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cs</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.065836</td>\n",
       "      <td>238.129893</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>www</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.129893</td>\n",
       "      <td>234.911922</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1000</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.056940</td>\n",
       "      <td>205.950178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>awarded</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.053381</td>\n",
       "      <td>193.078292</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>150ppm</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.051601</td>\n",
       "      <td>186.642349</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>uk</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.099644</td>\n",
       "      <td>180.206406</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>500</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.048043</td>\n",
       "      <td>173.770463</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ringtone</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.044484</td>\n",
       "      <td>160.898577</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>000</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.042705</td>\n",
       "      <td>154.462633</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mob</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.042705</td>\n",
       "      <td>154.462633</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>co</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.078292</td>\n",
       "      <td>141.590747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>collection</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.039146</td>\n",
       "      <td>141.590747</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>valid</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.037367</td>\n",
       "      <td>135.154804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2000</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.037367</td>\n",
       "      <td>135.154804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>800</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.037367</td>\n",
       "      <td>135.154804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10p</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.037367</td>\n",
       "      <td>135.154804</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8007</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.035587</td>\n",
       "      <td>128.718861</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0.000553</td>\n",
       "      <td>0.067616</td>\n",
       "      <td>122.282918</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>weekly</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.033808</td>\n",
       "      <td>122.282918</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tones</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.032028</td>\n",
       "      <td>115.846975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>land</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.032028</td>\n",
       "      <td>115.846975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>http</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.032028</td>\n",
       "      <td>115.846975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>national</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.030249</td>\n",
       "      <td>109.411032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5000</th>\n",
       "      <td>0.000276</td>\n",
       "      <td>0.030249</td>\n",
       "      <td>109.411032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>went</th>\n",
       "      <td>0.012718</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.139912</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ll</th>\n",
       "      <td>0.052530</td>\n",
       "      <td>0.007117</td>\n",
       "      <td>0.135494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>told</th>\n",
       "      <td>0.013824</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.128719</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feel</th>\n",
       "      <td>0.013824</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.128719</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gud</th>\n",
       "      <td>0.014100</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.126195</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cos</th>\n",
       "      <td>0.014929</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.119184</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>but</th>\n",
       "      <td>0.090683</td>\n",
       "      <td>0.010676</td>\n",
       "      <td>0.117731</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>amp</th>\n",
       "      <td>0.015206</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.117017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>something</th>\n",
       "      <td>0.015206</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.117017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>sure</th>\n",
       "      <td>0.015206</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.117017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ok</th>\n",
       "      <td>0.061100</td>\n",
       "      <td>0.007117</td>\n",
       "      <td>0.116488</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>said</th>\n",
       "      <td>0.016312</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.109084</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>morning</th>\n",
       "      <td>0.016865</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.105507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>yeah</th>\n",
       "      <td>0.017694</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.100562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lol</th>\n",
       "      <td>0.017694</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.100562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>anything</th>\n",
       "      <td>0.017971</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.099015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>my</th>\n",
       "      <td>0.150401</td>\n",
       "      <td>0.014235</td>\n",
       "      <td>0.094646</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>doing</th>\n",
       "      <td>0.019077</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.093275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>way</th>\n",
       "      <td>0.019630</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.090647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ask</th>\n",
       "      <td>0.019630</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.090647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>already</th>\n",
       "      <td>0.019630</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.090647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>too</th>\n",
       "      <td>0.021841</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.081468</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>come</th>\n",
       "      <td>0.048936</td>\n",
       "      <td>0.003559</td>\n",
       "      <td>0.072723</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>later</th>\n",
       "      <td>0.030688</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.057981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lor</th>\n",
       "      <td>0.032900</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.054084</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>da</th>\n",
       "      <td>0.032900</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.054084</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>she</th>\n",
       "      <td>0.035665</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.049891</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>he</th>\n",
       "      <td>0.047000</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.037858</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>lt</th>\n",
       "      <td>0.064142</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.027741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gt</th>\n",
       "      <td>0.064971</td>\n",
       "      <td>0.001779</td>\n",
       "      <td>0.027387</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7456 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                 ham      spam  spam_ratio\n",
       "token                                     \n",
       "claim       0.000276  0.158363  572.798932\n",
       "prize       0.000276  0.135231  489.131673\n",
       "150p        0.000276  0.087189  315.361210\n",
       "tone        0.000276  0.085409  308.925267\n",
       "guaranteed  0.000276  0.076512  276.745552\n",
       "18          0.000276  0.069395  251.001779\n",
       "cs          0.000276  0.065836  238.129893\n",
       "www         0.000553  0.129893  234.911922\n",
       "1000        0.000276  0.056940  205.950178\n",
       "awarded     0.000276  0.053381  193.078292\n",
       "150ppm      0.000276  0.051601  186.642349\n",
       "uk          0.000553  0.099644  180.206406\n",
       "500         0.000276  0.048043  173.770463\n",
       "ringtone    0.000276  0.044484  160.898577\n",
       "000         0.000276  0.042705  154.462633\n",
       "mob         0.000276  0.042705  154.462633\n",
       "co          0.000553  0.078292  141.590747\n",
       "collection  0.000276  0.039146  141.590747\n",
       "valid       0.000276  0.037367  135.154804\n",
       "2000        0.000276  0.037367  135.154804\n",
       "800         0.000276  0.037367  135.154804\n",
       "10p         0.000276  0.037367  135.154804\n",
       "8007        0.000276  0.035587  128.718861\n",
       "16          0.000553  0.067616  122.282918\n",
       "weekly      0.000276  0.033808  122.282918\n",
       "tones       0.000276  0.032028  115.846975\n",
       "land        0.000276  0.032028  115.846975\n",
       "http        0.000276  0.032028  115.846975\n",
       "national    0.000276  0.030249  109.411032\n",
       "5000        0.000276  0.030249  109.411032\n",
       "...              ...       ...         ...\n",
       "went        0.012718  0.001779    0.139912\n",
       "ll          0.052530  0.007117    0.135494\n",
       "told        0.013824  0.001779    0.128719\n",
       "feel        0.013824  0.001779    0.128719\n",
       "gud         0.014100  0.001779    0.126195\n",
       "cos         0.014929  0.001779    0.119184\n",
       "but         0.090683  0.010676    0.117731\n",
       "amp         0.015206  0.001779    0.117017\n",
       "something   0.015206  0.001779    0.117017\n",
       "sure        0.015206  0.001779    0.117017\n",
       "ok          0.061100  0.007117    0.116488\n",
       "said        0.016312  0.001779    0.109084\n",
       "morning     0.016865  0.001779    0.105507\n",
       "yeah        0.017694  0.001779    0.100562\n",
       "lol         0.017694  0.001779    0.100562\n",
       "anything    0.017971  0.001779    0.099015\n",
       "my          0.150401  0.014235    0.094646\n",
       "doing       0.019077  0.001779    0.093275\n",
       "way         0.019630  0.001779    0.090647\n",
       "ask         0.019630  0.001779    0.090647\n",
       "already     0.019630  0.001779    0.090647\n",
       "too         0.021841  0.001779    0.081468\n",
       "come        0.048936  0.003559    0.072723\n",
       "later       0.030688  0.001779    0.057981\n",
       "lor         0.032900  0.001779    0.054084\n",
       "da          0.032900  0.001779    0.054084\n",
       "she         0.035665  0.001779    0.049891\n",
       "he          0.047000  0.001779    0.037858\n",
       "lt          0.064142  0.001779    0.027741\n",
       "gt          0.064971  0.001779    0.027387\n",
       "\n",
       "[7456 rows x 3 columns]"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the DataFrame sorted by spam_ratio\n",
    "# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier\n",
    "tokens.sort_values('spam_ratio', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "83.667259786476862"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# look up the spam_ratio for a given token\n",
    "tokens.loc[\"dating\", \"spam_ratio\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning the vectorizer (discussion)\n",
    "\n",
    "Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# show default parameters for CountVectorizer\n",
    "vect"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:\n",
    "\n",
    "- **stop_words:** string {'english'}, list, or None (default)\n",
    "    - If 'english', a built-in stop word list for English is used.\n",
    "    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.\n",
    "    - If None, no stop words will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# remove English stop words\n",
    "vect_sw = CountVectorizer(stop_words='english')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4179, 7204)\n",
      "(4179, 7456)\n"
     ]
    }
   ],
   "source": [
    "vect_sw.fit(X_train)\n",
    "X_train_dtm_sw = vect_sw.transform(X_train)\n",
    "print(X_train_dtm_sw.shape) \n",
    "print(X_train_dtm.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **ngram_range:** tuple (min_n, max_n), default=(1, 1)\n",
    "    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.\n",
    "    - All values of n such that min_n <= n <= max_n will be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# include 1-grams and 2-grams\n",
    "vect_ngram = CountVectorizer(ngram_range=(1, 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **max_df:** float in range [0.0, 1.0] or int, default=1.0\n",
    "    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).\n",
    "    - If float, the parameter represents a proportion of documents.\n",
    "    - If integer, the parameter represents an absolute count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# ignore terms that appear in more than 50% of the documents\n",
    "vect_maxdf = CountVectorizer(max_df=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **min_df:** float in range [0.0, 1.0] or int, default=1\n",
    "    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called _\"cut-off\"_ in the literature.)\n",
    "    - If float, the parameter represents a proportion of documents.\n",
    "    - If integer, the parameter represents an absolute count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# only keep terms that appear in at least 2 documents\n",
    "vect_mindf = CountVectorizer(min_df=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Guidelines for tuning CountVectorizer:**\n",
    "\n",
    "- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.\n",
    "- **Experiment**, and let the data tell you the best approach!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TF-IDF weighting\n",
    "\n",
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html):\n",
    "\n",
    "> Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.\n",
    "\n",
    "> To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.\n",
    "\n",
    "> Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.\n",
    "\n",
    "> This downscaling is called tfâ€“idf for â€œTerm Frequency times Inverse Document Frequencyâ€."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4179, 7456)"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Computing TF only\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "tf_transformer = TfidfTransformer(use_idf=False)\n",
    "X_train_tf = tf_transformer.fit_transform(X_train_dtm)\n",
    "X_train_tf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4179, 7456)"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Computing TF-IDF\n",
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "tfidf_transformer = TfidfTransformer()\n",
    "X_train_tfidf = tfidf_transformer.fit_transform(X_train_dtm)\n",
    "X_train_tfidf.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "nb_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1393, 7456)\n"
     ]
    }
   ],
   "source": [
    "X_test_tfidf = tfidf_transformer.transform(X_test_dtm)\n",
    "print(X_test_tfidf.shape)\n",
    "y_pred_class_tfidf = nb_tfidf.predict(X_test_tfidf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.96482412060301503"
      ]
     },
     "execution_count": 99,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import metrics\n",
    "metrics.accuracy_score(y_test, y_pred_class_tfidf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1208,    0],\n",
       "       [  49,  136]])"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.confusion_matrix(y_test, y_pred_class_tfidf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building a pipeline\n",
    "\n",
    "> In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "text_clf = Pipeline([('vect', CountVectorizer()),\n",
    "                     ('tfidf', TfidfTransformer()),\n",
    "                     ('nb', MultinomialNB()),\n",
    "                    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> The names vect, tfidf and clf (classifier) are arbitrary. We shall see their use in the section on grid search, below. We can now train the model with a single command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip...linear_tf=False, use_idf=True)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.fit(X_train, y_train)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pred = text_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1208,    0],\n",
       "       [  49,  136]])"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.confusion_matrix(y_test, pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "             precision    recall  f1-score   support\n",
      "\n",
      "        ham       0.96      1.00      0.98      1208\n",
      "       spam       1.00      0.74      0.85       185\n",
      "\n",
      "avg / total       0.97      0.96      0.96      1393\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(metrics.classification_report(y_test, pred,\n",
    "    target_names=[\"ham\",\"spam\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__How does pipeline chain the estimators?__   \n",
    "By successive 'fit' and 'transform' (except the last estimator that is only fitted)   \n",
    "From [scikit-learn documentation](http://scikit-learn.org/stable/modules/pipeline.html):\n",
    "> Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parameter tuning using grid search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Parameters of the estimators in the pipeline can be accessed using the < estimator \\> \\_\\_ < parameter \\> syntax:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'memory': None,\n",
       " 'nb': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),\n",
       " 'nb__alpha': 1.0,\n",
       " 'nb__class_prior': None,\n",
       " 'nb__fit_prior': True,\n",
       " 'steps': [('vect',\n",
       "   CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "           lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "           ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "           strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "           tokenizer=None, vocabulary=None)),\n",
       "  ('tfidf',\n",
       "   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),\n",
       "  ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],\n",
       " 'tfidf': TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True),\n",
       " 'tfidf__norm': 'l2',\n",
       " 'tfidf__smooth_idf': True,\n",
       " 'tfidf__sublinear_tf': False,\n",
       " 'tfidf__use_idf': True,\n",
       " 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "         lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "         ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "         strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "         tokenizer=None, vocabulary=None),\n",
       " 'vect__analyzer': 'word',\n",
       " 'vect__binary': False,\n",
       " 'vect__decode_error': 'strict',\n",
       " 'vect__dtype': numpy.int64,\n",
       " 'vect__encoding': 'utf-8',\n",
       " 'vect__input': 'content',\n",
       " 'vect__lowercase': True,\n",
       " 'vect__max_df': 1.0,\n",
       " 'vect__max_features': None,\n",
       " 'vect__min_df': 1,\n",
       " 'vect__ngram_range': (1, 1),\n",
       " 'vect__preprocessor': None,\n",
       " 'vect__stop_words': None,\n",
       " 'vect__strip_accents': None,\n",
       " 'vect__token_pattern': '(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       " 'vect__tokenizer': None,\n",
       " 'vect__vocabulary': None}"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.get_params()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "parameters = {'vect__ngram_range': [(1, 1), (1, 2)],\n",
    "              'tfidf__use_idf': (True, False),\n",
    "              'nb__alpha': (1e-2, 1e-3),\n",
    "}\n",
    "gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "gs_clf = gs_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result of calling fit on a GridSearchCV object is a classifier that we can use to predict:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, ..., 0, 1, 0])"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.98588178990189046"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.best_score_  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "nb__alpha: 0.01\n",
      "tfidf__use_idf: True\n",
      "vect__ngram_range: (1, 2)\n"
     ]
    }
   ],
   "source": [
    "for param_name in sorted(parameters.keys()):\n",
    "    print(\"%s: %r\" % (param_name, gs_clf.best_params_[param_name]))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 2), preprocessor=None, stop_words=None,\n",
       "        strip...inear_tf=False, use_idf=True)), ('nb', MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))])"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.best_estimator_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 113,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.best_estimator_.get_params()[\"nb\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mean_fit_time': array([ 0.14699594,  0.34212303,  0.10872531,  0.32723244,  0.10028116,\n",
       "         0.33288169,  0.09984549,  0.25633995]),\n",
       " 'mean_score_time': array([ 0.05115279,  0.1017216 ,  0.04051574,  0.10789394,  0.05797712,\n",
       "         0.07368461,  0.04061079,  0.06274891]),\n",
       " 'mean_test_score': array([ 0.98396746,  0.98588179,  0.98324958,  0.9856425 ,  0.98181383,\n",
       "         0.98468533,  0.98301029,  0.98468533]),\n",
       " 'mean_train_score': array([ 0.99892319,  1.        ,  0.99760708,  1.        ,  0.99928217,\n",
       "         1.        ,  0.99856421,  1.        ]),\n",
       " 'param_nb__alpha': masked_array(data = [0.01 0.01 0.01 0.01 0.001 0.001 0.001 0.001],\n",
       "              mask = [False False False False False False False False],\n",
       "        fill_value = ?),\n",
       " 'param_tfidf__use_idf': masked_array(data = [True True False False True True False False],\n",
       "              mask = [False False False False False False False False],\n",
       "        fill_value = ?),\n",
       " 'param_vect__ngram_range': masked_array(data = [(1, 1) (1, 2) (1, 1) (1, 2) (1, 1) (1, 2) (1, 1) (1, 2)],\n",
       "              mask = [False False False False False False False False],\n",
       "        fill_value = ?),\n",
       " 'params': [{'nb__alpha': 0.01,\n",
       "   'tfidf__use_idf': True,\n",
       "   'vect__ngram_range': (1, 1)},\n",
       "  {'nb__alpha': 0.01, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)},\n",
       "  {'nb__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)},\n",
       "  {'nb__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)},\n",
       "  {'nb__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 1)},\n",
       "  {'nb__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)},\n",
       "  {'nb__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)},\n",
       "  {'nb__alpha': 0.001, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}],\n",
       " 'rank_test_score': array([5, 1, 6, 2, 8, 3, 7, 3], dtype=int32),\n",
       " 'split0_test_score': array([ 0.98278336,  0.98350072,  0.98278336,  0.98278336,  0.97991392,\n",
       "         0.98278336,  0.98134864,  0.98134864]),\n",
       " 'split0_train_score': array([ 0.9989228 ,  1.        ,  0.99712747,  1.        ,  0.99928187,\n",
       "         1.        ,  0.99820467,  1.        ]),\n",
       " 'split1_test_score': array([ 0.98707825,  0.988514  ,  0.98636037,  0.98994975,  0.98636037,\n",
       "         0.98779612,  0.98636037,  0.988514  ]),\n",
       " 'split1_train_score': array([ 0.99892319,  1.        ,  0.99856425,  1.        ,  0.99964106,\n",
       "         1.        ,  0.99892319,  1.        ]),\n",
       " 'split2_test_score': array([ 0.98204023,  0.98563218,  0.98060345,  0.9841954 ,  0.97916667,\n",
       "         0.98347701,  0.98132184,  0.9841954 ]),\n",
       " 'split2_train_score': array([ 0.99892357,  1.        ,  0.99712953,  1.        ,  0.99892357,\n",
       "         1.        ,  0.99856476,  1.        ]),\n",
       " 'std_fit_time': array([ 0.03751322,  0.02801085,  0.01648364,  0.04499868,  0.00108468,\n",
       "         0.030485  ,  0.00274948,  0.02129873]),\n",
       " 'std_score_time': array([ 0.00876487,  0.02648927,  0.0003166 ,  0.03298006,  0.02181901,\n",
       "         0.0019149 ,  0.00068138,  0.01333043]),\n",
       " 'std_test_score': array([ 0.00222048,  0.00205462,  0.00237287,  0.00309976,  0.00322933,\n",
       "         0.00221782,  0.00236889,  0.00294619]),\n",
       " 'std_train_score': array([  3.15582903e-07,   0.00000000e+00,   6.76819823e-04,\n",
       "          0.00000000e+00,   2.92913620e-04,   0.00000000e+00,\n",
       "          2.93334624e-04,   0.00000000e+00])}"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gs_clf.cv_results_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   mean_fit_time  mean_score_time  mean_test_score  mean_train_score  \\\n",
      "0       0.146996         0.051153         0.983967          0.998923   \n",
      "1       0.342123         0.101722         0.985882          1.000000   \n",
      "2       0.108725         0.040516         0.983250          0.997607   \n",
      "3       0.327232         0.107894         0.985642          1.000000   \n",
      "4       0.100281         0.057977         0.981814          0.999282   \n",
      "5       0.332882         0.073685         0.984685          1.000000   \n",
      "6       0.099845         0.040611         0.983010          0.998564   \n",
      "7       0.256340         0.062749         0.984685          1.000000   \n",
      "\n",
      "  param_nb__alpha param_tfidf__use_idf param_vect__ngram_range  \\\n",
      "0            0.01                 True                  (1, 1)   \n",
      "1            0.01                 True                  (1, 2)   \n",
      "2            0.01                False                  (1, 1)   \n",
      "3            0.01                False                  (1, 2)   \n",
      "4           0.001                 True                  (1, 1)   \n",
      "5           0.001                 True                  (1, 2)   \n",
      "6           0.001                False                  (1, 1)   \n",
      "7           0.001                False                  (1, 2)   \n",
      "\n",
      "                                              params  rank_test_score  \\\n",
      "0  {'nb__alpha': 0.01, 'tfidf__use_idf': True, 'v...                5   \n",
      "1  {'nb__alpha': 0.01, 'tfidf__use_idf': True, 'v...                1   \n",
      "2  {'nb__alpha': 0.01, 'tfidf__use_idf': False, '...                6   \n",
      "3  {'nb__alpha': 0.01, 'tfidf__use_idf': False, '...                2   \n",
      "4  {'nb__alpha': 0.001, 'tfidf__use_idf': True, '...                8   \n",
      "5  {'nb__alpha': 0.001, 'tfidf__use_idf': True, '...                3   \n",
      "6  {'nb__alpha': 0.001, 'tfidf__use_idf': False, ...                7   \n",
      "7  {'nb__alpha': 0.001, 'tfidf__use_idf': False, ...                3   \n",
      "\n",
      "   split0_test_score  split0_train_score  split1_test_score  \\\n",
      "0           0.982783            0.998923           0.987078   \n",
      "1           0.983501            1.000000           0.988514   \n",
      "2           0.982783            0.997127           0.986360   \n",
      "3           0.982783            1.000000           0.989950   \n",
      "4           0.979914            0.999282           0.986360   \n",
      "5           0.982783            1.000000           0.987796   \n",
      "6           0.981349            0.998205           0.986360   \n",
      "7           0.981349            1.000000           0.988514   \n",
      "\n",
      "   split1_train_score  split2_test_score  split2_train_score  std_fit_time  \\\n",
      "0            0.998923           0.982040            0.998924      0.037513   \n",
      "1            1.000000           0.985632            1.000000      0.028011   \n",
      "2            0.998564           0.980603            0.997130      0.016484   \n",
      "3            1.000000           0.984195            1.000000      0.044999   \n",
      "4            0.999641           0.979167            0.998924      0.001085   \n",
      "5            1.000000           0.983477            1.000000      0.030485   \n",
      "6            0.998923           0.981322            0.998565      0.002749   \n",
      "7            1.000000           0.984195            1.000000      0.021299   \n",
      "\n",
      "   std_score_time  std_test_score  std_train_score  \n",
      "0        0.008765        0.002220     3.155829e-07  \n",
      "1        0.026489        0.002055     0.000000e+00  \n",
      "2        0.000317        0.002373     6.768198e-04  \n",
      "3        0.032980        0.003100     0.000000e+00  \n",
      "4        0.021819        0.003229     2.929136e-04  \n",
      "5        0.001915        0.002218     0.000000e+00  \n",
      "6        0.000681        0.002369     2.933346e-04  \n",
      "7        0.013330        0.002946     0.000000e+00  \n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(gs_clf.cv_results_)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Some NLP tools to preprocess text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/toshuumilia/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n",
      "[nltk_data] Downloading package punkt to\n",
      "[nltk_data]     /home/toshuumilia/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package wordnet to\n",
      "[nltk_data]     /home/toshuumilia/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n",
      "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
      "[nltk_data]     /home/toshuumilia/nltk_data...\n",
      "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
      "[nltk_data]       date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# download ressources\n",
    "import nltk\n",
    "\n",
    "# download all popular ressources\n",
    "#nltk.download('popular')\n",
    "\n",
    "# download required ressources for this notebook\n",
    "nltk.download('stopwords')\n",
    "nltk.download('punkt')\n",
    "nltk.download('wordnet')\n",
    "nltk.download('averaged_perceptron_tagger')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'a',\n",
       " 'about',\n",
       " 'above',\n",
       " 'after',\n",
       " 'again',\n",
       " 'against',\n",
       " 'ain',\n",
       " 'all',\n",
       " 'am',\n",
       " 'an',\n",
       " 'and',\n",
       " 'any',\n",
       " 'are',\n",
       " 'aren',\n",
       " 'as',\n",
       " 'at',\n",
       " 'be',\n",
       " 'because',\n",
       " 'been',\n",
       " 'before',\n",
       " 'being',\n",
       " 'below',\n",
       " 'between',\n",
       " 'both',\n",
       " 'but',\n",
       " 'by',\n",
       " 'can',\n",
       " 'couldn',\n",
       " 'd',\n",
       " 'did',\n",
       " 'didn',\n",
       " 'do',\n",
       " 'does',\n",
       " 'doesn',\n",
       " 'doing',\n",
       " 'don',\n",
       " 'down',\n",
       " 'during',\n",
       " 'each',\n",
       " 'few',\n",
       " 'for',\n",
       " 'from',\n",
       " 'further',\n",
       " 'had',\n",
       " 'hadn',\n",
       " 'has',\n",
       " 'hasn',\n",
       " 'have',\n",
       " 'haven',\n",
       " 'having',\n",
       " 'he',\n",
       " 'her',\n",
       " 'here',\n",
       " 'hers',\n",
       " 'herself',\n",
       " 'him',\n",
       " 'himself',\n",
       " 'his',\n",
       " 'how',\n",
       " 'i',\n",
       " 'if',\n",
       " 'in',\n",
       " 'into',\n",
       " 'is',\n",
       " 'isn',\n",
       " 'it',\n",
       " 'its',\n",
       " 'itself',\n",
       " 'just',\n",
       " 'll',\n",
       " 'm',\n",
       " 'ma',\n",
       " 'me',\n",
       " 'mightn',\n",
       " 'more',\n",
       " 'most',\n",
       " 'mustn',\n",
       " 'my',\n",
       " 'myself',\n",
       " 'needn',\n",
       " 'no',\n",
       " 'nor',\n",
       " 'not',\n",
       " 'now',\n",
       " 'o',\n",
       " 'of',\n",
       " 'off',\n",
       " 'on',\n",
       " 'once',\n",
       " 'only',\n",
       " 'or',\n",
       " 'other',\n",
       " 'our',\n",
       " 'ours',\n",
       " 'ourselves',\n",
       " 'out',\n",
       " 'over',\n",
       " 'own',\n",
       " 're',\n",
       " 's',\n",
       " 'same',\n",
       " 'shan',\n",
       " 'she',\n",
       " 'should',\n",
       " 'shouldn',\n",
       " 'so',\n",
       " 'some',\n",
       " 'such',\n",
       " 't',\n",
       " 'than',\n",
       " 'that',\n",
       " 'the',\n",
       " 'their',\n",
       " 'theirs',\n",
       " 'them',\n",
       " 'themselves',\n",
       " 'then',\n",
       " 'there',\n",
       " 'these',\n",
       " 'they',\n",
       " 'this',\n",
       " 'those',\n",
       " 'through',\n",
       " 'to',\n",
       " 'too',\n",
       " 'under',\n",
       " 'until',\n",
       " 'up',\n",
       " 've',\n",
       " 'very',\n",
       " 'was',\n",
       " 'wasn',\n",
       " 'we',\n",
       " 'were',\n",
       " 'weren',\n",
       " 'what',\n",
       " 'when',\n",
       " 'where',\n",
       " 'which',\n",
       " 'while',\n",
       " 'who',\n",
       " 'whom',\n",
       " 'why',\n",
       " 'will',\n",
       " 'with',\n",
       " 'won',\n",
       " 'wouldn',\n",
       " 'y',\n",
       " 'you',\n",
       " 'your',\n",
       " 'yours',\n",
       " 'yourself',\n",
       " 'yourselves'}"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "set(stopwords.words('english'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenizing words and sentences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Token sentences:\n",
      " ['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', \"You shouldn't eat cardboard.\"]\n",
      "Token words:\n",
      " ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', \"n't\", 'eat', 'cardboard', '.']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import sent_tokenize, word_tokenize\n",
    "\n",
    "example_text = \"Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard.\"\n",
    "\n",
    "print(\"Token sentences:\\n\",sent_tokenize(example_text))\n",
    "print(\"Token words:\\n\",word_tokenize(example_text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "# Unsupervised learning of tokenizer: PunktSentenceTokenizer\n",
    "from nltk.corpus import state_union\n",
    "from nltk.tokenize import PunktSentenceTokenizer\n",
    "nltk.download(\"state_union\")\n",
    "train_text = state_union.raw(\"2005-GWBush.txt\")\n",
    "test_text = state_union.raw(\"2006-GWBush.txt\")\n",
    "custom_sent_tokenizer = PunktSentenceTokenizer(train_text)\n",
    "tokenized = custom_sent_tokenizer.tokenize(test_text)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stemming"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "words: ['car', 'cars', 'churches', 'going', 'gone', 'goes', 'went', 'geese']\n",
      "stems: ['car', 'car', 'church', 'go', 'gone', 'goe', 'went', 'gees']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import PorterStemmer\n",
    "from nltk.tokenize import sent_tokenize, word_tokenize\n",
    "\n",
    "example_words = [\"car\", \"cars\", \"churches\", \"going\", \"gone\", \"goes\", \"went\", \"geese\"]\n",
    "print(\"words:\", example_words)\n",
    "\n",
    "ps = PorterStemmer()\n",
    "print(\"stems:\",[ps.stem(w) for w in example_words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "new_text = \"It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['It', 'is', 'import', 'to', 'by', 'veri', 'pythonli', 'while', 'you', 'are', 'python', 'with', 'python', '.', 'all', 'python', 'have', 'python', 'poorli', 'at', 'least', 'onc', '.']\n"
     ]
    }
   ],
   "source": [
    "words = word_tokenize(new_text)\n",
    "print([ps.stem(w) for w in words])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### how to add stemming support to CountVectorizer (sklearn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "import nltk.stem\n",
    "\n",
    "stemmer = nltk.stem.SnowballStemmer('english')\n",
    "class StemmedCountVectorizer(CountVectorizer):\n",
    "    def build_analyzer(self):\n",
    "        analyzer = super(StemmedCountVectorizer, self).build_analyzer()\n",
    "        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])\n",
    "\n",
    "vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer=\"word\", stop_words='french')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatizing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Words:\n",
      "['car', 'cars', 'churches', 'going', 'gone', 'goes', 'went', 'geese']\n",
      "Lemmatized as noun (default):\n",
      "['car', 'car', 'church', 'going', 'gone', 'go', 'went', 'goose']\n",
      "Lemmatized as verb:\n",
      "['car', 'cars', 'church', 'go', 'go', 'go', 'go', 'geese']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "#nltk.download(\"wordnet\")\n",
    "print(\"Words:\")\n",
    "print([w for w in example_words])\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "print(\"Lemmatized as noun (default):\")\n",
    "print([lemmatizer.lemmatize(w) for w in example_words])\n",
    "    \n",
    "print(\"Lemmatized as verb:\")\n",
    "print([lemmatizer.lemmatize(w,'v') for w in example_words])\n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### POS (Part of speech) tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['They', 'refuse', 'to', 'permit', 'us', 'to', 'obtain', 'the', 'refuse', 'permit']\n",
      "[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]\n"
     ]
    }
   ],
   "source": [
    "# We need to tokenize first the text\n",
    "text = word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n",
    "print(text)\n",
    "# Then we can perform POS\n",
    "print(nltk.pos_tag(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "\n",
    "[Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html):\n",
    "\n",
    "CC:\tcoordinating conjunction   \n",
    "CD:\tcardinal digit   \n",
    "DT:\tdeterminer   \n",
    "EX:\texistential there (like: _\"there is\"_ ... think of it like _\"there exists\"_)   \n",
    "FW:\tforeign word   \n",
    "IN:\tpreposition/subordinating conjunction   \n",
    "JJ:\tadjective\t_\"big\"_   \n",
    "JJR:\tadjective, comparative _\"bigger\"_   \n",
    "JJS:\tadjective, superlative _\"biggest\"_   \n",
    "LS:\tlist marker\t1)   \n",
    "MD:\tmodal\tcould, will   \n",
    "NN:\tnoun, singular _\"desk\"_   \n",
    "NNS:\tnoun plural\t_\"desks\"_   \n",
    "NNP:\tproper noun, singular _\"Harrison\"_   \n",
    "NNPS:\tproper noun, plural\t_\"Americans\"_   \n",
    "PDT:\tpredeterminer _\"all the kids\"_   \n",
    "POS:\tpossessive ending _\"parent's\"_   \n",
    "PRP:\tpersonal pronoun\t_\"I, he, she\"_   \n",
    "PRP\\$:\tpossessive pronoun\t_\"my, his, hers\"_   \n",
    "RB:\tadverb\t_\"very, silently,\"_   \n",
    "RBR:\tadverb, comparative\t_\"better\"_   \n",
    "RBS:\tadverb, superlative\t_\"best\"_   \n",
    "RP:\tparticle\t_\"give up\"_   \n",
    "TO:\tto\t_\"go 'to' the store.\"_   \n",
    "UH:\tinterjection\t_\"errrrrrrrm\"_   \n",
    "VB:\tverb, base form\t_\"take\"_   \n",
    "VBD:\tverb, past tense\t_\"took\"_   \n",
    "VBG:\tverb, gerund/present participle\t_\"taking\"_   \n",
    "VBN:\tverb, past participle\t_\"taken\"_   \n",
    "VBP:\tverb, sing. present, non-3d\t_\"take\"_   \n",
    "VBZ:\tverb, 3rd person sing. present\t_\"takes\"_   \n",
    "WDT:\twh-determiner\t_\"which\"_   \n",
    "WP:\twh-pronoun\t_\"who, what\"_   \n",
    "WP\\$:\tpossessive wh-pronoun\t_\"whose\"_   \n",
    "WRB:\twh-abverb\t_\"where, when\"_   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package state_union to\n",
      "[nltk_data]     /home/toshuumilia/nltk_data...\n",
      "[nltk_data]   Package state_union is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import state_union\n",
    "from nltk.tokenize import sent_tokenize, word_tokenize\n",
    "nltk.download(\"state_union\")\n",
    "su_text = state_union.raw(\"2006-GWBush.txt\")\n",
    "sentence_tokens = sent_tokenize(su_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\\n \\nJanuary 31, 2006\\n\\nTHE PRESIDENT: Thank you all. Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream. Tonight we are comforted by the hope of a glad reunion with the hus\""
      ]
     },
     "execution_count": 126,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "su_text[:500]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',\n",
       " 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',\n",
       " '(Applause.)',\n",
       " 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan. 31, 2006.']"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentence_tokens[1:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]\n",
      "[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]\n",
      "[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]\n",
      "[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan.', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "# Look at POS for 5 first sentences of State of the Union address from 2006 from  past President George W. Bush.\n",
    "try:\n",
    "    for i in sentence_tokens[1:5]:\n",
    "        words = nltk.word_tokenize(i)\n",
    "        tagged = nltk.pos_tag(words)\n",
    "        print(tagged)\n",
    "except Exception as e:\n",
    "    print(str(e))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Penn TreeBank tags to WordNet pos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'n': 'noun', 'a': 'adj', 'r': 'adv', 'v': 'verb'}\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import wordnet\n",
    "print(wordnet._FILEMAP)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# a simple converter \n",
    "def get_wordnet_pos(treebank_tag):\n",
    "    \"\"\"\n",
    "    return WORDNET POS compliance to WORDNET lemmatization (a,n,r,v) \n",
    "    \"\"\"\n",
    "    if treebank_tag.startswith('J'):\n",
    "        return wordnet.ADJ\n",
    "    elif treebank_tag.startswith('V'):\n",
    "        return wordnet.VERB\n",
    "    elif treebank_tag.startswith('N'):\n",
    "        return wordnet.NOUN\n",
    "    elif treebank_tag.startswith('R'):\n",
    "        return wordnet.ADV\n",
    "    else:\n",
    "        # As default pos in lemmatization is Noun\n",
    "        return wordnet.NOUN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatization with context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('member', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('member', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corp', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guest', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizen', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lose', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('call', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('it', 'PRP$'), ('founding', 'NN'), ('ideal', 'NNS'), ('and', 'CC'), ('carry', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem import WordNetLemmatizer\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "try:\n",
    "    for i in sentence_tokens[1:2]:\n",
    "        words = nltk.word_tokenize(i)\n",
    "        tagged = nltk.pos_tag(words)        \n",
    "        lemma_pos_token = [(lemmatizer.lemmatize(w, pos=get_wordnet_pos(pos_tag)),pos_tag) for (w, pos_tag) in tagged] \n",
    "        print(lemma_pos_token)\n",
    "except Exception as e:\n",
    "    print(str(e))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deepenings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "- Sequences can be handled by SVM without transformation into fixed-sized vectors features by using the kernel trick and \"string kernels\"...\n",
    "\n",
    "- __New trend__: vector representation of words (word2vec, ...)   \n",
    "Not yet in scikit-learn, but you can look at http://zablo.net/blog/post/twitter-sentiment-analysis-python-scikit-word2vec-nltk-xgboost for an example using word2vec with scikit-learn..."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}