Under the hood of Artificial IntelligenceÂ¶

A technical discussion on AI, ML, and DL strategies and tactics.Â¶

What is Artificial Intelligence?Â¶

"branch of computer science dealing with the simulation of intelligent behavior in computers"

Merriam Webster Dictionary

“[...] study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of success at some goal.”

Wikipedia

What is Machine Learning?Â¶

“[...] gives computers the ability to learn without being explicitly programmed.”

Wikipedia

“Machine learning is a particular approach to artificial intelligence.”

Quora

Recent progress in AIÂ¶

Self-driving carsÂ¶

Self driving car vision

Object / image recognitionÂ¶

Object image recognition

Human-level speech recognitionÂ¶

Speech recognition

Near human-level language translationÂ¶

Language translation

Context aware chatbotsÂ¶

How does it work?Â¶

Ingredients for AI:Â¶

Data / dataset (perceives its environment)

Learning algorithm (takes action)

Objective function (maximizes chance of success)

DatasetÂ¶

Data is the new oilÂ¶

Data is the new oil

Data is at the heart of the AI revolution. Generation of data is powered by:

Multitude of sensors (IoT)

Smartphones

Digital / social footprints

Web logs

Bank / credit card transactions

Let's get technical!Â¶

So how do we actually use data in Machine Learning?Â¶

Given a full dataset, it is strategic, if not mandatory, to split your data into a train, validation, and test datasets.

Train datasetÂ¶

approximately 60% of the original dataset

as its name implies, this portion of the dataset is used to train your chosen learning model.

Validation datasetÂ¶

approximately 20% of the original dataset

this simulates an "unseen" set of data that the trained model can be tested on.

used to find optimal parameters for the learning model.

Test datasetÂ¶

approximately 20% of the original dataset

this measures the generalization capacity of the final model

used to evaluate whether a model can be used in production

Types of dataÂ¶

NumericalÂ¶

Continuous valued features

Weight

Length of facebook usage

CategoricalÂ¶

Features with discrete values

Courses (Physics, Math, Stats, etc.)

College level (1st year, 2nd year, etc.)

Social media channel (FB, Twitter, Instagram, etc.)

Data processingÂ¶

Handling missing dataÂ¶

DeletionÂ¶

Remove features that are too sparse (e.g., only ~1% of samples have values)

ImputationÂ¶

Replace missing values using different strategies.

Mean values

Median values

Predict using marginal models

Let's investigate an actual dataset!Â¶

Title: Auto-Mpg Data

Sources: (a) Origin: This dataset was taken from the StatLib library which is

         maintained at Carnegie Mellon University. The dataset was 
         used in the 1983 American Statistical Association Exposition.

Past Usage:
- See 2b (above)
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Relevant Information:

This dataset is a slightly modified version of the dataset provided in the StatLib library. In line with the use by Ross Quinlan (1993) in predicting the attribute "mpg", 8 of the original instances were removed because they had unknown values for the "mpg" attribute. The original dataset is available in the file "auto-mpg.data-original".

"The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes." (Quinlan, 1993)
Number of Instances: 398
Number of Attributes: 9 including the class attribute
Attribute Information:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Missing Attribute Values: horsepower has 6 missing values

https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names

In [2]:

import pandas as pd
import requests
import re

Let's download some data to work with.

In [3]:

# Download data from website
raw_data = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data').content

In [4]:

column_names = [
    'mpg',
    'cylinders',
    'displacement',
    'horsepower',
    'weight',
    'acceleration',
    'model_year',
    'origin',
    'car_name',
]

In [5]:

# Parse data
auto_mpg_dataset = pd.DataFrame(
    re.findall(
        '(.*) (.*) (.*) (.*) (.*) (.*) (.*) (.*) "(.*)"\n',
        '\n'.join([' '.join(s.split()) for s in raw_data.split('\n')])
    ),
    columns=column_names
)

# Substitute a parsable format for missing data
auto_mpg_dataset.horsepower = auto_mpg_dataset.horsepower.str.replace('?', '')

In [6]:

# Convert string type to integer/real number values
for col in column_names:
    try:
        auto_mpg_dataset[col] = pd.to_numeric(auto_mpg_dataset[col])
    except:
        pass

In [7]:

auto_mpg_dataset.dtypes

Out[7]:

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight          float64
acceleration    float64
model_year        int64
origin            int64
car_name         object
dtype: object

In [8]:

print("Dataset shape: {0} samples and {1} features.".format(*auto_mpg_dataset.shape))

Dataset shape: 398 samples and 9 features.

Let's check for the existence of missing data.

In [9]:

def show_features_missing(df):
    print('Features with missing data:')
    for col in df.columns[df.isnull().sum() > 0]:
        print('\t- {}'.format(col))

show_features_missing(auto_mpg_dataset)

Features with missing data:
	- horsepower

Let's impute the missing dataÂ¶

First, investigate how the data looks like to get a reasonable decision on what imputation strategy to use.

In [10]:

%matplotlib inline
ax = auto_mpg_dataset.horsepower.hist()

The data is skewed! It's better to use the median to reduce impact of outliers.

In [11]:

auto_mpg_dataset['horsepower'] = auto_mpg_dataset.horsepower.fillna(
    auto_mpg_dataset.horsepower.median()
)

In [12]:

auto_mpg_dataset.head()

Out[12]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	car_name
0	18.0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150.0	3436.0	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150.0	3433.0	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140.0	3449.0	10.5	70	1	ford torino

In [13]:

# Rename target variable
auto_mpg_dataset['target'] = auto_mpg_dataset.mpg
auto_mpg_dataset.drop('mpg', axis=1, inplace=True)

Let's do a feature transformation on the text data "car_name"Â¶

In [14]:

from sklearn.feature_extraction.text import CountVectorizer

# Use this as a quick way of building the vocabulary. :)
cvec = CountVectorizer()
cvec = cvec.fit(auto_mpg_dataset.car_name.values)

In [15]:

# Get proper column names from the tokenized words
word_columns = [i for i, j in sorted(cvec.vocabulary_.items(), key=lambda x: x[1])]


# Make a dataframe of the dense representation of the tokens corresponding to the car names
encoded_word_df = pd.DataFrame(
    cvec.transform(auto_mpg_dataset.car_name.values).toarray(),
    columns=word_columns
)

In [16]:

# Setup the final dataset
final_df = pd.concat([auto_mpg_dataset, encoded_word_df], axis=1)

print('Shape of our final dataset: {}'.format(final_df.shape))

Shape of our final dataset: (398, 309)

Let's split the data into train and test sets only due to small sample sizeÂ¶

In [17]:

test_df = final_df.sample(frac=0.3, replace=False, random_state=1029)
train_df = final_df[~final_df.index.isin(test_df.index)]

print('Training dataset samples: {}'.format(train_df.shape[0]))
print('Testing dataset samples: {}'.format(test_df.shape[0]))

Training dataset samples: 279
Testing dataset samples: 119

In [18]:

# Sanity check. Make sure samples belong to distinct data set.
test_df.index.intersection(train_df.index)

Out[18]:

Int64Index([], dtype='int64')

In [19]:

train_df.head()

Out[19]:

	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	car_name	target	...
0	8	307.0	130.0	3504.0	12.0	70	1	chevrolet chevelle malibu	18.0	...
1	8	350.0	165.0	3693.0	11.5	70	1	buick skylark 320	15.0	...
4	8	302.0	140.0	3449.0	10.5	70	1	ford torino	17.0	...
5	8	429.0	198.0	4341.0	10.0	70	1	ford galaxie 500	15.0	...
6	8	454.0	220.0	4354.0	9.0	70	1	chevrolet impala	14.0	...

5 rows × 309 columns

Machine Learning modelsÂ¶

Machine learning models can be classified into two main categories:

Supervised Machine learning

Uses data that have "labels" -> (features, target)
$f(x|\alpha, \beta, ...) = y$
- features -> x
- target -> y
- model -> f(.)
- hyperparameters -> $\alpha, \beta, ...$
Predictive models

Example:

Dataset with features:
    - length of study before exams
    - number of submitted projects

Target:
    - Grade for the subject

Unsupervised Machine learning

Raw / unlabelled data
Exploratory and clustering models

Example:

Dataset:
    - Wikipedia posts
    - Music data
    - Unannotated images

If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.

Yann Lecun

AI cake

Let's do supervised regression machine learningÂ¶

In [20]:

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

In [21]:

sc = StandardScaler()

def make_input(df, drop_features=(), is_train=True, use_scaler=True):
    y = df.target
    X = df.drop(sorted(drop_features) + ['target'], axis=1)

    if use_scaler:
        if is_train:
            X = sc.fit_transform(X)
        else:
            # Make sure train was transformed first
            X = sc.transform(X)

    return X, y

What are LinearRegression, RandomForest, and StandardScaler?

In [22]:

lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=1029)

X_train, y_train = make_input(
    train_df, drop_features=['car_name'],
    is_train=True, use_scaler=True
)

X_test, y_test = make_input(
    test_df, drop_features=['car_name'],
    is_train=False, use_scaler=True
)

lr.fit(X_train, y_train)
rf.fit(X_train, y_train)

print('Random forest train score: {}'.format(rf.score(X_train, y_train)))
print('Linear regression train score: {}\n'.format(lr.score(X_train, y_train)))

print('Random forest test score: {}'.format(rf.score(X_test, y_test)))
print('Linear regression test score: {}'.format(lr.score(X_test, y_test)))

Random forest train score: 0.979879110645
Linear regression train score: 0.958145174559

Random forest test score: 0.913236187664
Linear regression test score: -1.20366339692e+27

Let's see what happens if we remove the features that we transformed from the car namesÂ¶

In [23]:

# Drop columns corresponding to the transformed word features
train_no_words_df = train_df.drop(word_columns, axis=1)
test_no_words_df = test_df.drop(word_columns, axis=1)

In [24]:

X_train_no_words_df, y_train_no_words_df = make_input(
    train_no_words_df, drop_features=['car_name'], is_train=True,
    use_scaler=True
)

X_test_no_words_df, y_test_no_words_df = make_input(
    test_no_words_df, drop_features=['car_name'], is_train=False,
    use_scaler=True
)

lr.fit(X_train_no_words_df, y_train_no_words_df)
rf.fit(X_train_no_words_df, y_train_no_words_df)

print('Random forest train score: {}'.format(
    rf.score(X_train_no_words_df, y_train_no_words_df)
))

print('Linear regression train score: {}\n'.format(
    lr.score(X_train_no_words_df, y_train_no_words_df)
))

print('Random forest train score: {}'.format(
    rf.score(X_test_no_words_df, y_test_no_words_df)
))

print('Linear regression train score: {}'.format(
    lr.score(X_test_no_words_df, y_test_no_words_df)
))

Random forest train score: 0.979350053643
Linear regression train score: 0.822266524699

Random forest train score: 0.910432542959
Linear regression train score: 0.813695262061

Compared to earlier results, the score of the random forest model dropped a bit. In contrast, the linear regression model dramatically improved.

Possible reasons:

Random forest can look into the relatively large dimension of the data and find features that are relevant. By removing the word features, some of the information may have been lost.
Linear regression may perceive the high dimensionality introduced by the word features as additional noise which affected its performance.

Assuming there is an additional information in the word features, can we verify this?Â¶

In [26]:

from sklearn.decomposition import PCA

words_pca = PCA(n_components=2)

latent_car_name_features = words_pca.fit_transform(final_df[word_columns])

What is PCA (Principal Components Analysis)?

Let's plot the first two pricipal components to check for regularity in the features.

In [27]:

pd.DataFrame(latent_car_name_features, columns=['pc1', 'pc2']).plot.scatter('pc1', 'pc2')

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3a856c8590>

The samples tend to cluster based on the word features!Â¶

This shows that there is an inherent regularity in the word features -> additional information.

The "No free lunch hunch" theoremÂ¶

You can't say for sure which model will perform the best in a given problem prior to actual testing.

Try to experiment and use different models and parameters to identify which offers the best predictive power.

Now, let's do a classification problemÂ¶

Download the dataÂ¶

In [28]:

ad_feature_names = requests.get(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.names'
).content

ad_feature_names = [i.split(':')[0] for i in ad_feature_names.split('\n')[1:] if ':' in i]

In [29]:

ad_data = requests.get(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.data'
).content

ad_data = pd.DataFrame(
    [[j.strip() for j in i.strip().split(',')] for i in ad_data.split('\n') if i.strip()],
    columns=ad_feature_names+['target']
)

ad_data = ad_data.replace('?', '')

In [30]:

# Convert string type to integer/real number values
for col in ad_feature_names:
    try:
        ad_data[col] = pd.to_numeric(ad_data[col])
    except:
        pass

In [31]:

show_features_missing(ad_data)

Features with missing data:
	- height
	- width
	- aratio
	- local

Let's impute missing data using marginal models

In [32]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, roc_auc_score, f1_score

In [33]:

numerical_marginal_model = RandomForestRegressor(
    n_estimators=10,
    random_state=1029
)

categorical_marginal_model = RandomForestClassifier(
    n_estimators=10,
    random_state=1029
)

marginal_data = ad_data.dropna()

with_null_features = ['height', 'width', 'aratio', 'local']

for c in with_null_features:
    if c != 'local':
        numerical_marginal_model.fit(
            marginal_data.drop(with_null_features + ['target'], axis=1),
            marginal_data[c]
        )
        
        ad_data.loc[ad_data[c].isnull(), c] = numerical_marginal_model.predict(
            ad_data[ad_data[c].isnull()].drop(
                with_null_features + ['target'],
                axis=1
            )
        )
        
    else:
        categorical_marginal_model.fit(
            marginal_data.drop(with_null_features + ['target'], axis=1),
            marginal_data[c]
        )
        
        ad_data.loc[ad_data[c].isnull(), c] = categorical_marginal_model.predict(
            ad_data[ad_data[c].isnull()].drop(
                with_null_features + ['target'],
                axis=1
            )
        )

A better strategy would be to perform an out-of-fold prediction for the missing data and average the results of each fold to reduce variance. However, the above method could still work.

In [34]:

# Sanity check. Make sure there are no more features with missing data.
show_features_missing(ad_data)

Features with missing data:

In [35]:

# Convert the target variable into a binary encoded form
ad_data.target = 1 * (ad_data.target == 'ad.')

In [42]:

# Check distribution of target classes
ax = ad_data.target.hist()
print('Proportion of positive signal (ad.): {:.2f}%'.format(100 * ad_data.target.mean()))

Proportion of positive signal (ad.): 14.00%

The dataset is imbalanced!Â¶

In [37]:

# Perform simple random sampling in generating the test and train sets.
# Take note of this step. We'll discuss later a better way of doing this given the context of the data.
test_df = ad_data.sample(frac=0.3, replace=False, random_state=1029)
train_df = ad_data[~ad_data.index.isin(test_df.index)]

train_df = train_df.drop('local', axis=1)
test_df = test_df.drop('local', axis=1)

print('Training dataset samples: {}'.format(train_df.shape[0]))
print('Testing dataset samples: {}'.format(test_df.shape[0]))

Training dataset samples: 2295
Testing dataset samples: 984

In [38]:

X_train, y_train = make_input(train_df, is_train=True, use_scaler=False)
X_test, y_test = make_input(test_df, is_train=False, use_scaler=False)

rf = RandomForestClassifier(n_estimators=100, random_state=1029)
lr = LogisticRegression()

rf.fit(X_train, y_train)
lr.fit(X_train, y_train)

print('Random forest train score: {}'.format(rf.score(X_train, y_train)))
print('Logistic regression train score: {}\n'.format(lr.score(X_train, y_train)))

print('Random forest test score: {}'.format(rf.score(X_test, y_test)))
print('Logistic regression test score: {}\n'.format(lr.score(X_test, y_test)))

print(classification_report(y_test, rf.predict(X_test)))
print('ROC-AUC score: {}'.format(
    roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
))

Random forest train score: 0.99825708061
Logistic regression train score: 0.980392156863

Random forest test score: 0.984756097561
Logistic regression test score: 0.979674796748

             precision    recall  f1-score   support

          0       0.99      1.00      0.99       848
          1       0.98      0.91      0.94       136

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.975665059656

Cross validationÂ¶

Cross validation is a technique wherein a training data is divided into subsets. These subsets are used to train and validate the model.

This is useful if the dataset is small.

In [39]:

from sklearn.model_selection import cross_val_predict

In [40]:

out_of_fold_predictions = cross_val_predict(
    rf, X_train, y_train, cv=5,
    method='predict_proba'
)

print classification_report(y_train, out_of_fold_predictions[:, 1] > 0.5)
print('ROC-AUC score: {}'.format(roc_auc_score(y_train, out_of_fold_predictions[:, 1])))

             precision    recall  f1-score   support

          0       0.97      0.99      0.98      1972
          1       0.93      0.80      0.86       323

avg / total       0.96      0.96      0.96      2295

ROC-AUC score: 0.965255056864

Now we know that the validation result is high, let's train the model using the entire dataset.Â¶

In [41]:

rf = rf.fit(X_train, y_train)

print classification_report(y_test, rf.predict(X_test))
print('ROC-AUC score: {}'.format(roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])))

             precision    recall  f1-score   support

          0       0.99      1.00      0.99       848
          1       0.98      0.91      0.94       136

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.975665059656

Looks good right? But remember that our dataset is imbalanced.Â¶

So now, let's do this the proper way. :)Â¶

We need to apply a stratified sampling technique to generate our train and test datasets.

In [44]:

from sklearn.model_selection import StratifiedKFold, train_test_split

train_df, test_df, y_train, y_test = train_test_split(
    ad_data, ad_data.target, test_size=0.3,
    random_state=1029, stratify=ad_data.target
)

print('Training dataset samples: {}'.format(train_df.shape[0]))
print('Testing dataset samples: {}'.format(test_df.shape[0]))

Training dataset samples: 2295
Testing dataset samples: 984

Just execute a similar model fitting routine

In [45]:

X_train, y_train = make_input(train_df, is_train=True, use_scaler=False)
X_test, y_test = make_input(test_df, is_train=False, use_scaler=False)

rf = RandomForestClassifier(n_estimators=100, random_state=1029)
lr = LogisticRegression()

rf.fit(X_train, y_train)
lr.fit(X_train, y_train)

print('Random forest train score: {}'.format(rf.score(X_train, y_train)))
print('Logistic regression train score: {}\n'.format(lr.score(X_train, y_train)))

print('Random forest test score: {}'.format(rf.score(X_test, y_test)))
print('Logistic regression test score: {}\n'.format(lr.score(X_test, y_test)))

print(classification_report(y_test, rf.predict(X_test)))
print('ROC-AUC score: {}'.format(
    roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])
))

Random forest train score: 0.999564270153
Logistic regression train score: 0.982135076253

Random forest test score: 0.977642276423
Logistic regression test score: 0.968495934959

             precision    recall  f1-score   support

          0       0.98      0.99      0.99       846
          1       0.96      0.88      0.92       138

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.984093945935

Previous results:

Random forest train score: 0.99825708061
Logistic regression train score: 0.980392156863

Random forest test score: 0.984756097561
Logistic regression test score: 0.979674796748

             precision    recall  f1-score   support

          0       0.99      1.00      0.99       848
          1       0.98      0.91      0.94       136

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.975665059656

In [46]:

out_of_fold_predictions = cross_val_predict(
    rf, X_train, y_train, cv=5,
    method='predict_proba'
)

print classification_report(y_train, out_of_fold_predictions[:, 1] > 0.5)
print('ROC-AUC score: {}'.format(roc_auc_score(y_train, out_of_fold_predictions[:, 1])))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99      1974
          1       0.96      0.88      0.92       321

avg / total       0.98      0.98      0.98      2295

ROC-AUC score: 0.978942766873

Previous result:

             precision    recall  f1-score   support

          0       0.97      0.99      0.98      1972
          1       0.93      0.80      0.86       323

avg / total       0.96      0.96      0.96      2295

ROC-AUC score: 0.965255056864

In [47]:

rf = rf.fit(X_train, y_train)

print classification_report(y_test, rf.predict(X_test))
print('ROC-AUC score: {}'.format(roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])))

             precision    recall  f1-score   support

          0       0.98      0.99      0.99       846
          1       0.96      0.88      0.92       138

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.984093945935

Previous result:

             precision    recall  f1-score   support

          0       0.99      1.00      0.99       848
          1       0.98      0.91      0.94       136

avg / total       0.98      0.98      0.98       984

ROC-AUC score: 0.975665059656

By performing a more informed way of splitting the data, we have achieved a better validation and test scores!

We've learned how to handle datasets, apply machine learning algorithms, and execute strategies like cross validation to help build predictive models that are robust.

However, recent breakthroughs in AI were achieved by using Deep Learning. What is it?

What is Deep Learning?Â¶

"is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised."

Wikipedia

Often times, deep learning is described by analogizing it to the activities of neurons in our brain. In addition to that, it might be worthwhile to think about deep learning in the framework of learning itself.

One of the most promising theories that try to describe deep learning is based on the concept of "Information bottleneck".

A very rough visualization of Representation learning.

Representation learning

Visualization of Word embedding.

Word2vec

Visualization of Feature learning.

Feature learning

Let's do some Natural Language Processing (NLP) using a shallow neural network (Word2vec)Â¶

Here's the architecture of word2vec model

Word2vec diagram

Let's get some data and apply the word2vec model.Â¶

In [48]:

from bs4 import BeautifulSoup
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

Scrape summary of Game of Thrones from SparkNotes

In [49]:

%%time
all_sentences = []

for i in range(1, 16):
    print('Fetching page {}...'.format(i))
    url = 'http://www.sparknotes.com/lit/a-game-of-thrones/section{}.rhtml'.format(i)

    got_summary = requests.get(url).content
    b = BeautifulSoup(got_summary, 'html5lib')
    
    text_data = b.find(attrs={'class': 'studyGuideText'}).text

    sentences = [
        [j for j in re.findall('[a-z]+', i.lower()) if j not in ENGLISH_STOP_WORDS and len(j) > 1]
        for i in text_data.split('.') if i.strip() and len(i.split()) > 5
    ]

    all_sentences.extend(sentences)

Fetching page 1...
Fetching page 2...
Fetching page 3...
Fetching page 4...
Fetching page 5...
Fetching page 6...
Fetching page 7...
Fetching page 8...
Fetching page 9...
Fetching page 10...
Fetching page 11...
Fetching page 12...
Fetching page 13...
Fetching page 14...
Fetching page 15...
CPU times: user 12 s, sys: 46.7 ms, total: 12.1 s
Wall time: 1min 41s

Train word2vec model.

In [50]:

import numpy as np
from gensim.models.word2vec import Word2Vec

wv = Word2Vec(sentences=all_sentences, size=30, min_count=3, window=5, negative=10, sg=0, iter=100)

Using TensorFlow backend.

In [51]:

vocab = wv.wv.vocab.keys()
vecs = wv[vocab]
vecs /= np.linalg.norm(vecs, 2, axis=0)

pd.DataFrame(vecs).to_csv('got_w2v.vec.tsv', header=False, index=False, sep='\t')

In [52]:

from sklearn.cluster import KMeans, DBSCAN

km = KMeans(n_clusters=5)

with open('got_w2v.metadata.kmeans.tsv', 'w') as f:
    f.write('words\tcluster\n')
    for i, j in zip(vocab, km.fit_predict(vecs)):
        f.write('{}\t{}\n'.format(i, j))
        

max_eps = 0
min_none = np.inf
for i in np.arange(0.001, 0.5, 0.001):
    db = DBSCAN(eps=i, min_samples=5, metric='cosine')
    if (min_none > (db.fit_predict(vecs) == -1).sum()) and ((db.fit_predict(vecs) == 0).sum() < 30):
        max_eps = i
        min_none = (db.fit_predict(vecs) == -1).sum()

db = DBSCAN(eps=max_eps, min_samples=5, metric='cosine')

with open('got_w2v.metadata.dbscan.tsv', 'w') as f:
    f.write('words\tcluster\n')
    for i, j in zip(vocab, db.fit_predict(vecs)):
        f.write('{}\t{}\n'.format(i, j))

The word2vec model learned that "khal" is conceptually related to "khalasar", "drogo", and "horse". This happened even though we have not explicit told the model that these words are related!

word2vec result

Word2vec has some interesting algebraic properties!Â¶

In [58]:

wv.most_similar(positive=['daenerys', 'drogo'], negative=['queen'], topn=1)

Out[58]:

[(u'khal', 0.7924862504005432)]

The model was able to learn that "queen" is to "daenerys" while "khal" is to "drogo"!