I ran across pandas and though it would work pretty well while I was exploring TensorFlow. It already has a lot of data reading/manipulation functionality that I was going to need to write anyway.

import csv
import tensorflow as tf
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Initially pandas makes reading from the Iris CSV file trivial. I also like the nice tabular output in iPython.

ipd = pd.read_csv("iris.csv")
Sepal Length Sepal Width Petal Length Petal Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Plotting the measurements against one another show clusters for the different species that will hopefully allow us to classify.

for key,val in ipd.groupby('Species'):
    plt.plot(val['Sepal Length'], val['Sepal Width'], label=key, linestyle="_",  marker='.')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
for key,val in ipd.groupby('Species'):
    plt.plot(val['Petal Length'], val['Petal Width'], label=key, linestyle="_",  marker='.')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')   

I want a one-hot vector that represents the species that I can use to train the classification. I gather the species labels, and then use their index to pull out a unique row from the identity matrix. This forms the one-hot vectors.

species = list(ipd['Species'].unique())
ipd['One-hot'] = ipd['Species'].map(lambda x: np.eye(len(species))[species.index(x)] )
Sepal Length Sepal Width Petal Length Petal Width Species One-hot
136 6.3 3.4 5.6 2.4 virginica [0.0, 0.0, 1.0]
58 6.6 2.9 4.6 1.3 versicolor [0.0, 1.0, 0.0]
20 5.4 3.4 1.7 0.2 setosa [1.0, 0.0, 0.0]
69 5.6 2.5 3.9 1.1 versicolor [0.0, 1.0, 0.0]
33 5.5 4.2 1.4 0.2 setosa [1.0, 0.0, 0.0]

I need to split the data into training and test sets. Since the data is sorted by species, I need to shuffle it before splitting.

shuffled = ipd.sample(frac=1)
trainingSet = shuffled[0:len(shuffled)-50]
testSet = shuffled[len(shuffled)-50:]

I used essentially the same training code from the basic MNIST tutorial.

inp = tf.placeholder(tf.float32, [None, 4])
weights = tf.Variable(tf.zeros([4, 3]))
bias = tf.Variable(tf.zeros([3]))

y = tf.nn.softmax(tf.matmul(inp, weights) + bias)

y_ = tf.placeholder(tf.float32, [None, 3])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

train_step = tf.train.AdamOptimizer(0.01).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

init = tf.initialize_all_variables()

sess = tf.Session()

The first time I ran this, I got an error of ValueError: setting an array element with a sequence.. The Pandas as_matrix() function returns the data set as an ndarray of ndarrays. The TensorFlow API needs a list though, hence the seemingly useless comprehension which resolves the error.

keys = ['Sepal Length', 'Sepal Width','Petal Length', 'Petal Width']
for i in range(1000):
    train = trainingSet.sample(50)
    sess.run(train_step, feed_dict={inp: [x for x in train[keys].values],
                                    y_: [x for x in train['One-hot'].as_matrix()]})

print sess.run(accuracy, feed_dict={inp: [x for x in testSet[keys].values], 
                                    y_: [x for x in testSet['One-hot'].values]})


To actually use the trained data for classification, I need to evaluate y given an input vector and find the index of the resultant vector that is largest. tf.argmax does exactly that.

def classify(inpv):
    dim = y.get_shape().as_list()[1]
    res = np.zeros(dim)
    # argmax returns a single element vector, so get the scalar from it
    largest = sess.run(tf.argmax(y,1), feed_dict={inp: inpv})[0]
    return np.eye(dim)[largest]
sample = shuffled.sample(1)
print "Classified as %s" % classify(sample[keys])
Classified as [ 0.  0.  1.]
Sepal Length Sepal Width Petal Length Petal Width Species One-hot
116 6.5 3 5.5 1.8 virginica [0.0, 0.0, 1.0]

Download Notebook

comments powered by Disqus