| tags:ipython-notebook tensorflow machine-learning categories:ipython

# TensorFlow and Iris

I ran across pandas and though it would work pretty well while I was exploring TensorFlow. It already has a lot of data reading/manipulation functionality that I was going to need to write anyway.

```
import csv
import tensorflow as tf
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
```

Initially pandas makes reading from the Iris CSV file trivial. I also like the nice tabular output in iPython.

```
ipd = pd.read_csv("iris.csv")
ipd.head()
```

Sepal Length | Sepal Width | Petal Length | Petal Width | Species | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |

1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |

2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |

3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |

4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |

Plotting the measurements against one another show clusters for the different species that will hopefully allow us to classify.

```
plt.subplot(2,1,1)
for key,val in ipd.groupby('Species'):
plt.plot(val['Sepal Length'], val['Sepal Width'], label=key, linestyle="_", marker='.')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.subplot(2,1,2)
for key,val in ipd.groupby('Species'):
plt.plot(val['Petal Length'], val['Petal Width'], label=key, linestyle="_", marker='.')
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.legend(loc='best')
plt.show()
```

I want a one-hot vector that represents the species that I can use to train the classification. I gather the species labels, and then use their index to pull out a unique row from the identity matrix. This forms the one-hot vectors.

```
species = list(ipd['Species'].unique())
ipd['One-hot'] = ipd['Species'].map(lambda x: np.eye(len(species))[species.index(x)] )
ipd.sample(5)
```

Sepal Length | Sepal Width | Petal Length | Petal Width | Species | One-hot | |
---|---|---|---|---|---|---|

136 | 6.3 | 3.4 | 5.6 | 2.4 | virginica | [0.0, 0.0, 1.0] |

58 | 6.6 | 2.9 | 4.6 | 1.3 | versicolor | [0.0, 1.0, 0.0] |

20 | 5.4 | 3.4 | 1.7 | 0.2 | setosa | [1.0, 0.0, 0.0] |

69 | 5.6 | 2.5 | 3.9 | 1.1 | versicolor | [0.0, 1.0, 0.0] |

33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa | [1.0, 0.0, 0.0] |

I need to split the data into training and test sets. Since the data is sorted by species, I need to shuffle it before splitting.

```
shuffled = ipd.sample(frac=1)
trainingSet = shuffled[0:len(shuffled)-50]
testSet = shuffled[len(shuffled)-50:]
```

I used essentially the same training code from the basic MNIST tutorial.

```
inp = tf.placeholder(tf.float32, [None, 4])
weights = tf.Variable(tf.zeros([4, 3]))
bias = tf.Variable(tf.zeros([3]))
y = tf.nn.softmax(tf.matmul(inp, weights) + bias)
y_ = tf.placeholder(tf.float32, [None, 3])
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
train_step = tf.train.AdamOptimizer(0.01).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
```

The first time I ran this, I got an error of `ValueError: setting an array element with a sequence.`

. The Pandas `as_matrix()`

function returns the data set as an `ndarray`

of `ndarray`

s. The TensorFlow API needs a list though, hence the seemingly useless comprehension which resolves the error.

```
keys = ['Sepal Length', 'Sepal Width','Petal Length', 'Petal Width']
for i in range(1000):
train = trainingSet.sample(50)
sess.run(train_step, feed_dict={inp: [x for x in train[keys].values],
y_: [x for x in train['One-hot'].as_matrix()]})
print sess.run(accuracy, feed_dict={inp: [x for x in testSet[keys].values],
y_: [x for x in testSet['One-hot'].values]})
```

0.98

To actually use the trained data for classification, I need to evaluate `y`

given an input vector and find the index of the resultant vector that is largest. `tf.argmax`

does exactly that.

```
def classify(inpv):
dim = y.get_shape().as_list()[1]
res = np.zeros(dim)
# argmax returns a single element vector, so get the scalar from it
largest = sess.run(tf.argmax(y,1), feed_dict={inp: inpv})[0]
return np.eye(dim)[largest]
sample = shuffled.sample(1)
print "Classified as %s" % classify(sample[keys])
sample
```

Classified as [ 0. 0. 1.]

Sepal Length | Sepal Width | Petal Length | Petal Width | Species | One-hot | |
---|---|---|---|---|---|---|

116 | 6.5 | 3 | 5.5 | 1.8 | virginica | [0.0, 0.0, 1.0] |