(You might want to read Part 1 first.)
When last we left off, we'd built a model using shirt colors to predict boy-ness / girl-ness.
Our second attempt will involve the shirt images themselves (sort of). For our purposes, computer images are made up of pixels, each of whose color is determined by specifying red, green, and blue values between 0 and 255. So if you have an image with N pixels, you can think of it as a point in 3N-dimensional space, all of whose coordinates lie between 0 and 255.
And as before, we can build a linear model to classify points in space using logistic regression. The trick here is that the images have different sizes (and hence different numbers of pixels). So as a first step, we'll rescale every image to 138 pixels x 138 pixels = 19,044 pixels. (A lot of our images are this size, and the rest are mostly larger, which is why I chose it.) This will give us a representation of each t-shirt image as a point in 57,132-dimensional space. (Visualizing 57,132-dimensional space is tricky, so don't feel bad if you can't do it.)
Our dataset only contains about 1,000 shirts, which means that a 57,000-dimensional classifier would learn to identify every shirt in the test dataset rather than figure out what distinguishes the boys shirts from the girls shirts. This means we need to do some sort of dimensionality reduction to get our t-shirt images into a much lower-dimensional space.
Here we'll use Principal Component Analysis, which finds the direction (in 57,132-dimensional space) that accounts for the largest amount of variance in the dataset. It then subtracts out this direction, finds the most-variant-direction of the new dataset, and so on, until it has enough components.
(As always, code is on GitHub.)
I ended up using 10 components, which gives a representation of each t-shirt as just 10 numbers, representing the projection of the (57,132-dimensional representation of the) shirt onto the first 10 principal components, each of which is itself a vector in 57,132-dimensional space. For instance, the first principal component is the 57,132-element vector
[0.0002334, 0.00029256, 0.00042805, ... , 0.00051605]
By thinking of this as a vector of 19,044 rgb triplets, and by rescaling it so that its smallest component is 0 and its largest component 255, we can convert it into an image of an eigenshirt representing the "essence" of this component. Shirts with a large value for the first component will tend to be "similar" to this eigenshirt. Shirts with a large negative value for the first component will tend to be "similar" to its color-inverted "anti-eigenshirt". [We could have just as easily picked the "anti-eigenshirt" as the eigenshirt and flipped the signs of the components.]
The below table shows, for each of the 10 principal components, the eigenshirt, the shirt with the largest component value, the shirt with the closest-to-zero value, the shirt with the largest negative component value, and the "anti-eigenshirt".
|Eigenshirt||Most Eigenshirty||Not Eigenshirty||Most Anti-Eigenshirty||Anti-Eigenshirt|
If I were to try to give qualitative descriptions of these ten components, I guess they would be something like:
- Component 0: White -\> Black
- Component 1: Orange -\> Blue
- Component 2: Dark sleeved / white sleeveless -\> White sleeved / dark sleeveless
- Component 3: Wide dark / narrow white -\> Narrow dark / wide white
- Component 4: ?
- Component 5: Green -\> Purple
- Component 6: White trim / dark shirt -\> Dark trim / white shirt
- Component 7: Dark long sleeve / white sleeveless -\> White long sleeve / dark sleeveless
- Component 8: White shirt / dark print -\> Dark shirt / white print
- Component 9: ?
The Principal Component representation of each shirt is a 10-dimensional vector representing (roughly) where it fits on each of these spectra. For instance, the monkey shirt
is represented by the vector
[ -9313, 10067, -149, -4013, -2147, 1574, -296, -954, 1729, -196]
the biggest components of which are "orange" (eigenshirt #1), "dark" (anti-eigenshirt 0), and "narrow" (anti-eigenshirt 3).
If we try to reconstruct the image using just these ten components, we get
which seems to have captured orange, short sleeve, and dark graphic. You certainly can't tell it's a monkey, though.
If we try to predict "boy shirt or girl shirt" using just these 10 components, we get a model that's 93% accurate on the test set. The coefficients (multiplied by 10,000, since they're small) look like:
Component 0: -2.71 (eigenshirt is girlish)\ Component 1: -2.56 (girlish)\ Component 2: 3.55 (boyish)\ Component 3: 0.53 (weakly boyish)\ Component 4: -0.56 (weakly girlish)\ Component 5: 5.43 (boyish)\ Component 6: -15.9 (very girlish)\ Component 7: -4.68 (girlish)\ Component 8: 2.73 (boyish)\ Component 9: -2.14 (girlish)
As before, we can look at how the shirts are distributed as a function of the score they get from the model:
The miscategorized shirts generally have low (close to 0) scores, except for one particularly "girly" boys shirt that we'll see below.
Girliest Girl (looks like is based on shape and colors)
Girliest Boy (shape and colors again)
Boyiest Boy (da Bears)
Boyiest Girl (same one as last time!)
This is all very interesting and hints at Platonic ideal shirts (the philosophical details of which are out of scope for this blog). And clearly it does a much better job of predicting "boy shirt or girl shirt" than our previous color-based attempt. But whereas everyone knows about colors (except for the color-blind, of course), most people are unfamiliar with "eigenshirts" and will accuse you of having made them up just in order to have something to blog about. In particular, the girl who works at Gap Kids was entirely unimpressed with this model, and said that I needed to either buy something or leave the store.
Were I really committed to this model, I'd probably do more work to get the images comparable to each other so that not only were they the same size but the shirts were oriented as closely as possible and all had the same background color. Alas, I'm sort of principal-componented-out, and am eager to get back to writing my blog post about "the only correct way to interview engineers", the punch-line of which is that you should only ask questions that involve golf balls, piano tuners, counterfeit coins, airplanes, treadmills, or piano tuners.
And so we leave things until part 3, "Shirt Language Processing", which will be forthcoming at some point after I muster up the motivation to either transcribe the shirt images or find an intern to do it for me.