The original said nothing to me about my life.
Why [Programming Language X] Is Unambiguously Better than [Programming Language Y]
Recently I have seen a lot of people wondering about the difference between [X] and [Y]. After all, they point out, both are [paradigm] languages that target [platform] and encourage the [style] style of programming while leaving you enough flexibility to [write shitty code].
Having written [simple program that's often asked about in phone screens] in both languages, I think I’m pretty qualified to weigh in. I like to think about it in the following way: imagine [toy problem that you might give to a 5th grader who is just learning to program]. A [Y] implementation of it might look like this:
[really poorly engineered Y code]
Whereas in [X] you could accomplish the same thing with just
[slickly written X code that shows off syntactic sugar]
It’s pretty clear that the second is easier to understand and less errorprone.
Now consider type systems. [Religious assertion about the relative merits and demerits of static and dynamic typing.] Sure, [Y] gives you [the benefit of Y's type system or lack thereof] but is this worth [the detriment of Y's type system or lack thereof]? Obviously not!
Additionally, consider build tools. While [Y] uses [tool that I have never bothered to understand], [X] uses the far superior [tool that I marginally understand]. That’s reason enough to switch!
Finally, think about the development process. [X] has the amazing [Xspecific IDE that's still in prealpha], and it also integrates well with [texteditor that's like 50 years old and whose keybindings are based on Klingon] and [IDE that everyone uses but that everyone hates]. Sure, you can use [Y] with some of these, but it’s a much more laborious and painful process.
In conclusion, while there is room for polyglotism on the [platform] platform, we would all be well served if you [Y] developers would either crawl into a hole somewhere or else switch to [X] and compete with us for the handful of [X] jobs. Wait, never mind, [Y] is awesome!
*Thinking Spreadsheet* Free On The Web
After talking about doing so forever, I’ve finally “webified” Thinking Spreadsheet. So if you ever wanted to learn everything I know about spreadsheets but were too cheap to actually buy the book, here’s your opportunity. Share it with your friends and hope that Github doesn’t decide I’m abusing their “free website” feature!
Constructive Mathematics in F# (and Clojure)
(Tell me what a terrible person I am on Hacker News.)
For as long as I can remember^{1} I’ve dreamed of reimplementing the entirety of mathematics from scratch. And now that I’ve finished the “Wheel of Time” series I have a little bit of extra time on my hands each day, which has allowed me to take baby steps toward my dream.
What this is
An implementation of mathematics in F# (and also in Clojure)
What this is not
An efficient implementation of mathematics in F# (or in Clojure)
You would never want to use this library to do mathematics, as it is chockfull of all sorts of nontailrecursive function calls that will blow your stack like there’s no tomorrow. (If you don’t know what that means, just take my word that you would never want to use this library to do mathematics.) Instead, this library is an interesting way to learn about
 how to construct a mathematics from scratch
 how to implement a mathematics in F# (or Clojure)
 my bizarre obsessions
As always when I work on stuff like this, the code is on my GitHub.
This was originally just going to be in F#, and then I read a couple of blog posts about ClojureScript, which reminded me I’d been meaning to do something in Clojure, so why not implement the same stuff a second time? (This is why “in Clojure” is in parentheses everywhere, and why the F# code has all the comments.) I tried to make the F# code F#y and the Clojure code Clojurey, but I’m not sure how well I succeeded.
I won’t go into excruciating detail about either mathematical theory or F# (or Clojure), but hopefully you can understand both from the detail I do go into. I also will only call a few high points of each codebase, if you want more gory details check out GitHub.
Both sets of code have handfuls of tests written, which should give you a good sense of how both libraries operate.
Comparisons
In F#, I’ll define a discriminated union
type Comparison = LessThan  Equal  GreaterThan
In Clojure you don’t typically use “types”, but we can just use keywords :lessthan
and :equal
and :greaterthan
.
Natural Numbers
We’ll define these recursively. A natural number is either
 “One” (which is just some thing, forget that you’re already familiar with a “one”), or
 the “Successor” of a different natural number
Anything you can make using these rules is a natural number. Anything that you can’t isn’t.
We’ll call the successor of One “Two”, and the successor of Two “Three”, keeping in mind that at this point those are just names attached to things without any meaning other than “Two is the successor of One” and “Three is the successor of Two”.
In F# we can do this with a discriminated union:
type Natural = One  SuccessorOf of Natural
let Two = SuccessorOf One
// and so on
After trying a lot of things in Clojure, I finally decided the most Clojureish Clojurian way was
(defn successorof [n] {:predecessor n})
(def one (successorof nil))
(def two (successorof one))
; and so on
Although the Clojure way at first looks backward, if you think about it both ways define the “successor of One” to be the number whose “predecessor” is One.
Next we’ll want to use this recursive structure to create an arithmetic. For instance, we can easily add two natural numbers:
let rec Add (n1 : Natural) (n2 : Natural) =
match n1 with
// adding One to a number is the same as taking its Successor
 One > SuccessorOf n2
// otherwise n1 has a predecessor, add it to the successor of n2
// idea: n1 + n2 = (n1  1) + (n2 + 1)
 SuccessorOf n1' > Add n1' (SuccessorOf n2)
Clojure doesn't have builtin patternmatching, so instead I did something similar using a one?
function:
(defn add [n1 n2]
(if (one? n1)
(successorof n2)
(add (predecessorof n1) (successorof n2))))
Both make it easy to create lazy infinite sequences of all natural numbers.
let AllNaturals = Seq.unfold (fun c > Some (c, SuccessorOf c)) One
and
(def allnaturals (iterate successorof one))
And (blame it on the natural numbers) both run into trouble when you try to define subtraction. In F# the natural thing to do is return an Option type:
// now, we'd like to define some notion of subtraction as the inverse of addition
// so if n1 + n2 = n3, then you'd like "n3  n2" = n1
// but this isn't always defined, for instance
// n = One  One
// would mean One = One + n = SuccessorOf n, which plainly can never happen
// in this case we'll return None
let rec TrySubtract (n1 : Natural) (n2 : Natural) =
match n1, n2 with
// Since n1' + One = SucessorOf n1', then SuccessorOf n1'  One = n1'
 SuccessorOf n1', One > Some n1'
// if n = (n1 + 1)  (n2 + 1), then
// n + n2 + 1 = n1 + 1
// so n + n2 = n1,
// or n = n1  n2
 SuccessorOf n1', SuccessorOf n2' > TrySubtract n1' n2'
 One, _ > None // "Impossible subtraction"
In Clojure there is no option type, so I just returned nil
for a bad subtraction:
(defn trysubtract [n1 n2]
(cond
(one? n1) nil
(one? n2) (predecessorof n1)
:else (trysubtract (predecessorof n1) (predecessorof n2))))
Integers
The failure of "subtraction" leads us to introduce the Integers, which you can (if you are so inclined) think of as equivalence classes of pairs of natural numbers, where (for instance),
(Three,Two) = (Two,One) = "the result of subtracting one from two" = "the integer corresponding to one"
In F# we can again define a custom type:
type Integer =
 Positive of Natural.Natural
 Zero
 Negative of Natural.Natural
and map to equivalence classes using
let MakeInteger (plus,minus) =
match Natural.Compare plus minus with
 Comparison.Equal > Zero
 Comparison.GreaterThan > Positive (Natural.Subtract plus minus)
 Comparison.LessThan > Negative (Natural.Subtract minus plus)
whereas in Clojure we just use maps:
(def zero {:sign :zero})
(defn positive [n] {:sign :positive, :n n})
(defn negative [n] {:sign :negative, :n n})
and the very similar
(defn makeinteger [plus minus]
(let [compare (naturalnumbers/compare plus minus)]
(case compare
:equal zero
:greaterthan (positive (naturalnumbers/subtract plus minus))
:lessthan (negative (naturalnumbers/subtract minus plus)))))
We can easily define addition and subtraction and even multiplication, but when we get to division we run into problems again. You'd like 1 / 3 to be the number that when multiplied by three yields one. But there is no such Integer.
let rec TryDivide (i1 : Integer) (i2 : Integer) =
match i1, i2 with
 _, Zero > failwithf "Division by Zero is not allowed"
 _, Negative _ > TryDivide (Negate i1) (Negate i2)
 Zero, Positive _ > Some Zero
 Negative _, Positive _ >
match TryDivide (Negate i1) i2 with
 Some i > Some (Negate i)
 None > None
 Positive _ , Positive _ >
if LessThan i1 i2
then None // cannot divide a smaller integer by a larger one
else
match TryDivide (Subtract i1 i2) i2 with
 Some i > Some (SuccessorOf i)
 None > None
and similarly
(defn trydivide [i1 i2] =
(cond
(zero? i2) (throw (Exception. "division by zero is not allowed"))
(negative? i2) (trydivide (negate i1) (negate i2))
(zero? i1) zero
(negative? i1) (let [td (trydivide (negate i1) i2)]
(if td (negate td)))
:else ; both positive
(if (lessthan i1 i2)
nil
(let [td (trydivide (subtract i1 i2) i2)]
(if td (successorof td))))))
And if we're clever we can get a lazy sequence of all prime numbers:
let rec IsPrime (i : Integer) =
match i with
 Zero > false
 Negative _ > IsPrime (Negate i)
 Positive Natural.One > false
 Positive _ >
let isComposite =
Range Two (AlmostSquareRoot i)
> Seq.exists (fun i' > IsDivisibleBy i i')
not isComposite
let AllPrimes =
Natural.AllNaturals
> Seq.map Positive
> Seq.filter IsPrime
and in Clojure
(defn prime? [i]
(cond
(zero? i) false
(negative? i) (prime? (negate i))
(equalto i one) false
:else (notany? #(isdivisibleby i %) (range two (almostsquareroot i)))))
(def allprimes
(>> naturalnumbers/allnaturals
(map positive)
(filter prime?)))
Rational Numbers
Now, to solve the "division problem", we can similarly look at equivalence classes of pairs of integers, just as long as the second one isn't zero.
// motivated by the "division problem"  given integers i1 and i2, where i2 not zero,
// would like to define some number q = Divide i1 i2, such that EqualTo i1 (Multiply q i2)
// proceeding as above, why not define a new type of number as a *pair* (i1,i2) representing
// the "quotient" of i1 and i2. Again such a representation is not unique, as you'd want
// (Two,One) = (Four,Two) = [the number corresponding to Two]
// when do we want (i1,i2) = (i1',i2') ?
// when there is some i3 with i1 = i2 * i3, i1' = i2' * i3, or
// precisely when we have i1 * i2' = i1' * i2
// in particular, if x divides both i1 and i2, so that i1 = i1' * x, i2 = i2' * x, then
// i1 * i2' = i1' * x * i2' = i1' * i2, so that (i1, i2) = (i1', i2')
type Rational(numerator : Integer.Integer, denominator : Integer.Integer) =
let gcd =
if Integer.EqualTo Integer.Zero denominator then failwithf "Cannot have a Zero denominator"
else Integer.GCD numerator denominator
// want denominator to be positive always
let reSign =
match denominator with
 Integer.Negative _ > Integer.Negate
 _ > id
// divide by GCD to get to relatively prime
let _numerator = (Integer.Divide (reSign numerator) gcd)
let _denominator = (Integer.Divide (reSign denominator) gcd)
member this.numerator with get () = _numerator
member this.denominator with get () = _denominator
or
(defn rational [numerator denominator]
(let [gcd (if (integers/equalto integers/zero denominator)
(throw (Exception. "cannot have a zero denominator!"))
(integers/gcd numerator denominator))
resign (if (integers/lessthan denominator integers/zero)
integers/negate
(fn [i] i))]
{:numerator (integers/divide (resign numerator) gcd),
:denominator (integers/divide (resign denominator) gcd)}))
There is lots of extra code around the rationals, although it's hard to run into limitations as we did before. The most common limitation is that there's no rational whose square is two, but it's hard to run into that limitation without reasoning outside the system.
Real Numbers
Two common ways of constructing the real numbers from the rationals are Dedekind Cuts and equivalence classes of Cauchy Sequences. Neither is easy to implement in code.
Instead, I found a way to specify real numbers as cauchy sequences along with specific cauchy bounds:
// following http://en.wikipedia.org/wiki/Constructivism_(mathematics)#Example_from_real_analysis
// we'll define a Real numbers as a pair of functions:
// f : Integer > Rational
// g : Integer > Integer
// such that for any n, and for any i and j >= g(n) we have
// AbsoluteValue (Subtract (f i) (f j)) <= Invert n
type IntegerToRational = Integer.Integer > Rational.Rational
type IntegerToInteger = Integer.Integer > Integer.Integer
type Real = IntegerToRational * IntegerToInteger
let Constantly (q : Rational.Rational) (_ : Integer.Integer) = q
let AlwaysOne (_ : Integer.Integer) = Integer.One
let FromRational (q : Rational.Rational) : Real = (Constantly q), AlwaysOne
or
(defn real [f g] {:f f, :g g})
(defn fg [r] [(:f r) (:g r)])
(defn constantly [q] (fn [_] q))
(defn alwaysintegerone [_] integers/one)
(defn fromrational [q] (real (constantly q) alwaysintegerone))
One interesting twist here is that it is impossible to say whether two numbers are equal without reasoning outside the system. For instance, the real number FromRational Rational.Zero
is equal to the real number
(Rational.FromInteger >> Rational.Invert, Rational.FromInteger >> Rational.Invert)
(which represents the sequence 1, 1/2 , 1/3, 1/4, ...), but again you can only reason about that outside of code. Instead you can define CompareWithTolerance
which  given a tolerance  can tell you that one number is definitively greater than another, or that they're "approximately equal".
The ultimate test here would be to show that the real number
let SquareRootOfTwo : Real =
let rationalTwo = Rational.FromInteger Integer.Two
let sq x = Rational.Subtract (Rational.Multiply x x) rationalTwo
let sq' x = Rational.Multiply x rationalTwo
// newton's method
let iterate _ guess = Rational.Subtract guess (Rational.Divide (sq guess) (sq' guess))
let f = memoize iterate Rational.One
let g (n : Integer.Integer) = n
f, g
gives you the real number FromRational Rational.Two
when you square it. It looks like it should. Unfortunately, trying to do so will blow up the call stack, so it's not advised. Maybe someday I'll go back and try to make everything tailrecursive.
Gaussian Integers
Another drawback of the Integers is that none of them have negative squares. One way to solve this is by adding a number "i" whose square is negative one. I got kind of bored with these, so I never took them too far and never wrote any tests.
Complex Numbers
The obvious next step would be to add the square root of negative one "i" to the real numbers. But since they're not working so great I never did this.
Conclusion
I spent way too much time on this project, and I really need to get back to other things, like the groupcouponing site I'm planning to build, so I'm ready to call this quits. Here are some things I learned:
1. Math is hard.
2. Writing the Clojure versions was more "fun". However,
3. Getting the F# versions to work was much easier, because most of my Clojure bugs would have been caught by a type checker (or were caused by using maps as types and then having them unintentionally decompose).
4. If I put this much work into useful ideas, imagine what I could accomplish!
5. Probably I shouldn't read "Wheel of Time" again.
TShirts, Feminism, Parenting, and Data Science, Part 2: Eigenshirts
(You might want to read Part 1 first.)
When last we left off, we’d built a model using shirt colors to predict boyness / girlness.
Our second attempt will involve the shirt images themselves (sort of). For our purposes, computer images are made up of pixels, each of whose color is determined by specifying red, green, and blue values between 0 and 255. So if you have an image with N pixels, you can think of it as a point in 3Ndimensional space, all of whose coordinates lie between 0 and 255.
And as before, we can build a linear model to classify points in space using logistic regression. The trick here is that the images have different sizes (and hence different numbers of pixels). So as a first step, we’ll rescale every image to 138 pixels x 138 pixels = 19,044 pixels. (A lot of our images are this size, and the rest are mostly larger, which is why I chose it.) This will give us a representation of each tshirt image as a point in 57,132dimensional space. (Visualizing 57,132dimensional space is tricky, so don’t feel bad if you can’t do it.)
Our dataset only contains about 1,000 shirts, which means that a 57,000dimensional classifier would learn to identify every shirt in the test dataset rather than figure out what distinguishes the boys shirts from the girls shirts. This means we need to do some sort of dimensionality reduction to get our tshirt images into a much lowerdimensional space.
Here we’ll use Principal Component Analysis, which finds the direction (in 57,132dimensional space) that accounts for the largest amount of variance in the dataset. It then subtracts out this direction, finds the mostvariantdirection of the new dataset, and so on, until it has enough components.
(As always, code is on GitHub.)
I ended up using 10 components, which gives a representation of each tshirt as just 10 numbers, representing the projection of the (57,132dimensional representation of the) shirt onto the first 10 principal components, each of which is itself a vector in 57,132dimensional space. For instance, the first principal component is the 57,132element vector
[0.0002334, 0.00029256, 0.00042805, ... , 0.00051605]
By thinking of this as a vector of 19,044 rgb triplets, and by rescaling it so that its smallest component is 0 and its largest component 255, we can convert it into an image of an eigenshirt representing the “essence” of this component. Shirts with a large value for the first component will tend to be “similar” to this eigenshirt. Shirts with a large negative value for the first component will tend to be “similar” to its colorinverted “antieigenshirt”. [We could have just as easily picked the "antieigenshirt" as the eigenshirt and flipped the signs of the components.]
The below table shows, for each of the 10 principal components, the eigenshirt, the shirt with the largest component value, the shirt with the closesttozero value, the shirt with the largest negative component value, and the “antieigenshirt”.
Eigenshirt  Most Eigenshirty  Not Eigenshirty  Most AntiEigenshirty  AntiEigenshirt 
If I were to try to give qualitative descriptions of these ten components, I guess they would be something like:
Component 0: White > Black
Component 1: Orange > Blue
Component 2: Dark sleeved / white sleeveless > White sleeved / dark sleeveless
Component 3: Wide dark / narrow white > Narrow dark / wide white
Component 4: ?
Component 5: Green > Purple
Component 6: White trim / dark shirt > Dark trim / white shirt
Component 7: Dark long sleeve / white sleeveless > White long sleeve / dark sleeveless
Component 8: White shirt / dark print > Dark shirt / white print
Component 9: ?
The Principal Component representation of each shirt is a 10dimensional vector representing (roughly) where it fits on each of these spectra. For instance, the monkey shirt
is represented by the vector
[ 9313, 10067, 149, 4013, 2147, 1574, 296, 954, 1729, 196]
the biggest components of which are “orange” (eigenshirt #1), “dark” (antieigenshirt 0), and “narrow” (antieigenshirt 3).
If we try to reconstruct the image using just these ten components, we get
which seems to have captured orange, short sleeve, and dark graphic. You certainly can’t tell it’s a monkey, though.
Predicting
If we try to predict “boy shirt or girl shirt” using just these 10 components, we get a model that’s 93% accurate on the test set. The coefficients (multiplied by 10,000, since they’re small) look like:
Component 0: 2.71 (eigenshirt is girlish)
Component 1: 2.56 (girlish)
Component 2: 3.55 (boyish)
Component 3: 0.53 (weakly boyish)
Component 4: 0.56 (weakly girlish)
Component 5: 5.43 (boyish)
Component 6: 15.9 (very girlish)
Component 7: 4.68 (girlish)
Component 8: 2.73 (boyish)
Component 9: 2.14 (girlish)
As before, we can look at how the shirts are distributed as a function of the score they get from the model:
The miscategorized shirts generally have low (close to 0) scores, except for one particularly “girly” boys shirt that we’ll see below.
Superlatives
Girliest Girl (looks like is based on shape and colors)
Girliest Boy (shape and colors again)
Boyiest Boy (da Bears)
Boyiest Girl (same one as last time!)
This is all very interesting and hints at Platonic ideal shirts (the philosophical details of which are out of scope for this blog). And clearly it does a much better job of predicting “boy shirt or girl shirt” than our previous colorbased attempt. But whereas everyone knows about colors (except for the colorblind, of course), most people are unfamiliar with “eigenshirts” and will accuse you of having made them up just in order to have something to blog about. In particular, the girl who works at Gap Kids was entirely unimpressed with this model, and said that I needed to either buy something or leave the store.
Were I really committed to this model, I’d probably do more work to get the images comparable to each other so that not only were they the same size but the shirts were oriented as closely as possible and all had the same background color. Alas, I’m sort of principalcomponentedout, and am eager to get back to writing my blog post about “the only correct way to interview engineers”, the punchline of which is that you should only ask questions that involve golf balls, piano tuners, counterfeit coins, airplanes, treadmills, or piano tuners.
And so we leave things until part 3, “Shirt Language Processing”, which will be forthcoming at some point after I muster up the motivation to either transcribe the shirt images or find an intern to do it for me.
TShirts, Feminism, Parenting, and Data Science, Part 1: Colors
Before I was a parent I never gave much thought to children’s clothing, other than to covet a few of the baby shirts at TShirt Hell. Now that I have a twoyearold daughter, I have trouble thinking of anything but children’s clothing. (Don’t tell my boss!)
What I have discovered over the last couple of years, is that clothing intended for boys is fun, whereas clothing intended for girls kind of sucks. There’s nothing inherently twoyearoldboyish about dinosaurs, surfing ninjas, skateboarding globes, or “becomearobot” solicitations, just as there’s nothing inherently twoyearoldgirlish about pastelcolored balloons, or cats wearing bows, or dogs wearing bows, or ruffles. Forget about gender, I want Madeline to grow up to be a “surfing ninja” kind of kid, not a “cats wearing bows” kind of kid. An “angry skateboarding dog” kind of kid, not a “shoes with pretty ribbons” kind of kid.
Accordingly, I have taken to buying all of Madeline’s shirts in the boys section, the result of (her boyish haircut and) which is that half the time people refer to her as “he”. This doesn’t terribly bother me, especially if she ends up getting the gender wage premium that people are always yammering about on Facebook, but it makes me wonder why such a stark divide between toddler boy shirts and toddler girl shirts. And, of course, it makes me wonder if the divide is so stark that I can build a model to predict it!
The Dataset
I downloaded images of every “toddler boys” and “toddler girls” tshirt from Carters, Children’s Place, Crazy 8, Gap Kids, Gymboree, Old Navy, and Target. Because each one had their shirts at a different (random) website location, I decided that using an Image Downloader Chrome extension would be quicker and easier than writing a scraping script that worked with all the different sites.
I ended up with 616 images of boys shirts and 446 images of girls shirts. My lawyer has advised me against redistributing the dataset, although I might if you ask nicely.
Attempt #1: Colors
(As always, the code is on my GitHub.)
A quick glance at the shirts revealed that boys shirts tend toward boyish colors, girls shirts toward girlish colors. So a simple model could just take into account the colors in the image. I’ve never done much image processing before, so the Pillow Python library seemed like a friendly place to start. (In retrospect, a library that made at least a halfhearted attempt at documentation would probably have been friendlier.)
The PIL library has a getcolors function, that returns a list of
(# of pixels, (red, green, blue))
for each rgb color in the image. This gives 256 * 256 * 256 = almost 17 million possible colors, which is probably too many, so I quantized the colors by bucketing each of red, green, and blue into either [0,85), [85,170) or [170,255]. This gives 3 * 3 * 3 = 27 possible colors.
To make things even simpler, I only cared about whether an image contained at least one pixel of a given color [bucket] or whether it contained none. This allowed me to convert each image into an array of length 27 consisting only of 0′s and 1′s.
Finally, I trained a logistic regression model to predict, based solely on the presence or absence of the 27 colors, whether a shirt was a boys shirt or a girls shirt. Without getting too mathematical, we end up with a weight (positive or negative) for each of the 27 colors. Then for any shirt, we add up the weights for all the colors in the shirt, and if that total is positive, we predict “boys shirt”, and if that total is negative, we predict “girls shirt”.
I trained the model on 80% of the data and measured its performance on the other 20%. This (pretty stupid) model predicted correctly about 77% of the time.
Plotted below is the number of boys shirts (blue) and girls shirts (pink) in the test set by the score assigned them in the model. Without getting into gory details, a score of 0 means the model thinks it’s equally likely to be a boys shirt or a girls shirt, with more positive scores meaning more likely boys shirt and more negative scores meaning more likely girls shirt. You can see that while there’s a muddled middle, when the model is really confident (in either direction), it’s always right.
If we dig into precision and recall, we see
P(is actually girl shirt  prediction is “girl shirt”) = 75%
P(is actually boy shirt  prediction is “boy shirt”) = 77%
P(prediction is “girl shirt”  is actually girl shirt) = 63%
P(prediction is “boy shirt”  is actually boy shirt) = 86%
One way of interpreting the recall discrepancy is that it’s much more likely for girls shirts to have “boy colors” than for boys shirts to have “girl colors”, which indeed appears to be the case.
Superlatives
Given this model, we can identify
The Girliest Girls Shirt (no argument from me):
The Boyiest Girls Shirt (must be the blackandwhite and lack of color?):
The Girliest Boys Shirt (I can see that if you just look at colors):
The Boyiest Boys Shirt (a slightly odd choice, but I guess those are all boyish colors?):
The Most Androgynous Shirt (this one is most likely some kind of image compression artifact, the main colors are boyish but the image also has some girlish purple pixels in it that cancel those out):
The Blandest Shirt (for sure!):
The Most Colorful Shirt (no argument with this one either!):
Scores for Colors
By looking at the coefficients of the model, we can see precisely which colors are the most “boyish” and which are the most “girlish”. The results are not wholly unexpected:
151.71 

80.68 

69.35 

49.69 

43.83 

40.99 

35.94 

30.56 

26.08 

24.06 

20.89 

20.49 

18.89 

17.67 

1.29 

17.37 

21.77 

29.95 

49.91 

56.4 

66.77 

69.52 

70.15 

82.17 

119.1 

175.2 

224.74 
In Conclusion
In conclusion, by looking only at which of 27 colors are present in a toddler tshirt, we can do a surprisingly good job of predicting whether it’s a boys shirt or a girls shirt. And that pretty clearly involves throwing away lots of information. What if we were to take more of the actual image into account?
Coming soon, Part 2: EIGENSHIRTS
PostPrism Data Science Venn Diagram
In light of recent revelations, here’s an updated version of Drew Conway’s Data Science Venn Diagram:
ESPN, Race, and Presidents
Inspired by (and lifting large amounts of code from) Trey Causey’s investigation of the language that ESPN uses to discuss white and nonwhite quarterbacks, I similarly wondered about the language ESPN uses to discuss white and nonwhite Presidents. For instance, a common stereotype is that nonwhite Presidents assassinate their citizens using unmanned drones, while white Presidents assassinate their citizens using polonium210. Do such stereotypes creep into sportswriting?
Toward that end, I used Scrapy to scrape all the articles from the ESPN website that matched searches for (president obama), (president bush), (president clinton), and so on. This gave me a total of 543 articles. Then, using Wikipedia, Mechanical Turk, and a proprietary deep learning model, I categorized each of these Presidents as either “white” or “nonwhite”.
Using NLTK, I tokenized each article into sentences and then identified each sentence as being about
 one or more white Presidents
 one or more nonwhite Presidents
 both white and nonwhite Presidents
 no presidents
Curiously, while there were very few “nonwhite” Presidents, there were nonetheless about four times as many “nonwhite” sentences as “white” sentences. (This is itself an interesting phenomenon that’s probably worth investigating.)
I then split each sentence into words and counted how many times each word appeared in “white”, “nonwhite”, “both”, and “none” sentences. Like Trey, I followed the analysis here, similarly excluding stopwords and proper nouns, which I inferred based on capitalization patterns.
Finally, for each word I computed a “white percentage” and “nonwhite percentage” by looking at how likely that word was to appear in a “white” sentence or a “nonwhite” sentence and adjusting for the different numbers of sentences.
After all that, here are the words that were most likely to appear in sentences about “white” Presidents:
plaque 5
severed 4
grab 4
investigation 3
worn 3
unable 3
child 3
suppose 3
block 3
living 3
holders 3
pounds 3
ticket 3
blackout 3
thrown 3
exercise 3
scene 3
televised 3
upon 3
executives 3
Clearly this reads like something out of “CSI” or possibly “CSI: Miami”. If I were to make these words into a story, it would probably be something macabre like
The President grabbed the plaque he’d secretly made from a living child‘s severed foot and worn sock. The investigation supposed a suspect weighing at least 200 pounds who could have thrown the victim down the block, not a feeble politician famous for his televised blackout when he tried to exercise but was unable to grab his toes.
In constrast, here are the words most likely to appear in sentences about “nonwhite” Presidents:
bracket 32
interview 21
trip 16
champions 16
fan 48 1
asked 35 1
carrier 11
celebrate 11
thinks 11
early 11
eight 11
personal 10
picks 10
appearance 10
far 9
hear 9
congratulating 9
given 9
troops 9
safety 9
fine 9
person 9
This story would have to be something uplifting like
The President promised to raise taxes on every bracket before ending the interview. As a huge water polo fan, he needed to catch a ride on an aircraft carrier for his trip to celebrate with the champions. “Sometimes I get asked,” he thinks, “whether it’s too early to eat a personal pan pizza with eight toppings. So far I always say that I hear it’s not.” His safety is a given, since he’s surrounded by troops who are always congratulating him for being a fine person with a fine appearance.
As you can see, it has a markedly different tone, but not in a way that obviously correlates with the stereotypes mentioned earlier. Whatever prejudices lurk at ESPN are exceedingly subtle.
Obviously, this is only the tip of the iceberg. The algorithm for identifying which sentences were about Presidents is pretty rudimentary, and the wordcounting NLP techniques used here are pretty basic. Another obvious next step would be to pull in additional data sources like Yahoo! Sports or SI.com or FOX Sports.
If you’re interested in following up, the code is all up on my github, so have at it! And I’d love to hear your feedback.
Three Keys to Successful Parenting
Now that Madeline is two, it seems appropriate to declare myself a success as a parent. Which means it’s now appropriate for those of you with kids (as well as those of you thinking about having or abducting kids) to ask me, “Joel, what’s your secret?” Which means it’s now appropriate for me to say “I’m glad you asked,” and then write a blog post about it.
1. Improv
I’m sure many of you wondered why I took all those improv classes, and why I made you come watch my improvised musical where we could only use words that started with a letter suggested by the audience, and why I didn’t stop the guy in the second row from choosing ‘X’, and why my song “Xerox Xevious” sounded exactly like “Summer of ’69.”
Well, it turns out that improv is a very easy way to become a better parent. (And that all of my songs sound exactly like “Summer of ’69″.)
Before improv
“Daddy, can I have some more candy?”
“No. Go to bed.”
After improv
“Daddy, can I have some more candy?”
“Yes, and after your teeth rot and you become obese and get diabetes and have to have your foot amputated, then you should go to bed.”
Before improv
“Daddy, where do babies come from?”
“Go ask your mother.”
After improv
“Daddy, where do babies come from?”
[sits down on a plain black box, mimes that it's maybe some kind of pirate seat on some kind of pirate boat, and starts in a pirate accent] “Yarr, ye land lubbers always be asking me questions about babies … [10 minute monologue in a pirate voice about piratey things that cleverly reincorporates elements from earlier in the conversation] Arr, go ask the first mate!”
Before improv
“Daddy, I need to go to the bathroom.”
“Again? You just went!”
After improv
“Daddy, I need to go to the bathroom.”
“DING! Now in the style of Shakespeare.”
“Daddy, I need to go to the bathroom!”
“DING! Now in the style of film noir.”
“Daddy, I NEED to GO to the BATHROOM!”
“DING! Now in the style of a fetish video.”
“Daddy, I peed my pants.”
“And scene!”
2. Radical Libertarianism
Most books (with the notable exception of *Praxeological Parenting*) will tell you that moderate libertarianism is all you need to be a good parent. But there are a great many parenting problems that a belief in the nightwatchman state does little to solve.
For instance, when your kid doesn’t want to go to school because it’s a brainwashing factory designed to grind young impressionable minds into submission by (among other things) forbidding them from leaving their seats or talking “out of turn” or using the restroom without first obtaining permission, the moderate libertarian answer is typically to offer them a voucher that covers the tuition to a different brainwashing factory. Your kid is unlikely to find this satisfying, for obvious reasons.
Similarly, when your kid wants to BitTorrent the Criterion Director’s Cut version of Dora the Explorer, the wishywashy moderate libertarian “you wouldn’t download a Dora the Explorer handbag!” position on intellectual property is not going to make her particularly happy.
And what will you tell her when she asks (as all kids inevitably do) how granting a monopoly on violence could possibly be a good way to prevent monopolies and violence? Or why the dinosaurs on “Dinosaur Train” are able to peaceably resolve their various conflicts despite living approximately 66 million years before the invention of government? Or why it’s OK for the government to take pieces of paper out of daddy’s wallet just as long as they don’t take too many, while she gets punished for taking even one, and don’t try to give me any of that John Rawls “veil of ignorance” stuff, I might have bought that crap when I was an infant, but now that I’m TWO YEARS OLD the flaws in his “logic” are pretty glaringly obvious?
Whereas radical libertarianism easily sidesteps all these problems, making parenting a breeze (relatively speaking).
3. Trolling
Did you ever imagine that all those years you wasted trolling that idiot Marxist kid on LiveJournal debate would end up being useful? Because they are! Kids love being trolled! Love it! Here are a few of Madeline’s favorite trolls:
“My Hippo”
This one’s easy, you just pick up something that belongs to the kid (e.g. a stuffed hippo) and troll that it’s yours:
“Hey, my hippo.”
“No, MY hippo!”
“I’m pretty sure this is daddy’s hippo.”
“No, MY hippo!”
“Does it have your name on it?”
“MY hippo!”
“It was just lying on the floor and I homesteaded it.”
“MY hippo!”
“Have your protection agency call my protection agency and maybe we can work something out.”
“MY hippo!”
“Behind the veil of ignorance it could just as easily have been my hippo.”
“MY hippo!”
[ several hundred lines of dialogue removed due to space constraints ]
“Yeah, but what does it really mean to ‘own’ something?”
“MY hippo!”
“And scene!”
“Science Project”
Part of being a parent is helping your kids with science projects, so help them “demonstrate” something that isn’t real, like cold fusion, or quantum computing, or evolution. Chances are their teachers won’t know the difference, which makes it also work on another level.
“9/11 Trutherism”
Kids will believe just about anything, even that that third WTC 7 skyscraper would just collapse on its own despite not even being hit by a plane. Even so, it’s not very hard to convince them that the towers were brought down on 9/11 by controlled demolition using explosives secretly planted in advance by the government in order to create an excuse to invade Iraq and Afghanistan in order to pave the way for a new American hegemony. And then they’ll repeat this on the playground, and then you’ll get called in for a parentteacher conference at which you can reveal that you’d assumed that she’d picked these theories from the playground, which means that if she didn’t then maybe she just came up with them on her own? And that if the official narrative is so shoddy that a 2yearold can pick holes in it, then maybe Alex Jones is onto something!
“The Craigslist Experiment”
OK, so possibly there are some kinds of trolling kids don’t like.
Should you get a Ph.D.?
No.