Why [Programming Language X] Is Unambiguously Better than [Programming Language Y]

Recently I have seen a lot of people wondering about the difference between [X] and [Y]. After all, they point out, both are [paradigm] languages that target [platform] and encourage the [style] style of programming while leaving you enough flexibility to [write shitty code].

Having written [simple program that’s often asked about in phone screens] in both languages, I think I’m pretty qualified to weigh in. I like to think about it in the following way: imagine [toy problem that you might give to a 5th grader who is just learning to program]. A [Y] implementation of it might look like this:

[really poorly engineered Y code]

Whereas in [X] you could accomplish the same thing with just

[slickly written X code that shows off syntactic sugar]

It’s pretty clear that the second is easier to understand and less error-prone.

Now consider type systems. [Religious assertion about the relative merits and demerits of static and dynamic typing.] Sure, [Y] gives you [the benefit of Y’s type system or lack thereof] but is this worth [the detriment of Y’s type system or lack thereof]? Obviously not!

Additionally, consider build tools. While [Y] uses [tool that I have never bothered to understand], [X] uses the far superior [tool that I marginally understand]. That’s reason enough to switch!

Finally, think about the development process. [X] has the amazing [X-specific IDE that’s still in pre-alpha], and it also integrates well with [text-editor that’s like 50 years old and whose key-bindings are based on Klingon] and [IDE that everyone uses but that everyone hates]. Sure, you can use [Y] with some of these, but it’s a much more laborious and painful process.

In conclusion, while there is room for polyglotism on the [platform] platform, we would all be well served if you [Y] developers would either crawl into a hole somewhere or else switch to [X] and compete with us for the handful of [X] jobs. Wait, never mind, [Y] is awesome!

(Hacker News link)

Constructive Mathematics in F# (and Clojure)

(Tell me what a terrible person I am on Hacker News.)

For as long as I can remember1 I’ve dreamed of reimplementing the entirety of mathematics from scratch. And now that I’ve finished the “Wheel of Time” series I have a little bit of extra time on my hands each day, which has allowed me to take baby steps toward my dream.

What this is

An implementation of mathematics in F# (and also in Clojure)

What this is not

An efficient implementation of mathematics in F# (or in Clojure)

You would never want to use this library to do mathematics, as it is chock-full of all sorts of non-tail-recursive function calls that will blow your stack like there’s no tomorrow. (If you don’t know what that means, just take my word that you would never want to use this library to do mathematics.) Instead, this library is an interesting way to learn about

  • how to construct a mathematics from scratch
  • how to implement a mathematics in F# (or Clojure)
  • my bizarre obsessions

As always when I work on stuff like this, the code is on my GitHub.

This was originally just going to be in F#, and then I read a couple of blog posts about ClojureScript, which reminded me I’d been meaning to do something in Clojure, so why not implement the same stuff a second time? (This is why “in Clojure” is in parentheses everywhere, and why the F# code has all the comments.) I tried to make the F# code F#-y and the Clojure code Clojure-y, but I’m not sure how well I succeeded.

I won’t go into excruciating detail about either mathematical theory or F# (or Clojure), but hopefully you can understand both from the detail I do go into. I also will only call a few high points of each codebase, if you want more gory details check out GitHub.

Both sets of code have handfuls of tests written, which should give you a good sense of how both libraries operate.

Comparisons

In F#, I’ll define a discriminated union

type Comparison = LessThan | Equal | GreaterThan

In Clojure you don’t typically use “types”, but we can just use keywords :less-than and :equal and :greater-than.

Natural Numbers

We’ll define these recursively. A natural number is either

  • “One” (which is just some thing, forget that you’re already familiar with a “one”), or
  • the “Successor” of a different natural number

Anything you can make using these rules is a natural number. Anything that you can’t isn’t.

We’ll call the successor of One “Two”, and the successor of Two “Three”, keeping in mind that at this point those are just names attached to things without any meaning other than “Two is the successor of One” and “Three is the successor of Two”.

In F# we can do this with a discriminated union:

type Natural = One | SuccessorOf of Natural
let Two = SuccessorOf One
// and so on

After trying a lot of things in Clojure, I finally decided the most Clojure-ish Clojurian way was

(defn successor-of [n] {:predecessor n})
(def one (successor-of nil))
(def two (successor-of one))
; and so on

Although the Clojure way at first looks backward, if you think about it both ways define the “successor of One” to be the number whose “predecessor” is One.

Next we’ll want to use this recursive structure to create an arithmetic. For instance, we can easily add two natural numbers:

let rec Add (n1 : Natural) (n2 : Natural) =
    match n1 with
        // adding One to a number is the same as taking its Successor
    | One -> SuccessorOf n2
        // otherwise n1 has a predecessor, add it to the successor of n2
        // idea: n1 + n2 = (n1 - 1) + (n2 + 1)
    | SuccessorOf n1' -> Add n1' (SuccessorOf n2)

Clojure doesn't have built-in pattern-matching, so instead I did something similar using a one? function:

(defn add [n1 n2]
  (if (one? n1)
    (successor-of n2)
    (add (predecessor-of n1) (successor-of n2))))

Both make it easy to create lazy infinite sequences of all natural numbers.

let AllNaturals = Seq.unfold (fun c -> Some (c, SuccessorOf c)) One

and

(def all-naturals (iterate successor-of one))

And (blame it on the natural numbers) both run into trouble when you try to define subtraction. In F# the natural thing to do is return an Option type:

// now, we'd like to define some notion of subtraction as the inverse of addition
// so if n1 + n2 = n3, then you'd like "n3 - n2" = n1
// but this isn't always defined, for instance 
//  n = One - One
// would mean One = One + n = SuccessorOf n, which plainly can never happen
// in this case we'll return None
let rec TrySubtract (n1 : Natural) (n2 : Natural) =
    match n1, n2 with
        // Since n1' + One = SucessorOf n1', then SuccessorOf n1' - One = n1'
    | SuccessorOf n1', One -> Some n1'
        // if n = (n1 + 1) - (n2 + 1), then
        //    n + n2 + 1 = n1 + 1
        // so n + n2 = n1,
        // or n = n1 - n2
    | SuccessorOf n1', SuccessorOf n2' -> TrySubtract n1' n2'
    | One, _ -> None // "Impossible subtraction"

In Clojure there is no option type, so I just returned nil for a bad subtraction:

(defn try-subtract [n1 n2]
  (cond
    (one? n1) nil
    (one? n2) (predecessor-of n1)
    :else (try-subtract (predecessor-of n1) (predecessor-of n2))))

Integers

The failure of "subtraction" leads us to introduce the Integers, which you can (if you are so inclined) think of as equivalence classes of pairs of natural numbers, where (for instance),

(Three,Two) = (Two,One) = "the result of subtracting one from two" = 
 "the integer corresponding to one"

In F# we can again define a custom type:

type Integer =
| Positive of Natural.Natural
| Zero
| Negative of Natural.Natural

and map to equivalence classes using

let MakeInteger (plus,minus) =
    match Natural.Compare plus minus with
    | Comparison.Equal -> Zero
    | Comparison.GreaterThan -> Positive (Natural.Subtract plus minus)
    | Comparison.LessThan -> Negative (Natural.Subtract minus plus)

whereas in Clojure we just use maps:

(def zero {:sign :zero})
(defn positive [n] {:sign :positive, :n n})
(defn negative [n] {:sign :negative, :n n})

and the very similar

(defn make-integer [plus minus]
  (let [compare (natural-numbers/compare plus minus)]
    (case compare
      :equal zero
      :greater-than (positive (natural-numbers/subtract plus minus))
      :less-than (negative (natural-numbers/subtract minus plus)))))

We can easily define addition and subtraction and even multiplication, but when we get to division we run into problems again. You'd like 1 / 3 to be the number that when multiplied by three yields one. But there is no such Integer.

let rec TryDivide (i1 : Integer) (i2 : Integer) =
    match i1, i2 with
    | _, Zero -> failwithf "Division by Zero is not allowed"
    | _, Negative _ -> TryDivide (Negate i1) (Negate i2)
    | Zero, Positive _ -> Some Zero
    | Negative _, Positive _ -> 
        match TryDivide (Negate i1) i2 with
        | Some i -> Some (Negate i)
        | None -> None
    | Positive _ , Positive _ ->
        if LessThan i1 i2
        then None // cannot divide a smaller integer by a larger one
        else 
            match TryDivide (Subtract i1 i2) i2 with
            | Some i -> Some (SuccessorOf i)
            | None -> None

and similarly

(defn try-divide [i1 i2] =
  (cond
     (zero? i2) (throw (Exception. "division by zero is not allowed"))
     (negative? i2) (try-divide (negate i1) (negate i2))
     (zero? i1) zero
     (negative? i1) (let [td (try-divide (negate i1) i2)]
                           (if td (negate td)))
     :else ; both positive
       (if (less-than i1 i2)
         nil
         (let [td (try-divide (subtract i1 i2) i2)]
           (if td (successor-of td))))))

And if we're clever we can get a lazy sequence of all prime numbers:

let rec IsPrime (i : Integer) =
    match i with
    | Zero -> false
    | Negative _ -> IsPrime (Negate i)
    | Positive Natural.One -> false
    | Positive _ ->
        let isComposite =
            Range Two (AlmostSquareRoot i)
            |> Seq.exists (fun i' -> IsDivisibleBy i i')
        not isComposite 

let AllPrimes =
    Natural.AllNaturals
    |> Seq.map Positive
    |> Seq.filter IsPrime

and in Clojure

(defn prime? [i]
  (cond
    (zero? i) false
    (negative? i) (prime? (negate i))
    (equal-to i one) false
    :else (not-any? #(is-divisible-by i %) (range two (almost-square-root i)))))

(def all-primes
  (->> natural-numbers/all-naturals
    (map positive)
    (filter prime?)))

Rational Numbers

Now, to solve the "division problem", we can similarly look at equivalence classes of pairs of integers, just as long as the second one isn't zero.

// motivated by the "division problem" -- given integers i1 and i2, where i2 not zero,
// would like to define some number q = Divide i1 i2, such that EqualTo i1 (Multiply q i2) 

// proceeding as above, why not define a new type of number as a *pair* (i1,i2) representing
// the "quotient" of i1 and i2.  Again such a representation is not unique, as you'd want
// (Two,One) = (Four,Two) = [the number corresponding to Two]

// when do we want (i1,i2) = (i1',i2') ?  
// when there is some i3 with i1 = i2 * i3, i1' = i2' * i3, or
// precisely when we have i1 * i2' = i1' * i2

// in particular, if x divides both i1 and i2, so that i1 = i1' * x, i2 = i2' * x, then
// i1 * i2' = i1' * x * i2' = i1' * i2, so that (i1, i2) = (i1', i2')

type Rational(numerator : Integer.Integer, denominator : Integer.Integer) =
    let gcd = 
        if Integer.EqualTo Integer.Zero denominator then failwithf "Cannot have a Zero denominator"
        else Integer.GCD numerator denominator
        
    // want denominator to be positive always
    let reSign =
        match denominator with
        | Integer.Negative _ -> Integer.Negate
        | _ -> id

    // divide by GCD to get to relatively prime
    let _numerator = (Integer.Divide (reSign numerator) gcd)
    let _denominator = (Integer.Divide (reSign denominator) gcd)

    member this.numerator with get () = _numerator
    member this.denominator with get () = _denominator

or

(defn rational [numerator denominator]
	  (let [gcd (if (integers/equal-to integers/zero denominator)
	              (throw (Exception. "cannot have a zero denominator!"))
	              (integers/gcd numerator denominator))
	        re-sign (if (integers/less-than denominator integers/zero)
	                  integers/negate
	                  (fn [i] i))]
	    {:numerator (integers/divide (re-sign numerator) gcd),
	     :denominator (integers/divide (re-sign denominator) gcd)}))

There is lots of extra code around the rationals, although it's hard to run into limitations as we did before. The most common limitation is that there's no rational whose square is two, but it's hard to run into that limitation without reasoning outside the system.

Real Numbers

Two common ways of constructing the real numbers from the rationals are Dedekind Cuts and equivalence classes of Cauchy Sequences. Neither is easy to implement in code.

Instead, I found a way to specify real numbers as cauchy sequences along with specific cauchy bounds:

// following http://en.wikipedia.org/wiki/Constructivism_(mathematics)#Example_from_real_analysis
// we'll define a Real numbers as a pair of functions:
// f : Integer -> Rational
// g : Integer -> Integer
// such that for any n, and for any i and j >= g(n) we have
//  AbsoluteValue (Subtract (f i) (f j)) <= Invert n

type IntegerToRational = Integer.Integer -> Rational.Rational
type IntegerToInteger = Integer.Integer -> Integer.Integer
type Real = IntegerToRational * IntegerToInteger

let Constantly (q : Rational.Rational) (_ : Integer.Integer) = q
let AlwaysOne (_ : Integer.Integer) = Integer.One
let FromRational (q : Rational.Rational) : Real = (Constantly q), AlwaysOne

or

(defn real [f g] {:f f, :g g})
(defn f-g [r] [(:f r) (:g r)])

(defn constantly [q] (fn [_] q))
(defn always-integer-one [_] integers/one)

(defn from-rational [q] (real (constantly q) always-integer-one))

One interesting twist here is that it is impossible to say whether two numbers are equal without reasoning outside the system. For instance, the real number FromRational Rational.Zero is equal to the real number

(Rational.FromInteger >> Rational.Invert, Rational.FromInteger >> Rational.Invert)

(which represents the sequence 1, 1/2 , 1/3, 1/4, ...), but again you can only reason about that outside of code. Instead you can define CompareWithTolerance which -- given a tolerance -- can tell you that one number is definitively greater than another, or that they're "approximately equal".

The ultimate test here would be to show that the real number

let SquareRootOfTwo : Real =
    let rationalTwo = Rational.FromInteger Integer.Two
    let sq x = Rational.Subtract (Rational.Multiply x x) rationalTwo
    let sq' x = Rational.Multiply x rationalTwo
	// newton's method
    let iterate _ guess = Rational.Subtract guess (Rational.Divide (sq guess) (sq' guess))
    let f = memoize iterate Rational.One
    let g (n : Integer.Integer) = n
    f, g

gives you the real number FromRational Rational.Two when you square it. It looks like it should. Unfortunately, trying to do so will blow up the call stack, so it's not advised. Maybe someday I'll go back and try to make everything tail-recursive.

Gaussian Integers

Another drawback of the Integers is that none of them have negative squares. One way to solve this is by adding a number "i" whose square is negative one. I got kind of bored with these, so I never took them too far and never wrote any tests.

Complex Numbers

The obvious next step would be to add the square root of negative one "i" to the real numbers. But since they're not working so great I never did this.

Conclusion

I spent way too much time on this project, and I really need to get back to other things, like the group-couponing site I'm planning to build, so I'm ready to call this quits. Here are some things I learned:

1. Math is hard.
2. Writing the Clojure versions was more "fun". However,
3. Getting the F# versions to work was much easier, because most of my Clojure bugs would have been caught by a type checker (or were caused by using maps as types and then having them unintentionally decompose).
4. If I put this much work into useful ideas, imagine what I could accomplish!
5. Probably I shouldn't read "Wheel of Time" again.

1. Which is approximately 1 week.

T-Shirts, Feminism, Parenting, and Data Science, Part 2: Eigenshirts

(You might want to read Part 1 first.)

When last we left off, we’d built a model using shirt colors to predict boy-ness / girl-ness.

Our second attempt will involve the shirt images themselves (sort of). For our purposes, computer images are made up of pixels, each of whose color is determined by specifying red, green, and blue values between 0 and 255. So if you have an image with N pixels, you can think of it as a point in 3N-dimensional space, all of whose coordinates lie between 0 and 255.

And as before, we can build a linear model to classify points in space using logistic regression. The trick here is that the images have different sizes (and hence different numbers of pixels). So as a first step, we’ll rescale every image to 138 pixels x 138 pixels = 19,044 pixels. (A lot of our images are this size, and the rest are mostly larger, which is why I chose it.) This will give us a representation of each t-shirt image as a point in 57,132-dimensional space. (Visualizing 57,132-dimensional space is tricky, so don’t feel bad if you can’t do it.)

Our dataset only contains about 1,000 shirts, which means that a 57,000-dimensional classifier would learn to identify every shirt in the test dataset rather than figure out what distinguishes the boys shirts from the girls shirts. This means we need to do some sort of dimensionality reduction to get our t-shirt images into a much lower-dimensional space.

Here we’ll use Principal Component Analysis, which finds the direction (in 57,132-dimensional space) that accounts for the largest amount of variance in the dataset. It then subtracts out this direction, finds the most-variant-direction of the new dataset, and so on, until it has enough components.

(As always, code is on GitHub.)

I ended up using 10 components, which gives a representation of each t-shirt as just 10 numbers, representing the projection of the (57,132-dimensional representation of the) shirt onto the first 10 principal components, each of which is itself a vector in 57,132-dimensional space. For instance, the first principal component is the 57,132-element vector

[0.0002334, 0.00029256, 0.00042805, … , 0.00051605]

By thinking of this as a vector of 19,044 rgb triplets, and by rescaling it so that its smallest component is 0 and its largest component 255, we can convert it into an image of an eigenshirt representing the “essence” of this component. Shirts with a large value for the first component will tend to be “similar” to this eigenshirt. Shirts with a large negative value for the first component will tend to be “similar” to its color-inverted “anti-eigenshirt”. [We could have just as easily picked the “anti-eigenshirt” as the eigenshirt and flipped the signs of the components.]

The below table shows, for each of the 10 principal components, the eigenshirt, the shirt with the largest component value, the shirt with the closest-to-zero value, the shirt with the largest negative component value, and the “anti-eigenshirt”.

Eigenshirt Most Eigenshirty Not Eigenshirty Most Anti-Eigenshirty Anti-Eigenshirt

If I were to try to give qualitative descriptions of these ten components, I guess they would be something like:

Component 0: White -> Black
Component 1: Orange -> Blue
Component 2: Dark sleeved / white sleeveless -> White sleeved / dark sleeveless
Component 3: Wide dark / narrow white -> Narrow dark / wide white
Component 4: ?
Component 5: Green -> Purple
Component 6: White trim / dark shirt -> Dark trim / white shirt
Component 7: Dark long sleeve / white sleeveless -> White long sleeve / dark sleeveless
Component 8: White shirt / dark print -> Dark shirt / white print
Component 9: ?

The Principal Component representation of each shirt is a 10-dimensional vector representing (roughly) where it fits on each of these spectra. For instance, the monkey shirt

monkey_shirt

is represented by the vector

[ -9313, 10067, -149, -4013, -2147, 1574, -296, -954, 1729, -196]

the biggest components of which are “orange” (eigenshirt #1), “dark” (anti-eigenshirt 0), and “narrow” (anti-eigenshirt 3).

If we try to reconstruct the image using just these ten components, we get

monkey_shirt_reconstructed

which seems to have captured orange, short sleeve, and dark graphic. You certainly can’t tell it’s a monkey, though.

Predicting

If we try to predict “boy shirt or girl shirt” using just these 10 components, we get a model that’s 93% accurate on the test set. The coefficients (multiplied by 10,000, since they’re small) look like:

Component 0: -2.71 (eigenshirt is girlish)
Component 1: -2.56 (girlish)
Component 2: 3.55 (boyish)
Component 3: 0.53 (weakly boyish)
Component 4: -0.56 (weakly girlish)
Component 5: 5.43 (boyish)
Component 6: -15.9 (very girlish)
Component 7: -4.68 (girlish)
Component 8: 2.73 (boyish)
Component 9: -2.14 (girlish)

As before, we can look at how the shirts are distributed as a function of the score they get from the model:

shirts_by_score

The miscategorized shirts generally have low (close to 0) scores, except for one particularly “girly” boys shirt that we’ll see below.

Superlatives

Girliest Girl (looks like is based on shape and colors)

girliest_girl

Girliest Boy (shape and colors again)

girliest_boy

Boyiest Boy (da Bears)

boyiest_boy

Boyiest Girl (same one as last time!)

boyiest_girl

This is all very interesting and hints at Platonic ideal shirts (the philosophical details of which are out of scope for this blog). And clearly it does a much better job of predicting “boy shirt or girl shirt” than our previous color-based attempt. But whereas everyone knows about colors (except for the color-blind, of course), most people are unfamiliar with “eigenshirts” and will accuse you of having made them up just in order to have something to blog about. In particular, the girl who works at Gap Kids was entirely unimpressed with this model, and said that I needed to either buy something or leave the store.

Were I really committed to this model, I’d probably do more work to get the images comparable to each other so that not only were they the same size but the shirts were oriented as closely as possible and all had the same background color. Alas, I’m sort of principal-componented-out, and am eager to get back to writing my blog post about “the only correct way to interview engineers”, the punch-line of which is that you should only ask questions that involve golf balls, piano tuners, counterfeit coins, airplanes, treadmills, or piano tuners.

And so we leave things until part 3, “Shirt Language Processing”, which will be forthcoming at some point after I muster up the motivation to either transcribe the shirt images or find an intern to do it for me.

T-Shirts, Feminism, Parenting, and Data Science, Part 1: Colors

Before I was a parent I never gave much thought to children’s clothing, other than to covet a few of the baby shirts at T-Shirt Hell. Now that I have a two-year-old daughter, I have trouble thinking of anything but children’s clothing. (Don’t tell my boss!)

What I have discovered over the last couple of years, is that clothing intended for boys is fun, whereas clothing intended for girls kind of sucks. There’s nothing inherently two-year-old-boy-ish about dinosaurs, surfing ninjas, skateboarding globes, or “become-a-robot” solicitations, just as there’s nothing inherently two-year-old-girl-ish about pastel-colored balloons, or cats wearing bows, or dogs wearing bows, or ruffles. Forget about gender, I want Madeline to grow up to be a “surfing ninja” kind of kid, not a “cats wearing bows” kind of kid. An “angry skateboarding dog” kind of kid, not a “shoes with pretty ribbons” kind of kid.

Accordingly, I have taken to buying all of Madeline’s shirts in the boys section, the result of (her boy-ish haircut and) which is that half the time people refer to her as “he”. This doesn’t terribly bother me, especially if she ends up getting the gender wage premium that people are always yammering about on Facebook, but it makes me wonder why such a stark divide between toddler boy shirts and toddler girl shirts. And, of course, it makes me wonder if the divide is so stark that I can build a model to predict it!

The Dataset

I downloaded images of every “toddler boys” and “toddler girls” t-shirt from Carters, Children’s Place, Crazy 8, Gap Kids, Gymboree, Old Navy, and Target. Because each one had their shirts at a different (random) website location, I decided that using an Image Downloader Chrome extension would be quicker and easier than writing a scraping script that worked with all the different sites.

I ended up with 616 images of boys shirts and 446 images of girls shirts. My lawyer has advised me against redistributing the dataset, although I might if you ask nicely.

Attempt #1: Colors

(As always, the code is on my GitHub.)

A quick glance at the shirts revealed that boys shirts tend toward boy-ish colors, girls shirts toward girl-ish colors. So a simple model could just take into account the colors in the image. I’ve never done much image processing before, so the Pillow Python library seemed like a friendly place to start. (In retrospect, a library that made at least a half-hearted attempt at documentation would probably have been friendlier.)

The PIL library has a getcolors function, that returns a list of

(# of pixels, (red, green, blue))

for each rgb color in the image. This gives 256 * 256 * 256 = almost 17 million possible colors, which is probably too many, so I quantized the colors by bucketing each of red, green, and blue into either [0,85), [85,170) or [170,255]. This gives 3 * 3 * 3 = 27 possible colors.

To make things even simpler, I only cared about whether an image contained at least one pixel of a given color [bucket] or whether it contained none. This allowed me to convert each image into an array of length 27 consisting only of 0’s and 1’s.

Finally, I trained a logistic regression model to predict, based solely on the presence or absence of the 27 colors, whether a shirt was a boys shirt or a girls shirt. Without getting too mathematical, we end up with a weight (positive or negative) for each of the 27 colors. Then for any shirt, we add up the weights for all the colors in the shirt, and if that total is positive, we predict “boys shirt”, and if that total is negative, we predict “girls shirt”.

I trained the model on 80% of the data and measured its performance on the other 20%. This (pretty stupid) model predicted correctly about 77% of the time.

Plotted below is the number of boys shirts (blue) and girls shirts (pink) in the test set by the score assigned them in the model. Without getting into gory details, a score of 0 means the model thinks it’s equally likely to be a boys shirt or a girls shirt, with more positive scores meaning more likely boys shirt and more negative scores meaning more likely girls shirt. You can see that while there’s a muddled middle, when the model is really confident (in either direction), it’s always right.

shirts_by_score

If we dig into precision and recall, we see

P(is actually girl shirt | prediction is “girl shirt”) = 75%
P(is actually boy shirt | prediction is “boy shirt”) = 77%
P(prediction is “girl shirt” | is actually girl shirt) = 63%
P(prediction is “boy shirt” | is actually boy shirt) = 86%

One way of interpreting the recall discrepancy is that it’s much more likely for girls shirts to have “boy colors” than for boys shirts to have “girl colors”, which indeed appears to be the case.

Superlatives

Given this model, we can identify

The Girliest Girls Shirt (no argument from me):

girliest_girl_shirt

The Boyiest Girls Shirt (must be the black-and-white and lack of color?):

boyiest_girl_shirt

The Girliest Boys Shirt (I can see that if you just look at colors):

girliest_boy_shirt

The Boyiest Boys Shirt (a slightly odd choice, but I guess those are all boy-ish colors?):

boyiest_boy_shirt

The Most Androgynous Shirt (this one is most likely some kind of image compression artifact, the main colors are boyish but the image also has some girlish purple pixels in it that cancel those out):

most_androgynous

The Blandest Shirt (for sure!):

blandest

The Most Colorful Shirt (no argument with this one either!):

coloriest

Scores for Colors

By looking at the coefficients of the model, we can see precisely which colors are the most “boyish” and which are the most “girlish”. The results are not wholly unexpected:

151.71
80.68
69.35
49.69
43.83
40.99
35.94
30.56
26.08
24.06
20.89
20.49
18.89
17.67
1.29
-17.37
-21.77
-29.95
-49.91
-56.4
-66.77
-69.52
-70.15
-82.17
-119.1
-175.2
-224.74

In Conclusion

In conclusion, by looking only at which of 27 colors are present in a toddler t-shirt, we can do a surprisingly good job of predicting whether it’s a boys shirt or a girls shirt. And that pretty clearly involves throwing away lots of information. What if we were to take more of the actual image into account?

Coming soon, Part 2: EIGENSHIRTS

ESPN, Race, and Presidents

Inspired by (and lifting large amounts of code from) Trey Causey’s investigation of the language that ESPN uses to discuss white and non-white quarterbacks, I similarly wondered about the language ESPN uses to discuss white and non-white Presidents. For instance, a common stereotype is that non-white Presidents assassinate their citizens using unmanned drones, while white Presidents assassinate their citizens using polonium-210. Do such stereotypes creep into sportswriting?

Toward that end, I used Scrapy to scrape all the articles from the ESPN website that matched searches for (president obama), (president bush), (president clinton), and so on. This gave me a total of 543 articles. Then, using Wikipedia, Mechanical Turk, and a proprietary deep learning model, I categorized each of these Presidents as either “white” or “non-white”.

Using NLTK, I tokenized each article into sentences and then identified each sentence as being about

  • one or more white Presidents
  • one or more non-white Presidents
  • both white and non-white Presidents
  • no presidents

Curiously, while there were very few “non-white” Presidents, there were nonetheless about four times as many “non-white” sentences as “white” sentences. (This is itself an interesting phenomenon that’s probably worth investigating.)

I then split each sentence into words and counted how many times each word appeared in “white”, “non-white”, “both”, and “none” sentences. Like Trey, I followed the analysis here, similarly excluding stopwords and proper nouns, which I inferred based on capitalization patterns.

Finally, for each word I computed a “white percentage” and “non-white percentage” by looking at how likely that word was to appear in a “white” sentence or a “non-white” sentence and adjusting for the different numbers of sentences.

After all that, here are the words that were most likely to appear in sentences about “white” Presidents:

plaque 5
severed 4
grab 4
investigation 3
worn 3
unable 3
child 3
suppose 3
block 3
living 3
holders 3
pounds 3
ticket 3
blackout 3
thrown 3
exercise 3
scene 3
televised 3
upon 3
executives 3

Clearly this reads like something out of “CSI” or possibly “CSI: Miami”. If I were to make these words into a story, it would probably be something macabre like

The President grabbed the plaque he’d secretly made from a living child‘s severed foot and worn sock. The investigation supposed a suspect weighing at least 200 pounds who could have thrown the victim down the block, not a feeble politician famous for his televised blackout when he tried to exercise but was unable to grab his toes.

In constrast, here are the words most likely to appear in sentences about “non-white” Presidents:

bracket 32
interview 21
trip 16
champions 16
fan 48 1
asked 35 1
carrier 11
celebrate 11
thinks 11
early 11
eight 11
personal 10
picks 10
appearance 10
far 9
hear 9
congratulating 9
given 9
troops 9
safety 9
fine 9
person 9

This story would have to be something uplifting like

The President promised to raise taxes on every bracket before ending the interview. As a huge water polo fan, he needed to catch a ride on an aircraft carrier for his trip to celebrate with the champions. “Sometimes I get asked,” he thinks, “whether it’s too early to eat a personal pan pizza with eight toppings. So far I always say that I hear it’s not.” His safety is a given, since he’s surrounded by troops who are always congratulating him for being a fine person with a fine appearance.

As you can see, it has a markedly different tone, but not in a way that obviously correlates with the stereotypes mentioned earlier. Whatever prejudices lurk at ESPN are exceedingly subtle.

Obviously, this is only the tip of the iceberg. The algorithm for identifying which sentences were about Presidents is pretty rudimentary, and the word-counting NLP techniques used here are pretty basic. Another obvious next step would be to pull in additional data sources like Yahoo! Sports or SI.com or FOX Sports.

If you’re interested in following up, the code is all up on my github, so have at it! And I’d love to hear your feedback.

Three Keys to Successful Parenting

Now that Madeline is two, it seems appropriate to declare myself a success as a parent. Which means it’s now appropriate for those of you with kids (as well as those of you thinking about having or abducting kids) to ask me, “Joel, what’s your secret?” Which means it’s now appropriate for me to say “I’m glad you asked,” and then write a blog post about it.

1. Improv

I’m sure many of you wondered why I took all those improv classes, and why I made you come watch my improvised musical where we could only use words that started with a letter suggested by the audience, and why I didn’t stop the guy in the second row from choosing ‘X’, and why my song “Xerox Xevious” sounded exactly like “Summer of ’69.”

Well, it turns out that improv is a very easy way to become a better parent. (And that all of my songs sound exactly like “Summer of ’69”.)

Before improv

“Daddy, can I have some more candy?”
“No. Go to bed.”

After improv

“Daddy, can I have some more candy?”
Yes, and after your teeth rot and you become obese and get diabetes and have to have your foot amputated, then you should go to bed.”

Before improv

“Daddy, where do babies come from?”
“Go ask your mother.”

After improv

“Daddy, where do babies come from?”
[sits down on a plain black box, mimes that it’s maybe some kind of pirate seat on some kind of pirate boat, and starts in a pirate accent] “Yarr, ye land lubbers always be asking me questions about babies … [10 minute monologue in a pirate voice about pirate-y things that cleverly reincorporates elements from earlier in the conversation] Arr, go ask the first mate!”

Before improv

“Daddy, I need to go to the bathroom.”
“Again? You just went!”

After improv

“Daddy, I need to go to the bathroom.”
“DING! Now in the style of Shakespeare.”
“Daddy, I need to go to the bathroom!”
“DING! Now in the style of film noir.”
“Daddy, I NEED to GO to the BATHROOM!”
“DING! Now in the style of a fetish video.”
“Daddy, I peed my pants.”
“And scene!”

2. Radical Libertarianism

Most books (with the notable exception of *Praxeological Parenting*) will tell you that moderate libertarianism is all you need to be a good parent. But there are a great many parenting problems that a belief in the night-watchman state does little to solve.

For instance, when your kid doesn’t want to go to school because it’s a brainwashing factory designed to grind young impressionable minds into submission by (among other things) forbidding them from leaving their seats or talking “out of turn” or using the restroom without first obtaining permission, the moderate libertarian answer is typically to offer them a voucher that covers the tuition to a different brainwashing factory. Your kid is unlikely to find this satisfying, for obvious reasons.

Similarly, when your kid wants to BitTorrent the Criterion Director’s Cut version of Dora the Explorer, the wishy-washy moderate libertarian “you wouldn’t download a Dora the Explorer handbag!” position on intellectual property is not going to make her particularly happy.

And what will you tell her when she asks (as all kids inevitably do) how granting a monopoly on violence could possibly be a good way to prevent monopolies and violence? Or why the dinosaurs on “Dinosaur Train” are able to peaceably resolve their various conflicts despite living approximately 66 million years before the invention of government? Or why it’s OK for the government to take pieces of paper out of daddy’s wallet just as long as they don’t take too many, while she gets punished for taking even one, and don’t try to give me any of that John Rawls “veil of ignorance” stuff, I might have bought that crap when I was an infant, but now that I’m TWO YEARS OLD the flaws in his “logic” are pretty glaringly obvious?

Whereas radical libertarianism easily sidesteps all these problems, making parenting a breeze (relatively speaking).

3. Trolling

Did you ever imagine that all those years you wasted trolling that idiot Marxist kid on LiveJournal debate would end up being useful? Because they are! Kids love being trolled! Love it! Here are a few of Madeline’s favorite trolls:

“My Hippo”

This one’s easy, you just pick up something that belongs to the kid (e.g. a stuffed hippo) and troll that it’s yours:

“Hey, my hippo.”
“No, MY hippo!”
“I’m pretty sure this is daddy’s hippo.”
“No, MY hippo!”
“Does it have your name on it?”
“MY hippo!”
“It was just lying on the floor and I homesteaded it.”
“MY hippo!”
“Have your protection agency call my protection agency and maybe we can work something out.”
“MY hippo!”
“Behind the veil of ignorance it could just as easily have been my hippo.”
“MY hippo!”
[ several hundred lines of dialogue removed due to space constraints ]
“Yeah, but what does it really mean to ‘own’ something?”
“MY hippo!”
“And scene!”

“Science Project”

Part of being a parent is helping your kids with science projects, so help them “demonstrate” something that isn’t real, like cold fusion, or quantum computing, or evolution. Chances are their teachers won’t know the difference, which makes it also work on another level.

“9/11 Trutherism”

Kids will believe just about anything, even that that third WTC 7 skyscraper would just collapse on its own despite not even being hit by a plane. Even so, it’s not very hard to convince them that the towers were brought down on 9/11 by controlled demolition using explosives secretly planted in advance by the government in order to create an excuse to invade Iraq and Afghanistan in order to pave the way for a new American hegemony. And then they’ll repeat this on the playground, and then you’ll get called in for a parent-teacher conference at which you can reveal that you’d assumed that she’d picked these theories from the playground, which means that if she didn’t then maybe she just came up with them on her own? And that if the official narrative is so shoddy that a 2-year-old can pick holes in it, then maybe Alex Jones is onto something!

“The Craigslist Experiment”

OK, so possibly there are some kinds of trolling kids don’t like.