Hacking Hacker News

Hacker News, if you don’t know it, is an aggregator / forum attached to Y Combinator. People submit links to news stories and blog posts, questions, examples, and so on. Other people vote them up or down, and still other people argue about them in the comments sections.

If you have unlimited time on your hands, it’s an excellent firehose for things related to hacking. If your time is more limited, it’s more challenging. People submit hundreds of stories every day, and even if you only pay attention to the ones that get enough votes to make it to the homepage, it’s still overwhelming to keep up:

What’s more, a lot of the stories are about topics that are boring, like OSX and iPads and group couponing. So for some time I’ve been thinking that what Hacker News really needs is some sort of filter for “only show me stories that Joel would find interesting”. Unfortunately, it has no such filter. So last weekend I decided I would try to build one.

Step 1 : Design

To make things simple, I made a couple of simplifying design decisions.

First, I was only going to take into account static features of the stories. That meant I could consider their title, and their url, and who submitted them, but not how many comments they had or how many votes they had, since those would depend on when they were scraped.

In some ways this was a severe limitation, since HN itself uses the votes to decide which stories to show people. On the other hand, the whole point of the project was that “what Joel likes” and “what the HN community likes” are completely different things.

Second, I decided that I wasn’t going to follow the links to collect data. This would make the data collection easier, but the predicting harder, since the titles aren’t always indicative of what’s behind them.

So basically I would use the story title, the URL it linked to, and the submitter’s username. My goal was just to classify the story as interesting-to-Joel or not, which meant the simplest approach was probably to use a naive Bayes classifier, so that’s what I did.

Step 2 : Acquire Computing Resources

I have an AWS account, but for various reasons I find it kind of irritating. I’d heard some good things about Rackspace Cloud Hosting, so I signed up and launched one of their low-end $10/month virtual servers with (for no particular reason) Debian 6.0.

I also installed a recent Ruby (which is these days my preferred language for building things quickly) and mongoDB, which I’d been meaning to learn for a while.

Step 3 : Collect Data

First I needed some history. A site called Hacker News Daily archives the top 10 stories each day going back a couple of years, and it was pretty simple to write a script to download them all and stick them in the database.

Then I needed to collect the new stories going forward. At first I tried scraping them off the Hacker News “newest” page, but very quickly they blocked my scraping (which I didn’t think was particularly excessive). Googling this problem, I found the unofficial Hacker News API, which is totally cool with me scraping it, which I do once an hour. (Unfortunately, it seems to go down several times a day, but what can you do?)

Step 4 : Judging Stories

Now I’ve got an ever-growing database of stories. To build a model that classifies them, I need some training data with stories that are labeled interesting-to-Joel or not. So I wrote a script that pulls all the unlabeled stories from the database, one-at-a-time shows them to me and asks whether I’d like to click on the story or not, and then saves that judgment back to the database.

At first I was judging them most-recent-first, but then I realized I was biasing my traning set toward SOPA and PIPA, and so I changed it to judge them randomly.

Step 5 : Turning Stories into Features

The naive Bayes model constructs probabilities based on features of the stories. This means we need to turn stories into features. I didn’t spend too much time on this, but I included the following features:

* contains_{word}
* contains_{bigram}
* domain_{domain of url}
* user_{username}
* domain_contains_user (a crude measure of someone submitting his own site)
* is_pdf (generally I don’t want to click on these links)
* is_question
* title_has_dollar_amount
* title_has_number_of_years
* title_references_specific_YC_class (e.g. “(YC W12) seeks blah blah)
* title_is_in_quotes

For the words and bigrams, I removed a short list of stopwords, and I ran them all through a Porter stemmer. The others are all pretty self-explanatory.

Step 6 : Training a Model

This part is surprisingly simple:

* Get all the judged stories from the database.
* Split them into a training set and a test set. (I’m using an 80/20 split.)
* Compute all the features of the stories in the training set, and for each feature count (# of occurrences in liked stories) and (# of occurrences in disliked stories).
* Throw out all features that don’t occur at least 3 times in the dataset.
* Smooth each remaining feature by adding an extra 2 likes and an extra 2 dislikes. (2 is on the large side for smoothing, but we have a pretty small dataset.)
* That’s it. We YAML-ize the feature counts and save them to a file.
* For good measure, we use the model to classify the held-out test data, and plot a Precision-Recall curve

Step 7 : Classifying the Data

Naive Bayes classifier is fast, so it only takes a few seconds to generate and save interesting-to-Joel probabilities for all the stories in the database.

Step 8 : Publishing the Data

This should have been the easiest step, but it caused me a surprising amount of grief. First I had to decide between

* publish every story, accompanied by its probability; or
* publish only stories that met some threshhold

In the long term I’d prefer the second, but while I’m getting things to work the first seems preferable.

My first attempt involved setting up a Twitter feed and using the Twitter Ruby gem to publish the new stories to it as I scored them. This worked, but it wasn’t a pleasant way to consume them, and anyway it quickly ran afoul of Twitter’s rate limits.

I decided a blog of batched stories would be better, and so then I spent several hours grappling with Ruby gems for WordPress, Tumblr, Blogger, Posterous, and even LiveJournal [!] without much luck. (Most of the authentication APIs were for more heavy-duty use that I cared about — I just wanted to post to a blog using a stored password.)

Finally I got Blogger to work, and after some experimenting I decided the best approach would be to post once an hour, all the new stories since the last time I posted. Eventually I realized that I should rank the stories by interesting-to-Joel-ness, so that the ones I’d most want to read would be at the top:

and the ones I want to read least would be at the bottom:

The blog itself is at

http://joelgrus-hackernews.blogspot.com/

Step 9 : Automate

This part was pretty easy with two cron jobs. The first, once an hour, goes to the Hacker News API and retrieves all new unknown stories (up to a limit of like 600, which should never be hit in practice). It then scores them with the last saved model and adds them to the database. In practice, the API isn’t working half the time.

The second, a few minutes later, takes all the new stories and posts them to the blog. The end result is a blog of hourly scored digests of new Hacker News posts.

Step 10 : Improve the Model

The model can only get better with more training data, which requires me to judge whether I like stories or not. I do this occasionally when there’s nothing interesting on Facebook. Right now this is just the above command-line tool, but maybe I’ll come up with something better in the future.

Step 11 : Profit

I’m still trying to figure this one out. If you’ve got any good ideas, the code is here.

33 thoughts on “Hacking Hacker News

  1. Espen

    Wonderful, I really enjoy these “gather data and predict outcome” – posts. Did a similiar effort to lotto numbers a couple of months ago. Didn’t publish the code in the off chance that anyone would actually win anything by using my code.

  2. Pingback: Aussie Startups relaunches « Kate Kendall

  3. Sitaktif

    Cool, I really like the idea!

    I think this could even be developed by trying to give a higher score to items with “unusual” titles. I say that because I am concerned about falling in the trap of reading always articles on the same subject. I want to be able to see items with these new “unusual” subjects (eg. the ones that have words that are not common (yet) in the DB).

  4. Reinder

    Dude,

    Terrific story, very sound concept. Considering the money, why don’t you put it out there as a service? Determine whether there’s demand by looking at your GA for this post. Brand it only-HN, go full for all sorts of RSS, or just try to sell the algorithm behind it to someone who already owns a lot of data.

    Keep it up.

    Reinder

  5. Salvadir

    Really cool! I think you could add a like button on your template so you can easily keep training your model. Thanks for the post!

  6. jcubic

    You should create a site where user can login and vote for posts, and show only revelant stories that may be interesting to them. It whould be nice if HN API had a way to check if user vote on post in this case you whould be able to train clasifier with stories voted by users.

  7. Jamesb

    As others have said, you could always put it up as a webservice, where someone pays for an account, they login, answer the questions so as to train it for them, and then their feed gets delivered as an rss?

    You could have varying levels of payment for how often their feed gets updated (as this is going to cost you more in cloud processing)….

    …and then you could start writing adapters for other sites too…and you pay for how many sites you want it to recommend stories from…

    …you could also ask people to classify themselves or their interests and then you may not need to get people to train the system themselves?…

    in the age of Too-much-info-TL-DR, tools like this are only become more and more important

  8. Vidar Hokstad

    Subtle one, David….

    But yes – any chance you’ll be posting the code? I’ve wanted to do this for a while, but never found the time (or I’m just being lazy)…

  9. Ian George

    I think this is a great idea and a good writeup.

    The only useful improvement I can think of is to write a greasemonkey script to post anything you click through on the actual HN site back to the classifier.

  10. Newsblur

    Well there is: http://www.newsblur.com/
    Why bug around with old-school mail clients, after reading the title I hoped:
    not bayesian,not bayesian,not bayesian, ohh damnit..another one..
    sorry this isn’t meant to be a negative post, I respect your work, was just a little turned down because you didn’t use a Markov Chain of some sorts with NLP api’s. heh :)

  11. adv4nced

    Great article . Only one note : your started by saying that
    ” it’s still overwhelming to keep up ” and you ended up with ~700 links per day … ;)

  12. Joel Post author

    In practice there are a fair number of false positives, but very few false negatives. So there are still too many stories with 80%+ probability that I don’t like, but there are barely any stories with < 20% probability that I’d want to read.

    Although there are 700 links a day, I usually scan the top 10-20 in each post and click on maybe 4-5 of them, so it’s not too bad. :)

  13. Brian Armstrong

    Interesting post Joel. This isnt quite the same but wanted to mention i created a project, http://ribbot.com, to let people create their own Hacker News style site on other topics.

  14. polyglot

    Cool project Joel, but why do you make such a big output?

    Scanning 700 links to find a few stories worth reading about sounds like too much work. Are you planning to build another filter, say “a few stories that Joel finds very interesting from the bunch of stories that Joel finds finds interesting?”

    Just joking, great job.

  15. Joel Post author

    Well, the stories are ordered from most probable to least probable, so “scanning” really means looking at the first few. Also, at some point I’ll probably tweak it to only post links that are (say) 75% or more probable, which will cut the number down substantially.

  16. Colin Lee

    Ah, but couldn’t an enterprising young entrepreneur run this filter in reverse? One could run an analysis on the article scoring to figure out which story features get rated highest on a Bayesian filter system like this, if one takes off. I know I would. Suddenly, we’d see an abundance of stories about startup founders theorizing about ACTA.

  17. Joel Post author

    I’m not sure what “takes off” means here. This particular filter is calibrated to my tastes. If people want to cater to my tastes, they are more than welcome to, but presumably other people would train their filters differently, and so a “caters to Joel” strategy would have limited reach.

  18. John Hunter

    Very cool. It would be nice if you could include data on users that up-voted the story. My guess is you don’t have access to that data though. But if you did I would think that could be a very useful test. Even getting things like if these 2 or 3 people liked it super big bonus points… And even learning, either bozos that like stupid stuff or spammers so that if it learns big up-votes from those people is a negative (not a positive).

    At the very beginning of using Reddit I thought they were doing that kind of stuff when displaying the home page. I thought my preferences were being stored so that it could judge what I would like (not just to bump that story up).

    It would be cool if Reddit and/or HN could at least let us say add this person to my recommended list (and then highlight all stories those people up-vote).

  19. Chuck

    Hi Joel.

    I’m very interested in doing something alike (although different application and other area) this. And by this I mean implementing Naive Bayes for my own amusement. I’ve found your post very encouraging and interesting :)

    Thanks for sharing the code (I’m a pythonista myself, but ruby is not my enemy ;))

    Just wanted to add an observation to your featurizer.rb. I think you should modify your regex for YC classes slightly to include more cases. Lately I’ve seen a lot of people using YCS11, and since I don’t see many errornous matches for `YC[ SW0-9][0-9]` I thought that’d catch those few extras (also, limited to summer/winter. You could add F if fall ever becomes relevant). Anyway, use if you want :)

    – Chuck

  20. HACKER LIAR

    Hack This Page is a like, safe and legal training ground for hackers to test and expand their hacking skills. More than just another hacker wargames site, we are a living, breathing community with many active projects in development, with a vast selection of hacking articles and a huge forum where users can discuss hacking, network security, and just about everything. Tune in to the hacker underground and get involved with the project.

  21. HACKER LIAR

    Hacking means finding out weaknesses in an established system and exploiting them. A computer hacker is a person who finds out weaknesses in the computer and exploits it. Hackers may be motivated by a multitude of reasons, such as profit, protest, or challenge. The subculture that has evolved around hackers is often referred to as the computer underground but it is now an open community. While other uses of the word hacker exist that are not related to computer security, they are rarely used in mainstream context. They are subject to the long standing hacker definition controversy about the true meaning of the term hacker. In this controversy, the term hacker is reclaimed by computer programmers who argue that someone breaking into computers is better called a cracker, not making a difference between computer criminals (black hats) and computer security experts (white hats). Some white hat hackers claim that they also deserve the title hacker, and that only black hats should be called crackers.

  22. Pingback: Living in a Borg society » liftoff

Comments are closed.