Matt Howlett

Matt Howlett

Every Blog Should Have a Tagline

/now person similarity

2016-05-07

I'm intrigued by Derek Sivers' website nownownow.com. I'm intrigued because I don't see why it would take off in a big way, but at the same time I think that it's possible that it might.

Why? because it's interesting and novel. I think successful projects often start out as interesting, novel ideas and the path from idea to success is often not obvious from the beginning. For that reason, I think there are worse things to be doing than playing around with interesting, novel ideas.

So I've decided to have a play with nownownow.com myself (on Derek's invitation: "just make something first, and let me know."). Two ideas immediately came to mind:

  1. rather than show a long, randomly ordered list of people under a particular person's profile, it would be better to show a smaller list ordered by some measure of similarity to that person.
  2. search.

I was more drawn to the first idea, so I've started out by developing a proof of concept solution to that problem (though the ideas are somewhat related).

Method

Input Data

The front page of nownownow.com currently lists every person who has submitted their /now page and information about themselves. Underneath their photo is an answer to the question "What do you do?" and this text is usually fairly informative. For this proof of concept, similarity is calculated completely from this text which means all required input data can be obtained from a single request to the main page.

It would be relatively straightforward to crawl additional information about each person and I think that is probably the easiest way to improve on the results of this proof of concept.

Word Importance

Consider the following (very simple) person profile:

I write and sell software.

In order to compare this profile with other profiles, it would be very helpful to define a measure of word importance. In this example, the most important word to pick up on is 'software'. The words 'sell' and to a lesser extent 'write' are also somewhat informative. The words 'and' and 'I' should pretty much be ignored for the purpose of deciding whether two descriptions are similar or not.

Conveniently, there is a standard NLP technique that partitions words into categories that are useful for this purpose - part-of-speech (POS) tagging. A POS tagger analyses sentences and assigns to each word a part-of-speech category such as verb, adjective, noun etc.

Putting the above example through the Stanford POS tagger yields the following result:

I/PRP write/VBP and/CC sell/VBP software/NN

here:

  • /PRP = 'Personal Pronoun'
  • /VBP = 'Verb, non 3rd ps. sing. present'
  • /CC = 'Coordinating conjunction' and
  • /NN = 'Noun'

The POS category of the most important word in the example sentence (software) is /NN (noun). When I inspected all the nouns across all person profiles, I noticed that I thought the vast majority of them were 'important'. Other categories are related to word importance to varying degrees and I've made a list below which summarizes this. The bold number reflects the relative usefulness of the POS category as an indicator of word importance. No scientific method was employed in coming up with this list - I just inspected the data and pulled numbers out of the air.

  • /NN (Noun) 1.0
  • /NNP (Proper singular noun) 1.0
  • /NNS (Plural noun) 0.9
  • /JJ (Adjective) 0.1
  • /VB (Verb, base form) 0.1
  • /VBP (Verb, non 3rd ps. sing. present) 0.1

The first stage of the person similarity algorithm is to process each person's 'what do you do?' text with a POS tagger. The second step is to construct a collection of (word, weight) pairs for each person as follows:

  1. Words tagged with a category not contained in the above 'important' list are ignored.
  2. Words tagged with an 'important' category are added to the collection, with the appropriate weight.
  3. If a word is already in the collection it is not added a second time.
  4. It is possible for a particular word to appear more than once in a given profile and for different occurrences of this word to be tagged with a different category depending on context. This is an edge case which is not currently considered properly.

I experimented with variations on this algorithm. In particular, I considered strategies for upweighting words that appeared multiple times in a given person's profile. I found this caused undesirable biases towards profiles with repeated words however.

Word Similarity

Before starting work on this project, I assumed word similarity was going to be very important. For example, 'fish' and 'fisherman' are very closely related, 'fish' and 'boat' are somewhat related etc. I still believe incorporating word similarity into the algorithm can significantly improve results, however I found it wasn't necessary in order to produce acceptable results, so i've omitted this from the current iteration in order to keep things as simple as possible.

Before reaching this conclusion though, I did spend a bit of time assessing two strategies for measuring word similarity:

wordnet / WS4J

Wordnet is a well known project out of Princeton University. They explain what it is better than I can:

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

WS4J is a java library that defines various measures of word similarity based on their relationship in the wordnet database.

Unfortunately, I found all of these measures to perform relatively poorly in practice and I think they would provide only marginal, if any, benefit to the person similarity algorithm.

word2vec

word2vec is a project by Tomas Mikolov et. al from Google. It uses an unsupervised learning algorithm to associate each word with a high dimensional vector. The relative position of words in the resulting vector space can be used to infer things about the relationship between them (here is an excellent presentation on what is possible with word2vec).

For our use case, we can simply use the dot product as a measure of word similarity. In practice I found this number matched what i would have intuitively guessed quite closely.

A downside of word2vec is that it is much more resource intensive than than the wordnet based approach.

Person Similarity

A similarity score for two people can be calculated from their respective (word, weight) collections by determining which words occur in both collections and summing together the associated weights of those words.

A list of people most similar to a given person can be calculated by calculating the similarity score for that person with every other person and sorting by this score.

Unfortunately, this naive algorithm is O(N^2) - it will not scale to a large number of people. A methodology that is able to work with a large number of people would require a clustering algorithm of some kind. Because the number of people with profiles on nownownow.com is currently relatively small, I haven't bothered with this step yet (perhaps a topic of a future blog post).

Code and Results

Java code that implements the method described in this blog post is available on github

The output from that software when run on May 7th 2016 is online here.

This work hasn't been integrated into the nownownow.com website (yet!?). For now, the software just produces a list of every person together with the top 4 most similar people to that person.

Given the simplicity of the algorithm, I'm very impressed with how well it performs.