Who Are You? A Statistical Approach to Protecting LinkedIn Logins

January 22, 2016

  • Password Wordcloud

Can you can spot your LinkedIn password in the above image? Is your password on LinkedIn the same as another website? Have you been a victim of a phishing attempt? If so, your username and password might be in an attacker’s database without you even knowing it! Not to fear though — LinkedIn works hard to protect member accounts against attackers who have your username and password!

As a step towards protecting member data, LinkedIn constantly runs various machine learning models to identify accounts that have been taken over and are being used for malicious activities like spamming. However, that’s only part of the larger issue. To fully protect members, we work to prevent attackers from getting into the target account in the first place. This is where the problem gets interesting! With both the username and password correct, how do we know whether it is the real user (say Alice) logging in, or a hacker (say Eve) who stole Alice’s credentials? Can we detect an attack at the time of login, after the member/attacker has clicked on the “Submit” button and is waiting for the home page to load, without affecting the user experience for genuine users?

The first step is to note that with each login we not only have Alice's username and password, but also attributes like the IP address (and hence the geographic location), or user agent (and derived browser and operating system) of her current session (1). We also maintain aggregated information (2) about each IP, browser, and OS seen on the site and compute statistics such as how commonly we observe a given IP address, or how often do we see abuse from a given browser. All of this information helps us capture Alice’s usual patterns and identify her in a unique way. Now the question is how to combine this knowledge with Alice’s past login history to determine if the current attempt is suspicious.  

Working with collaborators from Ruhr-Universität Bochum and Università di Cagliari, we have devised a formula that allows us to compute the relative probabilities of the current login attempt being legitimate or an attack, given all the data associated with this login. In addition to the indicators noted above, we include the member’s login history and the probability of abuse seen from that IP or useragent (as determined by a separate internal model). Mathematically, we are trying to determine whether the following expression is true.

  • Password Image 1

In this formula, u is the member and X is the vector containing all of the data collected from the login attempt. The numerator denotes the probability of the current login attempt being an attack for member u and dataset X, while the denominator denotes the probability of the current login being legitimate.

One of the major hurdles with computing the above probability as stated is that login data per member is sparse. Specifically, most members have never been attacked; if we take this data at face value, then the above equation will always predict that a member account that has not been attacked before will never be attacked. In reality we see previously untargeted accounts being attacked all the time. To deal with this challenge, we transform this equation to include factors with statistically significant data. By applying Bayes’ theorem and making a few other assumptions, the equation can be converted to the following, more tractable form:

  • Password Image 2

All of the features on the right hand side except for Pr[X|u] can be computed with good confidence. The term Pr[X|u], i.e., the probability of the member coming from this particular feature (say IP address as an example) needs to be handled separately because member data is sparse. For example a member might login from an IP address we have never seen before in the member or global history, causing this term to evaluate to zero. In order to get around this, we exploit a well-known technique called smoothing. To account for unseen events, smoothing reduces the estimated probability of known events to reserve some probability for unseen events. In our case we utilize the fact that IP addresses display a neat hierarchical structure: each IP can be associated with an organization, which in turn lies in an ASN, which lies in a country. To estimate Pr[X|u], we aggregate probability estimates computed at different levels of observation (Org, ASN or country) using two different smoothing techniques (“interpolation” and “back-off”). Smoothing in this fashion lets our model predict higher probability of attack if the login is coming from an unseen IP that belongs to an unseen country, as compared to an unseen IP from a seen ISP. Similarly for useragent, the model will mark a login from a previously unseen OS as more suspicious as compared to login from a different version of a seen OS.

We tested multiple variations of our model on previous attacks seen by LinkedIn, along with simulated attacks. Our model thwarts 70% more attacker logins than a model that would do a straightforward history match of current login country with countries in the member’s past login history.

A detailed description of our login scoring model can be found in our technical paper that will appear at NDSS ‘16, and an overview of the work will be presented next week at Enigma 2016. This work was done in collaboration with David Freeman, Battista Biggio, Markus Duermuth, Giorgio Giacinto. While our model has performed well in experiments and we are implementing it now, no model is perfect and there are always cases where an attacker can sneak in. Having better models in place increases the cost to attackers of infiltrating legitimate accounts on the site, thereby making it far less profitable for them. We are always researching features and techniques to make our models more robust in order to protect members.

 

Footnotes

(1): Note that while we have automated systems crunching all the data in the background, we don't have people actively monitoring individual accounts and watching where a given member logs in; that kind of info is subject to very tight access controls within the company.

(2): Aggregated information about geolocation, browser and OS are used as signals in our models to protect our members from being touched by malicious elements in the network.