Removing Fake Accounts from LinkedIn
November 20, 2015
Integrity of the platform is of the utmost importance to LinkedIn. When members interact with other members on LinkedIn, they expect that they are interacting with real people. Unfortunately, from time to time, malicious actors create fake accounts on LinkedIn for a variety of reasons. If left unchecked, fake accounts result in a degraded experience for legitimate members.
Some fake accounts can be detected solely from the information in that account. For example, if an account is registered with the name “aaaaaa bbbbbb”, then Natural Language Processing techniques can determine that this is unlikely to be a real person’s name. However, if an attacker uses real data to generate fake accounts, then we can no longer detect them by looking at each account individually. What we have going for us is that attackers almost always generate large numbers of fake accounts using automated scripts. Accounts created by automated scripts show patterns that are unlikely to arise in accounts created by humans independently from each other. By studying clusters of accounts, we can detect fake accounts that may not be spotted by studying each account individually.
In a paper presented at AISec 2015, Danica Xiao, David Freeman, and I describe a method for detecting clusters of fake accounts using machine learning. In order to minimize the impact of fake accounts on real members, the method is designed to detect fake accounts as soon as possible after the account is created. Thus, we only use information available at the time of account creation; we do not use information that would take time to accumulate, such as behavioral history. The input to the model is a cluster of accounts, and the model predicts whether the entire cluster is fake.
Because the model takes clusters as input, newly registered accounts must first be grouped into clusters using any reasonable method that aggregates accounts believed to come from the same source. Then, the model uses cluster-level features which are based on the distribution of values of individual features (such as name or email address) over the entire cluster. As an example, consider the following 2 clusters of names:
The left side consists of very common American male names. The right side consists of names that are very rare, but cannot be immediately dismissed as fake. In either case, if we look at one name at a time, we cannot confidently decide that the account is fake. However, if either cluster is taken as a whole, then the cluster is suspicious because a random sample of names (even a random sample within a country) should contain names along the entire frequency spectrum, and is unlikely to look like either of these clusters.
A model based on the techniques described in this paper has been deployed at LinkedIn, and catches many clusters of fake accounts every day, thus preventing legitimate members from ever seeing them. But it is only the beginning of the fight to remove fake accounts. The attackers are not standing still in response to our new model, and we too must continue to evolve. We will improve both the clustering methods and the features of the model. In addition, the automated creation of clusters of fake accounts is only one of several kinds of account creation abuse we are addressing. Our ongoing work will help ensure that our members continue to interact only with real people.