Back to all incidents

LinkedIn — 2012 password leak + 2021 scrape

A 2012 breach exposed 117 million LinkedIn password hashes stored without salting, which were cracked and used for credential-stuffing attacks for years after the original incident.

Target
LinkedIn — 2012 password leak + 2021 scrape
Date public
5 June 2012
Sector
Technology
Attack type
Data Breach
Threat actor
Yevgeniy Nikulin (DOJ indictment, 2012); unattributed (2021)
Severity
High
Region
Global

In June 2012, several million LinkedIn passwords turned up on a Russian internet forum. LinkedIn confirmed a breach, said about 6.5 million accounts were affected, and moved on. Four years later, the real picture emerged: 117 million accounts had been taken, and the passwords were stored in a format so weak they could be cracked by an ordinary computer in hours. LinkedIn had stored passwords as SHA-1 hashes without adding a random salt — a technique that makes identical passwords produce identical stored values. With a precomputed table of common passwords, an attacker could match the hashed values to their plaintext originals almost instantly. The stolen passwords fed into credential-stuffing attacks against other websites for years: if you used the same password on LinkedIn as on your email, bank, or social media, attackers tried it there too. A separate problem arrived in 2021. A vast scraping operation collected publicly visible profile data — names, job titles, email addresses, locations — on 700 million LinkedIn users. No passwords this time, but the data was precise enough to fuel highly targeted phishing campaigns against specific industries and roles. A Russian national named Yevgeniy Nikulin was eventually convicted for the 2012 breach and sentenced to over seven years in US federal prison.

What happened

On 6 June 2012, a user on a Russian-language password-cracking forum published a file containing 6.46 million password hashes claimed to come from LinkedIn. LinkedIn acknowledged the same day that it was investigating, confirmed that some of the hashes corresponded to member accounts, and began invalidating the affected passwords. The company characterised the intrusion as affecting approximately 6.5 million accounts.

That characterisation remained the established account for four years. In May 2016, a dataset purporting to be the full scope of the 2012 breach appeared for sale on a dark-web marketplace called The Real Deal, asking five bitcoin. The dataset contained approximately 117 million LinkedIn member records: email addresses paired with SHA-1 password hashes. LinkedIn confirmed the authenticity of the data and forced password resets for all accounts included in the full dataset that had not already changed their passwords since 2012.

The password-storage mechanism LinkedIn had used — unsalted SHA-1 hashes — meant that the 117 million passwords were practically recoverable. SHA-1 is a fast hash function designed for integrity checking, not password storage. Without per-record random salting, any two users with the same password produce the same stored hash, enabling precomputed rainbow-table attacks that can recover common passwords in milliseconds. By the time the full dataset surfaced in 2016, independent researchers had already cracked the majority of the hashes.

The downstream consequence was a credential-stuffing supply chain that persisted for a decade. Many of the 117 million email-and-password pairs remained valid on other services — users who had reused their LinkedIn password on Gmail, Dropbox, banking sites and other platforms faced account takeovers across multiple services simultaneously. LinkedIn’s 2012 data became a foundational dataset in the underground credential market, cited in breaches and account-takeover operations for years afterwards.

In 2021, a separate problem emerged. In April, a dataset of approximately 500 million LinkedIn profiles scraped from the platform’s public-facing API was posted on a hacker forum. In June, a further dataset of 700 million profiles appeared. These datasets were not the product of a database compromise; they were assembled by systematically querying LinkedIn’s API with authenticated sessions to harvest publicly visible profile data. LinkedIn maintained that this constituted scraping of public data rather than a data breach, a characterisation disputed by regulators in several jurisdictions. The scraped data included names, email addresses, phone numbers, job titles, employer names, locations, and professional histories — sufficient to construct highly targeted spear-phishing and social-engineering campaigns against specific individuals.

How it worked

The 2012 intrusion method was established in detail during the prosecution of Yevgeniy Nikulin, a Russian national arrested in Prague in 2016 and extradited to the United States in 2018. Nikulin and co-conspirators gained access to the computer of a LinkedIn employee — the prosecution described the mechanism as malware installed on the employee’s personal computer that captured their LinkedIn corporate credentials. Those credentials were used to access LinkedIn’s internal network.

Inside the network, the attackers reached LinkedIn’s user database and extracted the password file. The file stored SHA-1 hashes without salting. The omission of salting meant the hashes were directly vulnerable to precomputed-table attacks; the choice of SHA-1 as the hash function, fast by design, meant brute-force attacks against any salt that had been used would also have been rapid. Six million of the hashes were immediately cracked by the forum community that received the initial leak and posted as plaintext-password pairs to demonstrate their validity.

The gap between the 6.5 million figure LinkedIn initially acknowledged and the 117 million figure that surfaced four years later raises questions about whether the full scope was known and not disclosed, or whether the investigation at the time failed to establish it. LinkedIn’s 2016 response indicated the company believed the 2012 investigation had identified the full extent of the breach; the discrepancy suggests the original forensic investigation was incomplete.

The 2021 scraping operated through a different mechanism entirely. LinkedIn’s API, used by developers to build applications that integrate with the platform, returned profile data when queried with authenticated user credentials. Scrapers created large numbers of LinkedIn accounts and used automated tooling to systematically traverse the platform’s user graph — following connection networks, searching by company and location, enumerating alumni networks — to harvest the publicly visible fields of as many profiles as possible. The scale of the 2021 operation (700 million profiles out of approximately 740 million total LinkedIn users) suggests near-comprehensive coverage of the platform’s user base.

Timeline

  • Unknown date, 2012 — Yevgeniy Nikulin and co-conspirators access a LinkedIn employee’s credentials via malware. Intrusion into LinkedIn’s internal network and database.
  • Early June 2012 — LinkedIn password hash file exfiltrated.
  • 6 June 2012 — 6.46 million SHA-1 hashes posted on Russian forum “insidepro.com”. LinkedIn confirms breach and begins invalidating affected passwords. Company states approximately 6.5 million accounts affected.
  • 2012–2016 — Full dataset circulates on underground markets. Hashes cracked at scale; recovered credentials used in credential-stuffing campaigns against other platforms.
  • May 2016 — Full dataset of approximately 117 million records listed for sale on The Real Deal dark-web marketplace for 5 BTC (approximately $2,200 at the time). Reported by Motherboard and confirmed by LeakedSource.
  • June 2016 — LinkedIn confirms authenticity of the 2016 dataset, forces password resets for all affected accounts not changed since 2012.
  • October 2016 — Yevgeniy Nikulin arrested in Prague at the request of US authorities.
  • March 2018 — Nikulin extradited to the United States after lengthy legal proceedings in the Czech Republic.
  • July 2020 — Nikulin convicted on all counts including computer intrusion, aggravated identity theft, and wire fraud.
  • September 2020 — Nikulin sentenced to 88 months (7 years 4 months) in federal prison.
  • April 2021 — Dataset of approximately 500 million LinkedIn profiles scraped via API posted on hacker forum.
  • June 2021 — Dataset of 700 million LinkedIn profiles posted; near-comprehensive coverage of the platform’s user base. LinkedIn maintains this is scraping of public data, not a breach.
  • 2021–2022 — EU data protection regulators, including Ireland’s DPC, open investigations into the 2021 scraping incidents.

What defenders should learn

The LinkedIn breach is the canonical teaching case for password-storage architecture, and the lesson has not changed: SHA-1 is not a password-hashing function. It was designed for integrity checking, not credential storage. It is fast — which is exactly the wrong property for a password hash. A password hash function needs to be deliberately slow and computationally expensive, so that brute-force attempts cost an attacker significant time per guess. bcrypt, the standard alternative available in 2012, was designed with an adjustable cost factor precisely for this purpose. LinkedIn’s choice to use SHA-1 transformed every stolen record from a protected credential into a recoverable plaintext with modest compute.

The salt omission doubles the error. Without a random per-record salt, every user who chose “linkedin” as their password — and many did — stored exactly the same hash. A precomputed table of common SHA-1 hashes cracks an entire population simultaneously rather than requiring individual attacks. Salting eliminates this entire attack class: it costs nothing computationally and requires four lines of code. That LinkedIn’s implementation omitted it in 2012 is a failure of basic security hygiene that the platform’s size and sophistication had no excuse for.

The four-year gap between the 2012 event and the 2016 full-scope disclosure is a disclosure and investigation failure with specific lessons. If an incident response investigation concludes a breach affected 6.5 million records when the true figure is 117 million, either the forensic scope was insufficient or the investigation was improperly bounded. Post-breach investigations need to account for the possibility that the attacker’s access was broader than the data found in the attacker’s possession at the moment of discovery.

The 2021 scraping incident is a different problem with different mitigations, but equally important. Publicly visible data is not harmless data. An API that returns profile information to any authenticated user can be enumerated systematically at scale if no rate-limiting, behavioural anomaly detection, or access-pattern monitoring is in place. The fact that the same data is “public” on the website does not mean unlimited programmatic bulk access should be treated as normal. Detecting and interrupting bulk-enumeration behaviour requires active monitoring of API access patterns — not just perimeter authentication.

The downstream credential-stuffing risk is the final lesson and the most enduring. Password reuse across services is the mechanism that converts LinkedIn’s breach into account takeovers at banks, email providers, and social platforms that were never compromised. Organisations can reduce their exposure to third-party credential leaks by implementing breached-password detection at login — checking offered passwords against known-compromised datasets and refusing or flagging them — and by encouraging multi-factor authentication, which makes a stolen password alone insufficient for account access regardless of where it was originally stolen.

Sources

Back to all incidents