Microsoft Pulls Guide Using Pirated Books for LLM Training

TL;DR

Microsoft has deleted a blog post written by a senior product manager that effectively directed developers to pirate Harry Potter books for training AI models. The post, published in November 2024, linked to a Kaggle dataset containing all seven books that was incorrectly marked as “public domain.”

What the Blog Post Said

The blog, written by senior product manager Pooja Kamath, promoted a new feature making it easier to “add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs.” As a demonstration, it suggested using the Harry Potter series as a “well-known dataset” that would “resonate with a wide audience.”

The post linked to a Kaggle dataset containing all seven books and suggested two use cases: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.”

The Backlash

The blog post resurfaced on Hacker News, where commenters pointed out that the dataset was clearly pirated. The Kaggle upload, which had around 10,000 downloads, had apparently flown under the radar of J.K. Rowling’s legal team, which is known for vigorously enforcing Harry Potter copyrights. The dataset was promptly deleted after Ars Technica contacted the uploader, a data scientist in India with no apparent links to Microsoft.

Looking Forward

The incident highlights an uncomfortable reality in AI development: copyrighted material frequently appears in training datasets, whether intentionally or not. For businesses building AI features, the episode is a reminder that dataset provenance matters — and that even major tech companies can stumble on copyright compliance in their own developer guidance.

Industry News6 May 2026

Major publishers sue Meta over Llama AI training on books

Hachette, Macmillan, McGraw Hill, Elsevier and Cengage sue Meta in Manhattan federal court, alleging Llama was trained on millions of pirated works.

Copyrightintellectual property

Read article

Industry News16 Feb 2026

ByteDance Curbs AI Video Tool After Disney Legal Threat

ByteDance pledges to strengthen safeguards on its Seedance AI video tool after Disney sent a cease-and-desist letter accusing it of a 'virtual smash-and-grab' of copyrighted characters.

Copyrightintellectual property

Read article

Industry News27 Jan 2026

Hundreds of Artists Launch Anti-AI Campaign Demanding Licensing Deals

Scarlett Johansson, Cyndi Lauper and hundreds more sign 'Stealing Isn't Innovation' campaign calling for AI companies to license creative works.

AI ethicsCopyright

Read article

TL;DR

What the Blog Post Said

The Backlash

Looking Forward

Share this article

Related Articles

Major publishers sue Meta over Llama AI training on books

ByteDance Curbs AI Video Tool After Disney Legal Threat

Hundreds of Artists Launch Anti-AI Campaign Demanding Licensing Deals