TL;DR
Microsoft has deleted a blog post written by a senior product manager that effectively directed developers to pirate Harry Potter books for training AI models. The post, published in November 2024, linked to a Kaggle dataset containing all seven books that was incorrectly marked as “public domain.”
What the Blog Post Said
The blog, written by senior product manager Pooja Kamath, promoted a new feature making it easier to “add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs.” As a demonstration, it suggested using the Harry Potter series as a “well-known dataset” that would “resonate with a wide audience.”
The post linked to a Kaggle dataset containing all seven books and suggested two use cases: building Q&A systems providing “context-rich answers” and generating “new AI-driven Harry Potter fan fiction” that’s “sure to delight Potterheads.”
The Backlash
The blog post resurfaced on Hacker News, where commenters pointed out that the dataset was clearly pirated. The Kaggle upload, which had around 10,000 downloads, had apparently flown under the radar of J.K. Rowling’s legal team, which is known for vigorously enforcing Harry Potter copyrights. The dataset was promptly deleted after Ars Technica contacted the uploader, a data scientist in India with no apparent links to Microsoft.
Looking Forward
The incident highlights an uncomfortable reality in AI development: copyrighted material frequently appears in training datasets, whether intentionally or not. For businesses building AI features, the episode is a reminder that dataset provenance matters — and that even major tech companies can stumble on copyright compliance in their own developer guidance.