Josie designing and developing a website for a client while blocking ChatGPT and AI from crawling its data.

How to Stop ChatGPT and AI Platforms from Scraping Your Website (2025 Guide)

AI platforms like ChatGPT, Bard, and Gemini are transforming how people consume information. But behind the scenes, these large language models (LLMs) rely on vast amounts of web content; often scraped from websites without the owners’ knowledge or permission.

As a leading web design and development agency, we understand the importance of protecting our customers’ data and staying up to date with the latest technological evolutions in the digital world.

If you publish original news, research, or creative work, your website could already be part of an AI training dataset.

This guide explains why you should care and how to stop AI platforms from crawling, scraping, and using your website content for free.

Why AI Platforms Are Scraping Your Website

To train their language models, AI companies collect huge datasets from publicly available web pages. This includes:

News articles
Blog posts
Research papers
Product descriptions
User-generated content

If your website is open to crawlers, your work may already be part of their datasets.

What Are the Risks of AI Scraping?

1. Loss of Traffic

AI tools can summarise your stories and answer user queries without sending traffic to your site. This undermines your SEO and advertising revenue.

2. Unauthorised Use of Content

Your articles or research may appear in AI responses without attribution, licensing, or compensation.

3. Intellectual Property Concerns

Proprietary data could end up in AI models, making it harder to enforce ownership.

4. Competition From Your Own Work

AI-generated content trained on your material can compete against you in search results.

How to Stop AI Platforms from Scraping Your Website

The good news: most major AI companies allow you to opt out. You can block their crawlers with a few technical steps.

1. Block AI Crawlers with robots.txt

Add these lines to your site’s robots.txt file:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: *
Disallow: /

Place this file at https://yourdomain.com/robots.txt.

2. Use Meta Tags to Restrict Indexing

Add these meta tags to your site’s <head> section:

<meta name="robots" content="noai, noimageai">
<meta name="ChatGPTBot" content="noindex, nofollow">

3. Block AI Bots at Server Level

For Apache (.htaccess):

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChatGPT-User [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC]
RewriteRule .* - [F,L]

For Nginx:

if ($http_user_agent ~* (GPTBot|ChatGPT-User|Google-Extended)) {
    return 403;
}

4. Update Your Terms of Use

Add a clause to your site’s Terms:

“The content on this website may not be copied, scraped, or used to train AI models or large language models (LLMs) without prior written consent.”

Who Is Already Blocking AI Bots?

Major organisations are leading the way:

The New York Times blocked GPTBot in August 2023.
Reuters, CNN, and Le Monde have followed suit.
Elsevier prohibits AI training on its academic content.
BBC has blocked OpenAI’s GPTBot web crawler from accessing its websites since October 2023.

Why You Should Act Now

AI tools are here to stay. But letting them use your work without consent or compensation isn’t sustainable.

By acting today, you can:

Protect your intellectual property
Retain control over your website content
Safeguard your SEO and traffic

Next Steps for Website Owners

Audit your site for valuable content.
Add a robots.txt file blocking AI crawlers.
Set meta tags and server-level blocks.
Update your Terms of Use to address AI scraping.
Monitor your server logs for bot activity.

Don’t Wait for AI Companies to Play Fair

AI companies move fast. Protect your website and your business now; before your content becomes just another dataset

admin

See Full Bio

AI Scraping / ChatGPT / Website Protection