Robots.txt vs Meta Robots vs X-Robots-Tag: Why "Block AND Noindex" Doesn't Work

A page blocked in robots.txt AND tagged noindex is a contradiction — the robots.txt block prevents crawlers from ever seeing the noindex tag. Here's how robots.txt (crawl-level), meta robots (page-level), and X-Robots-Tag (HTTP header, any file type) actually relate, why "block AND noindex" doesn't work as intended, and the correct configuration for actually excluding pages from the index.

A page can be blocked in robots.txt, carry a noindex meta tag, AND have an X-Robots-Tag header — and these three mechanisms don't just overlap, they can actively contradict each other in ways that produce confusing, hard-to-diagnose indexing behavior

Controlling how search engines crawl and index a page involves (at least) three different mechanisms, operating at different levels — robots.txt (crawl-level, before a page is even requested), the meta robots tag (page-level, requires the page to be fetched to be seen), and the X-Robots-Tag HTTP header (response-level, can apply to any file type, not just HTML). Understanding when each applies — and critically, what happens when they conflict — resolves a surprising number of "why is this page still indexed / why won't Google index this page" mysteries.

robots.txt: crawl-level, "don't fetch this"

robots.txt directives tell crawlers which URLs they're permitted to request/crawl — this is checked before a crawler makes a request to a given URL.

User-agent: *
Disallow: /private/

The critical limitation, often misunderstood: robots.txt Disallow prevents crawling, but does NOT directly prevent indexing. If a URL is disallowed in robots.txt, but search engines learn about this URL some other way (e.g., it's linked to from other pages, including external sites) — search engines can still index the URL, typically showing it in search results without a description/snippet (since they couldn't crawl the page to generate a snippet) — sometimes appearing as "No information is available for this page" or similar in search results.

This surprises many site owners: "I blocked this in robots.txt, why is it still showing in search results?!" — the answer is that robots.txt blocking crawling isn't the same as excluding from the index — a blocked-from-crawling URL can still have an index entry (just a sparse one, lacking crawled content), if search engines are aware of the URL via other signals (links).

Meta robots tag: page-level, requires crawling to be seen

<meta name="robots" content="noindex, follow">

The meta robots tag is part of the page's HTML — which means a crawler must successfully fetch the page to see this tag. This creates a direct conflict with robots.txt blocking:

The conflict scenario: if a URL is both disallowed in robots.txt and has a noindex meta tag — the robots.txt disallow prevents the crawler from ever fetching the page, which means the crawler never sees the noindex tag — the noindex instruction is never received, because the mechanism that would deliver it (fetching the page) is blocked by the other mechanism (robots.txt).

The practical consequence: for a URL you genuinely want excluded from the index (not just uncrawled) — noindex (via meta tag or X-Robots-Tag) requires the page to be crawlable (not blocked in robots.txt) for the noindex instruction to be seen and acted upon. Blocking in robots.txt and expecting noindex to also apply is a contradictory configuration — robots.txt blocking prevents the noindex from ever being read.

The correct approach for "exclude from index entirely": allow crawling (don't block in robots.txt) but include noindex — this lets crawlers fetch the page, see the noindex instruction, and act on it (removing/not adding the URL to the index) — which is the intended outcome, achieved via the opposite configuration from the contradictory "block AND noindex" pattern.

X-Robots-Tag: HTTP header, applies to any content type

X-Robots-Tag: noindex, nofollow

This is an HTTP response header — set by the server, as part of the HTTP response, regardless of the content type being served. This makes it usable for content where a meta tag (which is HTML-specific) isn't applicable — PDFs, images, and other non-HTML file types can carry X-Robots-Tag headers, providing noindex/nofollow/other directives for file types that can't contain <meta> tags.

Same crawlability requirement as meta robots: like the meta tag, X-Robots-Tag requires the resource to be fetched (the response, including its headers, must be received by the crawler) for the directive to be seen — the same robots.txt-blocking conflict applies: a robots.txt-disallowed URL's X-Robots-Tag header is never seen, for the same reason its meta tag (if it were HTML) wouldn't be seen.

Common use case: blocking indexing of PDF files, downloadable documents, or other non-HTML assets that you don't want appearing in search results — X-Robots-Tag: noindex on the server response for these files achieves this, where a meta-tag approach simply couldn't (PDFs don't have <head> sections with meta tags in the way HTML does).

Summary: which mechanism for which goal

Goal	Correct mechanism	Common mistake
Prevent crawling of a URL pattern entirely (crawler shouldn't request these URLs)	robots.txt Disallow	Using `noindex` for this — `noindex` doesn't prevent crawling, it prevents indexing of crawled content
Exclude a specific, crawlable page from the index	`noindex` meta tag (or X-Robots-Tag) — page must NOT be robots.txt-blocked	Blocking in robots.txt and adding `noindex` — the `noindex` is never seen
Exclude non-HTML files (PDFs, images) from the index	`X-Robots-Tag: noindex` HTTP header	Attempting to add `<meta>` tags to non-HTML files (not possible)
Prevent search engines from following links on a page (but the page itself can be indexed)	`noindex` is NOT involved — use `nofollow` within the meta robots / X-Robots-Tag (e.g., `content="index, nofollow"`), or `rel="nofollow"` on specific links	Conflating page-level "don't follow links from this page" with link-level `rel="nofollow"` on individual links — these are related but operate at different granularities

Diagnosing "why is this page (still) indexed" with this framework

Step 1: is the page blocked in robots.txt? If yes — any noindex directive (meta tag or header) on this page is irrelevant, since crawlers can't see it. The first fix, if exclusion-from-index is the goal, is to remove the robots.txt block (allowing crawling), so that step 2's noindex can actually be processed.

Step 2: does the (now-crawlable) page have noindex? If the robots.txt block has been removed (or never existed) and noindex is correctly present and crawlable — search engines should, upon re-crawling this URL, process the noindex and remove it from the index — but this requires re-crawling to occur, which isn't instantaneous — a page that was indexed, which newly has noindex (and is crawlable), will typically remain in the index until the next crawl of that URL processes the new noindex directive — for pages that are crawled infrequently, this removal can take meaningfully longer than for frequently-crawled pages.

Step 3: are there external links to this URL that might cause search engines to be aware of it via means other than crawling your site? (Relevant primarily for the robots.txt-blocked-but-still-indexed-with-no-snippet scenario described earlier — if this is the situation you're seeing, and the goal is full exclusion including from any index-entry-without-snippet — the robots.txt block itself, in combination with external links, is producing this partial indexing; removing the robots.txt block and adding noindex — per the "correct approach" above — is again the resolution, allowing the explicit noindex to fully exclude the URL, rather than relying on robots.txt blocking alone, which doesn't guarantee full exclusion in the presence of external link signals.)

How to use the Meta Tag Generator on sadiqbd.com

Generate correct meta robots tags for pages requiring noindex, nofollow, or combinations — the core function
Before relying on a generated noindex tag: verify the target URL is not blocked in robots.txt — using the robots.txt checker/generator tool to confirm the URL pattern isn't disallowed, which would prevent the meta tag from ever being seen
For non-HTML resources (PDFs, downloadable files) requiring index-exclusion: recognize that meta tags aren't applicable — X-Robots-Tag HTTP headers (a server-configuration task, distinct from the meta-tag-generator's HTML output) are the appropriate mechanism instead

Frequently Asked Questions

If I want to both prevent crawling of a directory AND ensure nothing from it is ever indexed (even via external links), what should I do? This is a genuinely harder case than either "block crawling" or "noindex" alone cleanly solves — given the conflict described (noindex requires crawlability; robots.txt-blocked-but-linked URLs can still appear sparsely indexed). Approaches include: ensuring no external links point to these URLs in the first place (harder to control for external sites linking in, though feasible for avoiding creating such links yourself); using noindex (crawlable) rather than robots.txt blocking, accepting that these pages will be crawled (consuming some crawl budget) but will be fully excluded from the index once processed; or, for genuinely sensitive content that shouldn't be crawlable under any circumstances, authentication-based access control (requiring login) rather than relying on robots directives at all — robots directives are voluntary (compliant crawlers respect them, but they're not an access control mechanism in any enforced sense) for content where that distinction matters.

Does the order of directives within content="noindex, nofollow" matter? No — the set of directives present is what matters; content="nofollow, noindex" and content="noindex, nofollow" are equivalent. The comma-separated values represent a set of independent instructions, not a sequence where order carries meaning.

Is the Meta Tag Generator free? Yes — completely free, no sign-up required.

Try the Meta Tag Generator free at sadiqbd.com — generate correct meta robots tags, title tags, and meta descriptions for any page.