URL Structure: Query Parsing Ambiguities, IDN Punycode & Open Redirects

A URL has five components — and bugs come from confusing which part you're encoding. Here's query string parsing ambiguities between frameworks, IDN Punycode for international domains, URL normalisation for comparison, open redirect vulnerabilities, and relative URL resolution edge cases.

A URL has five components — and most bugs come from confusing which part you're operating on

The Uniform Resource Locator is one of the web's most ubiquitous primitives, and also one of the most quietly misunderstood. Most developers can construct a basic URL. Fewer can reliably handle edge cases: query strings that contain special characters, path segments that include slashes, internationalized domain names, or URLs that need to survive being embedded in other URLs.

The anatomy of a URL

https://user:[email protected]:8080/path/to/page?key=value&key2=value2#fragment
|-----|  |---------|  |-------------|  |-| |----------|  |------------------| |------|
scheme  userinfo       host           port    path            query           fragment

Scheme: the protocol — https, http, ftp, ws, wss, mailto, data, tel. Terminated by :// (or just : for schemes like mailto:)

Authority: [userinfo@]host[:port]. Optional userinfo is rarely used in modern URLs (embedding credentials in URLs is generally a security concern). The host can be a domain name, IPv4 address, or IPv6 address (IPv6 in brackets: [::1]).

Path: the resource location within the authority. Starts with /. Path segments are separated by /. The empty path and the root path / are different (though servers typically treat them identically).

Query: optional key-value pairs following ?. Separated by &. Keys and values are percent-encoded.

Fragment: optional, follows #. Identifies a portion of the resource. Processed client-side only — the fragment is never sent to the server.

URL parsing ambiguities

Query string parsing is not standardised

The query string format (key=value&key2=value2) is convention, not specification. RFC 3986 only defines that the query component follows ?. The parsing of key-value pairs from the query is up to the application.

Different frameworks parse the same query string differently:

Array notation:

?colors[]=red&colors[]=blue    (PHP, some frameworks)
?colors[0]=red&colors[1]=blue  (some frameworks)
?colors=red&colors=blue         (multiple values for same key)

PHP's parse_str() creates an array for [] notation
JavaScript's URLSearchParams.getAll('colors') returns all values for repeated keys
Some frameworks expect comma-separated: ?colors=red,blue

If a URL with array-notation parameters is consumed by a framework that doesn't understand that notation, colors[0] is treated as a literal key including the brackets.

Plus sign as space: In HTML form submissions (application/x-www-form-urlencoded), + represents a space in query strings. In the rest of a URL, + is a literal plus sign. A URL copied from a form submission and pasted into a context that interprets the raw percent-encoding correctly may misinterpret + as literal + rather than space.

The safe practice: always use %20 for spaces, even in query strings.

Internationalized Domain Names (IDN) and Punycode

Domain names were originally restricted to ASCII. Internationalized Domain Names (IDN) allow non-ASCII characters in domain labels, encoded using Punycode.

How it works: münchen.de (with ü) → xn--mnchen-3ya.de (Punycode representation)

The xn-- prefix indicates a Punycode-encoded label. The full ACE (ASCII Compatible Encoding) domain is used for DNS resolution; browsers display the Unicode form in the address bar for labels in trusted scripts.

The homoglyph/IDN homograph attack: аррle.com (Cyrillic а and р) vs apple.com — visually identical in many fonts. Modern browsers now display the Punycode form for mixed-script domains to mitigate this.

Encoding for use in applications: URLs must always use the Punycode form for domain names. Applications that accept user-input URLs should normalise IDN domains to Punycode before storage, comparison, or DNS resolution.

URL normalisation

The "same" URL can be expressed many ways:

HTTP://www.Example.COM/path and https://www.example.com/path are different (scheme and protocol differ)
https://www.example.com/path/ and https://www.example.com/path may or may not be the same resource
https://www.example.com/path?b=2&a=1 and https://www.example.com/path?a=1&b=2 are functionally equivalent for most servers (query parameter order doesn't affect server processing) but are different strings

Normalisation steps for URL comparison:

Lowercase the scheme and host (they're case-insensitive)
Remove default port (port 80 for HTTP, 443 for HTTPS)
Decode unnecessarily percent-encoded characters (%41 → A)
Resolve dot segments in the path (/a/b/../c → /a/c)
Normalise percent-encoding to uppercase hex (%2f → %2F)

URL deduplication (for crawlers, caches, SEO tools) requires all these steps to avoid treating the same page as different resources.

Relative vs. absolute URLs

Relative URLs are resolved against a base URL. The resolution rules are specified in RFC 3986 and have subtle behaviours:

Absolute path (starts with /): replaces the path of the base URL, keeping the scheme and authority. base: https://example.com/a/b + /c/d → https://example.com/c/d

Relative path (no leading /): relative to the current path's directory. base: https://example.com/a/b + c/d → https://example.com/a/c/d (Note: /a/b has directory /a/, so c/d resolves to /a/c/d)

Protocol-relative (starts with //): uses the base URL's scheme. base: https://example.com + //cdn.example.com/script.js → https://cdn.example.com/script.js

Getting relative URL resolution wrong causes broken links, incorrectly resolved assets, and open redirect vulnerabilities.

Open redirect vulnerabilities

A URL can be used as a redirect target:

https://example.com/login?next=https://evil.example.com

If the server doesn't validate that the next parameter is a safe destination (on the same domain, or a whitelisted URL), an attacker can use the legitimate domain's redirect functionality to send users to a malicious site — with the legitimate domain's address visible in the initial URL.

Safe redirect validation:

from urllib.parse import urlparse

def is_safe_redirect(url, allowed_host):
    parsed = urlparse(url)
    # Only allow relative URLs or same-host URLs
    return (not parsed.netloc or parsed.netloc == allowed_host)

How to use the URL Encoder/Decoder on sadiqbd.com

Encode: paste any text (a query parameter value, a path segment) and get the percent-encoded version
Decode: paste a percent-encoded URL component and get the original text
Full URL vs. component mode: choose component encoding (encodes ?, &, /, =) vs. URL encoding (preserves structural characters)
Debugging: paste a URL with unexpected encoding to see what characters were encoded and whether they match your intent

Frequently Asked Questions

What's the maximum length of a URL? No universal limit. HTTP/1.1 doesn't specify a maximum. In practice: most browsers support up to ~8,000 characters. Apache has a default limit of 8,190 bytes. Nginx defaults to 4,096 bytes for the URL line. Many load balancers and proxies have their own limits. For data that would produce very long URLs, use the request body (POST) rather than the query string.

Should I encode the entire URL or just components? Always encode individual components (path segments, query parameter keys and values). Never encode the full URL — encoding the structural characters (/, ?, &, =, ://) would destroy the URL structure.

Is the URL Encoder free? Yes — completely free, no sign-up required.

URLs are fundamental infrastructure — every web request depends on them being constructed, parsed, and resolved correctly. The edge cases (query parsing differences, IDN encoding, relative URL resolution) are where subtle bugs hide.

Try the URL Encoder/Decoder free at sadiqbd.com — encode or decode any URL component, with support for full URL and component encoding modes.