Manual:PAGENAMEE encoding/ja

From Linux Web Expert

Revision as of 17:47, 18 April 2024 by imported>FuzzyBot (Updating to match new version of source page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

MediaWiki pages name encoding is a complicated topic. MediaWiki magic words PAGENAME, PAGENAMEE, urlencode have distinct implementations, each with their own peculiarities.

A MediaWiki page name can have a leading space but not a trailing space. The ASCII characters that are not allowed in MediaWiki page names are the three types of brackets, sharp sign, underscore and vertical bar, and all control characters (including tabs and newlines).

# < > [ ] _ { | }
The underscore is not really disallowed, but is treated like a space without distinction in MediaWiki page names, so "A_B" and "A B" are referencing exactly the same page name (pages will be created, searched, and displayed (with their title) using spaces, never using underscores).

This article shall refer to these as the "not-allowed pagename characters". For clarity, we will present other ASCII 7-bit values for characters as the URL-style encoding of percent-hex-hex form known as percent-encoding.

PAGENAME

Some allowed characters returned by {{PAGENAME}} are HTML-style encoded:

  • " (double quote %22) is converted to &#34; (34 is the decimal value of hexadecimal 22); in standard HTML/XML style it could also be converted to &quot;.
  • & (ampersand %26) is converted to &#38; (38 is the decimal value of hexadecimal 26); in standard HTML/XML style it could also be converted to &amp;
  • ' (single quote %27) is converted to &#39; (39 is the decimal value of hexadecimal 27); in standard HTML/XML style it could also be converted to &apos;
This HTML/XML encoding is standard, even if the standard does not always requires escaping the single and double quotes except in few cases; the standard would also require reencoding the lower-than (<) and greater-than (>) signs but these two characters are forbidden in MediaWiki pagenames due to the syntax of the MediaWiki code used to compose pages.

The same HTML-encoding is used also with:

  • {{FULLPAGENAME}}
  • {{BASEPAGENAME}}
  • {{SUBPAGENAME}}
  • {{SUBJECTPAGENAME}}
  • {{TALKPAGENAME}}

We will refer to these as the "three special pagename characters".

PAGENAMEE

{{PAGENAMEE}} converts spaces to underscore and percent-encodes a set of characters:

  • It converts the 11 following ASCII characters (all allowed in pagenames):
    " % & ' + = ? \ ^ ` ~
    to (using the hexadecimal representation of the ASCII encoding)
    %22 %25 %26 %27 %2B %3D %3F %5C %5E %60 %7E
    It also converts all non-ASCII (Unicode) characters also %nn triplets with nn in hexadecimal (one for each octet of the UTF-8 sequence encoding the Unicode code point associated to the character), the first triplet being between %C2 and %FD, followed by one to three triplets between %80 and %BF (for the worst case, it could generate 12 characters from a single Unicode character, most Latin, Cyrillic, Greek characters being encoded on 6 characters, but sinograms and Korean Hangul needing 9 on 6 characters).
  • It converts the ASCII space ( ) into an underscore (_).
  • It does not convert ASCII alphanumerics and the 13 following ASCII punctuations and symbols (all allowed in pagenames):
    ! $ ( ) * , - . / : ; @ _

The same encoding is used also with:

  • {{FULLPAGENAMEE}}
  • {{BASEPAGENAMEE}}
  • {{SUBPAGENAMEE}}
  • {{SUBJECTPAGENAMEE}}
  • {{TALKPAGENAMEE}}

When preparing a pagename for embedding in the "searchpart" of a URL (see RFC 1738 and/or RFC 3986), it might have to be both percent-encoded and all space characters converted %20 or plus sign + which we will call "searchpart-encoded".

This avoids the problematic coding of the three special pagename characters by encoding, for instance, ampersand (&) as %26, but the typical searchpart-encoding of space is the plus sign (or sometimes as %20).

If no MediaWiki string manipulation extensions exist, then {{PAGENAMEE}} might only be useful for constructing a URL back into one's own wiki, to other wikis or to other sites where the page they provide use the same name and use underscores (there's no standard here, the encoding presented above was defined by MediaWiki itself for its own local use. Do not assume that other sites will perform the same conversion, most of them just use plain UTF-8 in their own local URLs if they need to represent non-ASCII characters and standard URL-encoding for the "unsafe" ASCII characters).

urlencode

The {{urlencode:data|style}} function (in its current version using now the "QUERY" style by default since MediaWiki 1.17) percent-encodes many more characters than PAGENAMEE.

It can convert any valid input string from its native UTF-8 encoding.
This function will also convert the 9 characters that are forbidden in pagenames and listed at top of page.
It converts the "three special characters" differently than what is performed by {{PAGENAME}}, using %nn hexadecimal triplets, instead of HTML entities.
It preserves the distinction between space and underscore (a distinction lost only in MediaWiki pagenames).
The result is conforming to the RFC 1738 URL encoding standard, using only letters, digits and "safe" characters and the two characters % (followed by two hexadecimal digits) and + (to encode spaces).
This result is fully and easily reversible, but MediaWiki does not natively provide a urldecode function to do it.

It can also be used to allow the Wikisource editor to work with multilingual characters they are accustomed to rather than deal with the more opaque percent-encoded characters. When considering using urlencode to construct an external link URL, especially within a template, there are two design style where that might be appropriate. Which one is appropriate is a matter the trade-offs between generality and ease-of-use.

  • For maximum generality, there is no simple combination of PAGENAME and other default wiki magic words to provide a general solution and to handle names that include all possible characters in pagenames. The not-allowed pagename characters and the three special pagename characters both present issues. If a desired name uses any of those characters, then the actual pagename would have to be different. The most general design for a template would be a template with two parameters: a URL-style searchpart-encoded parameter for the URL link and an HTML-style parameter for the link label. The URL-style parameter would be added to a search or lookup URL and the HTML-style parameter would be used to label the link. For instance, a template called OrgName that looks up an organization by name with the unusual 10-character organization name of a%23b> {c} would call the template as {{OrgName|a%2523b%3E+%7Bc%7D|a%23b&#62; &#123;c&#125;}}. Variations on this might use %20 instead of + in the URL-style parameter for space.
  • Another (but unrelated) escaping used in HTML or XML is to use &gt; instead of &#62; for the greater-than character in the HTML-style parameter or just the plain characters when they work OK. To be rigorous, one might argue that having two mandatory arguments is the best style for long-term stability in case the page is moved or translated to some other wiki where where the naming style of pages is different such as where a different alphabet is used for naming pages.
  • The urlencode parser function can be used to create a template that might be easy-to-use but not perfectly general. The urlencode function (in the query style) converts to %nn hexadecimal sequences almost all characters (including percent and plus) except alphanumerics and two of the RFC 1738 URL "safe" characters: - . (dash, period), and it converts blank to plus (additionally it encodes all non-ASCII characters as %nn hexadecimal sequences.
  • The technique of embedding the code fragment {{urlencode:{{{userparam|{{PAGENAME}}})))) into a template to create an external link URL can be useful (i.e. treating simple pagenames as data). A pagename with any of the "three special" pagename characters (which are returned by PAGENAME and similar functions whose result is intended for display on an HTML page) might be a problem. For example, a pagename with an ampersand, this would result in an HTML-style ampersand (&amp;) being converted by {{URLENCODE|pagename}} into to the URL query style %26amp%3B which most remote web site would not handle successfully. For names with the problematic characters, one could simply not use the template and provide a direct link in the wikisource or by adding appropriate templates or extensions to the wiki to support string manipulations.
  • A compromise between these two styles is a variation on the above code fragment, such as {{{userparam|{{urlencode:{{PAGENAME}}})))) where the userparam is optional but when explicitly supplied would have to be search-encoded.

Note that there's no mediawiki parser function that can successfully decode the HTML-encoding performed by PAGENAME. As well, there's no function to decode the special encoding performed by PAGENAMEE or found in URL paths to wiki pages. Parser functions like #ifeq or #ifswitch work because they compare their input by only HTML-decoding them, but they never URL-decode their parameters.

  • So to compare pagenames safely with the result of {{PAGENAME}} with a specified static name containing one of the three special characters which has not been HTML-encoded (for example the value of a parameter given to a transcluded template in a wiki page), you can first convert that parameter to the same special encoding performed by PAGENAME, by passing this value as a parameter of the {{PAGENAME|...}} parser function.
  • You can do the same to compare pagenames according to the value of {{PAGENAMEE}}, or you can use the urlencode function with the supplementary style parameter as {{urlencode|pagename|WIKI}}.

Web browser URL and wiki web server HTTP interface

The URL you type in or cut/paste into your web browser URL is similar but not exactly the same as PAGENAMEE.

  • In order to type in a pagename as a URL in your web browser that will go directly to the page, the following two characters must be URL-style encoded while being typed in: % ? as %25 %3F. A typical example is a pagename that ends in question mark where the wiki editors will create a wiki redirect without the question mark so that it works anyway. If you type in a space in the middle of a URL, you browser will convert it to %20 before sending it to any sort of web server. The same for that double-quote character " which is converted to %22. Depending on your browser, it may also encode some of the "unsafe" characters such as %&'`. See RFC 1738 for details but note that this behavior is browser-dependent. Compared to browsers that support only http, browsers that support schemes other than http such as ftp tend to convert more of these characters.
  • How a URL with percent-encoding is displayed in a web browser's address box depends on whether the wiki web server has used URL redirection. The characters of the PAGENAMEE character set will be converted only if they are adjacent to a space. For instance, if you type in a URL into your web browser ending in A_=_B or A=B then it will send that URL directly and you will get to the wiki page if it exists. If you enter a URL into your web browser ending in A = B (with spaces around the equals sign), then your web browser encodes spaces to %20, and thus sends A%20=%20B to the wiki web server. The wiki web server, then converts the string to A_%3D_B and sends that back to the wiki web browser via URL redirection. Now you can see why on a slow Internet link you might see the spaces in a pagename change first to a %20 and then to an underscore because your browser does the first conversion and the wiki web server does the second. You can try to see the real URL by copying the URL in the browser and pasting it as text into a simple text editor but you may find that even this technique produces browser-dependent results.
  • While not specific to the wiki web server, for wide characters, the browser performs a partial urldecode action on the real URL. This urldecoding is essential for the usability of wide characters in URLs. As an example, for an otherwise simple URL ending in a UTF-8 string percent-encoded as %E6%9D%B1%E4%BA%AC, your browser will usually urldecode that part and display it as 東京 (Unicode U+6771 U+5EAC), which are the two Kanji characters for Tokyo. This result can apply to both 7-bit and wide characters but is browser-dependent. For instance if you visit the eight-character pagename of A!*-. ~A as http://en.wikipedia.org/wiki/A%21%2A%2D%2E%5F%7EA you may find that your web browser then displays a URL that has urldecoded none, some or all of the percent-encoded characters and that a cut-and-paste of the browser URL into simple text will include none, some or all of this urldecoding. How much of this urldecoding occurs during cut-and-paste is browser-dependent.

Encodings compared

The following table shows the effect of the various supported encodings over the full set of printable ASCII characters (plus SPACE) and on the two first printable Unicode characters after ASCII. Tabulations and other whitespace controls are discussed more completely in the section below about whitespaces, but the table shows some contextual "effects" occurring with possibly dropped spaces and some other characters.

Characters
Encodings
\
09AZ- az   . . .  . ! " # $ % & ' ( ) * + , . .. ... / : ; < = > ? @ [ \ ] ^ _ ` { | } ~   ¡
{{PAGENAME:...}} 09AZ- Az . . . . ! &quot; $ % &amp; &apos; ( ) * + , ... / ; = ? @ \ ^ ` ~ ¡
{{PAGENAMEE:...}} 09AZ- Az ._. ._. ! %22 $ %25 %26 %27 ( ) * %2B , ... / ; %3D %3F @ %5C %5E %60 ~ %C2%A1
{{urlencode:...|WIKI}} 09AZ- az ._. .__. ! %22 %23 $ %25 %26 %27 ( )
%2B , . .. ... /
%3C %3D %3E %3F @ %5B %5C %5D %5E _ %60 %7B %7C %7D ~ %C2%A0 %C2%A1
{{urlencode:...|PATH}} 09AZ- az .%20. .%20%20. %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C . .. ... %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D ~ %C2%A0 %C2%A1
{{urlencode:...|QUERY}} 09AZ- az .+. .++. %21 %22 %23 %24 %25 %26 %27 %28 %29 %2A %2B %2C . .. ... %2F %3A %3B %3C %3D %3E %3F %40 %5B %5C %5D %5E _ %60 %7B %7C %7D %7E %C2%A0 %C2%A1
{{anchorencode:...}} 09AZ- az ._. ._. ! "
$ % & ' ( )
+ , . .. ... /
< = > ? @ [ \ ] ^ ` { | } ~ ¡
The behaviour of the {{anchorencode:...}} parser function depends on the $wgFragmentMode setting.

With the various encodings proposed in MediaWiki, it is notable that the only characters that are never transformed (or removed) are the 10 decimal digits, the minus-hyphen (-) and the uppercase Basic Latin letters (A-Z : an initial lowercase letter may be transformed by capitalisation in most wikis except those that preserve the case distinction in page names).

Note also that namespace names and interwiki prefixes don't have a case-significant letters, and if they are recognized at the beginning of a title, they may be replaced by an incompletely unrelated term, possibly in another language and/or script! So be careful with everything that comes before a colon (:) as the behavior will be specific for each wiki and their own local set of recognized namespace names (or synonyms) and interwiki prefixes (however these local prefixes do not affect what urlencode and anchorencode will return, which is independent of local naming rules for each wiki).

The two styles PATH and QUERY for urlencode are almost identical, their only difference is that:

  • the PATH style preserves the tilde (~) but encodes spaces with percent-hex notation (%20). It is not used in web URLs, only on local file pathnames for use with Unix-like shells (without needing extra escaping or quotation marks). It is not used for URLs using the file: URI scheme.
  • the QUERY style encodes the tilde (~) but encodes spaces differently with a plus sign (+). It is used in paths of web URLs for better compatibility with common web usages. Since MediaWiki version 1.17, it is the default encoding style used by urlencode if you don't specify any style, or if you give an unknown or empty style.
  • The WIKI style was the default style in MediaWiki before version 1.17; it unfortunately dropped some leading characters —the asterisk (*), colon (:), or semi-colon (;)— causing problems when using it to pass values in query parameters to a remote web API. Do not use this style except for compatibility with old templates depending on this old behavior to detect these three characters.

ページ名の大文字化

Lowercase letters (a-z) are preserved, except at the initial position where they may be converted to uppercase with PAGENAME and PAGENAMEE on wikis that have not disabled this capitalisation.

You can see an example of capitalisation in the table above.

Whitespaces in page names and anchors (section headings)

  • Leading and trailing spaces are dropped, and the remaining spaces in the middle will be trimmed in page names and anchors (but not in URL-encoding):
    {{PAGENAME: A  B }}A B
    {{PAGENAMEE: A  B }}A_B
    {{urlencode: A  B |WIKI}}A__B
    {{urlencode: A  B |PATH}}A%20%20B
    {{urlencode: A  B |QUERY}}A++B
    {{anchorencode: A  B }}A_B
  • Tabulations and newlines are not accepted in page names, but are preserved in URL-encoding and anchors:
    {{PAGENAME:A B}} (an empty string, meaning an invalid pagename)
    {{PAGENAMEE:A B}} (an empty string, meaning an invalid pagename)
    {{urlencode:A B|WIKI}}A%09B
    {{urlencode:A B|PATH}}A%09B
    {{urlencode:A B|QUERY}}A%09B
    {{anchorencode:A B}}A_B

ページ名に含まれるコロン

The colon (:) is treated specially in page names when it is the first character in the trimmed given name (where it will link to a description page instead of showing the content of that page when it is one of the special name spaces like "File", "Image", or "Int"). But PAGENAME will drop this leading colon, along with spaces immediately after that colon:

  • {{PAGENAME: : Example }}Example
    {{NAMESPACE: : Example }} (an empty string, meaning the main namespace of the wiki)

Otherwise, if the non-empty text before the first colon matches a known local namespace, then this name space and the colon will be dropped, along with spaces immediately after that colon (the dropped namespace will be trimmed and returned by {{NAMESPACE:...}}):

  • {{PAGENAME: File : Example }}Example
    {{NAMESPACE: File : Example }}File

Otherwise, if the non-empty text before the first colon matched a known interwiki prefix, then this prefix and that colon are dropped, along with spaces immediately after that colon, but an empty namespace will be returned:

  • {{PAGENAME: mw:Example }}Example
    {{NAMESPACE: mw:Example }}
  • {{PAGENAME: mw: Example }}Example
    {{NAMESPACE: mw: Example }}
  • {{PAGENAME: w:fr:Example }}W:fr:Example
    {{NAMESPACE: w:fr:Example }}
  • {{PAGENAME: w: fr: Example }}W: fr: Example
    {{NAMESPACE: w: fr: Example }}
  • {{PAGENAME: m: w: fr: Example }}M: w: fr: Example
    {{NAMESPACE: m: w: fr: Example }}
  • {{PAGENAME: m : w : fr : Example }}M : w : fr : Example
    {{NAMESPACE: m : w : fr : Example }}

Otherwise, the colons are kept, even of the text before the first colon could be a valid interwiki prefix (containing only letters without case distinction, or digits, or minus-hyphens and dashes, spaces or underscores; not restricted to be ASCII only):

  • {{PAGENAME: Unknown : Example}}Unknown : Example
    {{NAMESPACE: Unknown : Example}}

The same rules are applied by {{PAGENAMEE:...}} and {{NAMESPACEE:...}} before they encode their return value.

Colons (and their surrounding spaces as long as they are not leading or trailing spaces) are left intact by {{FULLPAGENAME:...}}.

All colons are left intact by URL-encoding. But most (not all) colons are preserved by anchor-encoding.

Colons (:) in anchors

Anchor-encoding is bit more tricky: most colons are kept, except when they are at the leading positions, even though a section heading like this one could start by a colon).

So for the title of this section, you get

  • {{anchorencode: : Colons (:) in anchors}}
_Colons_(:)_in_anchors

Note that the colon is unexpectedly converted by inserting a newline before it, as if this the parameter was the content of a wiki source page (causing an indented block to be rendered)! The result does not match the identifier that MediaWiki generated for this section heading.

Pipes (|) in anchors

A more critical bug/limitation is observed when the leading character is a pipe (|), because it is treated as a parameter separator of {{anchorencode:...}} (despite the fact that it takes only a single parameter with no extra option):

  • {{anchorencode: | Pipes (|) in anchors}} - an empty string, so everything after the first pipe in the section heading is discarded!

A common work-around (using the common utility template {{!}} to avoid the verbatim pipe returned by this template to be interpreted as a parameter separator):

  • {{anchorencode: {{!}} Pipes ({{!}}) in anchors}}|_Pipes_(|)_in_anchors

This works because the expansion of templates is delayed after the parser function name and its parameter(s) in {{anchorencode:...}} have first been parsed up to the colon, and then (needlessly) separated on pipes: the expansion of templates, which may be present within parameter names or values between the colon and the double closing brace, will occur only when these parameters will be queried by the parser function itself, but this will not change the number or order of these parameters.

The same work-around may be used if you need to pass any of the following:

  • two verbatim opening braces ({{) by using {{(}}{{(}} or {{((}}
  • two verbatim closing braces (}}) by using {{)}}{{)}} or {{))}}


Semicolons (;), asterisks (*), or sharp signs (#) in anchors

The same bug does not occur when the first non-blank character (or any further character) of a section heading is a semicolon (;), an asterisk (*), or even a sharp sign (#), so these characters are preserved along with the rest of the string:

  • {{anchorencode: ; Semicolons (;) in anchors}} ;_Semicolons_(;)_in_anchors
  • {{anchorencode: * Asterisks (*) in anchors}} *_Asterisks_(*)_in_anchors
  • {{anchorencode: # Sharp signs (#) in anchors}} #_Sharp_signs_(#)_in_anchors

Full stops and slashes in page names

Note that page names are parsed from left to right into (possibly empty) segments (called "title parts") separated by slashes (/). In some cases the occurrence of segments containing only a single dot (or full stop .) or two dots (..) and will cause the rest of the string to be transformed. See Help:Extension:ParserFunctions for details.

Otherwise these dots are left intact by {{urlencode:...}} and {{anchorencode:...}}, but slashes may be converted.

Also the sequence of two successive slashes (//) may not be accepted in page names, depending on the configuration of the wiki. Usually this is an indicator that the name is an URL, when it is preceded by a valid URI scheme (or no URI scheme at all where it means a default http: or https: URI scheme will be used, depending on user's preference). An URI scheme should then contain a colon (:), but MediaWiki currently recognizes only URI schemes where the colon is final, in a restricted list; otherwise.

For example on this wiki,

"{{PAGENAME|//www.mediawiki.org/}}""//www.mediawiki.org/"

On Wikimedia sites, such as Mediawiki.org, the double slashes are recognized as URIs, and most valid URIs are disallowed as page names (if an URI scheme is present, it could be recognized as a name space it it has been configured, otherwise the page name will fall into the main namespace of the wiki):

  • Creating a link to these URI-like page names uses:
    • [[page name in double brackets|with optional displayed text]]
  • But links to the effective target of the URI uses either of the following:
    • [URL-without-spaces-in-single-brackets with optional displayed text]
    • URI-without-spaces (also displayed show verbatim, but the link will be conditionally activated as it is subject to restrictions of recognized URI schemes).

So on this wiki on mediawiki.org, the following code unexpectedly creates a direct link to the external URL, surrounded by verbatim single brackets:

<tvar name=1>[[//www.mediawiki.org/|www.mediawiki.org]][[1]]

URIs are not recognized by URL-encoding and anchor-encoding (this means that valid full URLs cannot be safely created with urlencode!).