Manual:Parser

From Linux Web Expert

This is an overview of the design of the MediaWiki parser.

Design principles

The MediaWiki parser is not really a parser, in the strict sense of the word. It does not recognise a grammar, rather it translates wikitext to HTML. It was called a parser for want of a better word. At least, even before the term was introduced as a class name, it was generally understood what was meant by "the MediaWiki Parser".

Performance is its primary goal, taking precedence over readability of the code and the simplicity of the markup language it defines. As such, changes which improve the performance of the parser will be warmly received.

Since the parser operates on potentially malicious user input up to 2MB in size, it is essential that it has a worst case execution time proportional to the input size, rather than proportional to the square of the input size.

The parser targets a low-memory environment, assuming a few hundred MB of RAM, and thus it uses markup as intermediate state where possible instead of generating inefficient PHP data structures.

Security is also a critical goal -- user input cannot be allowed to leak through into unvalidated HTML output, except if this is specifically configured for the wiki. Remote images, and other markup which causes the client to send a request to an arbitrary remote server, is not allowed by default, for privacy reasons.

History

Lee Daniel Crocker wrote the initial version of MediaWiki in 2002. His wikitext parser was originally inside the OutputPage class, with the main entry point being OutputPage::addWikiText(). The basic structure was similar to the current parser. It stripped out non-markup sections such as <nowiki>, replacing them with temporary strip markers. Then it ran a security pass (removeHTMLtags), then a series of transformation passes, and then finally put the strip markers back in.

The transformation passes used plain regex replacement where possible, and tokenization based on explode() or preg_split() for more complex operations. The complete implementation was about 700 lines.

Many of the passes still exist with their original names, although almost all of them have been rewritten.

In 2004, Tim Starling split the parser out to Parser.php, and introduced ParserOptions and ParserOutput. He also introduced templates and template arguments. Significant work was contributed by Brion Vibber, Gabriel Wicke, Jens Frank, Wil Mahan and others.

In 2008, for MediaWiki 1.12, Tim merged the strip and replaceVariables passes into a new preprocessor, which was based on building an in-memory parse tree, and then walking the tree to produce expanded wikitext.

In 2011, the Parsoid project began. Parsoid is an independent wikitext parser in JavaScript, introduced to support VisualEditor. It includes an HTML-based DOM model and a serializer which generates wikitext from a (possibly user-edited) DOM. At this point, harmonization with Parsoid became a development goal for the MediaWiki parser.

For some time, it was unclear whether the MediaWiki parser would continue to exist in the long term, or whether it would be deprecated in favour of Parsoid. Current thinking is that at least the preprocessor component of the MediaWiki parser will be retained. Parsoid lacks a complete preprocessor implementation, and relies on remote calls to MediaWiki to provide this functionality.

Entry points

The main public entry points which start a parse operation are:

parse()
Generates a ParserOutput object, which includes the HTML content area and structured data defining changes to the HTML outside the content area, such as JavaScript modules and navigation links.
preSaveTransform()
Wikitext to wikitext transformation, called before saving a page.
getSection(), replaceSection()
Section identification and extraction to support section editing.
preprocess()
Wikitext to wikitext transformation with template expansion, roughly equivalent to the first stage of HTML parsing. This is used by Parsoid to remotely expand templates. Message transformation also uses this function.
startExternalParse()
This sets up the parser state so that an external caller can directly call the individual passes.

Input

The input to the parser is:

  • Wikitext
  • A ParserOptions object
  • A Title object and revision ID

There are also some dependencies on global state and configuration, notably the content language.

ParserOptions has many options, which collectively represent:

  • User preferences which affect the parser output. This was originally the main application for ParserOptions, which is why it takes a User object as a constructor parameter. It is important that the caching system is aware of such user options, so that users with different options have cached HTML stored in different keys. This is handled via ParserOptions::outputHash().
  • Caller-dependent options. For example, Tidy and limit reporting are only enabled when parsing the main content area of an article. Different options are set for normal page views, previews and old revision views.
  • Test injection data. For example, there is setCurrentRevisionCallback() and setTemplateCallback() which can be used to override certain database calls.

During a parse operation, the ParserOptions object and the title and revision context are available via the relevant accessors. The input text is not stored in a member variable, it is available only via formal parameters.

Output

Some entry points only return text, but there is always a ParserOutput object available which can be fetched with Parser::getOutput().

The ParserOutput object contains:

  • The "text" HTML fragment, set shortly before parse() returns.
  • Extensive metadata about "links", which is used by LinksUpdate to update SQL caches of link information. This includes category membership, image usage, interlanguage and interwiki links, and extensible "page properties". In addition to being used to update database index tables, category and interlanguage links also affect the page display.
  • Various properties which affect the page display outside the content area. This includes JavaScript modules, to be loaded via ResourceLoader, the page title, for <h1> and <title> elements, indicators, categories and language links.

ParserOutput is a serializable object. It is stored into the ParserCache, often on page save, and retrieved on page view.

The current OutputPage object represents the output from the current request. It is vital that no parser extension directly modifies OutputPage, since such modifications will not be reproduced when the ParserOutput object is retrieved from the cache. Similarly, it is not possible to hook into the skin and to use a class static property set during parse to affect the skin output.

Instead, extensions wishing to modify the page outside the content HTML can use ParserOutput::setExtensionData() to store serializable data which they will need when the page is displayed. Then ParserOutput::addOutputHook() can be used to set a hook which will be called when the ParserOutput is retrieved and added to the current OutputPage.

State

The Parser object is both a long-lived configuration object and a parse state object.

The configuration aspect of the Parser object is initialised when clearState() calls Parser::firstCallInit(). This sets up extensions and core built-ins, and builds regexes and hashtables. It is quite slow (~10ms) so multiple calls should be avoided if possible.

The parse state aspect of the Parser object is initialised by the entry point, which sets several variables, and calls clearState(), which clears local caches and accumulators.

It is difficult to run more than one parse operation at a time. Attempting to re-enter Parser::parse() from a parser hook will lead to destruction of the previous parse state and corruption of the output. In theory one can set the $clearState parameter to parse() to false to prevent the clearState() call and allow re-entry, but in practice this is almost never done and probably doesn't work.

In practice, there are two options for recursive re-entry:

Cloning the Parser object
This is often done and will probably work. As long as all extensions cooperate, it provides an independent state which allows a second parse operation to be started immediately via an entry point such as Parser::parse(). However, note that PHP's clone operator is a shallow copy. This means that if any parse state is stored in object references, that parse state will be shared with the clone, and modifications to the clone will affect the original object. The core tries to work around this by breaking object references in __clone(). Extensions that store state in object references attached to the Parser object should hook ParserCloned and manually break such references.
Using the recursive entry points.
These allow text to be parsed in the same state as the currently executing parse operation, without clearing the current state. Notably:
  • recursiveTagParse(): This returns "half-parsed" HTML, with strip markers still included, suitable for return from a tag or function hook.
  • recursiveTagParseFully(): This returns fully parsed HTML, suitable for direct output to the user, for example via ParserOutput::setExtensionData().

Caching

The parser has several kinds of cache, each of which impose restrictions on how pieces of input or output change.

ParserCache

<translate> Main page:</translate> Manual:Parser cache

The parser cache is external to the parser. It caches the ParserOutput object associated with the main page view for a given article. As discussed above in the sections on input and output, it is important to avoid caching data intended for one user that won't make sense for another user. This means using the ParserOptions and ParserOutput objects correctly, instead of bypassing them with global state.

External callers can take advantage of the parser cache by calling WikiPage::getParserOutput(). However, callers must take care to use a ParserOptions object which matches the one that would be used for a page view with the same value of ParserOptions::optionsHash(). Usually this means it should be generated by WikiPage::makeParserOptions(). There are many parser options which are not represented in the options hash -- setting those options before calling WikiPage::getParserOutput() will cause the parser cache to be polluted, see T110269.

Caches internal to the parser

Because the MediaWiki parser can be heavily customised by its callers, and because its output depends on changeable wiki data, few things are stored in a shared cache. The exception to this is the preprocessor DOM cache. All other caches managed by the parser are process-local and are cleared at the start of each parse operation.

Preprocessor DOM shared cache

This is a shared cache of the preprocessor DOM. The preprocessor grammar depends very sparingly on configuration and hooks, and thus can be cached between requests. The preprocessor grammar does depend on the set of XML-style extension tags currently registered, so this should not be changed by callers unless the risk of cache pollution is very low (e.g. unit tests).

Preprocessor DOM local cache

There is also a cache of the preprocessor DOM which is local to the parse operation.

Empty-argument expansion cache

This is a cache of the result of expanding a template with no arguments, for example {{foo}}, as opposed to {{foo|bar}}. It is local to the PPFrame, that is, template invocations from different templates are allowed to return different results. Also, parser functions may use PPFrame::setVolatile() to disable this cache in the frame in which they are invoked. The number of times a parser function hook is called will be modified by this cache.

File:OOjs UI icon notice-destructive.svg <translate> Warning:</translate> Our plans are to strengthen this cache. New code should not use PPFrame::setVolatile() or otherwise depend on the same template invocation being able to return different wikitext depending on context. Parsoid assumes that the result of top-level-frame template expansion on given wikitext always remains the same, for a given parse operation.

Current revision cache

This is a cache of the Revision object associated with a given title, local to the current parse operation.

VarCache

"Variables" in parser terminology are syntax constructs which give information about the environment. They are invoked like templates with no arguments, and usually have names spelt in upper case, for example {{PAGENAME}} and {{CURRENTDAY}}. There is a cache of these variables local to the parse operation, so for example it is not possible for {{CURRENTTIME}} to return different values even if the parse operation lasts for more than one second.

Parse time

There is a special cache of the current time, which ensures that time-related variables agree with each other. The caller can customise the time seen by the parser with ParserOptions::setTimestamp().

External modules may implement caching

The MediaWiki parser has many features which query the current state of the wiki. The parser calls into the relevant MediaWiki module, which may implement its own caching. There are global caches, most often in memcached, and in-process caches which are stored for the lifetime of the request, usually without invalidation. For example:

LinkCache
This holds a process-local cache of page existence, which is used for link colouring.
Title
There is a process-local cache of page metadata, managed as a static singleton attached to the Title class.
FileRepo
There are both shared and process-local caches of file/image metadata, used for rendering of image links.
MessageCache
The {{int:}} parser function invokes MediaWiki's localisation system, which has many layers of caching.

Batching

It is usually more efficient to query the database (or other storage backend) in batches, rather than sending a request for each item of data as it is seen in the wikitext stream. However, batching can be difficult to implement. There is currently only one wikitext feature which makes use of the technique: link colouring.

Links to articles which exist are coloured blue, links to non-existent pages are coloured red. Also there is a little-known feature (which may some day be retired) which allows users to specify a "stub threshold", and then links are coloured depending on the size of the target article.

Wiki pages can sometimes have a very large number of links. We originally tried to store link metadata for the whole parse operation, to be resolved in a single batch query, but storing so many links in an array caused out-of-memory errors. So the current technique is to replace each link as it is seen with a placeholder, with metadata stored in an array. Then when there are 1000 links in the array, LinkHolderArray::replaceLinkHolders() is triggered, which does the relevant query for the current batch and replaces the placeholders with actual link HTML.

The internal link pass (replaceInternalLinks) is triggered recursively by various callers, so the current batch is stored by the Parser.

Passes

A parse operation transforms wikitext to HTML progressively. The following parse order is always used to transform wikitext to HTML. This is a sketch of the data flow, rather than strict execution order. There are numerous special cases where recursive parsing jumps ahead or restarts from a previous position. When new unprocessed input is discovered, it is processed from the start. And when output needs to be generated for feeding into a later pass, or for direct output to the user via a ParserOutput accessor, recursive parsing can proceed to the end of the pass list before resuming with the main text.

Preprocessor DOM generation

First, the input text is processed by the preprocessor with an integrated scanner and parser. This generates a special XML-like DOM. This DOM structure is not related to HTML.

There are two implementations of the preprocessor: Preprocessor_DOM stores its intermediate DOM using PHP's DOM extension, and Preprocessor_Hash stores its intermediate DOM in a tree of PHP objects. By default, Preprocessor_DOM is used under php.net PHP, and Preprocessor_Hash is used by default under HHVM. This default is optimal for CPU performance.

The Preprocessor_DOM implementation has lower memory usage, although this memory usage is not properly accounted for by the runtime, and so will not cause a PHP out-of-memory error. This can cause swapping, or the kernel oom-killer may be invoked.

The Preprocessor_Hash implementation is a tree of typed node objects with children. It has no attributes, they are emulated with specially named elements.

The grammar recognised by the preprocessor, and the structure of the DOM thus generated, is discussed in the article about the Preprocessor ABNF.

Preprocessor expansion

Visiting DOM nodes and generating the wikitext they represent is termed expansion, because (short) template invocations are replaced with (long) template contents.

The result of expansion is similar to wikitext, except that it has the following kinds of placeholder:

  • Extension strip markers. These markers represent half-parsed HTML generated by extension tags.
  • Heading markers. These markers represent the start of editable sections, where section edit links will later be conditionally placed. The preprocessor needs to mark headings which originated from the source, since headings may be generated by parser function expansion which do not correspond to any source location. Headings generated by templates can also generate editable sections.

Both of these kinds of placeholder use the special string Parser::MARKER_PREFIX. It is not possible for the preprocessor to use HTML comments as markers, since in the output markup of this pass, HTML comments are treated as user input, and will be stripped by Sanitizer::removeHTMLtags().

Sanitizer::removeHTMLtags

This is the security pass, which escapes HTML tags and attributes which are not on a whitelist. It sanitizes CSS attributes to remove any possible scripts or remote loading.

This pass removes HTML comments. This is mostly for convenience, to allow future passes to replace text without having to recognise and skip comments.

From this point on, user input cannot be allowed to be passed directly to the output without validation and escaping. And from this point on, HTML comments can be used as markers.

Markup transformation passes

Following are the remaining passes called by Parser::internalParse():

doTableStuff
This is a simple line-based parser which converts wikitext-style table markup to HTML table markup. It makes no effort to check the input for HTML-style table tags.
doDoubleUnderscore
This pass notes the presence of double-underscore items like __NOGALLERY__ and __NOTOC__ and removes them from the text. Parsoid refers to these as "behaviour switches". They are recognised with a MagicWordArray.
doHeadings
replaceInternalLinks
doAllQuotes
replaceExternalLinks
doMagicLinks
formatHeadings

internalParseHalfParsed

Guillemet
doBlockLevels
replaceLinkHolders
Language conversion
Tidy
The non-tidy cases

Limit report

Security

Extensions

Hooks

The following hooks belonging to the Parser group are available:

Hooks

Called from Parser :

Called from ParserOptions :

Called from ParserOutput :

Called from Sanitizer :

Called from DataAccess (for Parsoid):

Called from ParserCache :

Called from LinkHolderArray :

Deprecated or removed: