Manual:generateSitemap.php

From Linux Web Expert

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Details

generateSitemap.php file is a maintenance script to generate a sitemap for a MediaWiki installation. Sitemaps are files that make it more efficient for search engine robots (like googlebot) to crawl a website (so long as the bot supports the sitemap protocol.)

By default, the script generates a sitemap index file and one gzip-compressed sitemap for each namespace that has content. See #Options for a list of options that can be passed to the script.

You may need to set up a cron job to update the sitemap automatically.

For generic instructions on using MediaWiki's maintenance scripts, see Manual:Maintenance scripts .

Options

--help

displays the available options for generateSitemap.php

--fspath=<path>

The file system path to save to, e.g sitemap/. Note, this directory must be viewable online!

--identifier=<identifier>

What site identifier to use for the wiki, defaults to $wgDBname

--urlpath=<prefix>

The domain-relative URL that points to --fspath, e.g.
/sitemap/
If specified, gets prefixed to the filenames in the sitemap index. This is needed because some search engines like Google require absolute URLs in sitemaps. You should specify --server also; the values will often be similar, but having them separate allows the script to accommodate even unusual setups.
Before MediaWiki 1.32.0 this parameter should contain the protocol and host name. However, since MediaWiki 1.32.0 the contents of this parameter will be appended to the --server parameter, thus it must neither contain protocol nor hostname. This is a breaking change not announced via the RELEASE-NOTES.

--server=<server>

The protocol and host name to use in URLs, e.g.
https://en.wikipedia.org
This is sometimes necessary because server name detection may fail in command line scripts and will show up only as "localhost" or "my.servername" in the xml files.
If set, the value of $wgCanonicalServer will override the value of this option in all sitemap files except for the sitemap index file.

--compress=[yes|no]

Whether or not to compress the sitemap files. The default setting is yes.

--skip-redirects

If this option is added redirects are skipped and thus not listed within the sitemap. This is recommended since Google can complain about redirects. However the default setting is not to skip redirects.

Example

To create a sitemap for a wiki, first (in $IP, i.e. document root which is where your "LocalSettings.php" file is located) create a directory in which to store the sitemap files (there will be one per namespace and an index file):

mkdir sitemap

Then generate the sitemap:

<td class="mw-version-versionbox" title="<translate nowrap> The latest stable version is <tvar name=1>1.41</tvar></translate>">
<translate> ≥</translate> 1.40
<translate> MediaWiki version:</translate>
php maintenance/run.php generateSitemap --memory-limit=50M --fspath=/path/to/examplecom/sitemap/ --identifier=example.com --urlpath=/sitemap/ --server=https://www.example.com --compress=yes --skip-redirects
<td class="mw-version-versionbox" title="<translate nowrap> MediaWiki <tvar name=1>1.39</tvar> is legacy version</translate>">
1.32 – 1.39
<translate> MediaWiki versions:</translate>
php maintenance/generateSitemap.php --memory-limit=50M --fspath=/path/to/examplecom/sitemap/ --identifier=example.com --urlpath=/sitemap/ --server=https://www.example.com --compress=yes --skip-redirects
<td class="mw-version-versionbox" title="<translate nowrap> MediaWiki <tvar name=1>1.31</tvar> is unsupported version</translate>">
<translate> ≤</translate> 1.31
<translate> MediaWiki version:</translate>
php maintenance/generateSitemap.php --memory-limit=50M --fspath=/path/to/examplecom/sitemap/ --identifier=example.com --urlpath=https://www.example.com/sitemap --server=https://www.example.com --compress=yes --skip-redirects

This will create a sitemap index stored at /path/to/examplecom/sitemap/sitemap-index-example.com.org.xml which points to a compressed XML file for each namespace, e.g. /path/to/examplecom/sitemap/sitemap-example.com-NS_0-0.xml.gz for the main namespace.

This does not mean your sitemap can now be found automatically! You will then need to submit the link for the sitemap index to the crawling site (eg Yandex or Google), i.e. https://www.example.com/sitemap/sitemap-index-example.com.org.xml .

Alternatively, you can make this findable by any crawler by adding a link to the sitemap index to your site root directory e.g.:

ln -s /sitemap/sitemap-index-example.com.org.xml sitemap.xml

Linking a top-level sitemap.xml also works if you choose to run the Wayback Machine sitemap submitter on your own site.

Non-latin domains need to use Punycode.

Related configuration parameters

MediaWiki version:
<translate> ≥</translate> 1.13
  • $wgSitemapNamespaces : Array of namespaces to generate a Google sitemap for, or false if one is to be generated for all namespaces. The default setting is false.
MediaWiki version:
<translate> ≥</translate> 1.19
  • $wgSitemapNamespacesPriorities : Custom namespace priorities for sitemaps. This should be a map of namespace IDs to priority. The default setting is false.