Manual:Grabbers

From Linux Web Expert

Revision as of 16:09, 4 February 2024 by imported>Prod (rm duplicated section)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page describes a series of grabber scripts designed to get a wiki's content without direct database access. If you don't have a database dump or access to the database and you need to move/back up a wiki, or if you want to move a wiki to another database system, the MediaWiki API provides access to get most all of what you need. These scripts also require MediaWiki 1.39+ since Gerrit change 924121.

Appropriate access on the target wiki is required to get private or deleted data, but most scripts will just work without such access. This document was originally compiled and scripts assembled in order to move Uncyclopedia; because the overall goal was to just get the damn thing moved, 'pretty' was not exactly in our vocabulary when we were setting this up so some of it/them are still kind of a horrible mess. However, many of them have been revised since then and made more robust, and were used successfully to move several wikis from wikia to a new host.

The way those scripts work is to replicate the database with the same public identifiers (revision ID, log ID, article ID), so most of them must be used on a clean, empty database (with just the table structure) or a database that has the same IDs as the remote wiki being replicated.

Stuff to get

If you're moving an entire wiki, these are probably what you need to get. More information on the tables can found on Manual:database layout, but the secondary tables can be rebuilt based on these. Otherwise you probably know what you want.

  • Revisions: text, revision, page, page_restrictions, protected_titles, archive (most hosts will provide an xml dump of at least text, revision, and page tables, which add up to the bulk of a wiki)
  • Logs: logging
  • Interwiki: interwiki
  • Files (including deleted) and file data: image, oldimage, filearchive
  • Users: user, user_groups, user_properties, watchlist, ipblocks
  • Other stuff probably (this is just core; extensions often add other tables and stuff).

Scripts

  • php files should be in the code repository (GitHub).
  • python files have been added to the repository too.
  • No ruby is involved. So far.

PHP scripts

These are maintenance scripts and output their grab straight into the wiki's database. To "install" them:

Notes:

  • It needs a working LocalSettings.php with database credentials, and a working MediaWiki database, so be sure you've set up the wiki first
    • You can create it quickly by running php maintenance/install.php --server="http://dummy/" --dbname=grabber --dbserver="localhost" --installdbuser=root --installdbpass=rootpassword --lang=en --pass=aaaaa --dbuser=grabber --dbpass=grabber --scriptpath=/ GrabberWiki Admin
    • Some configuration variables in LocalSettings.php that those scripts support: $wgDBtype, $wgCompressRevisions, External storage.
  • If you're importing all the contents with grabText.php, be sure to remove all rows from page, revision, revision_actor_temp, revision_comment_temp, ip_changes, slots, content and text tables prior to running the script.
    • You should also remove the rows from user and actor tables to prevent a clash with users with the same id or name, like the first user that gets created when the wiki is installed for the first time.
  • If you're importing all the logs with grabLogs.php, be sure to remove all rows from the logging and log_search tables prior to running the script.
  • If you need to login on the target wiki on recent versions of MediaWiki (which is sometimes required, when grabbing deleted text, or desirable due to higher api limits), you need to set up a bot password on the external wiki.
Those scripts require that you have sufficient privileges to create files on the directory from where you're executing them. That's because curl attempts to write a cookie file on the current directory, and failing to do so will result in a failed login attempt, without any other explanation of what's going on.
If you log in on the target wiki, the cookie file that curl creates isn't deleted when the script finishes. The cookie contains session IDs that, if leaked, can be used to log in as the same user, and access all its privileges. Because of that, once you successfully login, you don't need to provide credentials anymore, but this may cause issues if you target several wikis. In case of login problems, remove the cookie file, called cookies.tmp. Remember to remove the cookie file, or ensure it's not accessible from the web.
Script Target Tables Notes
grabText.php Page content (live).
  • Gets all revisions from all pages, from all or selected namespaces. It supports resuming operation starting from a given page.
  • Has support for External Storage, compression and ContentHandler.
  • It uses the original page, revision and user ID.
  • On june 2017 it took 6h15m to grab all revisions from the main namespace of a wiki (grabbing 25.800 pages and 1.083.000 revisions, with an average page length of 4KB and an average revision length of 17KB, 18GB of uncompressed text) using a database mounted in a ramdisk, with text compression and external storage enabled on a second database on disk, on a linode with 1GB of RAM and 512MB of swap, and logging in with a bot account (which has higher rate limits to query data).
  • It caches the internal IDs of the text table for inserts, so avoid running this script in parallel (for example, with different namespaces) or you'll get primary key errors.
grabNewText.php New content, for filling in edits from after a dump was created and imported.
  • This script must be used only after filling in the secondary tables from the initial dump import, or after using grabText.php.
  • Supports filtering changes from specific namespaces. Due to how it caches latest revision ID, it may erroneously skip changes if you skip namespaces that have been imported already.
    • To prevent this, always use grabText.php with an --enddate parameter consistent between all runs of grabText.php and grabNewText.php (use the current timestamp in UTC). And run grabNewText.php with --startdate equal to the last --enddate. Once you import a namespace with grabText.php, it must be specified on all later runs of grabNewText.php. Example:
# Grab only main namespace
grabText.php --namespaces=0 --enddate=t1
# Update main namespace after some days (this can be repeated updating t1 and t2 accordingly)
grabNewText.php --namespaces=0 --startdate=t1 --enddate=t2
# Grab a new namespace, use end date from previous grabNewText.php
grabText.php --namespaces="14|6" --enddate=t2
# Now grabNewText.php must specify all previously imported namespaces, starting from the previous end date
grabNewText.php --namespaces="0|14|6" --startdate=t2 --enddate=t3
grabDeletedText.php Deleted content.
  • Gets all deleted revisions, from all or selected namespaces. It supports resuming operation starting from a given page, or simply skips duplicate archive entries.
  • Has support for External Storage, compression and ContentHandler.
grabNamespaceInfo.php Namespaces n/a
  • Prints out a list to add to the LocalSettings.php because namespace information is not stored internally.
grabLogs.php Stuff that shows up on Special:Log.
  • Ability to filter by a list of log types.
  • Grabs logs from older to newer, can be used to resume operation in case it fails, or for updating a live site.
grabInterwikiMap.php Supported interwiki links - show up on Special:Interwiki if Extension:Interwiki is installed.
  • Can either import all interwikis or just the interlanguage links, though getting all the interwikis is generally recommended to maintain compatibility.
grabFiles.php Files and file info, including old versions (descriptions are page content).
  • Use this for a full dump - it imports files directly (such that log entries and file descriptions from other scripts are used), and includes old revisions.
  • You should run this script as the user that normally runs php on the webserver (for example, sudo -u wwwrun php grabFiles.php ...), or fix file owner afterwards.
  • It does multiple retries in case a file fails to download.
  • It compares the sha of the file against the sha provided by the api of the remote wiki to check for corrupted files and retries in case of failure.
  • It uses the internal MediaWiki classes configured in $wgLocalFileRepo to store the files. It can theoretically support complex storage engines.
  • It has special optimizations for wikia. Use the --wikia parameter in that case.
  • After importing it, if the original wiki is an old one (Wikia, for example) you should run refreshImageMetadata.php, otherwise you may have display issues with some animated gif (if they are compressed doing a composite animation of smaller images). Even if it's not, it won't hurt anyway. See also <translate> task <tvar name=1>T173360</tvar></translate>
  • If you're importing from Wikia, as of August 2018, their allimages generator is buggy and doesn't return all files (it skips some files). Be aware that you can get some files missing. You can run this query to get missing files with a file description page: select page_title from page where page_namespace = 6 and page_is_redirect = 0 and not exists (select * from image where img_name = page_title) and not exists (select * from categorylinks where cl_from = page_id and cl_to = 'Videos') (the "Videos" category contains videos uploaded by Wikia, which the script skips; its name may be different in another language, such as "Vídeos" for Spanish)
grabNewFiles.php Files and file info to update a site that had used grabFiles.php already.
  • Use this to update a live site or after grabNewFiles.php has been run and you want to get more recent uploads.
  • It handles new uploads, but also file renames, deletions and restores, following the logs of the wiki, so it's important to use the correct timestamp to start from.
  • All features from grabFiles.php apply here as well.
grabImages.php Current file versions, without database info n/a
  • If you only want to download the files off something and don't care about the descriptions or old revisions, use this to only download them without affecting the database (and then use the importImages maintenance script that comes with core to import them into the wiki)
    Otherwise use grabFiles.php as that imports files directly as well as downloading them.
grabDeletedFiles.php Deleted files and file info.
  • Works by pointing at a known (assumes default) deleted file hashing configuration on the target wiki. If you don't know it or the files are otherwise secured, you can try to use the built-in screenscraper instead (required due to a lack of API support for actually downloading deleted files, at least as of when this was written/on many target wikis), but it's kinda stupid and may or may not actually work either.
grabUserBlocks.php User blocks.
  • Grabs all blocks from older to newer.
grabUserGroups.php Groups users belong to.
  • Assumes ids are/will be the same on source and target wiki, and not much can be done about this because it generally runs before the accounts are actually created.
  • It can be filtered to fetch only a defined list of user groups.
  • To use this script on Wikia, use --wikia to filter out most global user groups.
grabProtectedTitles.php Pages protected from creation.
  • The rest of the protected pages information, that's not covered by grabText (as this does not apply to existing pages).
grabAbuseFilter.php Abuse filters.
  • Grabs current filters from Extension:AbuseFilter. Only grabs the current version of public filters, no version history and no abuse logs, because there's no api available for them.
populateUserTable.php Users.
  • It doesn't grab anything from the remote wiki, but populates the user table from information from the revision, logging, image, oldimage, filearchive and archive tables.
  • It only creates stub users that can't login to the wiki, used so the database has consistency and Special:Logs doesn't display strange user names like 127.0.0.1.
  • A special extension, StubUserWikiAuth, can be used to allow those users login using the information from the remote wiki.

Python scripts

  • The tools/grabbers/ python scripts will currently populate the ipblocks, user_groups, page_restrictions, and protected_titles tables.

It's recommended that you use python 2.7.2+. You will need to install oursql, and requests.

You need to edit settings.py and set the site you want to import from, and your database information.

The easiest way to run everything is just $ python python_all.py which executes all four individual scripts. You can also run each script individually if you choose (so you can run them concurrently).

Note: Autoblocks will not be imported since we do not have the data about which IP address is actually being blocked

More options

  • Wiki-Export - another set of scripts to grab pages and files, in python

Extension:MigrateUserAccount

It works by adding a new special page, Special:MigrateUserAccount, where users can reclaim their account. The user enters their account username, and if the user has an empty password field in the database, it will provide a generated token for users to add to the external wiki's user page (either in the content or in the edit summary). The user can then press a button to confirm that they've completed this, and the extension will make an API call to the external wiki to confirm. If the user edited the page with the token successfully, they are then prompted to enter new credentials for the current wiki.

This extension simply lets people reclaim their account, and thus any edits and contributions assigned to them. It does not import preferences, the watchlist, or other information from the external wiki.

This is more reliable than Extension:StubUserWikiAuth below since it doesn't require the user to enter Bot Password credentials from the remote wiki, which may eventually break on external farms like Fandom.

Download source code: https://github.com/weirdgloop/mediawiki-extensions-MigrateUserAccount

Extension:StubUserWikiAuth

If you used grabbers to import all revisions and files, the tables would be already populated with the same user IDs as the original wiki. Use this extension to populate the user table with "stub" rows that won't contain password information. Then configure the extension as a PasswordAuthenticationProvider as described on the extension's manual. If a user attempts to login and the user is still a stub one in the user table, the extension will do the login on the remote wiki instead. If the login is successful, will create a new hash of the password and store it on the database. Any further logins will be made only locally. This currently works only on Fandom with a remote Bot Password account, due to Fandom custom login system.

This is recommended over Extension:MediaWikiAuth, since all user IDs would be already correct.

Download source code: https://github.com/ciencia/mediawiki-extensions-StubUserWikiAuth

Extension:MediaWikiAuth

Imports user accounts on login. Note that this requires the site you are copying from to still be active to use their authentication.

Affects user, user_properties, watchlist tables

  • Uses screenscraping as well as the API due to incomplete functionality.
  • Updates user ids in most other tables to match the imported id, though apparently not userid log_params for user creation entries

Other stuff

Not grabbers, but things to potentially worry about.

  • Configuration stuff - groups, namespaces, etc
  • Extensions
  • Extension stuff - ajaxpoll, checkuser, socialprofile, and others have their own tables and stuff
  • Secondary tables - the above grabber scripts generally just set the primary tables; secondary tables such as category, redirect, site_stats, etc can be rebuilt using other maintenance scripts included with MediaWiki, usually rebuildall.php.

See also