Manual:Restoring wiki code from cached HTML

From Linux Web Expert

If you've managed to fail in your attempts at backing up your wiki , like we did, you may, unfortunately, after a server failure, be left with no other option than trying to recreate your lost content from various cached copies of pages from your site.

Where to get cached HTML for your site

  • The first place to look for cached HTML pages from your lost wiki is in the page cache on your browser. Access about:cache on either Google Chrome/Chromium or Firefox, and you will be able to view these cached pages... but make sure you are in 'Work Offline' mode first so that you don't kill your cache with new pages from your server.
  • Search engines keep caches of pages from at least the more popular websites: try Google, Bing and Yahoo.
  • The Web Archive, www.archive.org may also have some of your pages, if you're lucky.
  • You may find there are other caches available to you if you are inside a large company or university that maintains a caching proxy.

On Google, searching for site:mywiki.example.com will get you a list of most of the cached pages for your site, but sometimes you can access more pages by searching for specific page titles. This is a slow manual process of saving as much cached content as you can... as quickly as possible after the disaster occurs (once you restore your wiki, the cache will start refreshing from your new server and your further content may be lost)

Using HTML to reconstruct your wiki

If you've managed to retrieve most of your wiki content, it is then possible to process that content using a bunch of scripts. Some code from the year 2010 that is useful for this purpose is available at: http://code.ascend4.org/ascend/trunk/tools/mediawiki/html2mediawiki/

The above code does the basic job of reconstructing headings, lists, tables, links, math, and source code listings. It also correctly handles category tags, and some specific templates. The core parts of this code use BeautifulSoup and Python's regular expressions module to search for recognized patterns.

Every MediaWiki instance is different though: different installed extensions and different templates will mean that the above scripts will probably have to be carefully edited before you use them to process your particular site. There are probably some hard-wired references to the ASCEND wiki in the above code that you will need to carefully read over and change.

Other HTML2wiki scripts have been published but these have a slightly different aim: to translate HTML snippets for inclusion in a wiki, rather than reconstructing a wiki from its HTML impression.