Manual:PhpWiki conversion

From Linux Web Expert

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This article describes the step of a process that was developed for converting PhpWiki pages into MediaWiki pages. (The process was devised by m:User:KeithTyler for use in a workplace intranet.)

See " Converting content from a PHPWiki" for some other methods of converting.

The process entails:

  • Exporting
  • Control script
  • Scrubbing
  • Sed script
  • Database insert


Exporting

The export was performed via the "ZIP Snapshot" administrator function of PhpWiki. The function sends a ZIP file of all pages, in RFC822 format, to the browser for download.

The result zip file was unzipped into a source directory.

For past revisions

ZIP Snapshot only includes the current versions of files. If historic versions of pages are required, then the ZIP Dump should be used.

This process does not address past revisions. A process to migrate all past revisions of PhpWiki pages will require a separate (or smarter) database insertion process. In MediaWiki, current revisions of pages are stored in the cur table, while previous revisions are stored in the old table.

Control script

The control script consists of a for loop that calls the scrubbing and sed script on each file, and sends the output to a target results directory.

 for file in *
 do
   tail -n +14 "$file" | sed -f ../phpwikiconvert > "converted/$file"
 done

Scrubbing

The "tail +14" removes the RFC822 header information from the result file.

These headers contain PhpWiki metadata for the page, such as:

Date: Mon, 18 Oct 2004 16:31:28 -0700
Mime-Version: 1.0 (Produced by PhpWiki 1.3.5pre)
Content-Type: application/x-phpwiki;
  pagename=10-digit%20dialing;
  flags="";
  author=KeithTyler;
  version=2;
  lastmodified=1098142288;
  author_id=192.168.32.112;
  markup=2;
  charset=iso-8859-1
Content-Transfer-Encoding: binary

If this metadata needs to be maintained in the migration, a more complex conversion method will be required that retains this data and uses it in the database insertion step of the process.

Note that using tail is inelegant, but seems to work. A more robust method would extract all lines before the first blank line.

From my own experience, the nr of lines in the header may vary therefore it is essential to delete all lines before the first blank line. This is sth awk can do easily: replace "tail +14" by "awk 'i==1;/^$/{i=1}'" NOTE: Later versions of PhpWiki may insert a ^M at the end of each line. Including blank lines. So you may want to try adding \x0d to the awk script.

"perl -ne 'print if ($i==1); $i=1 if /^\s*$/' $file" worked at least once.

Sed script

The following sed script covers, in simple fashion:

  • Typeface markup
  • Header markup
  • WikiLink markup
  • Table markup
  • Redirect markup

For the database insertion process, we also need to escape all double-quotes.

Note that this does not perform any implicit conversion or linking of CamelCase links. Note also any implicity conversion of CamelCase links (e.g. to insert a space) will have to also have considerations for the database insertion step, or in the output step of the hook script.

# typeset markup
s/_\([^_]*\)_/''\1''/g  # italic -- OK
s/\*\([^\*]*\)\*/'''\1'''/g  # boldface -- OK
s!=\([^=]*\)=!<code>\1</code>!g  # fixed-width -- OK

# header markup -- OK
s/!!!\(.*\)$/==\1==/g
s/!!\(.*\)$/===\1===/g
s/!\(.*\)$/====\1====/g

# table markup (hopefully)
s!\([^|][^|]*\)|!\1||!g
s!^|!|-\n|!g  # convert row start -- OK
s!.*plugin OldStyleTable.*!\{\|!  # convert table start -- mostly OK
s!^?>$!\|\}!  # convert table end -- mostly OK

# link markup
s!\[\(.*\)|\(http.*\)]![\2 \1]!g  # url format -- OK
s!\[\(.*\)|\(.*\)\]![\2|\1]!g  # switch display and link text -- OK
s!\[\([^]]*\)\]![[\1]]!g  # double bracketize -- OK
s!\[\[\(http.*\)\]\]![\1]!g  # undo double-bracketing urls by above -- OK

# redirects
s!<?plugin RedirectTo page=\(.*\)?>!#REDIRECT [[\1]]!

# quotes
s!"!\\"!g

There are almost certainly issues that this script does not take care of. For example, it certainly will not take care of any PhpWikiPlugins (besides OldStyleTable). Of course, there are plenty of plugins that do not have apparent analogues in MediaWiki (like CalendarPlugin), so you will always need to be prepared for more human-oriented conversion and de-functionalisation work. If you've become wholly dependent on PhpWikiPlugins, you're quite likely in for a migration headache.

p2m 0.4.1 Beta - PHPWiki to MediaWiki Converter

p2m is a text converter for Wiki content which has been formatted with PHPWiki syntax. It translates the most important Wiki tags from PHPWiki to MediaWiki syntax. I have written p2m since I had to migrate a PHPWiki to a MediaWiki and I did not find a suitable converter.

PHPWiki2MediaWiki is a command line tool written in C#. It does not have a neat GUI nor is it intended as a full-blown converter. With p2m I was able to convert about 80% of my Wiki content which is not too bad, I guess Cool It does a good job, but I have to mention that is absolutely necessary to verify the converted content since p2m does not convert everything automatically!

p2m is freeware and available for Windows and Linux. Read more about it here.

DlStyleTables

The above sed does not address DlStyleTables. Which is good, because DlStyleTables are a lame excuse for not having more integrated table support in PhpWiki. If you never saw the point of DlStyleTables, good. But if for some reason you eventually decided to find a use for them, you'll have to convert them yourself.

If you don't convert DlStyleTables, they will look like this:

Term 1|

Definition 1

Term 2|

Definition 2

It will be challenging to find an automated process to convert these. Try perl; sed alone can't do it very easily. You will need to:

  • Convert Term lines to boldface cells
  • Convert Definition lines to cells
  • Place row dividers between each pair
  • Add table start and end markup to blocks of rows.
  • Avoid having DlStyleTables treated as OldStyleTable markup. This means your process will need to recognize the multi-line nature of DlStyleTables and recognize the line sequence as such, and not as part of an OldStyleTable. PhpWiki could do this because OldStyleTables were inside <?plugin ?> blocks, and DlStyleTables were regular markup.

File renaming

It was deemed desirable to rename the files to remove escaped characters (specfically space (%20) to underscore, and slash (%2F)).

A command line for loop performs this for spaces:

/phpwiki/converted% for file in *; do mv "$file" `echo "$file"|sed 's!%20!_!g'`; done 2>/dev/null

For slashes, we need to perform this conversion during the database insert step. Most systems will not allow slashes in filenames as slashes are directory name delimiters.

Database insert

The most efficient means determined for inserting a collection of pages into the MediaWiki DB was to use SQL INSERT statements (as part of a for loop using mysql client). The minimum set of fields needed to insert are:

  • cur_title
  • cur_text
  • cur_user_text
  • cur_timestamp

If a special user is desired to be attributed to the migration process, that user should be created, and then cur_user must be added to the insert. In the above example, cur_user will be the column default of 0. cur_user_text is required however, as the page history will display incorrectly without a name to display. Likewise for cur_timestamp.

We need to make sure of the following:

  • No spaces appear in the filenames
  • Slash characters are converted to slashes
  • All titles begin with a capital letter

Otherwise, MediaWiki will never line up the requested name with the title in the database. The first issue has been solved by the for-loop above; the second and third will be addressed in the insertion loop.

The shell script to do the insertion should look something like this:

for file in *; do
title=`echo $file|sed 's!%2F!/!g'|perl -n -e "print ucfirst;"`
cat <<END | mysql -u(username) -p(password) (databasename)
insert into cur
(cur_title,cur_user_text,cur_timestamp,cur_text)
values("$title","PhpWikiMigration",now()+0,"`cat $file`");
END
done

We use a quick sed to convert %2F to a slash, and a quick perl to capitalize the filename. Then we use cat with a heredoc to throw in a SQL statement with our fixed title, and the contents of the associated file.

The above script works for MediaWiki prior to 1.5. Below I have rewritten the script so it works for version 1.6.5 and hopefully for other version with the new database layout.

for file in *; do
title=`echo $file|sed 's!%2F!/!g'|perl -n -e "print ucfirst;"`
cat <<END | mysql -u <username> -p<password> <databasename>
INSERT INTO page
(page_id, page_namespace, page_title, page_counter, page_restrictions, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)
VALUES
(NULL,0, "$title", 0,'', 0, 1, RAND(), NOW()+0, 0, LENGTH("`cat $file`"));

INSERT INTO text (old_id, old_text, old_flags)
VALUES (NULL, "`cat $file`", "utf-8");

INSERT INTO revision
(rev_id, rev_page, rev_text_id, rev_comment, rev_minor_edit, rev_user, rev_user_text, rev_timestamp)
SELECT NULL, page_id, LAST_INSERT_ID(),"PhpWikiMigration", 0, 1 ,"Admin", NOW()+0 FROM page WHERE page_title = "$title";

UPDATE page,revision
SET page.page_latest = LAST_INSERT_ID()
WHERE page.page_id = revision.rev_page && revision.rev_id = LAST_INSERT_ID();

END
done

Caveats

Further caveats beyond those already stated above.

This process does not handle conversion of PhpWiki user's pages into Wikipedia User: namespace pages. It is possible by using a trick with the PhpWiki metadata: if the title name is the same as the author name, then put the page into namespace=2. Unfortunately, among other things, this assumes that the user was the last person to edit their user page. Depending on the size of your user base, you may opt to manually move these pages.

Plugin markup besides OldStyleTable and RedirectTo will remain in the converted output. This will need to be changed by hand. Since PhpWikiPlugins are essentially open-ended modules, there is really no way for the process to know exactly what your plugin is supposed to do (in our case, some of the plugins had been hacked at to add or change features, and some new plugins were created).

Even if PhpWiki provided a dump method that implicitly called the plugins and spat out their output, this could cause a mess when it comes to RedirectToPlugin and probably other convertable plugins (we'd get HTML tables instead of being able to convert OldStyleTable markup to MediaWiki table wikicode, for example).

If you were to opt to pull out your PhpWiki content in processed HTML, you'd have to invent a messy process for converting your WikiLinks back into WikiLinks.

This process does not treat underscores within words differently than underscores surrounding words. You will get errant italicization. The sed script could probably be fixed to address this.

Other things lacking from this process that could be added:

  • Conversion of ~ to ‎<nowiki>...‎</nowiki>
  • Conversion of %%% to ‎<br>
  • Conversion of ‎<verbatim> to ‎<pre>...‎</pre>

%%%, pre, and bare Wikiword support has been added to this script here: http://wikiworld.com/PhpWiki2MediaWiki

MediaWiki 1.10.0

I attempted a conversion from PHPWiki 1.3.13 to MediaWiki 1.10.0

The scripts above worked well, but I had to run one more SQL query at the end in order for the pages and revisions to be linked correctly in the database:

UPDATE page,revision SET page.page_latest = revision.rev_page WHERE page.page_id=revision.rev_page;

Results: I converted 112 articles. Most of them will need manual tweaking with the markup, as the sed script didn't get everything. Also need to write a script to insert uploaded files.

Also, I used the maintenance/importImages.php script and the maintenance/rebuildtextindex.php script. Also, our PHPWiki install used a .htaccess file to control access to the wiki, so here's how to import the users from .htaccess:

#!/bin/bash

USERLIST=`cut -f 1 -d ':' /path/to/.htpasswd |perl -n -e "print ucfirst"`

for username in $USERLIST
do
cat <<END | mysql -u root mediawikidb
INSERT INTO user(user_name) values ("$username");
UPDATE user SET user_password=md5(concat(user_id,'-',md5('changeme'))) where user_name = "$username";
END
echo "Added $username and set password."
done

See also