Posts Tagged ‘Mediawiki’

Converting HTML to Mediawiki using Sed and Curl

Monday, February 18th, 2008

I just created a Bot that takes existing HTML web pages, converts them to Wiki format, and then uploads the new pages to a Mediawiki Site. Instead of writing everything in Perl, I decided to just stick with simple scripts, and use Sed and Curl. It took some time to get all of the formatting correct, but it is now working very well.

The Bot will log in to the Wiki Site, store the cookies, grab HTML pages, convert them to Wiki format, and them upload them to the new Wiki Site. I have managed to update hundreds of pages using the script, and the formatting looks correct on all of the pages that I have checked so far.

It was actually a fun project, and I learned several new concepts in the process. I am going to try to expand on what I have done so far by adapting the script to more advanced applications, and also use it to make batch changes to my existing Wiki sites. It will be nice to have a way to go through, and add new headers and footers to my Mediawiki pages. I have several other ideas in mind, and if anyone has a challenging idea for creating a Bot, I would be interested in hearing about them.

I think the next step will be to make the whole thing more general, since the current script is very much focused on the existing web page format. I suspect that it is relatively difficult to make a Bot to decode any web page. This is because there are so many variations in HTML formatting between web sites that it is difficult to avoid having a few Regular Expression (Regex) characters showing up when you do not want them.

Mediawiki 1.11 Clean URLs

Sunday, February 10th, 2008

It looks like Mediawiki 1.11 and later versions have a bug, and have broken the previous fixes for clean URLs. If you try to use the new version of Mediawiki with the fixes that used to work with the previous version, you will not be able to log in, or log out. Whenever you click log in, an edit page will be come up for index.php. Obviously, this is not going to work, unless you disable Clean URLs.

The Solution: It turns out that the solution to this problem is to add the following to you LocalSettings.php file.

$wgUsePathInfo = false;

You will also still need to keep your previous settings, such as $wgArticlePath = “/$1″; and your previous RewriteRule. I actually use virtual hosting in Apache, which makes things a little more complicated.

It is surprising that Mediaiki does not make it easier to enable clean URLs, since Wikipedia uses them. Wikipedia actually uses the sitename/wiki/Main_Page format, but I prefer the sitename/Main_page format.