Posts Tagged ‘Curl’

Converting HTML to Mediawiki using Sed and Curl

Monday, February 18th, 2008

I just created a Bot that takes existing HTML web pages, converts them to Wiki format, and then uploads the new pages to a Mediawiki Site. Instead of writing everything in Perl, I decided to just stick with simple scripts, and use Sed and Curl. It took some time to get all of the formatting correct, but it is now working very well.

The Bot will log in to the Wiki Site, store the cookies, grab HTML pages, convert them to Wiki format, and them upload them to the new Wiki Site. I have managed to update hundreds of pages using the script, and the formatting looks correct on all of the pages that I have checked so far.

It was actually a fun project, and I learned several new concepts in the process. I am going to try to expand on what I have done so far by adapting the script to more advanced applications, and also use it to make batch changes to my existing Wiki sites. It will be nice to have a way to go through, and add new headers and footers to my Mediawiki pages. I have several other ideas in mind, and if anyone has a challenging idea for creating a Bot, I would be interested in hearing about them.

I think the next step will be to make the whole thing more general, since the current script is very much focused on the existing web page format. I suspect that it is relatively difficult to make a Bot to decode any web page. This is because there are so many variations in HTML formatting between web sites that it is difficult to avoid having a few Regular Expression (Regex) characters showing up when you do not want them.