This section holds some of the software I’ve written in various languages (mostly PHP and Javascript). All software is licensed under the GPL.

Word HTML Cleaner 1.1

While developing various websites I have needed to put large amounts of text from word documents into a webpage, but converting it by hand would take too long, and the HTML word outputs is just plain awful. So I wrote this little javascript to strip all the junk tags and attributes from word html, and to convert its plain text lists into proper html lists.

XHR Chat

This is a demonstration type chat script I wrote when experimenting with xmlHttpRequest. It never refreshes, keeps an infinite post log, and checks for new posts once a second. It will degrade gracefully if the browser doesn’t support xmlHttpRequest, just so long it still has javascript enabled.

15 Responses »

  1. Patrick - November 20th, 2006 at 2:40 pm

    Man, thank you thank you thank you!

    the Word Cleaner rocks my world!

  2. Greg - January 11th, 2007 at 2:33 pm

    I love the Word HTML Cleaner! It’s an absolutely wonderful script you’ve written. It’s exactly what I’ve been looking for… for many months in fact!

    I have only one suggestion, is it possible for the script to retain all empty tags?? The script seems to remove any and all empty cells and so all subsequent cells shift over in place of the missing cell(s).

    If this is something that could be worked into the script, or if you have any suggestions on how to implement it, I’d be forever grateful!!

    Thanks again for having written this!!

  3. Connor - January 12th, 2007 at 2:13 pm

    Glad you find it so useful Greg!

    The problem you mentioned has now been fixed.

  4. Greg - February 2nd, 2007 at 12:26 pm

    Thank you for applying the fix so quickly! Much appreciated. So far so good! Though, the only other thing, if I could make one more suggestion? Can the script be setup so that it’s case insensitive. As of now, if the HTML code has any uppercase tags or attributes, all of the tags are removed. Now, I’m all for clean code, don’t get me wrong, but that’s just too clean 🙂 Anyway, although I’ve got a script that will convert all tags to lowercase, all of the attributes are left in uppercase and will be removed if not specified in the arrays(as uppercase). If case insensitivity could be integrated into your current script, that would be terrific!! Thank you

  5. Connor - March 25th, 2007 at 8:39 pm

    It took me a really long time, but I’ve implemented case insensitivity now. It will convert all tags and attribute names to lower case, but it won’t remove them just because they’re upper case.

  6. Matthew - May 26th, 2007 at 4:48 pm

    Just wanted to let you know that I use the html cleaner for documents on my website. Thanks!

  7. Rekcor - June 18th, 2007 at 8:04 am

    Thank you for your script, it is great!

    I had a small problem however: not all of Word’s special characters are converted, because they are in fact just normal letters, but in Word’s Symbol font. E.g. a for alpha.

    These can be replaced using regular expressions.

  8. Andrys - December 3rd, 2007 at 5:22 pm

    I’ve spent a long time looking for a WORD html cleaner and all of them, including Tidy, gave me a universe of woes, including bloated code and strange interpretations, making me just clean it up manually instead whenever I received a huge zillion-worded WORD (Microsoft tries to live up to that name) html doc to post.

    Yours did exactly what I wanted RIGHT AWAY. I thought something must be wrong, but nope, you’d really cleaned all that gunk while leaving the meat intact just as laid out. MANY thanks !

    (Tidy refused to even try, by the way.)

  9. Greg - March 4th, 2008 at 11:06 am

    Connor, this code is proving to be very helpful time and time again. I thank you for development of this script and providing it through your website. In the version I’m currently using I’ve included some additional coding that allows for an interface with various options. In my desire to go a little deeper with these options I was wondering at what point in the script could a function be inserted which would allow me to manipulate the various arrays and variables. I’ve tried inserting one at various points in the script, but I wasn’t able to position it at the correct location. I hope what I’m requesting makes sense. If you have any suggestions, I’d greatly appreciate them! Thank you!

  10. Tudor - June 18th, 2008 at 3:30 am


    Very useful script. I wonder if you could offer a version without the conversion of plain text lists to html lists. I tried to clean some texts and everything works amazingly well except that all the numbered titles begin with “1.” instead of their original numbers.

    Anyway, great stuff! Thanks!

  11. Tudor - June 19th, 2008 at 3:56 pm

    The list problem is solved: I have deleted three lines from the code in the source page and use the modified page off-line. Once again, a piece of (commented!) code that saves a lot of production time. Better than anything I have found in several hours of search, be it commercial or not. Congratulations, and thank you!

  12. Andrys - January 5th, 2009 at 8:16 pm

    I had reason to use it again, on an impossible WORD doc, mid-June 2008 and January 2009. Last wrote a note above in Dec ’07.


    Just wanted to thank you again.

  13. Peter - January 19th, 2011 at 3:07 am

    Thank you! Seems to work pretty well with Word 2010 too.

  14. Alex - March 1st, 2013 at 3:25 am

    Absolutely brilliant script! Thanks for providing it. The only problem i’m having is that bullet lists do not seem to be converting to

    • tags unlike numbered lists which are converting correctly to
      1. tags. I’ve had a look through the code and it looks like you intended bullets to formal as
          tags. Unfortunately i’m a long way from expert so can’t resolve it. If you have a moment and are able to offer some suggestions for resolving this i’d be really grateful. Thanks!

  15. Alex - March 1st, 2013 at 3:30 am

    Sorry, my posting above formatted itself as I was using html to point out an issue. I’ll try again substituting < with ( and > with ). My query was in relation to bullet lists which don’t seem to be formatting into unordered lists (unlike numbered lists which are working perfectly). It looks like you intended bullet lists to work and format as (ul)<li)(/li)(/ul) lists but unfortunately i'm no expert and can't see why it isn't working. Numbered lists however work a charm and do format as (ol)<li)(/li)(/ol).

    If you had a chance it would be seriously helpful if you were able to suggest a fix or point out where I might be going wrong. Thanks for the brilliant script!

Leave a Reply

Comments will be styled using Markdown.