The quest for tidy HTML


If you create content for a blog, web site, CMS (content management system), or most anywhere else these days, you’re creating HTML documents. And if you’re creating that HTML in a simple text editor like Notepad (or even something a little fancier like TextMate or Sublime), you’re probably only including the HTML tags that you really need. Good for you!

But for the other 99% of us, who use tools like Microsoft Word (or iWork/Pages) for our writing, we have to deal with the messy reality that word processing software often puts extra stuff in the HTML markup that we don’t want. And that “extra stuff” can cause all sorts of rendering and formatting headaches when people view our content in web browsers, or when the content is served up by popular CMSs such as WordPress, Drupal, Joomla and others.

The Problem

To illustrate the problem, I tried typing this very simple document into Word:


When I saved that document as an HTML page, it was 812 lines of HTML! That’s because Word saves all of the information needed to allow editing of the document in Word later, as well as things like default style information and so on. Word does include an option for saving “filtered” HTML, and that reduces the output to 74 lines of HTML, but that’s still a bit crazy considering my document is just two simple lines of text. All of that extra HTML markup can (and does) cause problems in how the text gets rendered in CMSs and web browsers.

As a result of this problem, an entire cottage industry has sprung up around creating tools to clean up the complicated HTML generated by the “export to HTML” option in various word processing programs. For example, Jeff Atwood created a nice little utility for cleaning up Word’s HTML back in 2006, and you can find links to many others on that blog post and the comments. There are also tools to help clean up the HTML on the import side of things, such the DocImport API module for Drupal, or the “paste from Word” option in the WordPress editor:


Those tools are helpful, but some of them are buggy, some only work in certain situations, and you often end up manually hacking the HTML anyway, to make sure your prose is properly formatted (or even readable) in its final published form.

The Solution

My favorite solution to this problem is Windows Live Writer (WLW), which is available as a free download that installs in a matter of seconds if you have a fast internet connection. (Only available for Windows – sorry, Mac fans.)

WLW is smart about removing HTML cruft, especially the cruft generated by Word. You just copy and paste the content from Word into the WLW editor, and then you can click on the Source tab in WLW and copy that clean, tidy HTML to where you need it.

As an example of how well this works, my test document above goes from 812 lines of messy HTML to just these two lines of HTML when I paste it into WLW:


If you need to tidy up the HTML from Word, this is a great trick to use!

Word also has a “blog post” document type that works well for posting directly from Word to a blog. But that doesn’t solve the messy-markup problem for times when you need to generate clean HTML for non-blog destinations such as a static web page. In my simple test, it generates HTML about four times smaller than what you get when saving a standard Word document as HTML, but the WLW trick above results in HTML that is 400 times smaller.


Leave A Reply