In this article we will first discuss the case for and against using Word as your HTML editor. Then we will see how to properly save a Word file to smaller, more compact HTML files. Third and last, we will see how to do this through code, and create a batch process for converting numerous Word files to HTML.

The case for and against Word as an HTML editor

Microsoft has given us the ability to save a Word file as HTML for many of the latest editions of Office. It’s a very easy process, and many use this way of creating HTML pages because:

  • They are already familiar with Word and its formatting features.
  • Word comes installed on their computer, and they do not want to purchase additional HTML authoring software.
  • They have numerous files in Word format that they want on a website in HTML. Simply exporting them to HTML is the fastest way.

Unfortunately, there is a downside to this method: Word does a terrible job of creating compact, cross-browser HTML source code. If this is important to you, then you should probably stay away from using Word as your HTML editor in the first place. However, having said this, it is still possible to clean up the generated code quite a bit, first through Word itself and second through other tools or custom Regular Expressions.

Saving as HTML from Word

Start by opening an existing Word file on your system, or by creating a new one and typing in some text and pictures. Then click on File > Save as Web Page…

Doing so, Word will display the Save As dialog box.

We can see that Word took the filename of the DOC file (for any new files it creates a filename based on the title of the document) and is prompting us to save it with the extension .htm. This is clearly shown by the select box labeled Save as type which has Web Page (*.htm; *.html) already selected. We can now perform the normal save operations, like choosing the name and location of the HTML file. However, Word has a save option called Filtered HTML which greatly reduces the HTML code produced.

It’s important to understand the difference between the two options. When Word saves a file as HTML, it still wants to be able to open it back in Word and maintain the same formatting as when you created it. The way it does this, is by leaving a lot of Word propriatory code inside the generated HTML file. If however, we simply want to export our contents to the smallest HTML file possible, without needing to re-open them back in Word, we can choose the Filtered HTML option. This produces smaller files, less HTML code and, even more important, a better cross-browser compatible source code. When you select this option and click on Save, you will get a popup which will alert to this fact.

Click on Yes to finish the process. Something else worth noting happens here on save. Suppose you have some images embedded inside your Word file. These images could be GIFs, JPGs, BMPs, PNGs, etc. When you insert an image in Word, the image file is actually embedded inside the file and is saved along with it. When we save the file as HTML, Word exports all these images to a folder that it creates in the same location as the exported HTML file, and then generates links to them inside the HTML code. The exported images are handled like so:

  • They are reduced/increased in size depending if they were decreased/increased in width and length inside Word.
  • They are converted to GIFs and JPGs.
  • Their names stay the same.
  • The name of the folder that they are stored under is the name of the HTML file that is created, plus the extension “_files”. For example, if the filename is “My company.htm”, then the images will be under the folder “My company_files“.
  • The link inside the HTML file to the images is relative. For example, <img src=”My company_files/house.gif”>.

Exporting to HTML through code

Let us assume that we have a bunch of Word files sitting inside a directory, and they all need to be converted to HTML files. We can open each one and follow the procedure above, but that can take a long time, depending on how many of them you have. We can instead, use a little WSH scripting to do this for us. The idea is the same: create an instance of the Word application, loop through the folder, open each DOC file that we find, export it as Filtered HTML, close the file, move on to the next, and finally close the Word application object. Let’s first look at the code needed to do this with WSH VBScript, and then we will break it down.


Option Explicit

'declare all variables
Dim objWord
Dim oDoc
Dim objFso
Dim colFiles
Dim curFile
Dim curFileName
Dim folderToScanExists
Dim folderToSaveExists
Dim objFolderToScan

'set some of the variables
folderToScanExists = False
folderToSaveExists = False
Const wdSaveFormat = 10 'for Filtered HTML output

'********************************************
'change the following to fit your system
Const folderToScan = "C:\Word\documentation\"
Const folderToSave = "C:\Inetpub\wwwroot\word\"
'********************************************

'Use FSO to see if the folders to read from
'and write to both exist.
'If they do, then set both flags to TRUE,
'and proceed with the function
Set objFso = CreateObject("Scripting.FileSystemObject")
If objFso.FolderExists(folderToScan) Then
 folderToScanExists = True
Else
 MsgBox "Folder to scan from does not exist!", 48, "File System Error"
End If
If objFso.FolderExists(folderToSave) Then
 folderToSaveExists = True
Else
 MsgBox "Folder to copy to does not exist!", 48, "File System Error"
End If

If (folderToScanExists And folderToSaveExists) Then
 'get your folder to scan
 Set objFolderToScan = objFso.GetFolder(folderToScan)
 'put al the files under it in a collection
 Set colFiles = objFolderToScan.Files
 'create an instance of Word
 Set objWord = CreateObject("Word.Application")
 If objWord Is Nothing Then
 MsgBox "Couldn't start Word.", 48, "Application Start Error"
 Else
 'for each file
 For Each curFile in colFiles
 'only if the file is of type DOC
 If (objFso.GetExtensionName(curFile) = "doc") Then
 'get the filename without extension
 curFileName = curFile.Name
 curFileName = Mid(curFileName, 1, InStrRev(curFileName, ".") - 1)
 'open the file inside Word
 objWord.Documents.Open objFso.GetAbsolutePathName(curFile)
 'do all this in the background
 objWord.Visible = False
 'create a new document and save it as Filtered HTML
 Set oDoc = objWord.ActiveDocument
 oDoc.SaveAs folderToSave & curFileName & ".htm", wdSaveFormat
 oDoc.Close
 Set oDoc = Nothing
 End If
 Next
 End If
 'close Word
 objWord.Quit
 'set all objects and collections to nothing
 Set objWord = Nothing
 Set colFiles = Nothing
 Set objFolderToScan = Nothing
End If

Set objFso = Nothing

Save the following code as a vbs file (for example, createdoc.vbs) somewhere on your system. Before you use it, you must change the 2 constants folderToScan and folderToSave. These folders reflect which folder to look in for any Word files and which folder to save to. Once you edit these 2, double click on the vbs file to run it.

The code scans through the folder defined in folderToScan. After a simple check to see if the folder exists, it creates an instance of the File System Object, maps to this folder and puts all the files under it in a collection. It then creates an instance of the Word application, and loops through the files in the collection. For each Word file that it finds, it opens and saves it as Filtered HTML. If you now look inside the output folder, folderToSave, you will see the newly created HTML files with their corresponding directories of images.

The constant wdSaveFormat is a unique number that specifies an external file converter. Setting it to 10 creates Filtered HTML files. For regular HTML output use the number 8. This will produce bigger HTML files but will maintain the Word formatting.

14 Responses to Exporting Word files to HTML

  • Marco Antonio Pivetta

    Great script! Very professional and well documented.
    Although, I found one bug on line 64. Just added a backslash after folderToSave. The way it was, tha last directory of the folder name becomes part of the name of the file.

    64. oDoc.SaveAs folderToSave & “\” & curFileName & “.htm”, wdSaveFormat

    It becomes perfect to me when I changed line 54 to allow docx files also.

    54. If (objFso.GetExtensionName(curFile) = “doc”) Or (objFso.GetExtensionName(curFile) = “docx”) Then

    Thank you. It just solved a great problem to me. I had tryed to do this before, but my skills were not enough.

    I missed rss to add this site on my favorite list.

    • Evagoras Charalambous

      @Marco Antonio:

      Doesn’t adding the slash or not depend on how you define your folder in line 22? In my case, the variable “folderToSave” was defined as “C:\Inetpub\wwwroot\word\” (notice the last slash), so I don’t need to add another one on line 64 like you suggest. Double check your variable and how you define your folder and let me know if it still doesn’t work for you.

      Great addition of the DOCX document type! I only had DOC files to parse out when I wrote this, but it’s certainly nice to have in there.

      Here’s the link to the RSS:
      http://www.evagoras.com/feed/

  • Yahya Abdal-Aziz

    Hi, Evagoras!

    This is a great article, as it clearly explains the problems and demonstrates a solution, showing every step and what to expect on-screen. Thank you for the effort you took to write so well. :-)

    Could you tell me, please, which versions of Word can output Filtered HTML?

    Regards,
    Yahya

    Wheelers Hill,
    Victoria,
    Australia

    • Evagoras Charalambous

      @Yahya Abdal-Aziz

      I think the version I first tried this trick on was Word 2007. I assume every other version since then will have kept that feature, but I am not sure how far back this goes. Perhaps the 2007 is the first to offer this. Maybe a Word pro who reads this can verify?

  • Taher Dawoodi

    Hello Evagoras,
    I was having a problem with the images disappearing when i converted my word doc to html. They are showing up now when i used the Filtered settings. Not sure why that was happening. I am new to this, could you please tell me a good html editor that i can use instead of word ?

    Thanks,
    T

    • Evagoras Charalambous

      I am glad you got it working in the end. There are so many free and paid good HTML editors that I wouldn’t know where to begin! You could try Dreamweaver by Adobe, Expression by Microsoft, iWeb from Apple, or just Kompozer which is free. A quick search in Google for “good html editors” will reveal a plethora of choices for you to pick. My HTML editor of choice is Eclipse.

    • Carsten Beck

      Hi Taher,

      i also experience the problem of missing images. more precisely, the images are replaced with empty gif counterparts of the same dimensions as the originals. The conversion works when launching the script from the cli but fails when called programmatically.

      Have you (or anyone else reading this) gathered more information on the issue or do you ave any idea how to get rid of the behaviour ?

      Thx in advance and best regards, Carsten

  • Chris Barlow

    Many thanks Evagoras,

    Great code, exactly what I needed and so well explained.

    Chris

  • xenomorf

    Thank you very much Sir!

    Just what i need at the moment :D

    • xenomorf

      somehow, When it finish converting, A lot of ‘looks-like-junk-file’ is created with a filename start with ~$.

      • Evagoras Charalambous

        I have never run into that situation before. What version of Word files are you trying to convert? Do they contain anything else other than text and images?

  • Inara Hawley

    Just wanted to say a quick thanks for the very clear explanation.:)

  • Mark Barnes

    Thank you – just what I needed. Thanks to your well documented code, I was very easily able to modify the script to reverse the process (i.e. convert from HTML to .docx). Thanks again.

  • Micah Brewer

    This script has been very helpful and several occasions.

    Is there any way to get this script to traverse down through subfolders in the FolderToScan?

    C:\Word\documentation\
    C:\Word\documentation\SubFolder1
    C:\Word\documentation\SubFolder2
    C:\Word\documentation\SubFolder3
    C:\Word\documentation\SubFolder4

    I have doc and docx files scattered through a huge file folder structure and don’t want to have to update the FolderToScan definition for all of those. What about naming multiple FoldersToScan? I’m okay with the FolderToSave being a single folder for outputting the filtered HTML.

    Thanks…

Leave a Reply

Your email address will not be published. Required fields are marked *