In this article we will first discuss the case for and against using Word as your HTML editor. Then we will see how to properly save a Word file to smaller, more compact HTML files. Third and last, we will see how to do this through code, and create a batch process for converting numerous Word files to HTML.
The case for and against Word as an HTML editor
Microsoft has given us the ability to save a Word file as HTML for many of the latest editions of Office. It’s a very easy process, and many use this way of creating HTML pages because:
- They are already familiar with Word and its formatting features.
- Word comes installed on their computer, and they do not want to purchase additional HTML authoring software.
- They have numerous files in Word format that they want on a website in HTML. Simply exporting them to HTML is the fastest way.
Unfortunately, there is a downside to this method: Word does a terrible job of creating compact, cross-browser HTML source code. If this is important to you, then you should probably stay away from using Word as your HTML editor in the first place. However, having said this, it is still possible to clean up the generated code quite a bit, first through Word itself and second through other tools or custom Regular Expressions.
Saving as HTML from Word
Start by opening an existing Word file on your system, or by creating a new one and typing in some text and pictures. Then click on File > Save as Web Page…
Doing so, Word will display the Save As dialog box.
We can see that Word took the filename of the DOC file (for any new files it creates a filename based on the title of the document) and is prompting us to save it with the extension .htm. This is clearly shown by the select box labeled Save as type which has Web Page (*.htm; *.html) already selected. We can now perform the normal save operations, like choosing the name and location of the HTML file. However, Word has a save option called Filtered HTML which greatly reduces the HTML code produced.
It’s important to understand the difference between the two options. When Word saves a file as HTML, it still wants to be able to open it back in Word and maintain the same formatting as when you created it. The way it does this, is by leaving a lot of Word propriatory code inside the generated HTML file. If however, we simply want to export our contents to the smallest HTML file possible, without needing to re-open them back in Word, we can choose the Filtered HTML option. This produces smaller files, less HTML code and, even more important, a better cross-browser compatible source code. When you select this option and click on Save, you will get a popup which will alert to this fact.
Click on Yes to finish the process. Something else worth noting happens here on save. Suppose you have some images embedded inside your Word file. These images could be GIFs, JPGs, BMPs, PNGs, etc. When you insert an image in Word, the image file is actually embedded inside the file and is saved along with it. When we save the file as HTML, Word exports all these images to a folder that it creates in the same location as the exported HTML file, and then generates links to them inside the HTML code. The exported images are handled like so:
- They are reduced/increased in size depending if they were decreased/increased in width and length inside Word.
- They are converted to GIFs and JPGs.
- Their names stay the same.
- The name of the folder that they are stored under is the name of the HTML file that is created, plus the extension “_files”. For example, if the filename is “My company.htm”, then the images will be under the folder “My company_files“.
- The link inside the HTML file to the images is relative. For example, <img src=”My company_files/house.gif”>.
Exporting to HTML through code
Let us assume that we have a bunch of Word files sitting inside a directory, and they all need to be converted to HTML files. We can open each one and follow the procedure above, but that can take a long time, depending on how many of them you have. We can instead, use a little WSH scripting to do this for us. The idea is the same: create an instance of the Word application, loop through the folder, open each DOC file that we find, export it as Filtered HTML, close the file, move on to the next, and finally close the Word application object. Let’s first look at the code needed to do this with WSH VBScript, and then we will break it down.
Option Explicit 'declare all variables Dim objWord Dim oDoc Dim objFso Dim colFiles Dim curFile Dim curFileName Dim folderToScanExists Dim folderToSaveExists Dim objFolderToScan 'set some of the variables folderToScanExists = False folderToSaveExists = False Const wdSaveFormat = 10 'for Filtered HTML output '******************************************** 'change the following to fit your system Const folderToScan = "C:\Word\documentation\" Const folderToSave = "C:\Inetpub\wwwroot\word\" '******************************************** 'Use FSO to see if the folders to read from 'and write to both exist. 'If they do, then set both flags to TRUE, 'and proceed with the function Set objFso = CreateObject("Scripting.FileSystemObject") If objFso.FolderExists(folderToScan) Then folderToScanExists = True Else MsgBox "Folder to scan from does not exist!", 48, "File System Error" End If If objFso.FolderExists(folderToSave) Then folderToSaveExists = True Else MsgBox "Folder to copy to does not exist!", 48, "File System Error" End If If (folderToScanExists And folderToSaveExists) Then 'get your folder to scan Set objFolderToScan = objFso.GetFolder(folderToScan) 'put al the files under it in a collection Set colFiles = objFolderToScan.Files 'create an instance of Word Set objWord = CreateObject("Word.Application") If objWord Is Nothing Then MsgBox "Couldn't start Word.", 48, "Application Start Error" Else 'for each file For Each curFile in colFiles 'only if the file is of type DOC If (objFso.GetExtensionName(curFile) = "doc") Then 'get the filename without extension curFileName = curFile.Name curFileName = Mid(curFileName, 1, InStrRev(curFileName, ".") - 1) 'open the file inside Word objWord.Documents.Open objFso.GetAbsolutePathName(curFile) 'do all this in the background objWord.Visible = False 'create a new document and save it as Filtered HTML Set oDoc = objWord.ActiveDocument oDoc.SaveAs folderToSave & curFileName & ".htm", wdSaveFormat oDoc.Close Set oDoc = Nothing End If Next End If 'close Word objWord.Quit 'set all objects and collections to nothing Set objWord = Nothing Set colFiles = Nothing Set objFolderToScan = Nothing End If Set objFso = Nothing
Save the following code as a vbs file (for example, createdoc.vbs) somewhere on your system. Before you use it, you must change the 2 constants folderToScan and folderToSave. These folders reflect which folder to look in for any Word files and which folder to save to. Once you edit these 2, double click on the vbs file to run it.
The code scans through the folder defined in folderToScan. After a simple check to see if the folder exists, it creates an instance of the File System Object, maps to this folder and puts all the files under it in a collection. It then creates an instance of the Word application, and loops through the files in the collection. For each Word file that it finds, it opens and saves it as Filtered HTML. If you now look inside the output folder, folderToSave, you will see the newly created HTML files with their corresponding directories of images.
The constant wdSaveFormat is a unique number that specifies an external file converter. Setting it to 10 creates Filtered HTML files. For regular HTML output use the number 8. This will produce bigger HTML files but will maintain the Word formatting.