PageToScreen Logo

Monday, December 16, 2002

Word Documents to screen as HTML

I’ve been considering this issue lately for a couple of reasons. I manage the web site for a large faculty in a University and find that I am often asked to transfer documents created for paper to the web. Most of these documents have been created in Microsoft Word and are most often designed for paper, though I use the term ‘design’ very loosely!

furthermore...

Documents are most often created by office staff who do not have any training in graphic design. I apologise for this grand generalisation, but it is often true. Documents are most often laid out for the A4 page format. This is the standard and can be photocopied in most offices. The vertical (portrait) format is used and when reproduced in quantity is usually 'comb bound'. There are some decisions to be made when transferring these documents to the web:
  • Should the pages be laid out in a different format?
  • Should the pages be delivered as separate pages or should they scroll?
  • What happens to the header and footer or footnotes on these pages?

This article considers delivering the text as text within a web browser. An alternative approach exists in the form of output as PDF and this will be discussed in a later posting to ‘pagetoscreen.net’. Page Layout Paper documents are most often created for vertical (A4) format. The computer screen is horizontal. The pages could be laid out for a horizontal page format but this might involve a lot of hand crafting to get the pagination right. Usually, I leave the text in the 'portrait' mode and expect the browser to be scrolled to reach all of the text. Of course, if the document consists of lots of pages then a navigation system is essential. Perhaps a next and previous page link at the bottom of each page is appropriate or even a hyperlink list of contents at the top of each page. Some texts (like this one) are simply presented as a single long scrolling page. This has an important advantage: the text can be printed easily. To certain extent, changes in layout between the paper version and the web version are optional; it depends on how much work you want to do. When a Word document includes footnotes, then these could be gathered together in one place as endnotes rather than at the bottom of each page, and ideally each endnote reference should be hyperlinked to the source bibliographic reference. Output to HTML Microsoft Word can generate web pages, but earlier versions are less successful than later ones such as Office 2000. Here are some issues that often crop up when converting to HTML: If fonts other than 'web safe' fonts, such as Times, Helvetica, Arial and Verdana (more about Verdana later) are used then Microsoft Word will reference the use of any of your fonts that you use in your document within it's complex stylesheet, however, users of the web may not have those fonts on their computers. The HTML that MS Word creates, is very complex and introduces some features not always supported by browser software. Web designers dislike the code generated by Microsoft Word because it is difficult to incorporate into design templates or existing web pages. There are several techniques for creating HTML from Word documents: 'save as' HTML Using 'save as HTML' offers some control depending on the version and platform, and although Word does create a web page usable in Internet Explorer, the stylesheet created can be very complex. Word does maintain the 'look' of the document but there are some incompatabiltes when opening the web pages in browsers other than Microsoft Internet Explorer. Netscape doesn't always render the page correctly.

The Web Options dialogue in Tools, when saving as html" The complex code that MS Word uses can be removed although sometime with disastarous results! Tools are available, usually in the form of plug-ins for 'WYSYWIG' editors like Macromedia DreamWeaver or Adobe GoLive. These can only be successful after a degree of experimentation. Some tools are also available as a service online. One such service is currently available at: [url=http://www.textism.com]http://www.textism.com[/url] copy / paste technique A simple approach is to use the copy and paste technique. Copy from Word, paste to a web editor. With Word open edit>select all> edit>copy. Then in your HTML editor simply paste the contents of the clipboard into the open window. This technique is NOT appropriate for complete documents because ALL of the formatting is lost. Bold, italics, headings etc. - all is lost. So I only recommend this approach, when you wish to copy and paste smallish chunks of text from Word. Word to HTML conversion applications There are some standalone applications that will convert pages created from MS Word into HTML. One such tool is 'FileMerlin' at [url=http://www.acii.com]http://www.acii.com,[/url] they also have an online service on a pay per file conversion basis. Another is 'Myrmidon' at [url=http://www.terrymorse.com]http://www.terrymorse.com[/url] This is ONLY MAC pre-OS x software. A review of issues about converting Word to HTML can be found at philip.greenspun.com Design considerations Of course there are design issues in delivering text on screen that will be considered in more detail in future articles. The first point to make is that often the documents are NOT designed well for paper and so repurposing for screen reduces the quality of communication still further. Text width One of the most frequent design blunders is to use the FULL width of the A4 paper when laying out the text. Book and magazine designers are aware that line lengths of about 65 characters are optimum but up to 80 characters is acceptable. When the eye reaches the end of a long line of text it is then hard to find the beginning of the next. This can cause tiredness, a fact that I can vouch for, having read far too many student dissertations for my own good! Images Microsoft Word is NOT a page layout application, although it has significantly improved over the years in giving the author some 'WYSIWYG' capabilities. Images can be used in Word but there is little control in positioning or flow of text around the graphics. Fonts If the ultimate goal is to deliver on paper only, then ANY font can be used. Be careful though. Stick to variations of one or possibly two fonts. Use one font for the body text and another for the headings or special items. Do not use more than two fonts in one document. If there is a hierarchy to your document (headings, subheadings etc.), then use different sizes and weights to indicate this, but don't overdo this. Readers could be confused about the levels in your hierarchy. Documents that will be converted to web pages should only use web safe fonts. These are: Arial, Arial Black, Courier New, Comic Sans, Georgia, Impact, Times New Roman, Trebuchet, Verdana. All of these fonts will be available on a computer that has had Internet Explorer installed. So even with Netscape as the preferred browser it is very likely that these fonts will be available, since all Pcs and MACs come ready installed with MS Internet Explorer. One more thing about fonts; don�t make any exceptional changes to the display of the font, such as character spacing or scaling, because this will not be converted correctly in HTML. Columns One way to reduce the line length but still use most of the paper is to divide the page into columns. This is a typical style of magazine and newspapers layout but isn't usually used in book design. If the text is to be transferred to a web page then the text would HAVE to appear in one column (unless the text is very short). We have become used to scrolling down the web page to read and indeed, with the use of the 'scroll' mouse this is even easier. But what we don't want to do is to scroll down and then scroll back up to read down the next column. Text alignment In a paper document it is possible to achieve 'justified' text; that is text that appears to be aligned both left and right. Microsoft Word will accomplish this by spacing the words and applying hyphenation to break words between lines. Page layout programs like Quark and InDesign give the designer more control over these features. On the web it is possible to create justified text with style sheets but only some browsers support this. The technique is very crude and can lead to poor results. Do not be tempted to use 'centred' text apart from within buttons or titles. Text that is centred over more than 2 lines is VERY difficult to read. You have been reading 'Word Documents to screen' on PagetoScreen.net

Posted by Chris Jennings on 16 Dec around 4pm •

Tags:

Further Information:

I went to all these places for information: Netmechanic was consulted Webstyle Guide was consulted Philp Greenspun’s site is really useful I’m adding Dean Allen’s excellent TEXTISM site to my permanent links I did it my way: A PDF ebook version of this text is available (440k).  This idea map shows you how I thought this through (50k). Made with ‘Inspiration’.

And even Further Information:

Useful tips Styles Always use styles in the Word document as this will make it easier to modify for delivery on a screen. Changing body text, for example to a web safe font can be changed globally over the whole document.

Creative Commons License
This work is licensed under a Creative Commons Attribution License.

PageToScreen

Powered by ExpressionEngine