Monday, October 04, 2010

Using Word to Format a Manuscript for the Web

My apologies, but what you see below hasn't been proof-read. I tried to type it up before work, and now I'm running late for work! I will fix things later today; sorry for the rough state of this.


This week's "meaty" post deals with how to use some of Microsoft Word's more advanced search and replace features. They're useful to understand because you can use them to take (for example) a manuscript and add HTML tags so you can post an excerpt of it on the web. You can also used these features just to quickly hunt for subtle formating errors, like a tab inserted on a blank line.

Unfortunately, this post describes how to do these things for Word 2003 or earlier. Me and the new Word are frenemies, not friends. That said, a lot of what I'll talk about can be done in the new Word if you're familiar with its interface.

Show Non-visible Characters

First, turn on Word's formatting symbols. This inserts a non-printing symbol into your document for "invisible" items like carriage returns, tabs, and spaces. To do this, look on your icons for the pilcrow symbol, which looks like this: ¶, i.e. a backward "p" thingy.

Click that. Ooh, now your manuscript looks ugly, doesn't it? Don't worry; all those new symbols do not print out.

In principle, you can go look for invisible formatting errors now, but a search-and-replace would be easier.

How to Search For Non-Visible characters

Begin the search-and-replace the usual way: Click "Edit", then "Replace...".

Let's say you want to find a tab followed by a paragraph break (or "carriage return", which is where you hit the "enter" or "return" key on your computer to form a new line.)

In the "Search" box, type the following:


What does that mean? The ^t tells Word to search for a tab. The ^p tells it to search for a paragraph. Now type your correction into the "Replace" box. You presumably want rid of the tab, so enter:


Click "Find Next" or "Replace All" as usual to complete your search-and-replace.

A list of useful characters:

^p paragraph break
^l a new-line (these are different than carriage returns and can mess up your formatting in subtle ways, so it's good to know they exist)
^t tab

^- an optional hyphen
^~ a non-breaking hyphen

^+ an em-dash
^= an en-dash

^m a manual page break
^s a non-breaking space

How about searching for extra spaces? You can just type those into the search-and-replace boxes and go hunting. I recommend doing it that way if you're looking for a double space, or want to insert a double space after a period, but it's possible you have a more problematic situation, like a manuscript where the "tabs" were created by typing in a variable number of spaces.

For a situation like that, I recommend using wildcards and regular expressions.

Using Wildcards and Regular Expressions (a.k.a. RegEx Expressions) To Replace Extra Periods

First, let me say I don't recommend using wildcard expressions unless you absolutely need to and have done some research to understand what they are and how they work. You can find more information about them on the web by googling appropriate terms, like "ms word search wildcards" and "ms word search regular expressions". This website is a decent place to start, but is still a bit cryptic.

Never mind that, however; we're going to use just an eensy weensy little bit of this stuff to find multiple periods.

Open your search-and-replace dialogue. See the little box at the bottom that says "More"? Click that, then click the ticky-box beside "Use Wildcards".

Now, in the "Search" box, type:


where • is a space (you might not see a dot in the box; just make sure you typed the space.)

The {3,} means "hunt for 3 or more of the character I typed before this". You can change it to {2,100} and it will search for anywhere from 2 to 100 of the previous character.

Add your ^t symbol to the "Replace" box and complete your search-and-replace as usual to change multiple spaces into a single tab.

Adding HTML tags to the Document

First, I'll note that we're using Word to do this only because its search-and-replace feature is so lovely. When you're finished adding your HTML tags, etc., you need to save your file in .txt format, which is to say as a plain text file. Word adds aaaaall kinds of formatting to documents that will utterly bork your HTML. Plain text is what you want at the end of the day.

Before you start, click "Tools" on the menu bar, then "Autocorrect...". Now, turn off almost all the options on the the "Autocorrect" and "AutoFormat as You Type" tabs. This will save you heartache later.

A really useful trick I found when working in my Word document is to highlight all the things I want to add tags to in another colour so I can spot them easily.

Finding BOLD, ITALIC and CENTRED Text in Word

Open a find dialogue by clicking "Edit" > "Find..."

To search for text in italics, click into the "Find What" box BUT DO NOT TYPE ANYTHING. Instead, hold down the "Ctrl" key and type "i" on your keyboard.

You should see a bit of text appear below the box saying that you're looking for italics text. (You would click Ctrl+i twice more to turn it off.)

Now click the ticky-box for "Highlight all items found in:" and make sure "Main Document" is selected.

When you click "Find All", Word will select all the italics items in the document.

This doesn't add colour, however. To do that, click the highlighter icon and choose a colour.

A list of these useful "hot-keys":

Ctrl+i for italics text (twice more to turn it off)
Ctrl+b for bold text (twice more to turn it off)
Ctrl+e for centred text (once more to turn it off)

Now we're ready to add our HTML tags.

I recommend you add your italics (e.g. <i> and </i>), bold and centering tags before you do your paragraph (<p> and </p>) tags.

To add <i> and </i> tags around all italics text, first open a search-and-replace dialogue.

In the "Search" box, type no text but click Ctrl+i to select italic text.

In the "Replace" box, type <i>^&</i>

You (hopefully) recognize the HTML italics tags. The ^& symbol indicates to Word that whatever text it finds that fits the search criteria (i.e. text in italics) should be left as is between the new <i> and </i> tags.

To add similar tags to bold and centred text, repeat the process by changing the search criteria from italics to bold or centred, and then changing your HTML tags in the "Replace" box.

Once you've finished with the formatting, you want to surround your paragraphs with paragraph tags (<p> and </p>). This is a bit trickier. The way I do it is the following:

I add the <p> tags first. In the "Search" box, I type ^t because I have tabs at the start of all my paragraphs.

In the "Replace" box, I type <p> I complete my search-and-replace as usual.

Next, I do the </p> tags. In the "Search" box, I type ^p to register the ends of my paragraphs.

In the "Replace" box, I type </p>^p where the ^p at the end is there for legibility only. It keeps my paragraphs from running together on the page and won't affect the final HTML document at all.

But WAIT! It would be a very bad thing to just hit "Replace All" at this point, because I use paragraph breaks to add vertical white space to my document. Having a bunch of </p> tags with no <p> tags would make very broken HTML.

So I click into my document and select just my paragraphs, then apply the </p> tags to them only, and not areas of white space.

Helpful hint: You probably know how to highlight a few paragraphs with your mouse, but what you probably don't know is that if you highlight one section of text, then press and hold down the Ctrl key, you can highlight other sections too without highlighting the bits in the middle (which would be areas with white space.) This allows you to highlight just blocks of text and skip over the vertical white spaces around scene breaks or chapter headings.

Replacing "Curly" Quotes, etc., With HTML Entities

Word uses some nice-looking characters, like curving quotation marks or apostrophes, that don't necessarily translate over to HTML well. These characters are the reason why when someone emails you text they cut-and-pasted into their email program from Word, there are sometimes odd characters sprinkled throughout it. The email program couldn't understand what the curly quotes (etc.) were.

Thus, it's a good idea to strip out these characters and replace them with the corresponding HTML entity.

An HTML entity is a code that translates into symbol. For example, if I type "& # 60; p & # 62;" without the quotes and with all the spaces removed into an HTML document (or even into Blogger when I'm typing this post), the result when I look at that document on a web browser is the following collection of characters: <p>. The "& # 60;" gives me a < symbol and the "& # 62;" gives me a > symbol.

Do a simple search-and-replace and switch in the following codes for their corresponding characters. I've left off the &# symbols so the numbers show up. Just remember to sandwich all these numbers between &# and ;

8220 Left-hand-side curly quotes
8221 Right-hand-side curly quotes
8217 Apostrophe (right-hand-side quote
8216 Left-hand-side quote
8212 Em dash
8230 Ellipses

This website provides a more complete list of codes.

Convert to Plain Text

When you've completed all your preparatory work on Word, you want to save your file as plain text. Click "File", then "Save As...", then change the "Save as type:" box at the bottom to "Plain Text (*.txt)" option and save the file.


Open up the file in a text editor like MS Notepad (DO NOT use MS Wordpad.) To make your file an HTML document that can be read on a web browser, add the following code to the very top of the file:


<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

<html xmlns="" lang="en">
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"></meta>

<title>YOUR TITLE HERE</title>


and the following code to the very bottom of your file:




Save it, and you should now be able to look at it on a web browser by opening the browser, then clicking "File" > "Open File...".

Now a big fat caveat: It might look great, but there might still be many, many errors in your document. HTML is a very forgiving language in that it will ignore code that it knows is bad.

If you're happy with how your document looks, great, don't worry about the hidden errors. However, if you the reason why you converted it to HTML was so you could (for example) ePublish it on the Kindle, then what you see may not be good enough.

There are many websites that will validate your code, but you have to actually put your manuscript on the internet so they can access it. Obviously, only post your manuscript temporarily and take it down as soon as you're done checking it.

I like this validator (type your manuscript's URL into the box at the bottom of the page), but it is a fussy one. It will require you get every error out.

Don't panic, however, if it gives you several thousand errors at first. What's happening is that there might be one error (say a missing or an extra HTML tag) that makes everything after it into an error also. If you fix the first typo, often a bunch of other ones appear to just evaporate.

Author website: J. J. DeBenedictis

Pageloads since 01/01/2009: