Completing the WordML spec
The WordML spec will probably never be fully completed.
But here is an attempt to extend the documentation where Microsoft
is silent.
The rtf spec became a gigantic mess because it was incomplete
in places, vague in other places, and wrong (i.e. didn’t match
what Word did) in others. Because of that, every programmer writing
code to read/write rtf files made different reasonable assumptions
and we were left with a not terribly portable spec.
I am hoping that in the case of WordML we can get most
developers to make the same assumptions. This will at worst leave us
with two WordML specs in practice, the common assumptions that all
non-Microsoft developers use and the assumptions used by developers
in the Word group. So, as you make assumptions of your own, please email
me what you did and I will add it to the spec, as long as your
assumptions do not conflict with an assumption already listed.
Step 1 - Buy the book Office
2003 XML. This is a fantastic introduction to WordML and
SpreadsheetML.
Microsoft has submitted the xml specs to ECMA - details
here.
Questions
- For rFonts - What's h-ansi vs cs? I ask because a WordML document with hebrew defines h-ansi but not cs - I figured it would be cs defined.
- How do you get the format of the bullet for a list style? It appears to use the w:tmpl tag but there is no info about what the values mean.
- What is the relationship between the w:ind defined in a list and the w:ind in a para? It appears to use
both to determine the left indent but I'm not comfortable enough with my guess yet to post it here.
- To store an image in a WordML document, the .chm
file says you use <pict> and it has a <binData> subnode.
It has several other subnodes listed but they are all irrelevant
for a bitmap image. WordProcessingML.doc and any xml file saved by
Word both show a <v:shapetype> and <v:shape> subnode.
But neither explains any of the attributes or subnodes of those two
nodes. The vml schema lists all of the attributes and sub-nodes,
but has no description with any of them. To store a bitmap in a WordML
file - what attributes and sub-nodes need to be set and with what
values? If I just guess on this based on saving a bunch of files,
there are so many undocumented elements here I'm guaranteed to guess
wrong on some. (details below on how I think an image should be written.)
- <instrText> - Where can I find a list of what
fields Word uses this for, and how to set all attributes and sub
nodes in each case?
- For <w:fldSimple w:instr= where can I find a list
of all values Word 2003 supports? And for each value, a list of which
child nodes it requires and/or uses if optionally are there?
- When I create a hyperlink in Word, it saves it in
WordML using <fldChar>/<instrText>/<fldChar> ... <fldChar>.
Why does it do this instead of using hlink (which is what the docs
show)?
- Reading the WordProcessingML.doc and the OfficeXML.chm
files, it looks like a horizontal merge of cells are supposed to
be created using the <hmerge> node. But when Word creates the
xml file, it uses gridSpan. When should you use one vs the other?
(I assume as it has both, there is a reason for this and there are
situations where using one is required and other situations where
the other is required.)
- WordML - <w:tblLook w:val="00BF"/> - The docs for
WordML do not specify values for 0x1F. Yet Word 2003 writes this
node as <w:tblLook w:val="00BF"/> for a standard
table. So... What is the meaning of these 5 bits?
- According to the spec a table style can have in it's <w:tcPr> node
a < w:hmerge> node. What does this mean when it says every
cell in the table is defined to be part of a horizontally merged
section? I understand it's use under a <w:tc> node. It's allowing
it in the style that I don't understand.
Assumptions
- It is allowed to insert tags of any kind anywhere
in the document enclosing parts of the word document. For example
smart tags will wrap the <w:r>...</w:r> it is a tag for
inside <st1:...> nodes. What seems to work correctly is ignore
any tags from namespaces other than o/w/v but continue to process
nodes within them that you do recognize. I do skip over nodes that
are in the o/w/v namespace that I don't recognize.
- It appears the <w:p>/<w:pPr>/<w:rPr> is totally
ignored. If there is no <w:rPr> in the <w:r> it uses
the values from the style. If there is a <w:rPr> in the <w:r>,
it overrides the style - not the beginning of paragraph stuff.
- The wx: elements appear to be duplications of w: or o: elements
and exist to make it easier for a program other than Word (Internet
Explorer?) to make the document appear identically. As they are redundant
information, and may be wrong in places (wx:bdrwidth is specified
as points but Word appears to write it in twips) my approach is never
write a wx: element or attribute and never use one when reading the
document.
- If you want just a single header or footer, as opposed
to one for odd pages and one for even pages – create just an
odd header. As Word does it this way and the spec is silent on the
issue, my approach is it is an error to create just an even header.
For a header on even pages only, create a blank odd header.
- For <w:font> do not write/read the usb-0…csb-1 attributes.
They are not needed and there is no guarantee that the values for
the font on your system will be the same on another person’s
system.
- Apparently there is no way to set the default language
for a document (unlike rtf). You can set it on a paragraph by paragraph
basis.
- When I write "\u2003\u2002\u2009" (em,
1/2em, 1/6 em) to a text node in a WordML file, I get (using Courier
so it's fixed width);
- em - just slightly under 2 chars wide space.
- 1/2 em - 1 char wide space.
- 1/6 em - a box (the unknown char symbol) 1 char
wide.
- It looks the same with Arial & Times New
Roman The box for 1/6 em is definitly wrong and the other spacing
is not what you normally get. No idea why but this is how Word
does it.
- The units for <w:pBrdr><w:top w:space='#'> are
listed as 1/8 of a point. However Word appears to interpret them
as 1 point. It does accept and handle real numbers so you can have <w:pBrdr><w:top
w:space='3.5'>
- Word creates <w:lvlText w:val="%1.%2.%3."/> where
the formatting appears to be put the level 1 number at %1, the level
2 number at %2, etc. Word only allows 9 levels so %10 should be considered
illegal.
- The character 0x2011 (a non-breaking hyphen) shows
up as the unknow glyph box. If you use use "ToggleCharacterCode
= Ctrl+X twice on the "unknown glyph" box, it displays
and works properly. So use <w:noBreakHyphen/> instead.
Tables
-
A table is like an html table - there is a column for any cell break
in any row. This can require that other rows will have a w:gridSpan
attribute. This is always in twips and always exists (although Word
will read a file if it doesn't). These w:gridCol values are the
sole determiner of where cells break. Always write this and I think
it's
safe to assume it always exists. If any app ever writes a table where
all of the table values (everything from gridBefore to gridSpan)
do not work based on the concept that it is the same cells in each
row - my guess is it will not work with almost every WordML parser.
So please get this right.
- Each row then has an optional number of cells it skips before starting,
then it's cells, then an optional number of cells it does not have
at the end. What do the following in <w:trPr> mean? Here is
what Word 2003 appears to be doing:
- gridBefore/gridAfter - this is the number of cells that are not
displayed at the begining/end of this row - basically skipped cells.
- wBefore/wAfter - this is a twips count that matches gridBefore/gridAfter.
My vote is only read gridBefore/After but write both.
- w:tc/w:tcPr/w:tcW appears to be totally ignored. w:tblGrid/w:gridCol
appears to be the sole determiner of cell width. It usually matches
so don't read this, but do write it.
- tblCellSpacing - not sure exactly what this does.
- Ok, first the <w:jc.../> attribute in the tblPr settings
is for the positioning of the table as a while and has not effect
on the text in the table (makes sense). And in a <pPr> is sets the
para alignment. As to the <trPr> use of <jc> - I cannot find anything
it does..
- If w:tbl/w:tblGrid/w:gridCol is set, that is the width for each
column. It cannot be overruled. If it is not set, the columns are
sized in autofit mode. The attributes w:tbl/w:tr/w:tc/w:tcPr/w:tcW
are ignored in every sample document I trued.
Images
This is quite a bit so I have it here in it’s
own section. What I have here works – but I had to make so many
assumptions almost certainly some of them do not match Microsoft’s.
- wmz/emz files covered in my blog.
- Cropping adds the following: “<v:imagedata src="wordml://01000001.gif" o:title="network" croptop="19661f" cropleft="19661f"/>” I
have no idea what the units (f) are for this. Note: The WOrdML docs
say this value should be from 0.0 - 1.0 and it's clearly not that.
- I have no idea what all of the possibilities are for style=. But
here are some I have guessed:
- width: the displayed width of the image after croping and scaling.
Does not include padding. Write as '12.34pt'
- height: the displayed height of the image after croping and
scaling. Does not include padding. Write as '12.34pt'
- z-index: If < 0 then it is under text. If > 0 it is over
text. Write as -1
- position:absolute The picture is not inline with the text but
is positioned absolutely on the page.
- margin-left: The distance from the reference point (see below)
to the left of the picture. Write as '12.34pt'
- margin-top: The distance from the reference point (see below)
to the top of the picture. Write as '12.34pt'
- mso-position-horizontal-relative: margin | page | text | char
- For position:absolute, where to measure from.
- mso-position-verticall-relative: margin | page | para | line
- For position:absolute, where to measure from.
- mso-wrap-distance-top: the padding at the top of the image.
Write as '12.34pt'
- mso-wrap-distance-left: the padding at the left of the image.
Write as '12.34pt'
- mso-wrap-distance-right: the padding at the right of the image.
Write as '12.34pt'
- mso-wrap-distance-bottom: the padding at the bottom of the
image. Write as '12.34pt'
- <w10:wrap type='square' | 'topAndBottom'/> - For wrapping
around a positioned image.
GIF:
<w:pict>
<w:binData w:name="wordml://01000006.gif">R0lGODlhEAAQALMAAAAAAIAAAACAAICAAAAAgIAAgACAgICAgMDAwP8AAAD/AP//AAAA//8A/wD/
/////yH5BAEAAA0ALAAAAAAQABAAAARaMJxJZ7u4ncf7Axm2ASRAIGCofQ7DnGk4ui+qrkSeV5WG
/L+NhwM6lEijo03YGTkMhsMSoah8oFFb4wgYVbSalBQIjBkvRm63e1gM2ENOGuBGUnnnbUxNakQA
ADs=
</w:binData>
<v:shape id="_1" type="#_x0000_t75" style="width:12pt;height:12pt">
<v:imagedata
src="wordml://01000006.gif" o:title="networ6"/>
</v:shape>
</w:pict>
binData is uuencode of gif image, don’t change
type="#_x0000_t75", wordml:name.gif must match, style="width:12pt;height:12pt" gives
size in doc.
PNG:
<w:pict>
<w:binData w:name="wordml://03000001.png">iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAAB3RJTUUH1QEDFRoyw+VrogAAAAlw
SFlzAAALEgAACxIB0t1+/AAAACdQTFRF/wD/AAAA//8AgIAAgICAwMDAAP8A////AICAAP//AACA
AAD/gAAA6ZmItQAAAAF0Uk5TAEDm2GYAAABoSURBVHjaNcvBDYAgDIXhugEvwUTikUVsUsIEbOAO
rOAgemEFrw5mC9rTn/R79IjIRnq51up6ANeIcH+x/tHa2X0qZXgG1M/O5jkcyVHaJS8Wk768aBCb
Lz3Ug0mitzkTItQLE8E83Atm2Rvlc68eJAAAAABJRU5ErkJggk==
</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:12pt;height:12pt">
<v:imagedata
src="wordml://03000001.png" o:title="network"/>
</v:shape>
</w:pict>
Same type, different extension in wordml: - also .net
uuencode ends with ...BJRU5ErkJggg== at
end - don't know what kind of bug (if any) this is!!!
JPG:
<w:pict>
<w:binData w:name="wordml://02000001.jpg">/9j/4AAQSkZJRgABAQEASABIAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEP
ERETFhwXExQaFRERGCEYGh0dHx8fExciJCIeJBweHx7/2wBDAQUFBQcGBw4ICA4eFBEUHh4eHh4e
Hh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh7/wAARCAAQABADASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm
p6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA
AwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx
BhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElK
U1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3
uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDOl8Se
I5Pinonh7whceF7nRRcadamxiXR3ldBBbrcReXIv2gOJBcbiW9MbQuTmWVgtj4cikufFHhrVJ7SC
BbprHXbe9lZ2eOEPtjdnIMkijcR/ECcVf8P+PvElv4w0Yy+P7m90lIY7ldPvJPItLNoZX2x3mId9
um1I3QsCSqlvmyoONq914c0/wTc6ZpXiTwlfzw29qkUVt4jnu7q5+zSRSRxIjWUSszGJU4YYDZAb
AU/R4DN6uXVIqmoreLaTfNyzabd9b3XRdtD6HKswx3D+NSptNTbUnq9FJrRWu9fuTXof/9k=
</w:binData>
<v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:12pt;height:12pt">
<v:imagedata
src="wordml://02000001.jpg" o:title="network"/>
</v:shape>
</w:pict>
SpreadsheetML
- For formatted text you create a node as: <ss:Data
ss:Type="String" xmlns="http://www.w3.org/TR/REC-html40">hi <U>there</U> everyone</ss:Data>.
This appears to only support <B>, <FONT html:Face='' x:Family=''
html:Size='' html:Color=''>, <I>, <Sub>, <Sup> (not <super>),
and <U>.
- SpreadsheetML has even more problems with em spaces, hidden hyphens,
etc. than WordML.
|