Parsing and modifying Microsoft Word docx XML in PHP

I’m implementing Microsoft Word templates in PHP. That is, taking an existing Word document, modifying some particulars on the fly, and spitting out the finished product (think: automatically generating legal forms, pre-populated with case names and docket numbers, assigned judges, that sort of thing.)

The easiest way I’ve found to do that is to make liberal use of custom properties:

Those custom properties can then be inserted into the document using the DOCPROPERTY field:


 

Modifying these on the fly should, in theory, be easy. Microsoft’s “docx” file format, in use since Office 2007 (Windows) and 2008 (Mac), is a ZIP archive containing multiple XML files. Easily enough processed with PHP’s ZipArchive functionality:

// Store the supplied ZIP file data in a temporary file:
$fp = tmpfile();
fwrite( $fp, $data );
$stream = stream_get_meta_data($fp);
$filename = $stream['uri'];
$za
= new ZipArchive();
$za->open($filename);

$nza = new ZipArchive();
$new_filename = __DIR__ . DIRECTORY_SEPARATOR . uniqid() . $database["extension"];
$nza->open($new_filename,ZipArchive::CREATE );
for( $i = 0; $i < $za->numFiles; $i++ ){
$stat = $za->statIndex( $i );
//print_r($stat);
print $stat['name'];
if( key_exists( $stat['name'], $fields_data['container']['files'])) {
// Use Template to alter contents of ZIP entry and add modified contents to new ZIP archive
// ...
} else {
// Straight copy file into new ZIP archive
$nza->addFromString( $stat['name'], $za->getFromIndex( $i ));
}
}
$za->close();
$nza->close();

The custom properties are stored in the file docProps/custom.xml within the docx ZIP archive.

The PHP documentation for XML handling only contemplates pretty straightforward examples, though, and Microsoft’s is ... Not that. Here’s the XML Microsoft uses to specify two custom properties (whitespace added):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
xmlns:
vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Matter">
<
vt:lpwstr>30738</vt:lpwstr>
</property>
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Document number">
<
vt:lpwstr>999999</vt:lpwstr>
</property>
</Properties>

The PHP code to examine, and modify, those values, isn’t as straightforward as I’d like, but here’s what I’ve come up with so far, with some considerable help from redditor crazedizzled:

$custom = simplexml_load_string($xml);
$namespaces = $custom->getNamespaces(true);
$custom->registerXPathNamespace('vt', $namespaces['vt']);
$nodes = $custom->xpath('//vt:*');
$count = $custom->property->count();
for($i = 0; $i < $count; $i++) {
print $custom->property[$i]->attributes()->name . ": " . $nodes[$i][0] . "\n";
$nodes[$i][0] .= ".foo";
}
print $custom->asXML() . "\n\n";

Here’s the complete code (more or less; this is a working scratchpad example):

$xml = <<<ENDXML
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="Matter"><vt:lpwstr>30738</vt:lpwstr></property><property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="3" name="Document number"><vt:lpwstr>999999</vt:lpwstr></property></Properties>
ENDXML;

try
{
print "Processed XML for matter no. 1542: " .
populateDocPropsCustomXML($xml, [
'Matter' => 1542,
'Document number' => 2020.14723
] ) . "\n\n";
} catch(Exception $exception) {
print $exception;
}


/**
* @param $xml The XML data to parse and, if necessary, modify (from docProps/custom.xml)
* @param $keyValuePairs The substitution data, keyed with "name" attributes from property elements. E.g., 'MatterNo' => '1701'
* @return mixed The XML data, as modified
* @throws ErrorException
*/
function populateDocPropsCustomXML( $xml, $keyValuePairs ) {
libxml_use_internal_errors(true);
$custom = simplexml_load_string($xml);
if($custom === false) {
$errormesg = "Unable to parse XML:\n";
$errors = libxml_get_errors();
foreach ($errors as $e) {
$errormesg .= trim($e->message) . "\n"
. "\tLine/Column: " . $e->line . "/" . $e->column . "\n";
}
libxml_clear_errors();

throw new ErrorException( $errormesg );
}

$namespaces = $custom->getNamespaces(true);
$custom->registerXPathNamespace('vt', $namespaces['vt']);
$nodes = $custom->xpath('//vt:*');
$count = $custom->property->count();

for($i = 0; $i < $count; $i++) {
// print $custom->property[$i]->attributes()->name . ": " . $nodes[$i][0] . "\n";
$key = (string)$custom->property[$i]->attributes()->name;
if(key_exists( $key, $keyValuePairs )) {
$nodes[$i][0] = $keyValuePairs["$key"];
}
}
return $custom->asXML();
}

Comments