Copying content from MS Office to Tridion

It is a common practice of almost everybody who works with Tridion: copying text from Word or Excel into a rich text field in Tridion. Unfortunately, when you do this you end up copying much more than just the text. Styling characteristics also find their way into the rich text field, in the form a style attributes, class attributes, comments, unnecessary spans, and even complete sets of CSS classes. And when a component with such a rich text field is published, all this styling finds its way to your site, where it is very likely to clash with your carefully designed pages.

Of course, you can circumnavigate this issue by copying first to a text editor like Notepad, copy it again and then pasting it into Tridion. But there is a more elegant solution for this problem. It requires just a tiny bit of work.

To test, I used the following MS Word document:

When you paste this into a regular rich text field and you click on the source tab, this is what you get:

As you can see, the text is in there somewhere, but it is buried in a heap of HTML. In fact, the HTML generated by Word from this tiny document, is 36 KB and measures 653 lines! What’s worse: your HTML now contains styling that will show up on your site.

Fortunately, Tridion has a nice feature called ‘Filter XSLT’ that allows you to clean up or even completely change the html in the rich text area. You can specify an XSLT stylesheet to transform the HTML before it is saved.

I’m not going to explain in detail how this type of stylesheet works. You can read about it in the documentation.

The important thing here is that you can use it to remove undesired elements (or other nodes) from the rich text. In particular, I want to remove the following:

  • All comments
  • The elements style and colgroup
  • The attributes style, cellpadding, cellspacing, lang and xml:lang
  • The attributes border, width and height when they occur on a table or td
  • The attribute class when it is recognizable as one of the typical MS Office classes

The last one is tricky: when I tested with Office365, all the classes I saw started with ‘Mso’ (Word) or ‘XL’ (Excel). Perhaps Office can generate other classes as well. In that case you may have to fine-tune the XSLT.

To filter out these nodes, I added the following lines to the XSLT (right before the end, so before </xsl:stylesheet>:

That’s all. Now, the same text from MS Word, when pasted into a rich text field, gives the following result:

As you can see, the HTML is a lot cleaner and smaller (only 349 characters). A purist would probably say there are too many span elements now. You can easily get rid of them (although you have to be aware that this will remove spans anywhere in the rich text, not just when you are copying from MS Office). This will do the trick:

This rule filters out all span elements, but keeps the content inside them.

The rich text field is now totally clean and only 286 characters long:

That’s all. Download the stylesheet and enjoy your regained freedom to copy and paste.