
While working on my customisations to Tim Geyssens MailEngine I was looking for an accurate method of automatically creating a plain-text version of the HTML emails that were being sent out by the site. Further reading brought my attention to something called Markdown. After some hunting around with a little help from my friend Google I managed to find a markdown XSLT file. Using the XSLT I could transform my HTML email to plain-text with relative ease and accuracy. Of course in order to do this I would need a valid XML document and as my pages were already valid XHTML I had no problems there.
Here is my method for doing the conversion, all it requires is that you pass it the HTML you want to convert which must be valid XML:
/// <summary>
/// Converts to HTML to plain-text.
/// </summary>
/// <param name="HTML">The HTML.</param>
/// <returns>The plain text representation of the HTML</returns>
private static string ConvertToText(string HTML)
{
string text = string.Empty;
XmlDocument xmlDoc = new XmlDocument();
XmlDocument xsl = new XmlDocument();
xmlDoc.LoadXml(HTML);
xsl.CreateEntityReference("nbsp");
xsl.Load(System.Web.HttpContext.Current.Server.MapPath("/xslt/Markdown.xslt"));
//creating xslt
XslTransform xslt = new XslTransform();
xslt.Load(xsl, null, null);
//creating stringwriter
StringWriter writer = new System.IO.StringWriter();
//Transform the xml.
xslt.Transform(xmlDoc, null, writer, null);
//return string
text = writer.ToString();
writer.Close();
return text;
}
Download the XSLT file I used from here:
http://symphony-cms.com/downloads/xslt/file/20573/
I would love to hear from anyone that does this differently or if you can find any problems with the method I have chosen to implement for this solution.
Related Posts
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.






14 Responses
Hmmm, interesting way of doing it.
The way I usually strip HTML from some content in an XSLT macro is by doing the following:
1. Replace all BR’s with some dummy text.
2. Pass it through umbraco.library:StripHTML.
3. Convert the dummy text back into BR’s. Or to \n’s if I actually need a text only version.
It’s definitely not perfect. Especially since it doesn’t leave any whitespace around P’s and such. So I may have to take a closer look at your method.
The Markdown approach I think should provide a more accurate translation of formatting however I didn’t know about the StripHTML method in the Umbraco Library so I might have a quick look at that out of interest.
I implemented the C# code as recommended. It seems to work very well with one exception: When converting from XML-compliant HTML (XHTML 1.0 Strict), I get this as my first line of output in my resulting Text document:
How can I get rid of this in the post-transformed text string?
Well, I know I posted the same day but I have no idea if I will ever receive a reply.
I want to thank you for providing this truly elegant solution to the HTML-to-text problem. I was able to get it to do what I needed.
To remove the XML header I simply used .Substring to return everything in the string less the header… It probably is not the best way to solve the problem but it was okay for now. If you have a better solution, please let me know. I will check back here for any replies.
Many thanks again.
Sorry for the delayed reply Jim. I would suggest adding the following line below to your XSLT beneath the xsl:stylesheet declaration:
Please let me know how you get on.
Thank you! I will try that–it seems like a far better solution than the one I had implemented (which simply omitted the initial characters from the final string).
I will let you know if I have any further issues.
Thanks again,
Jim
That worked well Simon. Thank you!
The only other thing I noticed is that if the HTML document has a value, then this value gets exported as the first line of the text file. I can see reasons why this might be desirable but (in my opinion) this should not appear in the text version of a web page since the Title appears in the browser title bar and not on the page rendering area.
If you agree, how would you modify the XSL to omit the tag value? I am essentially XSL illiterate.
Thank you again,
Jim
How do you mean if the HTML Document has a value? Can you provide an example?
OOPS! No wonder you didn’t understand. I stupidly used the “less than sign” and “greater than sign” in my post around a tag name, and it got lost in the HTML filter.
I was asking about the page having a (Title) tag. I have re-posted my question below and gave used parentheses instead of tag brackets…
Thank you!
Jim
———————–
The only other thing I noticed is that if the HTML document has a (Title) value, then this (Title) value gets exported as the first line of the text file. I can see reasons why this might be desirable but (in my opinion) this should not appear in the text version of a web page since the Title value is designed to appear in the browser title bar and not on the page rendering area.
If you agree, how would you modify the XSL to omit the (Title) tag value? I am essentially XSL illiterate.
Thank you again,
Jim
Try altering the XSLT file to tell it you only want to markdown the body of your HTML content. In my copy around about line 101 you will find the following:
If you change it as follows:
I am also no expert with XSLT but I am learning slowly
Thanks again. I will try this (the suggestion makes sense–restricting to the body tag). As before I will let you know if I encounter any problems.
Regards,
Jim
Hi Simon,
I tried your suggestion (replacing “*” with “body” where indicated) but had no luck (the Title tag value is still appearing in the resulting text string). Although I was very careful to make exactly the change you suggested (down to the case, of course, of the word “body”), it is entirely possible that the problem is somehow with my implementation…
If you can think of any other ideas, I would be happy to try them. Meanwhile, I am truncating the start of the string as needed (works, but not as elegant).
Regards,
Jim
Jim, sorry for the delayed reply. Unfortunately I have been unsuccessful also in my attempts to achieve what you are after. I will continue to try when I get a chance and will post back on my success(or failure).
No problem! I will check back every so often to see if you were able to find a solution. For now, I’m just keeping the Title text in my output.
I still think your solution is superior so my request is just a refinement to an already superb approach.
Regards,
Jim