+1 vote

Hi!

I have the following mail: EML here
When I call the method "GetBodyAsText", it returns some of the HTML as text (as expected), but completely ignores the tag table (which contains all the data I need to extract). Mail.dll has been flawless, I really like it. Hope we can find a solution to this.

Thanks in advance.

by (460 points)
edited by
What is the first word of the ignored text?
It's really hard to tell. I've spent at least an hour trying to reason about the HTML vs text, and I really can't understand what is happening. The HTML starts with a <table> tag - which is ignored. The parsed text seems to be a fragment of the HTML (the bottom part), but I can't point you a word, since it's really messed up. If you could load up the EML I provided, it should be self-explanatory.
Thank you!
What do you mean by ignored? Html to text conversion extracts text from table tag with no problem.
Here is the HTML of the email (GetBodyAsHtml): https://drive.google.com/open?id=0BxZwJk5vPnf0SDdXNlE1cmxiV00
And here, the text I got from the method "GetBodyAsText": https://drive.google.com/open?id=0BxZwJk5vPnf0QTNrWnpFMkRDZ2M

As you can see, there is a table on the HTML, which gets ignored on the parse. The text doesn't contains the table data.

1 Answer

+1 vote
 
Best answer

Mail.dll works properly. First words inside table tag are: "Proposta do site".

[Test]
public void Test()
{
    IMail mail = new MailBuilder().CreateFromEmlFile("c:\\email_carsp.eml");
    StringAssert.Contains(@"<table width=""500"" border=""0"" cellpadding=""5"" 
cellspacing=""2"" style=""font-family:Verdana,Arial,Helvetica,sans-serif;font-size:12px"">
  <tbody><tr>
    <td colspan=""3"" bgcolor=""#FFFF00"" style=""font-size:14px""><strong><span class=""il"">Proposta</span> do site", mail.Html);

    StringAssert.Contains(@"Proposta do site", mail.GetTextFromHtml());
}

The problem with this email is that it has different content in plain/text and text/html formats.

As you probably know, email can have its body represented in several ways simultaneously. Receiver may chose which it wants to display/process. The idea behind it was that if, for example, receiver doesn't understand html it can simply fall-back to simpler format such as plain text.

HTML emails usually have content represented in both text/html and plain/text. No one can force the sender to use equivalent content for both formats.

Mail.dll exposes those as IMail.Text and IMail.Html respectively. There are several helper methods that provide conversion from one format to another:

  • IMail.Text - gets plain text version of this email message.

  • IMail.Html - gets HTML version of this email message.

  • IMail.GetTextFromHtml - extracts plain text from IMail.Html. This text may be different from what is actually stored in the email's IMail.Text property

  • IMail.GetBodyAsText - returns body in plain text format. Uses IMail.Text, IMail.GetTextFromHtml or IMail.GetTextFromRtf.

  • IMail.GetBodyAsHtml - returns body in HTML format. If IMail.IsHtml is false this method uses IMail.Text property to create valid HTML.

by (297k points)
selected by
Thanks a ton! Method "GetTextFromHtml()" totally worked!
...