Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OneNote html for bold and italic #60

Open
idvorkin opened this issue Sep 16, 2017 · 10 comments
Open

Support OneNote html for bold and italic #60

idvorkin opened this issue Sep 16, 2017 · 10 comments
Labels
feature A new business feature request A request for a new feature or enhancement scheme:onenote This issue relates to the OneNote scheme

Comments

@idvorkin
Copy link

idvorkin commented Sep 16, 2017

Onenote encodes its HTML pages in a way that's close to what Html2Markdown supports but Onenote HTML does bold and italics as follow:

Property Example
font-style style="font-style:italic" (normal or italic only)
font-weight style="font-weight:bold" (normal or bold only)
strike-through style="text-decoration:line-through"
text-align style="text-align:center" (for block elements only)
text-decoration style="text-decoration:underline" (none or underline only)

I'm willing to make the changes if you tell me how you want me to fix.

@baynezy
Copy link
Owner

baynezy commented Sep 17, 2017

@idvorkin - let me complete #61 first. This will make it more straightforward to implement.

@baynezy
Copy link
Owner

baynezy commented Sep 17, 2017

@idvorkin - #61 is complete. If you want to support Onenote HTML. You will need to create a new IScheme implementation, you can extend Markdown. Let me know if that doesn't make sense, or you need help.

@baynezy baynezy added the request A request for a new feature or enhancement label Sep 17, 2017
@idvorkin
Copy link
Author

idvorkin commented Sep 17, 2017

I was thinking the OneNote HTML representation of font properties would apply to other tools generating HTML, so we should have it be something the default converter understands.

Based on that, I'm think we'd implement by creating a new CustomerReplacer.CustomAction, which I'd include in the MarkDown._replacers list. Am I on the right track?

@idvorkin
Copy link
Author

I was thinking of only implementing these font decorations when they appear in span elements (where I normally observe them). The spec says these styles can also appear in other elements, where it gets trickier to implement.

Thinking out loud, if we want to implement for non span elements we can do a two pass approach:

  1. Add a span element around the original element content.
  2. Run span replacement.

For example, imagine the following input:

  <_h1 style="bold"> BLAH> </h1> 

Step 1: Spanify

 <_h1> <_span style="BOLD"> BLAH></span> </xh1> 

Step 2: Run span transformer.

@baynezy
Copy link
Owner

baynezy commented Sep 19, 2017

@idvorkin - Please don't modify Markdown that is for support of the vanilla Markdown spec. To support OneNote create a OneNote implementation of IScheme extending Markdown as outlined. The functions for the parsing can live in either your new class or you can put them in HtmlParser.

@idvorkin
Copy link
Author

As they say weeks of coding can save hours of design :)
Happy to sync in chat/voice/video if that's fastest

I'd love to better understand your design choice. How do you decide when an HTML representation should be part of the core converter vs a different scheme? The <strong> element you mention is an excellent example. I'd expect it to map to bold in markdown.

@baynezy baynezy added the feature A new business feature label Sep 23, 2017
@baynezy
Copy link
Owner

baynezy commented Sep 23, 2017

@idvorkin
Copy link
Author

idvorkin commented Sep 24, 2017

FYI, for the transform approach I'm thinking something like this:

var styleToElementName = new Dictionary<string, string>()
{
	{"font-weight:bold","b"},
	{"font-style:italics","i"},
};

var onenoteHTML = @"<td style=''><span style='font-weight:bold'>Expected Bold </span></td>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(onenoteHTML);

foreach (var s2e in styleToElementName)
{
	var styledElements = doc.DocumentNode.SelectNodes($"//span[@style='{s2e.Key}']");
	foreach (var element in styledElements)
	{
		element.Name = s2e.Value;
		element.Attributes.Where(a => a.Name == "style" && a.Value == s2e.Key).ToList()
               .ForEach(a => element.Attributes.Remove(a));
	}
}

@baynezy baynezy added the scheme:onenote This issue relates to the OneNote scheme label Sep 28, 2017
@aloneguid
Copy link

There are OneNote fixes which work for me. I assume that tables don't have line breaks, otherwise this neds extra processing (replacing with br tag):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Html2Markdown.Replacement;
using Html2Markdown.Scheme;
using HtmlAgilityPack;

namespace OneSyncTool.Core
{
   class Html2MarkdownScheme : IScheme
   {
      private readonly Markdown _builtIn = new Markdown();
      private readonly List<IReplacer> _replacers;

      public Html2MarkdownScheme()
      {
         _replacers = new List<IReplacer>(_builtIn.Replacers());

         //OneNote block decoration
         _replacers.Add(new PatternReplacer("<div\\s+style\\s*=\\s*\"position:absolute(.+?)>", ""));
         _replacers.Add(new PatternReplacer("</div>", ""));

         //everything else
         _replacers.Add(new OneNoteHapReplacer());
      }

      public IList<IReplacer> Replacers() => _replacers;

      internal class OneNoteHapReplacer : IReplacer
      {
         public string Replace(string html)
         {
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            ProcessFontStyles(doc);
            ProcessTables(doc);

            return doc.DocumentNode.OuterHtml;
         }

         private void ProcessFontStyles(HtmlDocument doc)
         {
            HtmlNodeCollection fontStyles = doc.DocumentNode.SelectNodes("//span[@style]");
            foreach (HtmlNode node in fontStyles)
            {
               string style = node.GetAttributeValue("style", null);
               if (style == null) continue;

               string[] styles = style.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim()).ToArray();
               var decorations = new List<string>();
               if (styles.Contains("font-style:italic")) decorations.Add("_");
               if (styles.Contains("font-weight:bold")) decorations.Add("**");
               if (styles.Contains("font-decoration:line-through")) decorations.Add("~~");
               // there's no underline in markdown? ignore it for now

               string replacement = Decorate(node.InnerHtml, decorations);

               node.ParentNode.ReplaceChild(doc.CreateTextNode(node.InnerHtml), node);
            }
         }

         private void ProcessTables(HtmlDocument doc)
         {
            HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("table");
            foreach(HtmlNode table in tables)
            {
               var s = new StringBuilder();
               bool isHeader = true;

               //there are text nodes in children, they are just line breaks and safe to ignore
               foreach(HtmlNode row in table.ChildNodes.Where(n => n.Name == "tr"))
               {
                  int cellCount = 0;
                  s.Append("|");
                  foreach(HtmlNode cell in row.ChildNodes.Where(n => n.Name == "td"))
                  {
                     s.Append(cell.InnerText.Trim());
                     s.Append("|");
                     cellCount++;
                  }
                  s.AppendLine();

                  if(isHeader)
                  {
                     s.Append("|");
                     for(int i = 0; i < cellCount; i++)
                     {
                        s.Append("-|");
                     }
                     s.AppendLine();
                     isHeader = false;
                  }
               }

               table.ParentNode.ReplaceChild(doc.CreateTextNode(s.ToString()), table);
            }
         }

         private string Decorate(string text, IReadOnlyCollection<string> decorations)
         {
            foreach(string dec in decorations)
            {
               text = dec + text + text;
            }

            return text + Environment.NewLine; //append new line because it's in a span
         }
      }

      internal class PatternReplacer : IReplacer
      {
         public PatternReplacer(string pattern, string replacement)
         {
            Pattern = pattern;
            Replacement = replacement;
         }

         public string Pattern { get; }

         public string Replacement { get; }

         public string Replace(string html)
         {
            return new Regex(Pattern).Replace(html, Replacement);
         }
      }
   }
}

@aloneguid
Copy link

Just to demo it, original onenote page:

image

exported to markdown:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new business feature request A request for a new feature or enhancement scheme:onenote This issue relates to the OneNote scheme
Projects
None yet
Development

No branches or pull requests

3 participants