-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support OneNote html for bold and italic #60
Comments
I was thinking the OneNote HTML representation of font properties would apply to other tools generating HTML, so we should have it be something the default converter understands. Based on that, I'm think we'd implement by creating a new |
I was thinking of only implementing these font decorations when they appear in span elements (where I normally observe them). The spec says these styles can also appear in other elements, where it gets trickier to implement. Thinking out loud, if we want to implement for non span elements we can do a two pass approach:
For example, imagine the following input:
Step 1: Spanify
Step 2: Run span transformer. |
@idvorkin - Please don't modify |
As they say weeks of coding can save hours of design :) I'd love to better understand your design choice. How do you decide when an HTML representation should be part of the core converter vs a different scheme? The <strong> element you mention is an excellent example. I'd expect it to map to bold in markdown. |
FYI, for the transform approach I'm thinking something like this: var styleToElementName = new Dictionary<string, string>()
{
{"font-weight:bold","b"},
{"font-style:italics","i"},
};
var onenoteHTML = @"<td style=''><span style='font-weight:bold'>Expected Bold </span></td>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(onenoteHTML);
foreach (var s2e in styleToElementName)
{
var styledElements = doc.DocumentNode.SelectNodes($"//span[@style='{s2e.Key}']");
foreach (var element in styledElements)
{
element.Name = s2e.Value;
element.Attributes.Where(a => a.Name == "style" && a.Value == s2e.Key).ToList()
.ForEach(a => element.Attributes.Remove(a));
}
} |
There are OneNote fixes which work for me. I assume that tables don't have line breaks, otherwise this neds extra processing (replacing with br tag): using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using Html2Markdown.Replacement;
using Html2Markdown.Scheme;
using HtmlAgilityPack;
namespace OneSyncTool.Core
{
class Html2MarkdownScheme : IScheme
{
private readonly Markdown _builtIn = new Markdown();
private readonly List<IReplacer> _replacers;
public Html2MarkdownScheme()
{
_replacers = new List<IReplacer>(_builtIn.Replacers());
//OneNote block decoration
_replacers.Add(new PatternReplacer("<div\\s+style\\s*=\\s*\"position:absolute(.+?)>", ""));
_replacers.Add(new PatternReplacer("</div>", ""));
//everything else
_replacers.Add(new OneNoteHapReplacer());
}
public IList<IReplacer> Replacers() => _replacers;
internal class OneNoteHapReplacer : IReplacer
{
public string Replace(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
ProcessFontStyles(doc);
ProcessTables(doc);
return doc.DocumentNode.OuterHtml;
}
private void ProcessFontStyles(HtmlDocument doc)
{
HtmlNodeCollection fontStyles = doc.DocumentNode.SelectNodes("//span[@style]");
foreach (HtmlNode node in fontStyles)
{
string style = node.GetAttributeValue("style", null);
if (style == null) continue;
string[] styles = style.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim()).ToArray();
var decorations = new List<string>();
if (styles.Contains("font-style:italic")) decorations.Add("_");
if (styles.Contains("font-weight:bold")) decorations.Add("**");
if (styles.Contains("font-decoration:line-through")) decorations.Add("~~");
// there's no underline in markdown? ignore it for now
string replacement = Decorate(node.InnerHtml, decorations);
node.ParentNode.ReplaceChild(doc.CreateTextNode(node.InnerHtml), node);
}
}
private void ProcessTables(HtmlDocument doc)
{
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("table");
foreach(HtmlNode table in tables)
{
var s = new StringBuilder();
bool isHeader = true;
//there are text nodes in children, they are just line breaks and safe to ignore
foreach(HtmlNode row in table.ChildNodes.Where(n => n.Name == "tr"))
{
int cellCount = 0;
s.Append("|");
foreach(HtmlNode cell in row.ChildNodes.Where(n => n.Name == "td"))
{
s.Append(cell.InnerText.Trim());
s.Append("|");
cellCount++;
}
s.AppendLine();
if(isHeader)
{
s.Append("|");
for(int i = 0; i < cellCount; i++)
{
s.Append("-|");
}
s.AppendLine();
isHeader = false;
}
}
table.ParentNode.ReplaceChild(doc.CreateTextNode(s.ToString()), table);
}
}
private string Decorate(string text, IReadOnlyCollection<string> decorations)
{
foreach(string dec in decorations)
{
text = dec + text + text;
}
return text + Environment.NewLine; //append new line because it's in a span
}
}
internal class PatternReplacer : IReplacer
{
public PatternReplacer(string pattern, string replacement)
{
Pattern = pattern;
Replacement = replacement;
}
public string Pattern { get; }
public string Replacement { get; }
public string Replace(string html)
{
return new Regex(Pattern).Replace(html, Replacement);
}
}
}
} |
Onenote encodes its HTML pages in a way that's close to what Html2Markdown supports but Onenote HTML does bold and italics as follow:
I'm willing to make the changes if you tell me how you want me to fix.
The text was updated successfully, but these errors were encountered: