Commit Graph

26 Commits

Author SHA1 Message Date
Régis Hanol 501b19b6e0
FIX: server-side HtmlToMarkdown improvements (#9586)
TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown.
It also adds support for converting HTML <tables> to markdown tables.

The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove
leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements)

It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example.
One such example can be found [here](https://meta.discourse.org/t/86782).

For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser.
The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules).
They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following:

- Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line)
- Remove any leading/trailing whitespaces inside an inline context

One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'.
We would also need to hoist <pre> elements in order to not mess with their whitespaces.
Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear.

I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts

1. remove_not_allowed!

The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown.
All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed".
In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes
that have no "visitor" (eg. a 'visit_<tag>' method) defined.

2. remove_hidden!

Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden.
The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0.

3. hoist_line_breaks!

The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying.
The <br> tags are inline elements but they visually work like a block element (ie. they create a new line).
If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>".
The latter being much more easy to process than the former, so that's what this method is doing.
The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element.

4. remove_whitespaces!

The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well

- remove_whitespaces!
- is_inline?
- collapse_spaces!
- remove_trailing_space!

The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags).
If a node has any children, they will be chunked into groups of inline elements vs block elements.
For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods.
For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively.

The "is_inline?" method determines whether a node is part of a inline context.
A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline.

The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags.
For example, if we have "  Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42".

Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk.

This solution is not 100% bullet-proof.
It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed.

FIX: better detection of hidden elements when converting HTML to Markdown
FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown
FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown
FIX: added support for <img> dimensions when converting from HTML to Markdown
FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown
FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown
FIX: added support for <acronym> when converting from HTML to Markdown
DEV: remove unused 'sanitize' gem

Wow, did you just read all that?! Congratz, here's a cookie: 🍪.
2020-04-30 12:21:25 +02:00
Dan Ungureanu 1393950dbc
FIX: Improve HTML to Markdown conversion (#9231)
This commit ensures that whitespaces are preserved in <pre>, but removed
inside text paragraphs.
2020-03-18 19:31:10 +02:00
Sam Saffron 30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Vinoth Kannan 87b53e170b FIX: skip <br> inside <p> if next character is \n 2019-04-14 14:44:54 +05:30
Gerhard Schlager 577af81e76 FIX: Font tag resulted in wrong email trimming 2018-12-18 11:40:54 +01:00
Gerhard Schlager 37461a6398 FIX: Weird mixture of line breaks resulted in wrong email trimming 2018-12-18 11:40:54 +01:00
David Taylor 9248ad1905 DEV: Enable `Style/SingleLineMethods` and `Style/Semicolon` in Rubocop (#6717) 2018-12-04 11:48:13 +08:00
Régis Hanol 26d5ae61dd FIX: handle <pre> inside <blockquote> in html_to_markdown 2018-02-26 23:28:02 +01:00
Vinoth Kannan 6b3aa81c11 FIX: Remove other whitespaces except the line intents 2017-12-09 02:36:27 +05:30
Vinoth Kannan dcc63a8ead FIX: Keep all the indenting in the text 2017-12-09 01:11:00 +05:30
Leo McArdle 0ef7a969f2 Some more HTML to Markdown fixes (#5046)
* FIX: handle spaces better within emphasis tags in html_to_markdown

* FIX: handle line breaks at beginning of emphasis tags in html_to_markdown
2017-08-14 22:13:24 +02:00
Leo McArdle 65d5cd7239 FIX: generate valid markdown from <br></b> in an email (#5022)
* FIX: generate valid markdown from <br></b> in an email

* FIX: don't generate markdown for empty <strong> or <em> tags in emails
2017-08-02 23:02:59 +02:00
Guo Xiang Tan 5012d46cbd Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
Régis Hanol a1b8a3b52b FIX: supports bare <li> when converting html to markdown 2017-05-17 15:05:11 +02:00
Robin Ward b57b635d30 FIX: Extract `div` tags within `span`s 2017-05-09 12:33:54 -04:00
Régis Hanol 768c63c103 Add 'keep_cid_imgs' option to HTML to Markdown converter to improve incoming email parsing 2017-05-03 23:01:55 +02:00
Régis Hanol e38014772b FIX: skip hidden <img> (no tracking for you) 2017-05-03 19:40:34 +02:00
Régis Hanol c8044c6956 FIX: skip hidden nodes when converting from HTML to Markdown 2017-05-03 19:34:03 +02:00
Régis Hanol bff36de130 FIX: HtmlToMarkdown should not convert empty/bad <img> tags 2017-05-03 18:29:25 +02:00
Régis Hanol c880af8120 FIX: properly trim whitespaces (including those pesky &nbsp; html entities) 2017-05-03 18:04:31 +02:00
Régis Hanol edbf12622b FIX: HtmlToMarkdown should not convert empty/bad <a> tags 2017-05-03 16:42:37 +02:00
Régis Hanol aba76bace6 add support to keep img tags when converting to html 2017-04-28 22:14:46 +02:00
Régis Hanol 51ee49aad2 FIX: properly support HTML document when converting to markdown 2017-04-28 22:02:20 +02:00
Régis Hanol b76674f640 FEATURE: convert incoming emails in HTML to markdown
- remove incoming_email_prefer_html site setting
- remove HtmlCleaner class
2017-04-26 16:49:06 +02:00
Régis Hanol e5c29a1dde eradicate debugging 'puts' 💥 2017-04-24 23:08:15 +02:00
Régis Hanol d5630d6160 HtmlToMarkdown library
Small library to transform HTML to Discourse-flavored markdown (mostly used for imports)
2017-04-24 22:01:41 +02:00