discourse

Commit Graph

Author	SHA1	Message	Date
Gerhard Schlager	b8f2cbf41c	DEV: Add `additional_allowed_tags` to `HtmlToMarkdown` Import script often use subclasses of `HtmlToMarkdown` and might need to allow additional tags that can be used within the custom class.	2024-06-10 16:03:30 +02:00
Martin Brennan	575bc4af73	FIX: Remove newlines from img alt & title in HTML to markdown parser (#25473 ) We were having a minor issue with emails with embedded images that had newlines in the alt string; for example: ``` <p class="MsoNormal"><span style="font-size:11.0pt"><img width="898" height="498" style="width:9.3541in;height:5.1875in" id="Picture_x0020_5" src="cid:image003.png@01DA4EBA.0400B610" alt="A screenshot of a computer program Description automatically generated"></span><span style="font-size:11.0pt"><o:p></o:p></span></p> ``` Once this was parsed and converted to markdown (or directly to HTML in some cases), this caused an issue in the composer and the post UI, where the markdown parser didn't know how to deal with this, making the HTML show directly instead of showing an image. The easiest way to deal with this is to just strip \n from image alt and title attrs in the HTMLToMarkdown class.	2024-01-31 10:23:09 +10:00
Gerhard Schlager	5b97f79569	DEV: Replace `starts_with?` with `start_with?` in `HtmlToMarkdown` (#24521 ) This allows us to use that class without loading Rails, e.g. in imports (converters).	2023-11-23 00:57:24 +01:00
David Taylor	6417173082	DEV: Apply syntax_tree formatting to `lib/*`	2023-01-09 12:10:19 +00:00
Loïc Guitaut	ca97850726	DEV: remove deprecation warnings related to Nokogiri	2022-10-25 10:57:03 +02:00
Loïc Guitaut	008b700a3f	DEV: Upgrade to Rails 7 This patch upgrades Rails to version 7.0.2.4.	2022-04-28 11:51:03 +02:00
Gerhard Schlager	962ccf0ab5	FIX: Hoisting linebreaks shouldn't fail for HTML5 elements (#14364 )	2021-09-17 10:41:34 +02:00
Krzysztof Kotlarek	354c939656	FIX: remove Nokogumbo references (#13951 ) Specs broken after `f4720205c0`	2021-08-05 11:46:25 +10:00
Régis Hanol	cd93d1b5f7	FEATURE: new 'trim_incoming_emails' site setting (#12874 ) This setting allows admin to de/activate automatic trimming of incoming email. There are instances where it does wonders in trimming all the garbage content and other instances where it's so bad that it trims the most important part of the email. FIX: don't remove hidden content using the style attribute when converting HTML to Markdown. The regexp used was doing more harm than good. It was way too broad. FIX: properly elide signatures from emails sent with Front App. This is fairly safe as Front App nicely identifies signatures in the HTML part.	2021-04-28 17:08:48 +02:00
Arpit Jalan	85c4e8fd32	FEATURE: support `mark` tag (#12088 ) This commit adds support for `mark` tag for highlighting text content.	2021-02-15 21:47:30 +05:30
Daniel Waterworth	3b368a48d1	Revert "DEV: Add logging for stack level too deep exception in HtmlToMarkdown" We can do this in a better way by storing an IncomingEmail record. Follow-up-to: `4a9ee25c56`	2020-07-09 13:41:33 +01:00
Daniel Waterworth	4a9ee25c56	DEV: Add logging for stack level too deep exception in HtmlToMarkdown	2020-07-09 12:25:00 +01:00
Blake Erickson	a89574ccb9	FIX: Inline error when converting html to markdown Looks like some html elements like `aside` and `section` will throw an error when checking if they are inline or not. The commit simply handles ``` Job exception: undefined method `inline?' for nil:NilClass ``` and adds a test for it.	2020-06-03 15:59:19 -06:00
Régis Hanol	501b19b6e0	FIX: server-side HtmlToMarkdown improvements (#9586 ) TLDR; this commit vastly improves how whitespaces are handled when converting from HTML to Markdown. It also adds support for converting HTML <tables> to markdown tables. The previous 'remove_whitespaces!' method was traversing the whole HTML tree and used a heuristic to remove leading and trailing whitespaces whenever it was appropriate (ie. mostly before and after HTML block elements) It was a good idea, but it was very limited and leaded to bad conversion when the html had leading whitespaces on several lines for example. One such example can be found [here](https://meta.discourse.org/t/86782). For various reasons, most of the whitespaces in a HTML file is ignored when the page is being displayed in a browser. The rules that the browsers follow are the [CSS' White Space Processing Rules](https://www.w3.org/TR/css-text-3/#white-space-rules). They can be quite complicated when you take into account RTL languages and other various tidbits but they boils down to the following: - Collapse whitespaces down to one space (0x20) inside an inline context (ie. nodes/tags that are being displaying on the same line) - Remove any leading/trailing whitespaces inside an inline context One quick & dirty way of getting this 90% solved would be to do 'HTML.gsub!(/[[:space:]]+/, " ")'. We would also need to hoist <pre> elements in order to not mess with their whitespaces. Unfortunately, this solution let some whitespaces creep around HTML tags which leads to more '.strip!' calls than I can bear. I decided to "emulate" the browser's handling of whitespaces and came up with a solution in 4 parts 1. remove_not_allowed! The HtmlToMarkdown library is recursively "visiting" all the nodes in the HTML in order to convert them to Markdown. All the nodes that aren't handled by the library (eg. <script>, <style> or any non-textual HTML tags) are "swallowed". In order to reduce the number of nodes visited, the method 'remove_not_allowed!' will automatically delete all the nodes that have no "visitor" (eg. a 'visit_<tag>' method) defined. 2. remove_hidden! Similar purpose as the previous method (eg. reducing number of nodes visited), there's no point trying to convert something that is hidden. The 'remove_hidden!' method removes any nodes that was hidden using the "hidden" HTML attribute, some CSS or with a width or height equal to 0. 3. hoist_line_breaks! The 'hoist_line_breaks!' method is there to handle <br> tags. I know those tiny <br> don't do much but they can be quite annoying. The <br> tags are inline elements but they visually work like a block element (ie. they create a new line). If you have the following HTML "<i>Foo<br>Bar</i>", it ends up visually similar to "<i>Foo</i><br><i>Bar</i>". The latter being much more easy to process than the former, so that's what this method is doing. The "hoist_line_breaks" will hoist <br> tags out of inline tags until their parent is a block element. 4. remove_whitespaces! The "remove_whitespaces!" is where all the whitespace removal is happening. It's broken down into 4 methods as well - remove_whitespaces! - is_inline? - collapse_spaces! - remove_trailing_space! The 'remove_whitespace!' method is recursively walking the HTML tree (skipping <pre> tags). If a node has any children, they will be chunked into groups of inline elements vs block elements. For each chunks of inline elements, it will call the "collapse_space!" and "remove_trailing_space!" methods. For each chunks of block elements, it will call "remote_whitespace!" to keep walking the HTML tree recursively. The "is_inline?" method determines whether a node is part of a inline context. A node is inline iif it's a text node or it's an inline tag, but not <br>, and all its children are also inline. The "collapse_spaces!" method will collapse any kind of (white) space into a single space (" ") character, even accros tags. For example, if we have " Foo \n<i> Bar </i>\t42", it will return "Foo <i>Bar </i>42". Finally, the "remove_trailing_space!" method is there to remove any trailing space that might creep in at the end of the inline chunk. This solution is not 100% bullet-proof. It does not support RTL languages at all and has some caveats that I felt were not worth the work to get properly fixed. FIX: better detection of hidden elements when converting HTML to Markdown FIX: take into account the 'allowed_href_schemes' site setting when converting HTML <a> to Markdown FIX: added support for 'mailto:' scheme when converting <a> from HTML to Markdown FIX: added support for <img> dimensions when converting from HTML to Markdown FIX: added support for <dl>, <dd> and <dt> when converting from HTML to Markdown FIX: added support for multilines emphases, strongs and strikes when converting from HTML to Markdown FIX: added support for <acronym> when converting from HTML to Markdown DEV: remove unused 'sanitize' gem Wow, did you just read all that?! Congratz, here's a cookie: 🍪.	2020-04-30 12:21:25 +02:00
Dan Ungureanu	1393950dbc	FIX: Improve HTML to Markdown conversion (#9231 ) This commit ensures that whitespaces are preserved in <pre>, but removed inside text paragraphs.	2020-03-18 19:31:10 +02:00
Sam Saffron	30990006a9	DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging	2019-05-13 09:31:32 +08:00
Vinoth Kannan	87b53e170b	FIX: skip <br> inside <p> if next character is \n	2019-04-14 14:44:54 +05:30
Gerhard Schlager	577af81e76	FIX: Font tag resulted in wrong email trimming	2018-12-18 11:40:54 +01:00
Gerhard Schlager	37461a6398	FIX: Weird mixture of line breaks resulted in wrong email trimming	2018-12-18 11:40:54 +01:00
David Taylor	9248ad1905	DEV: Enable `Style/SingleLineMethods` and `Style/Semicolon` in Rubocop (#6717 )	2018-12-04 11:48:13 +08:00
Régis Hanol	26d5ae61dd	FIX: handle <pre> inside <blockquote> in html_to_markdown	2018-02-26 23:28:02 +01:00
Vinoth Kannan	6b3aa81c11	FIX: Remove other whitespaces except the line intents	2017-12-09 02:36:27 +05:30
Vinoth Kannan	dcc63a8ead	FIX: Keep all the indenting in the text	2017-12-09 01:11:00 +05:30
Leo McArdle	0ef7a969f2	Some more HTML to Markdown fixes (#5046 ) * FIX: handle spaces better within emphasis tags in html_to_markdown * FIX: handle line breaks at beginning of emphasis tags in html_to_markdown	2017-08-14 22:13:24 +02:00
Leo McArdle	65d5cd7239	FIX: generate valid markdown from <br></b> in an email (#5022 ) * FIX: generate valid markdown from <br></b> in an email * FIX: don't generate markdown for empty <strong> or <em> tags in emails	2017-08-02 23:02:59 +02:00
Guo Xiang Tan	5012d46cbd	Add rubocop to our build. (#5004 )	2017-07-28 10:20:09 +09:00
Régis Hanol	a1b8a3b52b	FIX: supports bare <li> when converting html to markdown	2017-05-17 15:05:11 +02:00
Robin Ward	b57b635d30	FIX: Extract `div` tags within `span`s	2017-05-09 12:33:54 -04:00
Régis Hanol	768c63c103	Add 'keep_cid_imgs' option to HTML to Markdown converter to improve incoming email parsing	2017-05-03 23:01:55 +02:00
Régis Hanol	e38014772b	FIX: skip hidden <img> (no tracking for you)	2017-05-03 19:40:34 +02:00
Régis Hanol	c8044c6956	FIX: skip hidden nodes when converting from HTML to Markdown	2017-05-03 19:34:03 +02:00
Régis Hanol	bff36de130	FIX: HtmlToMarkdown should not convert empty/bad <img> tags	2017-05-03 18:29:25 +02:00
Régis Hanol	c880af8120	FIX: properly trim whitespaces (including those pesky   html entities)	2017-05-03 18:04:31 +02:00
Régis Hanol	edbf12622b	FIX: HtmlToMarkdown should not convert empty/bad <a> tags	2017-05-03 16:42:37 +02:00
Régis Hanol	aba76bace6	add support to keep img tags when converting to html	2017-04-28 22:14:46 +02:00
Régis Hanol	51ee49aad2	FIX: properly support HTML document when converting to markdown	2017-04-28 22:02:20 +02:00
Régis Hanol	b76674f640	FEATURE: convert incoming emails in HTML to markdown - remove incoming_email_prefer_html site setting - remove HtmlCleaner class	2017-04-26 16:49:06 +02:00
Régis Hanol	e5c29a1dde	eradicate debugging 'puts' 💥	2017-04-24 23:08:15 +02:00
Régis Hanol	d5630d6160	HtmlToMarkdown library Small library to transform HTML to Discourse-flavored markdown (mostly used for imports)	2017-04-24 22:01:41 +02:00

39 Commits