| Title: | High-performance HTML to Markdown converter |
|---|---|
| Description: | High-performance HTML to Markdown converter Rust bindings generated with extendr. |
| Authors: | Na'aman Hirschfeld [aut, cre] |
| Maintainer: | Na'aman Hirschfeld <[email protected]> |
| License: | MIT |
| Version: | 3.8.0 |
| Built: | 2026-06-27 13:35:49 UTC |
| Source: | https://github.com/xberg-io/html-to-markdown |
Uses internally tagged representation ("annotation_type": "bold") for JSON serialization.
AnnotationKindAnnotationKind
Returns the default CodeBlockStyle variant.
CodeBlockStyle()CodeBlockStyle()
A CodeBlockStyle enum value
All parameters default to NULL, which means the Rust default is used.
Pass named arguments to override individual settings.
conversion_options( heading_style = NULL, list_indent_type = NULL, list_indent_width = NULL, bullets = NULL, strong_em_symbol = NULL, escape_asterisks = NULL, escape_underscores = NULL, escape_misc = NULL, escape_ascii = NULL, code_language = NULL, autolinks = NULL, default_title = NULL, br_in_tables = NULL, compact_tables = NULL, highlight_style = NULL, extract_metadata = NULL, whitespace_mode = NULL, strip_newlines = NULL, wrap = NULL, wrap_width = NULL, convert_as_inline = NULL, sub_symbol = NULL, sup_symbol = NULL, newline_style = NULL, code_block_style = NULL, keep_inline_images_in = NULL, preprocessing = NULL, encoding = NULL, debug = NULL, strip_tags = NULL, preserve_tags = NULL, skip_images = NULL, url_escape_style = NULL, link_style = NULL, output_format = NULL, include_document_structure = NULL, extract_images = NULL, max_image_size = NULL, capture_svg = NULL, infer_dimensions = NULL, max_depth = NULL, exclude_selectors = NULL, visitor = NULL )conversion_options( heading_style = NULL, list_indent_type = NULL, list_indent_width = NULL, bullets = NULL, strong_em_symbol = NULL, escape_asterisks = NULL, escape_underscores = NULL, escape_misc = NULL, escape_ascii = NULL, code_language = NULL, autolinks = NULL, default_title = NULL, br_in_tables = NULL, compact_tables = NULL, highlight_style = NULL, extract_metadata = NULL, whitespace_mode = NULL, strip_newlines = NULL, wrap = NULL, wrap_width = NULL, convert_as_inline = NULL, sub_symbol = NULL, sup_symbol = NULL, newline_style = NULL, code_block_style = NULL, keep_inline_images_in = NULL, preprocessing = NULL, encoding = NULL, debug = NULL, strip_tags = NULL, preserve_tags = NULL, skip_images = NULL, url_escape_style = NULL, link_style = NULL, output_format = NULL, include_document_structure = NULL, extract_images = NULL, max_image_size = NULL, capture_svg = NULL, infer_dimensions = NULL, max_depth = NULL, exclude_selectors = NULL, visitor = NULL )
heading_style |
Heading style to use in Markdown output (ATX |
list_indent_type |
How to indent nested list items (spaces or tab) |
list_indent_width |
Number of spaces (or tabs) to use for each level of list indentation |
bullets |
Bullet character(s) to use for unordered list items (e.g. |
strong_em_symbol |
Character used for bold/italic emphasis markers ( |
escape_asterisks |
Escape |
escape_underscores |
Escape |
escape_misc |
Escape miscellaneous Markdown metacharacters ( |
escape_ascii |
Escape ASCII characters that have special meaning in certain Markdown dialects |
code_language |
Default language annotation for fenced code blocks that have no language hint |
autolinks |
Automatically convert bare URLs into Markdown autolinks |
default_title |
Emit a default title when no |
br_in_tables |
Render |
compact_tables |
Emit tables without column padding (compact GFM format) |
highlight_style |
Style used for |
extract_metadata |
Populate |
whitespace_mode |
Controls how whitespace sequences are normalised in the converted output |
strip_newlines |
Strip all newlines from the output, producing a single-line result |
wrap |
Wrap long lines at |
wrap_width |
Maximum output line width in characters when |
convert_as_inline |
Treat the entire document as inline content (no block-level wrappers) |
sub_symbol |
Markdown notation for subscript text (e.g. |
sup_symbol |
Markdown notation for superscript text (e.g. |
newline_style |
How to encode hard line breaks ( |
code_block_style |
Style used for fenced code blocks (backticks or tilde) |
keep_inline_images_in |
HTML tag names whose |
preprocessing |
Options for the HTML pre-processing pass applied before conversion begins |
encoding |
Expected character encoding of the input HTML (default |
debug |
Emit debug information during conversion |
strip_tags |
HTML tag names whose content is stripped from the output entirely |
preserve_tags |
HTML tag names that are preserved verbatim in the output |
skip_images |
Skip conversion of |
url_escape_style |
URL encoding strategy for link and image destinations |
link_style |
Link rendering style (inline or reference) |
output_format |
Target output format (Markdown, plain text, etc.) |
include_document_structure |
Include structured document tree in result |
extract_images |
Extract inline images from data URIs and SVGs |
max_image_size |
Maximum decoded image size in bytes (default 5MB) |
capture_svg |
Capture SVG elements as images |
infer_dimensions |
Infer image dimensions from data |
max_depth |
Maximum DOM traversal depth. |
exclude_selectors |
CSS selectors for elements to exclude entirely (element + all content) |
visitor |
(feature-gated) Optional visitor for custom traversal logic |
A named list suitable for the options argument of convert().
Use ConversionOptions::builder() to construct, or Default::default() for defaults.
ConversionOptionsConversionOptions
heading_styleHeading style to use in Markdown output (ATX # or Setext underline).
list_indent_typeHow to indent nested list items (spaces or tab).
list_indent_widthNumber of spaces (or tabs) to use for each level of list indentation.
bulletsBullet character(s) to use for unordered list items (e.g. "-", "*").
strong_em_symbolCharacter used for bold/italic emphasis markers (* or _).
escape_asterisksEscape * characters in plain text to avoid unintended bold/italic.
escape_underscoresEscape _ characters in plain text to avoid unintended bold/italic.
escape_miscEscape miscellaneous Markdown metacharacters ([]()# etc.) in plain text.
escape_asciiEscape ASCII characters that have special meaning in certain Markdown dialects.
code_languageDefault language annotation for fenced code blocks that have no language hint.
autolinksAutomatically convert bare URLs into Markdown autolinks.
default_titleEmit a default title when no <title> tag is present.
br_in_tablesRender <br> elements inside table cells as literal line breaks.
compact_tablesEmit tables without column padding (compact GFM format).
highlight_styleStyle used for <mark> / highlighted text (e.g. ==text==).
extract_metadataPopulate result.metadata with <head> / <meta> extraction (title, description, Open
whitespace_modeControls how whitespace sequences are normalised in the converted output.
strip_newlinesStrip all newlines from the output, producing a single-line result.
wrapWrap long lines at wrap_width characters.
wrap_widthMaximum output line width in characters when wrap is true (default 80).
convert_as_inlineTreat the entire document as inline content (no block-level wrappers).
sub_symbolMarkdown notation for subscript text (e.g. "~").
sup_symbolMarkdown notation for superscript text (e.g. "^").
newline_styleHow to encode hard line breaks (<br>) in Markdown.
code_block_styleStyle used for fenced code blocks (backticks or tilde).
keep_inline_images_inHTML tag names whose <img> children are kept inline instead of block.
preprocessingOptions for the HTML pre-processing pass applied before conversion begins.
encodingExpected character encoding of the input HTML (default "utf-8").
debugEmit debug information during conversion.
strip_tagsHTML tag names whose content is stripped from the output entirely.
preserve_tagsHTML tag names that are preserved verbatim in the output.
skip_imagesSkip conversion of <img> elements (omit images from output).
url_escape_styleURL encoding strategy for link and image destinations.
link_styleLink rendering style (inline or reference).
output_formatTarget output format (Markdown, plain text, etc.).
include_document_structureInclude structured document tree in result.
extract_imagesExtract inline images from data URIs and SVGs.
max_image_sizeMaximum decoded image size in bytes (default 5MB).
capture_svgCapture SVG elements as images.
infer_dimensionsInfer image dimensions from data.
max_depthMaximum DOM traversal depth. None means unlimited. When set, subtrees beyond this depth are
exclude_selectorsCSS selectors for elements to exclude entirely (element + all content).
visitorOptional visitor for custom traversal logic.
ConversionOptions
Uses Option<T> fields for selective updates. Bindings use this to construct
options from language-native types. Prefer ConversionOptionsBuilder for Rust code.
ConversionOptionsUpdateConversionOptionsUpdate
heading_styleOptional override for ConversionOptions::heading_style.
list_indent_typeOptional override for ConversionOptions::list_indent_type.
list_indent_widthOptional override for ConversionOptions::list_indent_width.
bulletsOptional override for ConversionOptions::bullets.
strong_em_symbolOptional override for ConversionOptions::strong_em_symbol.
escape_asterisksOptional override for ConversionOptions::escape_asterisks.
escape_underscoresOptional override for ConversionOptions::escape_underscores.
escape_miscOptional override for ConversionOptions::escape_misc.
escape_asciiOptional override for ConversionOptions::escape_ascii.
code_languageOptional override for ConversionOptions::code_language.
autolinksOptional override for ConversionOptions::autolinks.
default_titleOptional override for ConversionOptions::default_title.
br_in_tablesOptional override for ConversionOptions::br_in_tables.
compact_tablesOptional override for ConversionOptions::compact_tables.
highlight_styleOptional override for ConversionOptions::highlight_style.
extract_metadataOptional override for ConversionOptions::extract_metadata.
whitespace_modeOptional override for ConversionOptions::whitespace_mode.
strip_newlinesOptional override for ConversionOptions::strip_newlines.
wrapOptional override for ConversionOptions::wrap.
wrap_widthOptional override for ConversionOptions::wrap_width.
convert_as_inlineOptional override for ConversionOptions::convert_as_inline.
sub_symbolOptional override for ConversionOptions::sub_symbol.
sup_symbolOptional override for ConversionOptions::sup_symbol.
newline_styleOptional override for ConversionOptions::newline_style.
code_block_styleOptional override for ConversionOptions::code_block_style.
keep_inline_images_inOptional override for ConversionOptions::keep_inline_images_in.
preprocessingOptional override for ConversionOptions::preprocessing.
encodingOptional override for ConversionOptions::encoding.
debugOptional override for ConversionOptions::debug.
strip_tagsOptional override for ConversionOptions::strip_tags.
preserve_tagsOptional override for ConversionOptions::preserve_tags.
skip_imagesOptional override for ConversionOptions::skip_images.
url_escape_styleOptional override for ConversionOptions::url_escape_style.
link_styleOptional override for ConversionOptions::link_style.
output_formatOptional override for ConversionOptions::output_format.
include_document_structureOptional override for ConversionOptions::include_document_structure.
extract_imagesOptional override for ConversionOptions::extract_images.
max_image_sizeOptional override for ConversionOptions::max_image_size.
capture_svgOptional override for ConversionOptions::capture_svg.
infer_dimensionsOptional override for ConversionOptions::infer_dimensions.
max_depthOptional override for ConversionOptions::max_depth.
exclude_selectorsOptional override for ConversionOptions::exclude_selectors.
visitorOptional override for ConversionOptions::visitor.
ConversionResult with content, metadata, images,and warnings.
convert(html, options = ConversionOptions$default())convert(html, options = ConversionOptions$default())
html |
— the HTML string to convert. |
options |
— optional conversion options. Defaults to |
ConversionResult object (list with class attribute).
Returns an error if HTML parsing fails or if the input contains invalid UTF-8.
<head> and top-level elementsContains all metadata typically used by search engines, social media platforms, and browsers for document indexing and presentation.
DocumentMetadataDocumentMetadata
titleDocument title from <title> tag
descriptionDocument description from <meta name="description"> tag
keywordsDocument keywords from <meta name="keywords"> tag, split on commas
authorDocument author from <meta name="author"> tag
canonical_urlCanonical URL from <link rel="canonical"> tag
base_hrefBase URL from <base href=""> tag for resolving relative URLs
languageDocument language from lang attribute
text_directionDocument text direction from dir attribute
open_graphOpen Graph metadata (og:* properties) for social media Keys like "title", "description", "image",
twitter_cardTwitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title",
meta_tagsAdditional meta tags not covered by specific fields Keys are meta name/property attributes, values
A single cell in a table grid
GridCellGridCell
contentThe text content of the cell.
row0-indexed row position.
col0-indexed column position.
row_spanNumber of rows this cell spans (default 1).
col_spanNumber of columns this cell spans (default 1).
is_headerWhether this is a header cell (<th>).
Captures heading elements (h1-h6) with their text content, identifiers, and position in the document structure.
HeaderMetadataHeaderMetadata
levelHeader level: 1 (h1) through 6 (h6)
textNormalized text content of the header
idHTML id attribute if present
depthDocument tree depth at the header element
html_offsetByte offset in original HTML document
Returns the default HeadingStyle variant.
HeadingStyle()HeadingStyle()
A HeadingStyle enum value
Returns the default HighlightStyle variant.
HighlightStyle()HighlightStyle()
A HighlightStyle enum value
Captures <img> elements and inline <svg> elements with metadata
for image analysis and optimization.
ImageMetadataImageMetadata
srcImage source (URL, data URI, or SVG content identifier)
altAlternative text from alt attribute (for accessibility)
titleTitle attribute (often shown as tooltip)
dimensionsImage dimensions as (width, height) if available
image_typeImage type classification
attributesAdditional HTML attributes
Returns the default ImageType variant.
ImageType()ImageType()
A ImageType enum value
Represents <a> elements with parsed href values, text content, and link type classification.
LinkMetadataLinkMetadata
hrefThe href URL value
textLink text content (normalized, concatenated if mixed with elements)
titleOptional title attribute (often shown as tooltip)
link_typeLink type classification
relRel attribute values (e.g., "nofollow", "stylesheet", "canonical")
attributesAdditional HTML attributes
Returns the default LinkStyle variant.
LinkStyle()LinkStyle()
A LinkStyle enum value
Returns the default LinkType variant.
LinkType()LinkType()
A LinkType enum value
Returns the default ListIndentType variant.
ListIndentType()ListIndentType()
A ListIndentType enum value
Returns the default NewlineStyle variant.
NewlineStyle()NewlineStyle()
A NewlineStyle enum value
Uses internally tagged representation ("node_type": "heading") for JSON serialization.
NodeContentNodeContent
Provides comprehensive metadata about the current node being visited, including its type, attributes, position in the DOM tree, and parent context.
NodeContextNodeContext
node_typeCoarse-grained node type classification
tag_nameRaw HTML tag name (e.g., "div", "h1", "custom-element")
attributesAll HTML attributes as key-value pairs
depthDepth in the DOM tree (0 = root)
index_in_parentIndex among siblings (0-based)
parent_tagParent element's tag name (None if root)
is_inlineWhether this element is treated as inline vs block
Returns the default NodeType variant.
NodeType()NodeType()
A NodeType enum value
Returns the default OutputFormat variant.
OutputFormat()OutputFormat()
A OutputFormat enum value
HTML preprocessing options for document cleanup before conversion
PreprocessingOptionsPreprocessingOptions
enabledEnable HTML preprocessing globally
presetPreprocessing preset level (Minimal, Standard, Aggressive)
remove_navigationRemove navigation elements (nav, breadcrumbs, menus, sidebars)
remove_formsRemove form elements (forms, inputs, buttons, etc.)
PreprocessingOptions
This struct uses Option<T> to represent optional fields that can be selectively updated.
Only specified fields (Some values) will override existing options; None values leave the
corresponding fields unchanged when applied via PreprocessingOptions::apply_update.
PreprocessingOptionsUpdatePreprocessingOptionsUpdate
enabledOptional global preprocessing enablement override
presetOptional preprocessing preset level override (Minimal, Standard, Aggressive)
remove_navigationOptional navigation element removal override (nav, breadcrumbs, menus, sidebars)
remove_formsOptional form element removal override (forms, inputs, buttons, etc.)
Returns the default PreprocessingPreset variant.
PreprocessingPreset()PreprocessingPreset()
A PreprocessingPreset enum value
Warnings indicate that conversion completed but some content may have been handled differently than expected — for example, an image that could not be extracted, a truncated input, or malformed HTML that was repaired with best-effort parsing.
ProcessingWarningProcessingWarning
Conversion always succeeds (returns ConversionResult) even when warnings are
present. Callers should inspect warnings and decide how to
handle them based on their tolerance for partial results:
Logging pipelines: emit each warning at WARN level and continue.
Strict pipelines: treat any warning as a hard error by checking
result.warnings.is_empty() before using the output.
See WarningKind for the full taxonomy of warning categories.
messageHuman-readable warning message.
kindThe category of warning.
Represents machine-readable structured data found in the document. JSON-LD blocks are collected as raw JSON strings for flexibility.
StructuredDataStructuredData
data_typeType of structured data (JSON-LD, Microdata, RDFa)
raw_jsonRaw JSON string (for JSON-LD) or serialized representation
schema_typeSchema type if detectable (e.g., "Article", "Event", "Product")
Returns the default StructuredDataType variant.
StructuredDataType()StructuredDataType()
A StructuredDataType enum value
A top-level extracted table with both structured data and markdown representation
TableDataTableData
gridThe structured table grid.
markdownThe markdown rendering of this table.
Unlike DocumentNode, which captures block-level structure (headings, paragraphs, etc.),
a TextAnnotation describes inline-level markup — bold, italic, links, code spans, and
similar — that spans a contiguous run of bytes inside DocumentNode::content's text field.
TextAnnotationTextAnnotation
Byte offsets (start..end) are into the UTF-8 encoded text of the parent node. The range
follows Rust slice conventions: start is inclusive and end is exclusive, so the annotated
text is text[start as usize..end as usize].
Multiple annotations on the same node can overlap (e.g. bold-italic text), and they are stored in the order they are encountered during DOM traversal.
See AnnotationKind for the full list of supported annotation types.
startStart byte offset (inclusive) into the parent node's text.
endEnd byte offset (exclusive) into the parent node's text.
kindThe type of annotation.
Returns the default TextDirection variant.
TextDirection()TextDirection()
A TextDirection enum value
Returns the default UrlEscapeStyle variant.
UrlEscapeStyle()UrlEscapeStyle()
A UrlEscapeStyle enum value
Allows visitors to control the conversion flow by either proceeding with default behavior, providing custom output, skipping elements, preserving HTML, or signaling errors.
VisitResultVisitResult
ContinueContinue with default conversion behavior
CustomReplace default output with custom markdown
SkipSkip this element entirely (don't output anything)
PreserveHtmlPreserve original HTML (don't convert to markdown)
ErrorStop conversion with an error
Returns the default WarningKind variant.
WarningKind()WarningKind()
A WarningKind enum value
Returns the default WhitespaceMode variant.
WhitespaceMode()WhitespaceMode()
A WhitespaceMode enum value