Improving LLMS with Microsofts Markitdown

Updated: Apr 10, 2025

By: Joseph Horace

#MarkItDown

#MarkItDown Python API

#MarkItDown command line

#MarkItDown Docker

#MarkItDown Azure Document Intelligence

#MarkItDown features

#MarkItDown installation

#MarkItDown usage

#document conversion

#Markdown

microsofts markitdown
microsofts markitdown

Introduction

Why Is Structured Data Critical for LLM Training?

Large Language Models (LLMs) such as GPT, PaKM, or Claude are trained from large volumes of textual data. This enables them to learn patterns in language, context, and semantics. The quality of this training data matters to ensure the quality of the model.

Structured data plays a big role in the quality of data. For example, headings indicate a shift in the topic, lists signal hierarchy, and bold/italics signal importance. Well-structured content makes it easier for LLMs to ingest the data.

What Are the Challenges with Real-World Document Formats?

While documents (e.g., PDF, Word files, and PPT) are rich in information, they are difficult to work with. They are optimized for humans rather than LLMs. They pose challenges such as:

Inconsistent formatting
Layout noise
Semantic ambiguity
Mixed media

What Is MarkItDown and How Does It Help?

MarkItDown is a document conversion tool tool developed by Microsoft. It helps to translate rich documents into Markdown format. It’s built on Python and supports multiple input formats not limited to .pdf, .docx, .pptx, .xlsx, .md, and .html.

It differs from other tools by preserving the structure. Rather than just generating text, it retains the formatting. It preserves the original intent and structure of the source material.

2. Challenges with Raw Document Formats

What does “unstructured data” mean in the context of documents?

Unstructured data, in the context of LLMs, refers to data lacking clear semantics and hierarchical cues. Such data is difficult for AI to interpret, group, and find relationships within.

Why is unstructured data problematic for LLMs?

LLMs learn from the patterns in how language is presented. If the data lacks structure, it creates several issues:

Token confusion: Headings and subheadings might appear to LLMs as normal flat text.
Loss of context: Lists, quotes, code, and sections are difficult to distinguish.
Increased noise: It can dilute useful training cues — for example, headers, which can completely change meaning.
Ambiguity: Without structure, the LLM can't distinguish between classes of data — e.g., metadata, decorations, and narrative content.

What are common sources of unstructured data in real-world scenarios?

Unstructured data often comes from:

PDFs with complex layouts or embedded fonts
Word documents using inconsistent styles
PowerPoint decks with sparse or heavily visual slides
Scanned documents with OCR inaccuracies
HTML pages cluttered with layout tags, ads, or scripting

What specific types of “noise” or challenges exist in formats like PDF, DOCX, etc.?

These formats come with a range of issues:

Layout noise: Headers, footers, multi-column text, and watermark overlays
Formatting inconsistencies: For example, bold might be used where it shouldn't
Mixed content: Images, charts, and footnotes might interrupt the text flow
Parsing limitations: Since most tools extract text in reading order rather than logical order, they might generate poor results

Why are existing tools insufficient for handling these issues at scale?

Many tools already exist for converting files to Markdown. However, they lack semantic preservation. They may return a raw string that lacks clear markup indicating structure.

Some tools do not support multiple file types in a single workflow, and others might require manual corrections as part of post-processing. The output often needs scripts to clean and structure the data after processing.

MarkItDown: Features and Usage

What is MarkItDown, and what document formats does it support?

MarkItDown is a document conversion tool developed by Microsoft to transform documents of various formats into clean, structured Markdown.

It supports a wide range of input types, including PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR), Audio (EXIF metadata and speech transcription), HTML, Text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), Youtube URLs, EPubs, and more!

It can also handle more niche sources such as URLs and zip files containing documents.

What are the core features of MarkItDown?

Structure Preservation: Unlike its predecessors, MarkItDown preserves the semantic structure of documents such as headings, lists, and formatting. This structure is important for LLMs during the interpretation
Clean Output: MarkItdown output is clean from excessive formatting that might hinder LLM performance
Semantics Integrity: Markitdown ensures that the intent of the original document is preserved

How does it preserve document structure and formatting?

MarkItDown works by analyzing the document and translating its structure into Markdown syntax. It identifies elements like headings, subheadings, lists, and paragraphs, preserving their hierarchy and meaning. Additionally, any inline formatting (like bold or italics) is also converted to its Markdown equivalent.

What makes its output clean and suitable for LLM ingestion?

MarkItDown's output is free from excessive visual elements such as complex layouts and non-textual content, which are irrelevant to LLMs. This makes the documents more easily ingestible by models. It uses simple symbols to represent structure (e.g., # for headings, - for lists), reducing the number of tokens required to represent the same information compared to rich-text formats.

MarkItDown handles a wide variety of input formats, including:

- PDFs: Preserving text, headings, and tables from complex layouts.
- Word files (.docx): Maintaining paragraphs, lists, and other semantic structures.
- PowerPoint presentations (.pptx): Handling slides and bullet points.
- Excel spreadsheets (.xlsx): Extracting data from structured tables.
- Markdown files (.md) and HTML: Converting already structured content into more usable Markdown.

What are the benefits of MarkItdown in the context of LLMs?

Token Efficiency: MarkItDown uses simple notations like # for headings, reducing the number of tokens required for formatting.

Structure Clarity: MarkItDown’s clear, easy-to-interpret structure helps LLMs better understand the content’s hierarchy and meaning.
Ease of Parsing: MarkItDown’s output is easier for LLMs to parse compared to other complex formats.
Consistency: MarkItDown produces documents with a predictable format, making it easier for LLMs to learn patterns.
Minimal Noise: Unlike rich formats, Markdown eliminates extraneous elements, reducing noise and allowing LLMs to focus on the content.
Scalability: Markdown’s plain-text nature allows it to scale easily for large datasets, which is essential for training LLMs. Markdown files are straightforward to process and integrate into data pipelines.
Compatibility: Most AI tools and machine learning libraries support Markdown, making it an accessible format for feeding data into various LLMs.

How is MarkItDown used via the command line?

In this section, we cover the practical usage of MarkItDown across different environments:

Installation of MarkItDown

You can install MarkItDown either by cloning the GitHub repository or using Python's package manager (pip):

pip install markitdown

Alternatively, clone the repo directly:

git clone https://github.com/microsoft/markitdown
cd markitdown
pip install .

Usage via Command Line

Once installed, you can convert files using the command line with simple syntax such as:

markitdown convert input.docx -o output.md

This command converts the specified document to Markdown format and saves it as output.md.

Usage via Python API

MarkItDown can also be used programmatically via its Python API. Here's an example:

from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("testfile.xlsx")
print(result.text_content)

Usage via Docker

MarkItDown can be run inside a Docker container for isolated and reproducible environments. Here's how to build and run it:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

Azure docuem intelligence

Azure AI Document Intelligence (formerly Form Recognizer) is a cloud-based service that extracts structured data—text, tables, key-value pairs—from various document formats using prebuilt and custom models.

MarkItDown can integrate with Azure Document Intelligence to enhance its document parsing capabilities:

from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)

This allows you to leverage Azure’s layout understanding before converting to Markdown, making the output more reliable for downstream LLM tasks.

Limitations and Considerations

While MarkItDown performs well compared to alternatives, it is not 100% accurate. It may produce suboptimal results when handling complex or poorly structured documents.
May Require Post-Processing for Edge Cases:
In some cases, the output may require additional post-processing to clean up formatting or correct minor issues, especially in documents with unconventional layouts.
Potential Size/Format Limitations:
Extremely large files or those using uncommon formats may be unsupported or could lead to performance issues during conversion.

Conclusion

High-quality, structured data is the foundation of effective LLM training. While rich documents such as PDFs contain valuable information for humans, this information can be useless to LLMs if not presented properly. MarkItDown bridges this gap by converting such documents into a format (Markdown) that LLMs can efficiently use.

As LLM applications grow, tools like MarkItDown will become essential in bridging the gap between raw content and high-quality, machine-readable data.

Frequently Asked Questions

What is the purpose of Markdown?

Markdown is a lightweight markup language designed to format plain text using simple, readable syntax. Its primary purpose is to make content easy to write, read, and convert into rich formats like HTML. For LLMs, Markdown offers structure without the overhead of complex formatting, making it ideal for efficient parsing and training.

Why not use PDFs or Word documents directly for LLM training?

PDFs and Word documents are designed for visual presentation, not machine understanding. They often include layout noise, inconsistent styling, and embedded elements that make it hard for models to extract clean, structured data. Markdown strips away this noise, preserving only the semantically meaningful parts.

Can MarkItDown handle scanned or image-based documents?

Yes, MarkItDown supports image files and leverages OCR (Optical Character Recognition) to extract text content from scanned documents or images, making them usable in text-based pipelines.

Does MarkItDown require internet access or external services?

MarkItDown can run entirely offline. However, certain integrations—like using Azure Document Intelligence—require internet access and cloud service credentials.

How does MarkItDown compare to other document converters?

Unlike many tools that focus on extracting plain text, MarkItDown emphasizes structure preservation, semantic integrity, and clean Markdown output, making it highly suitable for LLM ingestion pipelines without heavy post-processing.

What document formats does MarkItDown support?

MarkItDown supports a wide range of document formats, including: PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR support), Audio (EXIF metadata and speech transcription), HTML, CSV, JSON, XML, ZIP files (recursively processes contents), YouTube URLs, EPUBs. MarkItDown can also handle niche sources like URLs and compressed archives containing multiple document types.

Is MarkItDown compatible with cloud environments like Azure or AWS?

Yes, MarkItDown can be integrated into cloud environments. For instance, it works with Azure Document Intelligence (formerly Form Recognizer) to extract structured data from documents before converting them into Markdown format.

Can MarkItDown handle large files?

MarkItDown can handle a range of file sizes, but for extremely large files, performance might be affected. It’s recommended to break down large documents into smaller chunks for better efficiency. Additionally, files with very complex formats may require additional optimization for best results.

Does MarkItDown support multiple languages?

Yes, MarkItDown supports multiple languages, especially in text-based formats (e.g., CSV, JSON). For image-based content or OCR, the quality of extraction may vary based on the language and the OCR engine's capabilities.

What happens if MarkItDown cannot process a document?

If MarkItDown encounters an unsupported file type or a complex document layout, it may produce incomplete or incorrect output. In such cases, manual post-processing or fallback to other conversion methods might be required.

How does MarkItDown improve token efficiency for LLMs?

MarkItDown uses minimalistic syntax to preserve structure—such as `#` for headings and `-` for lists—thus reducing the number of tokens required for formatting. This makes the resulting Markdown documents more token-efficient and better suited for LLM training.

Can MarkItDown be automated or integrated into a pipeline?

Yes, MarkItDown can be easily automated through command-line interfaces (CLI) or the Python API, making it suitable for integration into larger data processing or machine learning pipelines.

References

Background References

About the Author

Joseph Horace

Horace is a dedicated software developer with a deep passion for technology and problem-solving. With years of experience in developing robust and scalable applications, Horace specializes in building user-friendly solutions using cutting-edge technologies. His expertise spans across multiple areas of software development, with a focus on delivering high-quality code and seamless user experiences. Horace believes in continuous learning and enjoys sharing insights with the community through contributions and collaborations. When not coding, he enjoys exploring new technologies and staying updated on industry trends.