logo
Basic Utils
Home
postiz
  • Improving LLMS with Microsofts Markitdown

    microsofts markitdown
    microsofts markitdown
    microsofts markitdown

    Introduction

    Why Is Structured Data Critical for LLM Training?

    Large Language Models (LLMs) such as GPT, PaKM, or Claude are trained from large volumes of textual data. This enables them to learn patterns in language, context, and semantics. The quality of this training data matters to ensure the quality of the model.

    Structured data plays a big role in the quality of data. For example, headings indicate a shift in the topic, lists signal hierarchy, and bold/italics signal importance. Well-structured content makes it easier for LLMs to ingest the data.

    What Are the Challenges with Real-World Document Formats?

    While documents (e.g., PDF, Word files, and PPT) are rich in information, they are difficult to work with. They are optimized for humans rather than LLMs. They pose challenges such as:

    • Inconsistent formatting
    • Layout noise
    • Semantic ambiguity
    • Mixed media

    What Is MarkItDown and How Does It Help?

    MarkItDown is a document conversion tool tool developed by Microsoft. It helps to translate rich documents into Markdown format. It’s built on Python and supports multiple input formats not limited to .pdf, .docx, .pptx, .xlsx, .md, and .html.

    It differs from other tools by preserving the structure. Rather than just generating text, it retains the formatting. It preserves the original intent and structure of the source material.

    2. Challenges with Raw Document Formats

    What does “unstructured data” mean in the context of documents?

    Unstructured data, in the context of LLMs, refers to data lacking clear semantics and hierarchical cues. Such data is difficult for AI to interpret, group, and find relationships within.

    Why is unstructured data problematic for LLMs?

    LLMs learn from the patterns in how language is presented. If the data lacks structure, it creates several issues:

    • Token confusion: Headings and subheadings might appear to LLMs as normal flat text.
    • Loss of context: Lists, quotes, code, and sections are difficult to distinguish.
    • Increased noise: It can dilute useful training cues — for example, headers, which can completely change meaning.
    • Ambiguity: Without structure, the LLM can't distinguish between classes of data — e.g., metadata, decorations, and narrative content.

    What are common sources of unstructured data in real-world scenarios?

    Unstructured data often comes from:

    • PDFs with complex layouts or embedded fonts
    • Word documents using inconsistent styles
    • PowerPoint decks with sparse or heavily visual slides
    • Scanned documents with OCR inaccuracies
    • HTML pages cluttered with layout tags, ads, or scripting

    What specific types of “noise” or challenges exist in formats like PDF, DOCX, etc.?

    These formats come with a range of issues:

    • Layout noise: Headers, footers, multi-column text, and watermark overlays
    • Formatting inconsistencies: For example, bold might be used where it shouldn't
    • Mixed content: Images, charts, and footnotes might interrupt the text flow
    • Parsing limitations: Since most tools extract text in reading order rather than logical order, they might generate poor results

    Why are existing tools insufficient for handling these issues at scale?

    Many tools already exist for converting files to Markdown. However, they lack semantic preservation. They may return a raw string that lacks clear markup indicating structure.

    Some tools do not support multiple file types in a single workflow, and others might require manual corrections as part of post-processing. The output often needs scripts to clean and structure the data after processing.

    MarkItDown: Features and Usage

    What is MarkItDown, and what document formats does it support?

    MarkItDown is a document conversion tool developed by Microsoft to transform documents of various formats into clean, structured Markdown.

    It supports a wide range of input types, including PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR), Audio (EXIF metadata and speech transcription), HTML, Text-based formats (CSV, JSON, XML), ZIP files (iterates over contents), Youtube URLs, EPubs, and more!

    It can also handle more niche sources such as URLs and zip files containing documents.

    What are the core features of MarkItDown?

    • Structure Preservation: Unlike its predecessors, MarkItDown preserves the semantic structure of documents such as headings, lists, and formatting. This structure is important for LLMs during the interpretation
    • Clean Output: MarkItdown output is clean from excessive formatting that might hinder LLM performance
    • Semantics Integrity: Markitdown ensures that the intent of the original document is preserved

    How does it preserve document structure and formatting?

    MarkItDown works by analyzing the document and translating its structure into Markdown syntax. It identifies elements like headings, subheadings, lists, and paragraphs, preserving their hierarchy and meaning. Additionally, any inline formatting (like bold or italics) is also converted to its Markdown equivalent.

    What makes its output clean and suitable for LLM ingestion?

    MarkItDown's output is free from excessive visual elements such as complex layouts and non-textual content, which are irrelevant to LLMs. This makes the documents more easily ingestible by models. It uses simple symbols to represent structure (e.g., # for headings, - for lists), reducing the number of tokens required to represent the same information compared to rich-text formats.

    MarkItDown handles a wide variety of input formats, including:

      • PDFs: Preserving text, headings, and tables from complex layouts.
      • Word files (.docx): Maintaining paragraphs, lists, and other semantic structures.
      • PowerPoint presentations (.pptx): Handling slides and bullet points.
      • Excel spreadsheets (.xlsx): Extracting data from structured tables.
      • Markdown files (.md) and HTML: Converting already structured content into more usable Markdown.

    What are the benefits of MarkItdown in the context of LLMs?

    Token Efficiency: MarkItDown uses simple notations like # for headings, reducing the number of tokens required for formatting.

    • Structure Clarity: MarkItDown’s clear, easy-to-interpret structure helps LLMs better understand the content’s hierarchy and meaning.
    • Ease of Parsing: MarkItDown’s output is easier for LLMs to parse compared to other complex formats.
    • Consistency: MarkItDown produces documents with a predictable format, making it easier for LLMs to learn patterns.
    • Minimal Noise: Unlike rich formats, Markdown eliminates extraneous elements, reducing noise and allowing LLMs to focus on the content.
    • Scalability: Markdown’s plain-text nature allows it to scale easily for large datasets, which is essential for training LLMs. Markdown files are straightforward to process and integrate into data pipelines.
    • Compatibility: Most AI tools and machine learning libraries support Markdown, making it an accessible format for feeding data into various LLMs.

    How is MarkItDown used via the command line?

    In this section, we cover the practical usage of MarkItDown across different environments:

    Installation of MarkItDown

    You can install MarkItDown either by cloning the GitHub repository or using Python's package manager (pip):

    pip install markitdown
    

    Alternatively, clone the repo directly:

    git clone https://github.com/microsoft/markitdown
    cd markitdown
    pip install .
    

    Usage via Command Line

    Once installed, you can convert files using the command line with simple syntax such as:

    markitdown convert input.docx -o output.md
    

    This command converts the specified document to Markdown format and saves it as output.md.

    Usage via Python API

    MarkItDown can also be used programmatically via its Python API. Here's an example:

    from markitdown import MarkItDown
    md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
    result = md.convert("testfile.xlsx")
    print(result.text_content) 
    

    Usage via Docker

    MarkItDown can be run inside a Docker container for isolated and reproducible environments. Here's how to build and run it:

    docker build -t markitdown:latest .
    docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
    

    Azure docuem intelligence

    Azure AI Document Intelligence (formerly Form Recognizer) is a cloud-based service that extracts structured data—text, tables, key-value pairs—from various document formats using prebuilt and custom models.

    MarkItDown can integrate with Azure Document Intelligence to enhance its document parsing capabilities:

    from markitdown import MarkItDown
    md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
    result = md.convert("test.pdf")
    print(result.text_content)
    

    This allows you to leverage Azure’s layout understanding before converting to Markdown, making the output more reliable for downstream LLM tasks.

    Limitations and Considerations

    • While MarkItDown performs well compared to alternatives, it is not 100% accurate. It may produce suboptimal results when handling complex or poorly structured documents.
    • May Require Post-Processing for Edge Cases:
    • In some cases, the output may require additional post-processing to clean up formatting or correct minor issues, especially in documents with unconventional layouts.
    • Potential Size/Format Limitations:
    • Extremely large files or those using uncommon formats may be unsupported or could lead to performance issues during conversion.

    Conclusion

    High-quality, structured data is the foundation of effective LLM training. While rich documents such as PDFs contain valuable information for humans, this information can be useless to LLMs if not presented properly. MarkItDown bridges this gap by converting such documents into a format (Markdown) that LLMs can efficiently use.

    As LLM applications grow, tools like MarkItDown will become essential in bridging the gap between raw content and high-quality, machine-readable data.

    Frequently Asked Questions

    Markdown is a lightweight markup language designed to format plain text using simple, readable syntax. Its primary purpose is to make content easy to write, read, and convert into rich formats like HTML. For LLMs, Markdown offers structure without the overhead of complex formatting, making it ideal for efficient parsing and training.

    PDFs and Word documents are designed for visual presentation, not machine understanding. They often include layout noise, inconsistent styling, and embedded elements that make it hard for models to extract clean, structured data. Markdown strips away this noise, preserving only the semantically meaningful parts.

    Yes, MarkItDown supports image files and leverages OCR (Optical Character Recognition) to extract text content from scanned documents or images, making them usable in text-based pipelines.

    MarkItDown can run entirely offline. However, certain integrations—like using Azure Document Intelligence—require internet access and cloud service credentials.

    Unlike many tools that focus on extracting plain text, MarkItDown emphasizes structure preservation, semantic integrity, and clean Markdown output, making it highly suitable for LLM ingestion pipelines without heavy post-processing.

    MarkItDown supports a wide range of document formats, including: PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR support), Audio (EXIF metadata and speech transcription), HTML, CSV, JSON, XML, ZIP files (recursively processes contents), YouTube URLs, EPUBs. MarkItDown can also handle niche sources like URLs and compressed archives containing multiple document types.

    Yes, MarkItDown can be integrated into cloud environments. For instance, it works with Azure Document Intelligence (formerly Form Recognizer) to extract structured data from documents before converting them into Markdown format.

    MarkItDown can handle a range of file sizes, but for extremely large files, performance might be affected. It’s recommended to break down large documents into smaller chunks for better efficiency. Additionally, files with very complex formats may require additional optimization for best results.

    Yes, MarkItDown supports multiple languages, especially in text-based formats (e.g., CSV, JSON). For image-based content or OCR, the quality of extraction may vary based on the language and the OCR engine's capabilities.

    If MarkItDown encounters an unsupported file type or a complex document layout, it may produce incomplete or incorrect output. In such cases, manual post-processing or fallback to other conversion methods might be required.

    MarkItDown uses minimalistic syntax to preserve structure—such as `#` for headings and `-` for lists—thus reducing the number of tokens required for formatting. This makes the resulting Markdown documents more token-efficient and better suited for LLM training.

    Yes, MarkItDown can be easily automated through command-line interfaces (CLI) or the Python API, making it suitable for integration into larger data processing or machine learning pipelines.

    References

    Background References

    1. (March 19, 2025). MarkItDown GitHub Repository. *Github*. Retrieved March 19, 2025 from https://github.com/microsoft/markitdown
    2. (February 11, 2025). Azure AI Document Intelligence. *Azure*. Retrieved February 11, 2025 from https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence
    3. (April 10, 2025). Markdown. *Wikipedia*. Retrieved April 10, 2025 from https://en.wikipedia.org/wiki/Markdown

    About the Author

    Joseph Horace's photo

    Joseph Horace

    Horace is a dedicated software developer with a deep passion for technology and problem-solving. With years of experience in developing robust and scalable applications, Horace specializes in building user-friendly solutions using cutting-edge technologies. His expertise spans across multiple areas of software development, with a focus on delivering high-quality code and seamless user experiences. Horace believes in continuous learning and enjoys sharing insights with the community through contributions and collaborations. When not coding, he enjoys exploring new technologies and staying updated on industry trends.

    logo
    Basic Utils

    simplify and inspire technology

    ©2024, basicutils.com