Multimodal Inputs

This chapter explains how MASFactory handles multimodal fields, attachments, history references, and the current caveats around provider adapters.

1) Field declarations

Multimodal support is driven by FieldSpec / FieldModality. You can declare fields in either mapping form or the lightweight string form:

python

pull_keys = {
    "receipt_image": "IMAGE:Receipt image",
    "invoice_pdf": "PDF:Invoice PDF",
}

Supported modalities:

TEXT
IMAGE
PDF
ANY

The lightweight prefix is case-insensitive, so pdf:Invoice PDF is valid too.

2) Media asset objects

MASFactory normalizes attachments into provider-agnostic asset types:

ImageAsset
PdfAsset

Common constructors:

python

from masfactory import ImageAsset, PdfAsset

image = ImageAsset.from_path("./receipt.png")
pdf = PdfAsset.from_path("./invoice.pdf")

Assets can also be created from bytes, base64, URLs, and provider file IDs when the selected model adapter supports those sources.

3) What the Agent does with attachments

When an input field contains media assets:

Agent.observe() validates the field against its declared modality.
The current turn receives attachment tags such as [receipt_image_1 Receipt image].
New attachments are sent as MediaMessageBlocks alongside text instructions.
The formatted user prompt references the attachment by tag.

This keeps prompt text readable while still preserving structured media blocks for capable adapters.

4) `reuse_attachment_tags`

reuse_attachment_tags only controls current-turn deduplication.

If True:

repeated identical attachments within the same turn reuse the first tag
if the attached history provider returns rich historical media blocks for the same asset, the Agent may reuse that historical tag instead of resending the attachment

If False:

the current turn always emits fresh tags and fresh media blocks

This flag does not configure how history is stored or merged.

5) History behavior

History policy belongs to the concrete history implementation.

For built-in HistoryMemory:

only one HistoryProvider-backed memory may be attached to an Agent
merge_historical_media=True rewrites repeated historical attachments into indexed tag references
merge_historical_media=False keeps raw historical media blocks intact

That means the richness of history returned to the Agent is decided by the history provider itself, and the Agent adapts to what it receives.

6) Skill media

Skills may declare static media in SKILL.md frontmatter:

---
name: receipt-skill
media:
  - type: image
    path: guide.png
    mime_type: image/png
---
Always compare the receipt against the guide image.

These skill media assets:

stay with the skill text on the system side
are treated as static directive attachments
are not deduplicated against chat history
are not controlled by reuse_attachment_tags

7) Provider caveats

Current built-in adapter behavior:

OpenAIModel supports multimodal user inputs and PDF inputs through the Responses API
LegacyOpenAIModel supports image input but not PDF input
AnthropicModel and GeminiModel support multimodal user inputs
AnthropicModel and GeminiModel currently reject system-side media with a clear error

So skill media is only usable with adapters that can carry system-side media content.

8) Minimal example

python

from masfactory import Agent, ImageAsset, OpenAIModel

agent = Agent(
    name="receipt_agent",
    model=OpenAIModel(model_name="gpt-4.1", api_key="..."),
    instructions="You are a careful receipt reviewer.",
    prompt_template="Please inspect {receipt_image} and answer: {question}",
    pull_keys={
        "question": "Question",
        "receipt_image": "IMAGE:Receipt image",
    },
    push_keys={"answer": "Answer"},
)

result = agent.step(
    {
        "question": "What is the total amount?",
        "receipt_image": ImageAsset.from_path("./receipt.png"),
    }
)

Multimodal Inputs ​

1) Field declarations ​

2) Media asset objects ​

3) What the Agent does with attachments ​

4) reuse_attachment_tags ​

5) History behavior ​

6) Skill media ​

7) Provider caveats ​

8) Minimal example ​

Multimodal Inputs

1) Field declarations

2) Media asset objects

3) What the Agent does with attachments

4) `reuse_attachment_tags`

5) History behavior

6) Skill media

7) Provider caveats

8) Minimal example