Multimodal Inputs
This chapter explains how MASFactory handles multimodal fields, attachments, history references, and the current caveats around provider adapters.
1) Field declarations
Multimodal support is driven by FieldSpec / FieldModality. You can declare fields in either mapping form or the lightweight string form:
pull_keys = {
"receipt_image": "IMAGE:Receipt image",
"invoice_pdf": "PDF:Invoice PDF",
}Supported modalities:
TEXTIMAGEPDFANY
The lightweight prefix is case-insensitive, so pdf:Invoice PDF is valid too.
2) Media asset objects
MASFactory normalizes attachments into provider-agnostic asset types:
ImageAssetPdfAsset
Common constructors:
from masfactory import ImageAsset, PdfAsset
image = ImageAsset.from_path("./receipt.png")
pdf = PdfAsset.from_path("./invoice.pdf")Assets can also be created from bytes, base64, URLs, and provider file IDs when the selected model adapter supports those sources.
3) What the Agent does with attachments
When an input field contains media assets:
Agent.observe()validates the field against its declared modality.- The current turn receives attachment tags such as
[receipt_image_1 Receipt image]. - New attachments are sent as
MediaMessageBlocks alongside text instructions. - The formatted user prompt references the attachment by tag.
This keeps prompt text readable while still preserving structured media blocks for capable adapters.
4) reuse_attachment_tags
reuse_attachment_tags only controls current-turn deduplication.
If True:
- repeated identical attachments within the same turn reuse the first tag
- if the attached history provider returns rich historical media blocks for the same asset, the Agent may reuse that historical tag instead of resending the attachment
If False:
- the current turn always emits fresh tags and fresh media blocks
This flag does not configure how history is stored or merged.
5) History behavior
History policy belongs to the concrete history implementation.
For built-in HistoryMemory:
- only one
HistoryProvider-backed memory may be attached to anAgent merge_historical_media=Truerewrites repeated historical attachments into indexed tag referencesmerge_historical_media=Falsekeeps raw historical media blocks intact
That means the richness of history returned to the Agent is decided by the history provider itself, and the Agent adapts to what it receives.
6) Skill media
Skills may declare static media in SKILL.md frontmatter:
---
name: receipt-skill
media:
- type: image
path: guide.png
mime_type: image/png
---
Always compare the receipt against the guide image.These skill media assets:
- stay with the skill text on the system side
- are treated as static directive attachments
- are not deduplicated against chat history
- are not controlled by
reuse_attachment_tags
7) Provider caveats
Current built-in adapter behavior:
OpenAIModelsupports multimodal user inputs and PDF inputs through the Responses APILegacyOpenAIModelsupports image input but not PDF inputAnthropicModelandGeminiModelsupport multimodal user inputsAnthropicModelandGeminiModelcurrently reject system-side media with a clear error
So skill media is only usable with adapters that can carry system-side media content.
8) Minimal example
from masfactory import Agent, ImageAsset, OpenAIModel
agent = Agent(
name="receipt_agent",
model=OpenAIModel(model_name="gpt-4.1", api_key="..."),
instructions="You are a careful receipt reviewer.",
prompt_template="Please inspect {receipt_image} and answer: {question}",
pull_keys={
"question": "Question",
"receipt_image": "IMAGE:Receipt image",
},
push_keys={"answer": "Answer"},
)
result = agent.step(
{
"question": "What is the total amount?",
"receipt_image": ImageAsset.from_path("./receipt.png"),
}
)