General

Navigating Security Tradeoffs of AI Agents

6 min read

The Risks of Open Source AI Ecosystems

Open source AI systems are new and fast-evolving. By that virtue, they contain more risk. There are no standardized signing or integrity checks for models, and high trust in popular repositories means that these attacks spread widely, rapidly and before threats are detected.

Yet open source is inevitable for implementing AI. The open source AI ecosystem forms the backbone of the world’s current AI infrastructure. Every major LLM deployment, from Grok to ChatGPT, runs on an open source foundation while proprietary layers handle business-specific execution.

While AI agents hold the potential to act as force multipliers within the business, they hold the same potential for threat actors. A single corrupted model, connector or dependency in the AI supply chain can be used across many teams and workflows, pushing hostile behavior everywhere at once.

Hidden Threats Inside AI Models: Model File Attacks

In a model file attack, attackers upload malicious AI model files to trusted open source repositories. These files look legitimate, sometimes with official branding, but contain hidden executable code. When a developer loads the model, the malicious payload is executed automatically. Common model file attacks can steal AWS credentials from metadata services, download remote access trojans and exfiltrate data to attacker servers. After that, the model usually functions normally, so users don’t notice the breach.

When Trusted AI Infrastructure Turns Against You: Rug Pull Attacks

In rug pull attacks, an attacker manipulates the Model Context Protocol (MCP) server that an AI agent connects to in order to perform malicious actions. MCP servers add tools for AI agents and give them capabilities. Many of the most useful MCP servers are simply open source code projects maintained by untrusted third parties. If the repository is compromised, an attacker can modify the MCP server to perform malicious actions after an LLM is integrated with it — for example, copying data and sending it to an outside source. End users who simply keep their tools up to date are at risk of rug pull attacks without being aware.

The alternative is to use remote MCP servers whose code is maintained by trusted organizations. Many popular platforms, such as GitHub, maintain their own remote MCP servers. These servers can be connected to and are generally trusted to the extent that an organization trusts the MCP provider. This does not prevent agents from performing malicious actions with the tools they are given via the remote MCP server; it simply reduces the risk of an MCP rug pull attack.

What Leaders Should Do Now

We predict model file attacks will persist for the foreseeable future, and defending against them is the first step of any AI agent security strategy. Teams must scan model files with tools that can parse machine learning formats, and load models in isolated containers, virtual machines or browser sandboxes until verified clean.
Remote MCP servers will generally be safer if you trust the organization running the remote MCP server. Local MCP servers that may be downloaded from GitHub are essentially code you don’t control. If your organization must use an open source local MCP server, do manual and automated static code analysis on the code to confirm safety, as well as redoing that safety analysis any time the MCP server is updated from GitHub.

The Risks of Compromised AI Agents

If an AI agent is like a supercharged employee, a compromised AI agent is like a supercharged insider threat. Delegating authority to agents gives them access and privileges that would normally require human action. They can send fraudulent messages, alter approvals and permissions, exfiltrate data, approve incorrect financial actions and more.

Because agents are trusted internally, suspicious behavior is likely to go unnoticed until something breaks.

For predictive models used for business intelligence, manipulation will influence business decisions in ways that may go unnoticed until financial or regulatory harm surfaces. Language model exploitation will likely see tactics around data extraction. A compromised agent will enable multi-step fraud and data harvesting with the speed of an automated system acting as an internal user.

Malicious usage of agents may not be the largest threat surface, however. Due to their nondeterministic behavior, it will not be uncommon for trusted users to unintentionally perform harmful actions via an organization’s agents.

What Leaders Should Do Now

Implement soft defenses such as guardrails to protect against prompt injection attacks as a first step. Prompt injection guardrails are a soft defense because while they can detect and block the majority of prompt injections and jailbreaks, it is currently impossible to deterministically block all prompt injections or jailbreaks. The fundamental architecture of LLMs means it’s impossible to perfectly separate the data and control planes (i.e., system prompts versus user instructions).
Implement hard defenses such as paring down the permissions and tools an agent can use to the absolute necessities. This is the only deterministic way of protecting agents from performing malicious actions. For example, if you have an agent doing meeting prep by reading your emails, then it will need a read_email() tool, but it definitely does not need a write_email() tool. Whitelisting is a strong defense mechanism against indirect prompt injection. If an internal agent is meant to help employees get answers to workplace questions, then whitelisting only the organization’s domains prevents the agent from ingesting untrusted third-party data. If the agent was given unrestricted access to search the web, then it can ingest potentially malicious text.
Do not rely on security instructions in the agent’s system prompt. System prompts should be considered unclassified information since organizations cannot deterministically prevent all prompt injections that may leak the system prompt. Nor do LLMs perfectly follow their prompt instructions 100% of the time. Many developers have dealt with unintentionally deleted data despite explicitly stating, “Do not delete the database.”
Detailed logging of agent actions is a must. Currently, agentic identity is a difficult problem to solve. Agents generally need to be able to perform actions using the user’s permissions. OAuth2 is a secure standard for the delegation of permissions, but it has blind spots. Computer Use agents, agents that can control a computer and browser like a human, are one of these blind spots. Logging and log analysis are the best ways to proactively monitor agent actions with provenance.
Choose only one brand of AI ecosystem. Deciding on only one ecosystem, such as Claude, OpenAI or Gemini for example, can make it easier to institute organization-wide security rules around their tooling, including rules preventing coding agents from performing certain tool calls or being able to read from untrusted third-party data sources.

The Strategic Tradeoff Every Enterprise Must Decide

The immense efficiency gains promised by AI agents will raise the risk tolerance of the average enterprise. Organizations face a major question: What are the minimum degree of controls that can be placed on agents without seriously undermining their return on investment?

Keep it simple. Identify the simplest security policies possible, implement them and revisit those policies every eight weeks. That’s how fast AI is evolving.

Strictly enforce agent access controls. The more power and permissions an agent has, the more strict organizations must enforce access controls. Agents with read-only access to resources present a significantly lower threat surface than agents with write permissions. Even if an agent is compromised or manipulated, the boundaries set by the hard-coded permissions will drastically limit the blast radius.

Treat agents as potentially rogue employees or contractors. Our research, and the experience of others, has found that AI agents occasionally perform harmful actions simply due to their nondeterministic architecture. Apply architectural limits and ensure every AI agent action goes through checkpoints you can monitor, log and disable if necessary.

The Future of the AI Supply Chain

Centralized org-specific agents accessible via an API or URL are continuing to provide time savings, but local and customizable agents such as Claude Cowork and OpenClaw are likely to be the significant drivers of productivity in the near future.

These trends, along with the rapid pace of development, point to the growing importance of the AI supply chain. Models and agents rely on layers of external code, datasets, connectors and APIs. A single compromised link can push hostile behavior into multiple systems at once. As integration accelerates, securing AI will become a core part of modern resilience and will demand the same level of governance and validation applied to any other critical system.

At Unit 42, our elite threat researchers and responders live on the bleeding edge of AI. We’ll help you empower safe AI use and development across your organization. We can assist to:

Discover and evaluate how AI is already being used in your organization.
Assess AI development infrastructure and processes, giving your organization a personalized benchmark against Unit 42’s robust AI security framework.
Provide expert guidance to secure deployed AI apps using automated tools and expert-led threat modeling.
Offer recommendations on proactively leveraging AI to enhance the SOC and respond to threats at machine speed.

To read more about the evolving AI threat landscape, check out the full 2026 Unit 42 Global Incident Response Report, and learn more about how Unit 42 can help you turn risk into resilience.