Blog

Transfer Pricing and Its Future: Rebuilding Compliance with AI

Sachin Yadav·16 min read

Multinational enterprises save billions in taxes with one simple trick, and tax authorities hate it. Here's why transfer pricing is the most scrutinised area in international tax, and what it actually takes, in engineering terms, to rebuild the workflow with AI.

The trick: why transfer pricing exists at all

Suppose your company operates in multiple countries. When one part of the company (say, in the US) sells something to another part (say, in India), it must set a price. That price is the transfer price.

Different countries tax differently, so groups have a structural incentive to shift profits to lower-tax jurisdictions.

A toy example. Company AB has subsidiaries A (30% tax) and B (10% tax). A manufactures ball-bearings at $50 and sells them to B at $70. B sells to the market at $90.

	Cost basis	Sale price	Profit	Tax rate	Tax
A	$50	$70	$20	30%	$6
B	$70	$90	$20	10%	$2
Group total					$8

Now A sells to B at $60 instead:

	Cost basis	Sale price	Profit	Tax rate	Tax
A	$50	$60	$10	30%	$3
B	$60	$90	$30	10%	$3
Group total					$6

A $10 shift in the transfer price moved the group's tax bill down 25%. Scaled across thousands of transactions, this is the realm of billions in shifted profits, and exactly what tax authorities want to prevent.

Profit shifting under different transfer prices

The compliance burden

To prove transactions are legitimate, multinational groups must produce hundreds of pages of documentation per jurisdiction, justifying every related-party transaction against the Arm's Length Principle. A typical filing cycle takes 6 to 8 weeks per entity and pulls in tax, finance, legal, and operations.

This is where the AI opportunity sits. Not because TP is "automatable", but because 60 to 70% of the cycle is mechanical document work performed under structured rules. That's exactly the shape of problem modern AI systems are good at, if the engineering is done seriously.

The rest of this piece walks through how TP documentation actually works, and where production-grade AI changes each step. I'll be specific about the stack.

Reference architecture: five-stage AI pipeline for TP

The Arm's Length Principle (and why related-party detection is non-trivial)

The OECD principle is simple: price it like you would with an unrelated party. If you'd charge an outsider $100, charge your subsidiary $100 too.

Related parties are entities with >25% ownership, but also entities related through:

Family relatives owning other companies
Control through contracts
Economic dependence (one customer = >50% of revenue)
Shared key personnel or resources

Identifying related parties across complex group structures (indirect ownership, fan-out trusts, contractual control) is hard manual work. This is a clean graph problem.

The AI engineering. Ingest ownership records, board minutes, and shareholder agreements into a typed entity graph (nodes = entities/people, edges = ownership/control/contractual relationships with effective dates). Use entity resolution (embedding similarity plus an LLM judge with provenance) to deduplicate variant spellings of the same legal entity across jurisdictions. Then run graph traversal queries (transitive ownership, indirect control) to surface every entity that triggers TP documentation. Weeks of corporate-structure analysis collapses to a query, with a full audit trail of which document justified each edge.

Related-party detection as a graph problem

The 5-step process, and the AI engineering for each

The TP documentation cycle is fragmented and time-intensive. Here's how it works, and the specific AI architecture that changes each step.

Step 1. Data Gathering

What teams do today. Collect financial statements across entities, gather contracts, invoices, and intercompany agreements, map every intercompany transaction. Chase information across time zones, ERP systems, and shared drives.

Where the time goes. Most documents arrive as PDFs (some scanned), Excel models with merged cells and footnotes, PPT decks, and email chains. Pulling them into a structured form that an analyst can reason over is the long pole.

The AI engineering.

Layout-aware extraction. Plain OCR fails on multi-column financials, footnoted tables, and stamped contracts. Use a layout-aware parser (Azure Document Intelligence, AWS Textract, or open-source alternatives like Docling, Unstructured, or LlamaParse) that preserves page → block → table → cell hierarchy with bounding boxes. For hard scans, fall back to a vision-language model (Claude Sonnet, GPT-4o) prompted with the target schema.
Schema-bound structured extraction. Define Pydantic schemas per document type: ICA(parties, effective_date, scope, pricing_method, term), Invoice(supplier, buyer, line_items, currency, tax_codes), FinancialStatement(entity, period, line_items_by_taxonomy). Extract with constrained decoding (Instructor, Outlines, vendor JSON modes) so the model can only emit valid shapes. Every field carries a confidence and a source_ref pointing back to PDF, page, and bbox.
Entity resolution. The same vendor appears as "Acme Corp.", "Acme Corporation Pvt Ltd", and "ACME CORP" across systems. Resolve with a two-stage pipeline: embedding-based candidate generation plus an LLM judge with a deterministic prompt and structured rationale. Persist canonical IDs.
Reconciliation as a workflow, not a model call. Cross-entity matching (parent records show $5M outflow, sub records show $4.8M inflow: currency? timing? missing invoice?) belongs in deterministic code with LLM assistance on exceptions, not the other way around.
Confidence-routed HITL. Every extracted field has a calibrated confidence score. Fields below threshold land in a human review queue with the source snippet rendered next to the model's proposal. Reviewer corrections become training signal.

Output of Step 1. A queryable, typed, provenance-tagged data warehouse where every number traces to its source page.

Step 2. FAR Analysis (Functions, Assets, Risks)

What teams do today. Analyse what each entity does, what it owns, and what risks it bears. Determine the "tested party", typically the entity with simpler, routine functions. Argue that profit should follow function, risk, and asset contribution.

Example.

Entity	Functions	Assets	Risks
PharmaCo USA (Parent)	R&D, clinical trials, patent ownership	Patents, trademarks, R&D facilities	R&D, clinical trial, regulatory
PharmaCo Ireland (Mfg)	Manufacturing, QC, warehousing	Equipment, factory, inventory	Production, quality, inventory
PharmaCo Germany (Dist.)	Sales, marketing, distribution	Customer relationships, network	Market, credit, FX

Expected profit split: USA 60 to 70%, Germany 20 to 30%, Ireland 10 to 15%.

For intangibles, OECD's DEMPE framework (Development, Enhancement, Maintenance, Protection, Exploitation) overrides legal ownership: the entity that performs the value-creating functions earns the returns, not just the entity that holds the IP on paper.

The AI engineering.

Retrieval-augmented FAR drafting. Build a retrieval index over: prior-year FAR analyses (the strongest signal for next year), the group's intercompany agreements, operational documents, transfer-pricing thought-leadership, OECD guidelines, and country-specific rulings. Use hybrid retrieval (BM25 plus dense embeddings). Pure dense retrieval underperforms on regulatory text where exact term-matching matters ("DEMPE", "TNMM", section numbers).
Function, asset, and risk classification as structured generation. For each entity, generate a typed FAR object with cited evidence per claim. A function like "performs clinical trials" must point at a specific contract clause or operational document. Unsubstantiated claims are rejected by a validator before they reach the draft.
DEMPE reasoning as an agentic pattern. This is not a single prompt. It's a multi-step reasoning task: (a) enumerate intangibles, (b) identify legal owner, (c) for each DEMPE function, retrieve evidence of who actually performs it, (d) flag mismatches between legal and economic ownership. Run this as a graph of typed tool calls, not a free-form chain-of-thought.
LLM-as-judge with a partner-authored rubric. A senior TP partner writes the rubric (10 to 20 criteria with examples of pass or fail). The judge model scores each FAR section against it, with a written rationale. Disagreements between judge and reviewer become eval set entries.
Eval harness, non-negotiable. Gold set of 50 to 100 historical FAR analyses scored by senior partners. Any model or prompt change is gated on this set. Without an eval harness, you have no way to know whether your "improvements" are regressions.

Step 3. Comparability and Benchmarking

What teams do today. Search TP databases (TP Catalyst, RoyaltyRange, Orbis, Capitaline, Prowess) for comparable companies. Screen thousands down to 10 to 20 defensible comparables. Apply the most appropriate method (CUP, RPM, CPM, PSM, or TNMM) and calculate the arm's-length range.

The Profit Level Indicator (PLI) is the financial ratio you compare on. Pick the wrong PLI and the entire benchmark is indefensible.

PLI	Formula	When to use
Operating Margin on Sales (ROS)	Operating profit / Sales	Distributors
Cost Plus Markup	Operating profit / Total costs	Routine manufacturers, service providers
Berry Ratio	Gross profit / Operating expenses	Limited-risk distributors with low value-add
Return on Assets (ROA)	Operating profit / Assets	Capital-intensive entities
Return on Capital Employed (ROCE)	EBIT / Capital employed	Asset-heavy, long-cycle businesses

The AI engineering.

Comparables search as a retrieval problem. Index the comparables database with two signals: (a) deterministic filters (industry codes, country, revenue band, independence status) and (b) dense embeddings of business descriptions for semantic similarity. Hybrid retrieval surfaces 50 to 100 candidates fast.
OECD comparability factors as a structured screen. For each candidate, an LLM judge scores it on the OECD's five comparability factors (product characteristics, FAR, contractual terms, economic circumstances, business strategies) and rejects with a written rationale per fail. Rejection rationales are the single most-audited artefact in a TP report. They must be high-quality and consistent. This is where most "AI for benchmarking" tools fail today.
LLMs do not do arithmetic on report numbers. PLI computation, interquartile ranges, median tests: all in deterministic code. The LLM proposes the method and writes the narrative; pandas computes the numbers. Mixing these is how hallucinated numbers end up in a regulator's hands.
Multi-method backtesting. Run CUP, RPM, CPM, and TNMM in parallel where data exists; surface a comparability matrix showing which method is most defensible given the comparable set you actually have. Sensitivity-test by perturbing the comparable set (drop top, drop bottom, re-run) to expose fragile conclusions before a tax authority does.
Caching with provenance. Every comparable in the final set is stored with its source DB snapshot, filter trace, rejection log of nearby alternatives, and the analyst who signed off. This is the audit defence.

Comparables search as a retrieval funnel

Step 4. Report Writing

What teams do today. Draft 50 to 200 pages per entity and jurisdiction, organised into the OECD three-tier structure:

Master File. Group overview: org structure, intangibles, financial activities, business descriptions.
Local File. Country-specific: all intercompany transactions, FAR, comparables, benchmarking.
Country-by-Country (CbC) report. Standardised allocation of revenue, profit, tax, and employees across jurisdictions.

Each report justifies the chosen transfer pricing method, explains edge cases, and demonstrates compliance with local rules.

The AI engineering.

Template-grounded generation. Sections are mostly fixed: executive summary, group overview, FAR, method selection, comparability analysis, conclusion. The template lives in code; the LLM fills sections with retrieved facts. Free-form generation is a red flag.
Citations are mandatory and machine-checked. Every numerical claim cites the underlying data row; every regulatory claim cites the underlying guideline. A post-generation validator walks the draft, resolves every citation, and rejects unsupported claims. Hallucinated section numbers (e.g. "OECD TPG Chapter VI Section 6.42") are caught here, not by a regulator.
Verbatim regulatory phrasing via retrieval. Where regulators expect specific language (Indian Rule 10D, UAE FTA TP guide, German GAufzV), don't paraphrase. Retrieve and quote. Paraphrasing regulatory text is how firms lose audits.
Jurisdiction rules as a policy layer. Country-specific local file requirements (India needs Form 3CEB sections, UAE needs Disclosure Form alignment, Germany needs Sec. 90(3) AO) are encoded as a structured rule set. The generator checks every required section is present and populated; missing items block the draft.
Style transfer per firm. Each firm has a voice: formal, plain, footnote-heavy. Few-shot from 3 to 5 prior reports of the firm to lock voice without retraining.

Step 5. Final Review

What teams do today. Senior TP experts review reports, take client feedback, revise, and sign off. This is where errors are most expensive. A missed inconsistency goes to the tax authority.

The AI engineering.

LLM-as-judge with a compliance checklist. Per-jurisdiction checklists (50 to 100 items) score each report section. Outputs: pass, fail, or human-review-required, each with cited evidence from the draft.
Numerical consistency checks. The same PLI often appears in 4 to 6 places (exec summary, method section, benchmark, conclusion). A deterministic check ensures they match to the cent.
Citation auditing. Every regulatory reference is verified against the canonical source. Hallucinated citations are the single highest-frequency failure mode of LLM-generated legal text.
Sensitivity replay. Re-run the benchmark with perturbed comparable sets and surface conclusions that flip. Those need a written defence in the report.
Human-in-the-loop is permanent, not transitional. TP reports are signed by a person; the regulator pursues that person. The AI's job is to make the reviewer 5× faster and more confident, not to remove them.

Review and governance: four parallel checks before sign-off

Where AI breaks in transfer pricing: the honest part

If you've shipped AI in regulated domains, you know the failure modes. They show up in TP too:

Comparable rejection is judgment-heavy. An LLM can suggest rejections; it can't own them. Senior judgment about whether a comparable is "good enough" is shaped by prior audit experience the model doesn't have.
Novel transactions have no prior pattern. A new product line, a restructuring, a first-year UAE CT filing: retrieval gives you nothing similar. The model degrades to general reasoning, and general reasoning isn't audit-defensible.
Penalty exposure is asymmetric. A false positive (claiming a method is defensible when it isn't) has real downside. Penalties of 100 to 300% of tax shortfall in some jurisdictions. Calibration of confidence scores matters more than headline accuracy.
Regulators are correctly skeptical. "An AI wrote this" is not a defence. The reviewer's name on the report is the defence. Build the system around making that human's life better, not replacing them.
Data quality dominates model quality. A better model on bad ERP data produces confident bad outputs faster. Step 1 is 80% of the value.

The firms that get this right will treat TP automation as an engineering problem with a regulated boundary: data pipelines, evals, audit logs, HITL gates. Not as a prompt engineering exercise.

Data security and compliance: the part regulators and clients actually ask about

TP data is unusually sensitive. A typical engagement touches material non-public financial information, full intercompany agreements, supplier and customer lists, and personally identifiable information for directors and key employees. If any of this leaks, the downside is regulatory penalties, securities-law exposure on the client side, and the end of the engagement. Before any pilot ships, this is the conversation procurement and the client's CISO will have.

A few things to be specific about, since most "enterprise AI" pitches stay vague here.

HIPAA does not apply to TP. It is a US health-data law. Mentioning it in a TP proposal is a sign the vendor is reading from a generic template. The frameworks that do apply:

SOC 2 Type II. The table-stakes attestation any Big 4 or regulated client will ask for. Covers security, availability, processing integrity, confidentiality, and privacy. Type II is the one that matters (it audits controls over a period, not at a point in time).
ISO 27001. Preferred outside the US, especially in EU and APAC enterprise procurement.
India DPDP Act, UAE PDPL, GDPR. Apply to the personally identifiable parts of TP filings (director KYC, employee benchmarking data). Data residency rules in DPDP and many GCC jurisdictions mean the underlying compute must stay in-region.
Client-level NDAs and engagement letters. Usually stricter than statutory law, and what actually governs day-to-day handling.

Which cloud and model providers fit. For a regulated tax workload, the realistic shortlist:

Provider	Why it fits TP work	Watch-outs
AWS Bedrock (Claude, Llama, Titan)	SOC 2, ISO 27001, in-region inference (ap-south-1 Mumbai, me-central-1 UAE), contractual no-training, KMS + VPC endpoints	Model selection is region-dependent. Verify which models are live in your target region before designing.
Azure OpenAI (GPT and o-series)	SOC 2, ISO 27001, FedRAMP, in-region (Central India, South India, UAE North), enterprise commitment of no training, private endpoints	Quota and capacity allocation can be tight in newer regions. Confirm before quoting timelines.
Google Vertex AI (Gemini)	Same posture; strong on data residency and confidential computing	India and UAE regional coverage is less mature than AWS or Azure for the latest models.
Anthropic, OpenAI, Mistral direct APIs	Fastest access to the strongest models; zero-retention modes available on enterprise tiers	Less granular data residency control than the hyperscalers. Suitable for de-identified workloads or pilots.
Self-hosted open-source (Llama, Qwen, Mistral on vLLM in the client VPC)	Maximum control. No data leaves the client perimeter. Required for the most sensitive Big 4 and government-entity engagements.	Higher ops cost, model quality trails frontier by 6 to 12 months, MLOps maturity needed.

The honest default for TP work in India and the UAE is Bedrock or Azure OpenAI with a region-pinned deployment, contractual no-training, customer-managed KMS, and VPC-private inference. Self-hosting comes in for the most sensitive sub-workflows (raw ICA review, internal restructuring memos) and for clients with a hard "data never leaves our network" policy.

Architectural controls worth building in from day one.

Tenant isolation. Per-client vector indexes, per-client retrieval permissions, per-client object stores. No shared embedding space across clients. This is the single most-asked question in security reviews.
Customer-managed keys (KMS) on data at rest, TLS 1.3 in transit. Document stores, vector stores, and audit logs encrypted with client-controlled keys where the engagement requires it.
Private network paths. VPC endpoints for model inference and storage; no traffic over the public internet. Standard on Bedrock and Azure OpenAI, but it has to be configured.
PII detection and redaction at the egress boundary. Before any document leaves the customer VPC for an external API, run a deterministic PII scanner. Block or redact based on policy. Even with no-training clauses, the principle of minimum disclosure still applies.
Full audit log. Every prompt, retrieval call, tool invocation, model response, and reviewer action recorded with user identity and request ID. Replayable. Retained per the engagement letter.
RBAC and least privilege. Analysts see the engagement they're staffed on, not other clients. Senior reviewers get a wider scope. Cross-tenant access is logged and reviewed.
Eval and prompt governance. Prompt changes go through code review and an eval-set gate before they touch a production engagement. Treat prompts as code.

Verify the no-training clause is contractual, not marketing. Both AWS Bedrock and Azure OpenAI have explicit "your data is not used to train models" terms in their enterprise agreements. Get this in the MSA, not the marketing page. For direct-API providers, this is usually a paid-tier feature (Anthropic and OpenAI both call this "zero data retention" or similar). If it is not in writing, assume the opposite.

The headline summary: for TP work, the boring infrastructure decisions (region pinning, KMS, VPC private inference, tenant isolation, audit logs) are what makes the system credible to procurement. The interesting AI work only gets to ship after those are in place.

Closing

Transfer pricing documentation takes weeks and gets harder every year. The mechanical 60 to 70% (extraction, search, drafting, consistency checking) is unambiguously a job for the AI stack as it exists in 2026. The remaining 30 to 40% (judgment under jurisdictional ambiguity, audit defence, client advisory) is where TP professionals should be spending their time.

The firms that build (or partner to build) the right pipeline will deliver faster, more consistent, and more defensible compliance. The ones that bolt a chatbot onto SharePoint will not.

If you work in transfer pricing, at a Big 4, a next-tier firm, or in-house, I'm building this at Polemos Labs, and we're running free 2-week audit pilots with three TP practices in India and the UAE this quarter.

Working on this?

Polemos Labs is running free 2-week audit pilots with three TP practices in India and the UAE this quarter. If that's your team, write to us.

hello@polemos.in