The team

Two founders who built evaluation infrastructure inside Microsoft.

We’ve spent the last two years building the evaluation and observability infrastructure that decides whether production Microsoft Copilot products are safe to ship. Novarch turns that discipline into a runtime layer your customers can put in front of their own agents: pinned models, cited rules, structured evidence, and an operator in the loop on borderline cases.

Co-founder · Product

Sid Vemuri

Product manager on the Microsoft Fabric Consumer AI experience.

Previously the PM lead for evaluation on Microsoft Power BI Copilot. That work meant defining the metrics, building the test frameworks, and checking whether the quality signals the team relied on actually matched what users experienced. Those evaluations informed architecture decisions and shipped measurable quality improvements into a product millions of analysts use.

Background: MS in Machine Learning, Georgia Institute of Technology. Lead author on a CogSci 2024 paper studying how AI models capture human-like concepts.

Now: Microsoft · Fabric Consumer AI
Prior: Microsoft · Power BI Copilot Evals (PM lead)
Education: Georgia Tech · MS, Machine Learning
Research: Lead author, CogSci 2024

Co-founder · Engineering

Sandra Ho

Applied AI engineer on Microsoft’s Security and AI Research team.

Builds the observability and evaluation harnesses for Microsoft Security Copilot, and runs evaluations against Security Copilot and frontier models across a wide range of security tasks. The practical question her work answers: under realistic adversary conditions, how does a given model actually behave inside a detection-engineering workflow?

Co-author of CTI-REALM, an open-source benchmark that evaluates AI on realistic attack emulations, real telemetry, and the full detection-engineering workflow. Microsoft’s EVP of Security, Igor Tsyganskiy, cited the benchmark publicly when announcing Microsoft’s Project Glasswing collaboration with Anthropic.

Now: Microsoft · Security and AI Research
Ships: Eval & observability for Security Copilot + frontier-model security evals
Open source: CTI-REALM · co-author
Education: Carnegie Mellon University

Why this team

Eval discipline came first and the product came after.

The load-bearing decisions in Novarch all came from the same place: years spent inside Microsoft asking whether an AI system is good enough to ship into a workflow that matters. That's where one-call-per-action, a pinned model SHA, structured output with a cited rule and signals, a database-rendered audit document, and an operator on borderline cases all come from.

Runtime enforcement is downstream of that question. Novarch is the version of the answer your customers can run on agents you didn’t train.

Two founders who built evaluation infrastructure inside Microsoft.

Sid Vemuri

Sandra Ho

Eval discipline came first and the product came after.

Talk to the founders building this.